Back

Survey: Structured Representation for Generalizable Manipulation Skill Modeling

1. Introduction & Problem Definition

Robotic manipulation remains one of the central unsolved problems in embodied AI. A robot that can pick up a mug in a lab demo is, in the most optimistic framing, a robot that can pick up that mug, from that pose, under that lighting. The field's core aspiration — manipulation skills that generalize across objects, configurations, tasks, and environments — has proven stubbornly difficult. This survey argues that the choice of structured representation is the decisive factor determining whether a manipulation system generalizes or merely memorizes, and that the field is converging on this insight from multiple independent directions.

Why Generalization is Hard

The difficulty of generalizable manipulation stems from a combinatorial explosion along multiple axes simultaneously. Objects vary in shape, size, material, and articulation. Scenes vary in clutter, occlusion, and spatial arrangement. Tasks vary in goal specification, horizon length, and required contact dynamics. A policy that encodes all of this into a monolithic neural representation — whether an image embedding, a point cloud feature, or a language-conditioned latent — must learn invariances that are extraordinarily difficult to extract from finite data.

End-to-end approaches (RT-1/2, OpenVLA, \(\pi_0\)) have attempted to brute-force this problem with scale: more data, bigger models, broader task distributions. The results are impressive but fragile. As [35] demonstrate, policies achieving near-perfect single-skill performance collapse to 0–50% success when those same skills are composed in cluttered scenes. [12] show that a fine-tuned 8B-parameter model with explicit scene graph reasoning outperforms GPT-5 on complex manipulation tasks — a striking result suggesting that representation structure matters more than model scale.

The Structured Representation Thesis

This survey covers work that addresses generalization through structured intermediate representations — scene graphs, affordance maps, keypoints, relational predicates, object-centric decompositions, and compositional world models — that impose inductive biases reflecting the physical and semantic structure of manipulation. The thesis is that these representations succeed precisely because they decompose the monolithic perception-to-action mapping into components that mirror the compositional structure of the physical world: objects are distinct entities, interactions are pairwise, tasks decompose into subtasks, and spatial relationships are the language of manipulation.

We survey 58 papers spanning 2016–2026, organized along two primary dimensions: the type of structure imposed (graphs, affordances, keypoints, symbols) and the role the structure plays in the manipulation pipeline (perception, planning, policy learning, world modeling). The survey is written for researchers working at the intersection of robot learning, task and motion planning, and foundation models for embodied AI.

2. Taxonomy of Approaches

We organize the field into six families, distinguished by how they structure the representation and where that structure enters the manipulation pipeline.

Structured Representations for Generalizable Manipulation
  • A Dense Spatial & Keypoint Representations
    • A1. Dense visual descriptors (DON, correspondence-based)
    • A2. Semantic keypoints (kPAM, KETO)
    • A3. Spatial action maps (Transporter Networks, CLIPort)
  • B Affordance-Based Representations
    • B1. Per-point affordance prediction (Where2Act, VAT-Mart, AdaAfford)
    • B2. Human-video affordance transfer (VRB, Robo-ABC, General Flow)
    • B3. Affordance-guided policies (AffordDP, SAGA, PALM)
  • C Scene Graph Representations
    • C1. Scene graphs for task planning (Zhu et al. 2020, cg+, ConceptGraphs, MoMa-LLM, DovSG)
    • C2. Scene graphs as policy inputs (Compose by Focus, KG-M3PO, GF-VLA, GSR)
    • C3. Action-conditioned / executable graphs (RoboEXP, UniManip)
  • D Object-Centric Decompositions
    • D1. Relational spatial reasoning (SORNet, StructFormer, StructDiffusion)
    • D2. Segmented 3D representations (GROOT, DreMa)
    • D3. Object-centric world models (3D-OES, OCWM, RoboDreamer)
  • E Symbolic & Neuro-Symbolic Representations
    • E1. Skill-induced symbols (Konidaris et al. 2018)
    • E2. Learned operators & predicates (Silver et al. 2021, Ahmetoglu et al. 2024, InterPreT)
    • E3. Neuro-symbolic predicates (VisualPredicator)
  • F Relational Dynamics & Graph-Based Skill Models
    • F1. Graph neural dynamics (Interaction Networks, GN physics engines)
    • F2. Relational RL for manipulation (Li et al. 2020)
    • F3. Compositional factor graphs (Generative Factor Chaining)
    • F4. Heterogeneous/knowledge graphs (SoftGPT, skill libraries, GTN)
Table 1: Taxonomy summary with representative papers and key properties.
Family Representative Papers Structure Type Generalization Axis Manual Design?
A. Spatial/Keypoint DON, kPAM, Transporter Per-pixel/point features, keypoints Object instance, viewpoint Partial (keypoint selection)
B. Affordance Where2Act, VRB, SAGA Contact points, trajectories, heatmaps Object category, cross-embodiment Minimal (self-supervised)
C. Scene Graph ConceptGraphs, GSR, Compose by Focus Nodes=objects, edges=relations Scene configuration, task composition Moderate (relation vocabulary)
D. Object-Centric GROOT, StructDiffusion, DreMa Per-object features/point clouds Background, viewpoint, novel objects Minimal (segmentation-based)
E. Symbolic LOFT, VisualPredicator, InterPreT PDDL predicates/operators Task horizon, novel goals Moderate (predicate space)
F. Relational Dynamics IN, GN, Points2Plans Object-relation graphs for dynamics Object count, configuration Low (graph structure from physics)

A key observation: these families are not mutually exclusive, and the strongest recent systems combine elements from multiple families. [6] uses affordance heatmaps (B) grounded via foundation model features over 3D point clouds (D). [12] reasons over scene graphs (C) using learned symbolic transitions (E). [35] feeds scene graphs (C) into diffusion policies with 3D object-centric encodings (D). The trend is toward hybrid structured representations that layer multiple forms of structure.

3. Timeline & Evolution

Phase 1: Learning Structured Perception for Manipulation (2016–2019)

The field's foundations were laid by two parallel threads. In computer vision, [5] introduced Interaction Networks, establishing that factoring dynamics into objects and pairwise relations — a graph structure — enables generalization across system sizes. [22] and [44] developed scene graph generation methods that, while not targeting manipulation, created the perceptual infrastructure that later manipulation work would consume. [2] extended scene graphs to 3D, a critical step for robotics where 2D representations are viewpoint-dependent.

In manipulation, [7] introduced Dense Object Nets (DON), demonstrating that self-supervised dense visual descriptors could serve as task-agnostic representations enabling category-level grasp transfer. [31] built on DON with kPAM, proposing semantic 3D keypoints as a sparse, task-relevant structured representation — the first clean formalization of keypoint affordances for category-level pick-and-place. Concurrently, [18] provided the theoretical foundation for how motor skills induce symbolic representations suitable for planning, establishing the "skills entail symbols" principle.

Inflection point: DON and kPAM demonstrated that less representation (sparse keypoints vs. full pose) can yield more generalization — a counterintuitive insight that oriented the field away from full pose estimation toward task-relevant structured abstractions.

Phase 2: Structured Action Spaces and Affordance Prediction (2020–2022)

[47] introduced Transporter Networks, achieving extraordinary sample efficiency (100% from 1 demo on block insertion) by structuring the action space as spatial displacements with built-in translational equivariance. This was not just a better architecture — it was a demonstration that structuring the action representation matters as much as structuring the observation. [37] extended this with CLIPort, fusing CLIP's semantic structure with Transporter's spatial structure, showing that internet-pretrained semantic knowledge could substitute for task-specific training data.

Simultaneously, the affordance prediction line emerged. [33] introduced Where2Act, predicting per-pixel actionability for articulated objects — framing manipulation affordance as a dense visual prediction task rather than requiring explicit part models. [41] extended this with VAT-Mart, adding trajectory proposals to static affordance maps. [40] contributed AdaAfford, showing that affordance priors could be adapted online through strategic interaction — a critical insight that static affordance representations are insufficient for objects with hidden kinematic properties.

On the scene graph planning side, [48] introduced dual-level geometric/symbolic scene graphs with GNN processing for hierarchical manipulation planning, achieving 70%+ real-robot success while being nearly 4 orders of magnitude faster than PDDLStream. [46] (SORNet) and [27] (StructFormer) demonstrated transformer-based relational reasoning over objects for rearrangement, with StructFormer showing that joint multi-object reasoning outperforms pairwise approaches.

Inflection point: Where2Act and VAT-Mart established affordance as a dense spatial prediction rather than a categorical label, creating a representation that naturally interfaces with both learning-based and optimization-based downstream planners.

Phase 3: Foundation Models Meet Structured Representations (2023–2024)

The arrival of large vision-language models (CLIP, SAM, GPT-4V, Stable Diffusion) fundamentally changed the economics of structured representation. [8] (ConceptGraphs) showed that open-vocabulary 3D scene graphs could be built by composing off-the-shelf foundation models (SAM + CLIP + LLaVA + GPT-4), without any 3D dataset finetuning. This made scene graphs practical as a general-purpose interface between perception and LLM-based planning.

[3] (VRB) demonstrated that visual affordances extracted from human video could drive real robot learning across four distinct paradigms (imitation, exploration, goal-conditioned, RL), establishing affordance as a versatile cross-embodiment representation. [17] (Robo-ABC) showed that diffusion model features contain emergent semantic correspondence enabling zero-shot cross-category affordance transfer — no task-specific training needed.

On the planning side, [11] (MoMa-LLM) proved that structured scene graph encodings dramatically outperform raw representations when grounding LLMs for mobile manipulation planning. [Jiang et al., 2024] (RoboEXP) introduced action-conditioned scene graphs that encode not just spatial relations but how to act on them, making the graph itself an executable plan.

[49] (GROOT) validated that object-centric 3D representations — segmented point clouds in the robot frame — provide strong invariance to backgrounds, viewpoints, and novel object instances with minimal architectural overhead. [28] (StructDiffusion) showed that diffusion models over structured object-centric pose spaces outperform autoregressive approaches for multi-object rearrangement.

Inflection point: Foundation models made open-vocabulary structured representations practical. The question shifted from "can we build scene graphs/affordance maps?" to "how should we use them for manipulation?"

Phase 4: Structured Representations as the Core Skill Interface (2025–2026)

The most recent work treats structured representations not as auxiliary perception modules but as the primary interface for manipulation skill learning and composition.

[35] (Compose by Focus) demonstrated that scene graphs as direct policy inputs — not just for high-level planning — enable compositional generalization: skills trained in isolation compose in cluttered scenes because the graph filters out irrelevant visual variation. This is arguably the clearest evidence that representation structure, not training scale, is the key to compositional manipulation.

[12] (GSR) models world-state evolution as explicit transitions over scene graphs, achieving 92.5% on RLBench kitchen operations with an 8B model — outperforming GPT-5 and Gemini-2.5-Pro. [Liu et al., 2026] (UniManip) introduced a bi-level agentic operational graph that simultaneously serves as world model, task plan, and recovery memory, achieving 93.75% on zero-shot pick-and-place with a single wrist-mounted camera.

[6] (SAGA) proposed structured affordance grounding as a scalable bridge between foundation model reasoning and low-level control, training on only 2,410 demonstrations — two orders of magnitude fewer than prior generalist policies. [29] (PALM) showed that predicting future affordances as intermediate latent representations stabilizes long-horizon control, achieving 91.8% on LIBERO-LONG.

Inflection point: The field has reached a point where structured representations consistently outperform end-to-end approaches of much larger scale. The debate is no longer whether to use structure, but what kind and where in the pipeline.

4. Key Papers & Contributions

4.1 Dense Spatial & Keypoint Representations

Dense Object Nets [7] established that learned per-pixel descriptors could serve as a task-agnostic representation for manipulation, achieving category-level generalization via self-supervised contrastive learning on RGBD data. The key insight — that dense descriptors eliminate the need for explicit pose estimation while retaining spatial precision — opened the entire line of correspondence-based manipulation.

kPAM [31] refined this to semantic 3D keypoints, showing that manipulation goals could be formulated as geometric constraints on a small set of task-relevant points rather than full SE(3) poses. The 98% success rate on shoe placement across 20 diverse instances (including boots and sneakers with vastly different topology) demonstrated that keypoints capture what matters for the task while discarding irrelevant geometric detail. The limitation — manual keypoint definition per category — remains an active challenge.

Transporter Networks [47] contributed the most sample-efficient structured action representation: spatial feature transport via cross-correlation, achieving 100% from a single demonstration. The architectural insight that manipulation can be decomposed into "where to attend" and "where to transport" with built-in equivariance remains influential, though the restriction to SE(2) tabletop settings limits applicability. CLIPort [37] extended this with CLIP's semantic stream, demonstrating that frozen internet-pretrained features can substitute for task-specific demonstrations — the multi-task model matched or beat single-task in 57% of evaluations.

Assessment: This family provides the strongest sample efficiency but is limited to relatively simple action spaces (pick-and-place, SE(2)). The keypoint/descriptor paradigm is most powerful when task-relevant structure can be captured by a small number of spatial correspondences.

4.2 Affordance-Based Representations

The affordance family has evolved through three generations, each expanding what "affordance" means for manipulation.

Generation 1: Static per-point affordance. [33] (Where2Act) pioneered dense affordance prediction for articulated objects, but with modest absolute success rates (~29% SSR for pushing) and no trajectory information. [30] showed that visual affordance maps from human thermal contact data could guide dexterous grasping RL, achieving 63% success with 3x better sample efficiency than pure RL — the key insight being to decouple "where to grasp" from "how to grasp."

Generation 2: Affordance + trajectory. [41] (VAT-Mart) extended static affordance to include trajectory proposals — predicting not just where but how to interact. The interaction-for-perception training loop (RL generates data for perception, perception provides curiosity rewards for RL) became a template for subsequent work. [40] (AdaAfford) added test-time adaptation, showing that even 1–4 exploratory interactions could substantially improve affordance accuracy on instances with hidden kinematic properties (e.g., pushing affordance on a closed door improved from 48% to 80% F-score).

Generation 3: Foundation-model affordance transfer. [3] (VRB) demonstrated that contact points + post-contact trajectories extracted from human video serve as a versatile, embodiment-agnostic affordance interface — achieving 57% average across 8 tasks versus 25% for the next best method. [17] (Robo-ABC) showed that diffusion model features enable zero-shot cross-category affordance transfer (85.7% real-world grasping across 7 unseen categories). [45] (General Flow) pushed further, using 3D flow from human video as a universal affordance achieving 81% across 18 tasks including deformable objects — the strongest human-to-robot transfer result to date.

The latest affordance-guided policies integrate affordance as structured conditioning for diffusion policies. [42] (AffordDP) decomposes affordance into static (contact point via semantic correspondence) and dynamic (trajectory via geometric registration) components, steering diffusion denoising toward transferred affordances. [6] (SAGA) converts tasks into affordance-entity pairs grounded as 3D heatmaps, training on 2,410 demos — two orders of magnitude fewer than VLA approaches. [29] (PALM) predicts four complementary affordance types as latent queries for a diffusion transformer, achieving 91.8% on LIBERO-LONG.

Assessment: The affordance family is the most practically successful for cross-object and cross-embodiment generalization. Its strength is that affordance representations are task-relevant by construction — they encode how to interact rather than just what is there. The weakness is that affordance vocabularies remain relatively fixed; extending to new interaction types requires redesign.

4.3 Scene Graph Representations

Scene graphs have emerged as perhaps the most versatile structured representation, serving roles from high-level planning to low-level policy conditioning.

For planning: [48] pioneered dual-level (geometric + symbolic) scene graphs for hierarchical manipulation planning, achieving 70.6% real-robot success while being ~4 orders of magnitude faster than PDDLStream. The regression planning approach — searching backward from goals using GNN-learned preimage networks — generalized from 18 short demonstrations to much longer horizons. [Jiao et al., 2022] introduced Contact Graph+ (cg+), reducing planning to graph editing operations that directly map to robot actions, achieving 40–100x speedup over PDDLStream.

Open-vocabulary scene graphs: [8] (ConceptGraphs) built the standard pipeline for constructing open-vocabulary 3D scene graphs from foundation models. The key result is that LLM-based retrieval over scene graphs dramatically outperforms CLIP for complex queries — R@1 of 0.80 vs. 0.26 for negation queries, and perfect 1.0 vs. 0.0 for affordance queries on real lab scenes. [11] (MoMa-LLM) proved the complementary point: structured scene graph encoding (hierarchical room-object format with frontier descriptions) outperforms raw encodings for LLM-based planning (87.2 vs. 77.6 AUC-E). [43] (DovSG) addressed the practical challenge of dynamic environments, achieving 27x faster local scene graph updates than full reconstruction.

As policy inputs — the frontier: [35] (Compose by Focus) is, in our view, the most important recent paper in this family. By feeding task-relevant sub-scene graphs (just 3–4 nodes) directly into a diffusion policy via a Graph Attention Network, they achieve 0.78–0.97 success on skill composition where Diffusion Policy (0.0–0.52), DP3 (0.07–0.48), and \(\pi_0\) (0.02–0.77) fail. The result is stark: the same policy architecture with and without scene graph conditioning differs by 50+ percentage points on composition. [34] (KG-M3PO) showed that training the GNN end-to-end through RL gradients — rather than using the scene graph as a frozen perception module — further improves performance, especially under partial observability (58% vs. 4% for camera-only).

Action-conditioned and executable graphs: [15] (RoboEXP) enriched scene graphs with action nodes and four edge types (object→object, object→action, action→object, action→action), making the graph itself an executable plan — to retrieve any object, execute actions along the topological path. This achieved 70–90% success on tasks where GPT-4V baselines scored 0–30%. [25] (UniManip) extended this to a bi-level agentic operational graph serving simultaneously as world model, task plan, and recovery memory.

[12] (GSR) represents the culmination of this direction: treating scene graphs as the primary state space for an LLM-based reasoner, with explicit state-transition reasoning (predicting graph edits from actions, verifying goal satisfaction). The result — outperforming GPT-5 on RLBench with an 8B model — is the strongest evidence that structured representations can substitute for model scale.

Assessment: Scene graphs are the most general structured representation, capable of encoding objects, attributes, spatial relations, and action preconditions in a unified framework. Their weakness is computational overhead (foundation model inference for graph construction) and sensitivity to perception errors. The critical open question is whether graphs should be used for planning, policy conditioning, or both — recent evidence strongly favors both.

4.4 Object-Centric Decompositions

[46] (SORNet) showed that conditioning a Vision Transformer on canonical object views + scene images yields spatial representations that generalize zero-shot to unseen objects. The attention masking design — preventing object tokens from attending to each other — forces each object's embedding to be grounded in scene context rather than in other queries.

[27] (StructFormer) and [28] (StructDiffusion) advanced language-guided multi-object rearrangement. StructFormer's transformer captures holistic structural constraints that pairwise methods cannot (e.g., equal angular spacing in a circle), while StructDiffusion's diffusion formulation generates diverse valid arrangements with a 16% improvement over StructFormer. The progression from autoregressive to diffusion-based generation over structured object-centric spaces reflects a broader trend.

[49] (GROOT) provided the cleanest demonstration that explicit object-centric 3D segmentation is a powerful inductive bias: on hard generalization tests (novel backgrounds, camera shifts, new objects), GROOT achieves 38–83% where baselines collapse to 0–8%. The recipe — segment objects, back-project to 3D point clouds in robot frame, encode via PointNet, process via transformer — is simple but effective.

[4] (DreMa) used object-centric Gaussian Splatting as a "learnable digital twin," generating ~800 imagined demonstrations from 1–5 real ones via equivariant transformations. The result — matching 20-demo PerAct performance from just 5 demos — shows that explicit compositional 3D representations enable data augmentation strategies impossible in latent spaces.

Assessment: Object-centric decompositions provide the strongest viewpoint/background invariance and the cleanest path to data augmentation. They are most effective when the scene can be cleanly segmented into distinct objects — which, for most manipulation scenarios, is a reasonable assumption.

4.5 Symbolic & Neuro-Symbolic Representations

[18] established the theoretical foundation: given a set of motor skills (options), the abstract symbolic vocabulary needed for planning can be derived mechanically from the skills' initiation and effect sets. This "skills entail symbols" principle — that the right symbols are determined by the agent's capabilities, not imposed top-down — is the deepest theoretical insight in this survey's scope. The limitation is that it produces propositional, not relational, symbols.

[38] (LOFT) addressed the practical TAMP angle: learning PDDL operators from transition data, achieving 100% success on 31-step planning horizons where all neural baselines scored 0%. The access to classical planning heuristics (hAdd) via PDDL-compatible operators is a major practical advantage. [Han et al., 2024] (InterPreT) closed the predicate learning gap by using GPT-4 to ground human language feedback into executable predicate functions, then compiling PDDL domains for classical planning — achieving 73% on combined test sets where Code-as-Policies managed 38%.

[23] (VisualPredicator) introduced neuro-symbolic predicates — Python code combining VLM queries with algorithmic logic — and learned optimistic operators via online exploration. Achieving ~90–95% on out-of-distribution tasks (more objects, longer horizons than training), it approaches oracle performance without manual abstraction engineering. [1] learned both object and relational predicates from unsupervised manipulation experience, using Gumbel-Sigmoid attention to produce explicit binary relations translatable to PDDL — achieving ~85% planning success with 4 objects where baselines reach ~25%.

Assessment: Symbolic representations provide the strongest compositional generalization to longer horizons and more objects, because classical planners handle combinatorics that neural approaches cannot. The persistent bottleneck is the grounding problem — connecting symbols to continuous perception and action. Recent neuro-symbolic approaches are closing this gap rapidly.

4.6 Relational Dynamics & Graph-Based Skill Models

[5] (Interaction Networks) and [36] (Graph Networks as Physics Engines) established that factoring dynamics into objects + pairwise relations enables zero-shot generalization across system sizes — a model trained on 5-body systems transfers to 10+ bodies. This is the theoretical backbone for all graph-based dynamics models in manipulation.

[39] (3D-OES) brought this to manipulation, achieving 0.86 MPC pushing success with sim-to-real transfer by learning 3D object-factorized dynamics via GNN. [20] connected graph-based relational reasoning to RL for multi-object manipulation, stacking 6 blocks with 75% success using 77x less data than prior work — the GNN's relational inductive bias is what makes curriculum transfer work across object counts.

[13] (Points2Plans) introduced composable relational dynamics for long-horizon TAMP, using delta-dynamics predictions with a hybrid latent-geometric rollout to achieve 85%+ real-world success. The key insight — that relative (delta) state changes compound less error than absolute predictions — is a practical contribution for any graph-based dynamics model.

[32] (Generative Factor Chaining) represented multi-step manipulation as a spatial-temporal factor graph where nodes are objects/robots and factors are modular diffusion models, composing novel skill sequences via bi-directional message passing at inference.

Assessment: Graph-based dynamics models provide the strongest physics-aware generalization but require clean object segmentation and assume relatively simple interaction physics. They are most powerful when combined with symbolic planning (as in Points2Plans) rather than used as standalone world models.

5. State of the Art

What Works Best Today

The evidence from 2025–2026 converges on a clear picture:

For compositional long-horizon manipulation, scene-graph-conditioned policies with explicit state-transition reasoning dominate. [12] (GSR) achieves 82.4% on LIBERO long-horizon tasks (+14.4% over Gemini-2.5-Pro). [35] (Compose by Focus) achieves 0.78–0.97 on skill composition where baselines score 0.0–0.52. The key factor is that graph structure filters out irrelevant variation, enabling skills trained independently to compose in novel contexts.

For cross-object/cross-category generalization, affordance-based representations lead. [Yuan et al., 2024] (General Flow) achieves 81% across 18 tasks via 3D flow from human video. [17] (Robo-ABC) achieves 85.7% real-world grasping across 7 unseen categories. Foundation model features (DIFT, SD-DINOv2) have made zero-shot affordance transfer practical.

For data-efficient skill learning, object-centric 3D representations with structured action spaces are most effective. [47] (Transporter) achieves 100% from 1 demo. [Barcellona et al., 2024] (DreMa) matches 20-demo performance from 5 demos via compositional data augmentation. [Fang et al., 2025] (SAGA) trains on 2,410 demos — orders of magnitude fewer than VLAs.

For zero-shot deployment without training, agentic systems with structured graph representations show the most promise. [25] (UniManip) achieves 93.75% on zero-shot pick-and-place, outperforming VLA baselines by 22.5% while using a single wrist-mounted camera (vs. dual cameras for baselines).

Consensus

  1. End-to-end monolithic representations are insufficient for compositional generalization. Every paper that directly compares finds structured representations outperform unstructured ones on novel compositions, longer horizons, or more objects.
  2. Foundation models are the right perception backbone for structured representations. The debate about hand-designed vs. learned features is effectively over — SAM, CLIP, DINOv2, and diffusion features provide the perceptual grounding, while the structure imposed on top of these features determines generalization.
  3. 3D representations outperform 2D for manipulation. Across families — keypoints (kPAM), affordances (General Flow), object-centric (GROOT), scene graphs (ConceptGraphs) — lifting to 3D consistently improves performance by decoupling from viewpoint.

Disagreement

  1. Scene graphs vs. affordances as the primary representation. Scene graphs encode what is there and how things relate; affordances encode how to interact. Both work well, and the strongest systems (SAGA, UniManip) combine both. But whether the graph or the affordance should be the primary organizing principle remains an open design choice.
  2. Frozen vs. end-to-end training of structured representations. [35] uses frozen foundation model features with a learned GNN; [34] (KG-M3PO) trains the GNN end-to-end through RL. Both work; the tradeoff is between leveraging pretrained knowledge (frozen) and aligning representations to task performance (end-to-end).
  3. Classical planning vs. learned planning over structured representations. PDDL-based approaches [38, 10] offer formal guarantees and efficient search. LLM-based reasoning over graphs [12, 25] is more flexible but provides no guarantees. The right choice depends on task complexity and tolerance for failure.

6. Open Problems & Future Directions

The Grounding Gap. The most persistent challenge is connecting structured representations to precise low-level control. Scene graphs and PDDL predicates provide excellent task-level reasoning but abstract away the continuous geometry needed for contact-rich manipulation. Affordances capture contact geometry but lack the relational structure needed for multi-step reasoning. Closing this gap — representations that are simultaneously relational, spatial, and action-grounded — is the field's central open problem. Early attempts like RoboEXP's action-conditioned edges and SAGA's affordance-entity pairs point the way, but neither fully solves it.

Deformable and Contact-Rich Manipulation. Nearly every paper in this survey is evaluated on rigid objects. Deformable objects (cloth, rope, food), contact-rich tasks (insertion, assembly), and force-sensitive manipulation remain underserved by current structured representations. [19] showed that multimodal representations (vision + touch) are critical for contact-rich tasks, and [24] (ViTaMIn) demonstrated scalable visuo-tactile data collection. Integrating tactile structure into scene graphs or affordance representations — e.g., tactile affordances as graph node attributes — is a promising but unexplored direction.

Dynamic and Partially Observable Environments. Most structured representations assume a static, fully observable scene. [43] (DovSG) addressed dynamic updates but still achieves only 35% on long-term tasks. [40] (AdaAfford) showed that active exploration can adapt affordances, but only for single-object interactions. How to maintain and update structured representations in truly dynamic, partially observable environments — where objects move, appear, and disappear — remains open.

Scalable Affordance Vocabularies. Current affordance systems use fixed interaction vocabularies (grasp, place, push, pull). SAGA defines 6 affordance types; PALM predicts 4. For manipulation to generalize to the full diversity of human tasks, we need open-vocabulary or learnable affordance representations — affordance analogues of the open-vocabulary revolution in object recognition. This connects to the neuro-symbolic predicate learning direction ([23]), where new predicates (and hence new affordance types) can be discovered through interaction.

Unified Representations Across Abstraction Levels. The field currently uses different representations at different abstraction levels: scene graphs for task planning, affordances for skill selection, keypoints for motion generation. A truly general manipulation system would benefit from a single structured representation that supports reasoning at all levels — from "which room has the mug?" to "which pixel on the handle should I grasp?" Graph-based representations with hierarchical resolution (building → room → object → part → contact point) are the natural candidate, but no system has demonstrated this full hierarchy for manipulation.

Citation

If you find this survey useful, please cite it as

@misc{structured_representation_manipulation_survey_2026,
  author    = {Hu Tianrun},
  title     = {Structured Representation for Generalizable Manipulation Skill Modeling},
  year      = {2026},
  publisher = {GitHub},
  url       = {https://h-tr.github.io/blog/surveys/structured-representation-manipulation.html}
}
          

References

  1. Ahmetoglu, A., et al. (2024). "Symbolic Manipulation Planning with Discovered Object and Relational Predicates." 2024.
  2. Armeni, I., et al. (2019). "3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera." ICCV 2019.
  3. Bahl, S., et al. (2023). "Affordances from Human Videos as a Versatile Representation for Robotics." CVPR 2023.
  4. Barcellona, L., et al. (2024). "Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination." arXiv:2412.14957.
  5. Battaglia, P.W., et al. (2016). "Interaction Networks for Learning about Objects, Relations and Physics." NeurIPS 2016.
  6. Fang, K., et al. (2025). "SAGA: Open-World Mobile Manipulation via Structured Affordance Grounding." arXiv:2512.12842.
  7. Florence, P.R., et al. (2018). "Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation." CoRL 2018.
  8. Gu, Q., et al. (2024). "ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning." ICRA 2024.
  9. Guo, M. & Bürger, M. (2021). "Geometric Task Networks: Learning Efficient and Explainable Skill Coordination for Object Manipulation." 2021.
  10. Han, M., et al. (2024). "InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning." RSS 2024.
  11. Honerkamp, D., et al. (2024). "Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation." RA-L 2024.
  12. Hu, K., et al. (2026). "GSR: Learning Structured Reasoning for Embodied Manipulation." arXiv:2602.01693.
  13. Huang, Y., et al. (2024). "Points2Plans: From Point Clouds to Long-Horizon Plans with Composable Relational Dynamics." ICRA 2025.
  14. Jeong, Y., et al. (2025). "Object-Centric World Model for Language-Guided Manipulation." 2025.
  15. Jiang, H., et al. (2024). "RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation." 2024.
  16. Jiao, Z., et al. (2022). "Sequential Manipulation Planning on Scene Graph." IROS 2022.
  17. Ju, Y., et al. (2024). "Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation." 2024.
  18. Konidaris, G., et al. (2018). "From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning." JAIR 2018.
  19. Lee, M.A., et al. (2018). "Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks." ICRA 2019.
  20. Li, R., et al. (2020). "Towards Practical Multi-Object Manipulation using Relational Reinforcement Learning." 2020.
  21. Li, S., et al. (2025). "Graph-Fused Vision-Language-Action for Policy Reasoning in Multi-Arm Robotic Manipulation." 2025.
  22. Li, Y., et al. (2017). "Scene Graph Generation from Objects, Phrases and Region Captions." ICCV 2017.
  23. Liang, Y., et al. (2025). "VISUALPREDICATOR: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning." ICLR 2025.
  24. Liu, F., et al. (2025). "ViTaMIn: Learning Contact-Rich Tasks Through Robot-Free Visuo-Tactile Manipulation Interface." 2025.
  25. Liu, H., et al. (2026). "UniManip: General-Purpose Zero-Shot Robotic Manipulation with Agentic Operational Graph." 2026.
  26. Liu, J., et al. (2023). "SoftGPT: Learn Goal-oriented Soft Object Manipulation Skills by Generative Pre-trained Heterogeneous Graph Transformer." 2023.
  27. Liu, W., et al. (2021). "StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects." 2021.
  28. Liu, W., et al. (2022). "StructDiffusion: Language-Guided Creation of Physically-Valid Structures using Unseen Objects." RSS 2023.
  29. Liu, Y., et al. (2026). "PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation." 2026.
  30. Mandikal, P. & Grauman, K. (2020). "Learning Dexterous Grasping with Object-Centric Visual Affordances." ICRA 2021.
  31. Manuelli, L., et al. (2019). "kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation." ISRR 2019.
  32. Mishra, U.A., et al. (2024). "Generative Factor Chaining: Coordinated Manipulation with Diffusion-based Factor Graph." 2024.
  33. Mo, K., et al. (2021). "Where2Act: From Pixels to Actions for Articulated 3D Objects." ICCV 2021.
  34. Narendra, A., et al. (2026). "Knowledge-Guided Manipulation Using Multi-Task Reinforcement Learning." ICRA 2026.
  35. Qi, H., et al. (2025). "Compose by Focus: Scene Graph-based Atomic Skills for Compositional Generalization." ICRA 2026.
  36. Sanchez-Gonzalez, A., et al. (2018). "Graph Networks as Learnable Physics Engines for Inference and Control." ICML 2018.
  37. Shridhar, M., et al. (2021). "CLIPort: What and Where Pathways for Robotic Manipulation." CoRL 2021.
  38. Silver, T., et al. (2021). "Learning Symbolic Operators for Task and Motion Planning." IROS 2021.
  39. Tung, H.-Y., et al. (2020). "3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators." CoRL 2020.
  40. Wang, Y., et al. (2022). "AdaAfford: Learning to Adapt Manipulation Affordance for 3D Articulated Objects via Few-shot Interactions." ECCV 2022.
  41. Wu, R., et al. (2022). "VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects." ICLR 2022.
  42. Wu, S., et al. (2024). "AffordDP: Generalizable Diffusion Policy with Transferable Affordance." 2024.
  43. Yan, Z., et al. (2024). "Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation." RA-L 2025.
  44. Yang, J., et al. (2018). "Graph R-CNN for Scene Graph Generation." ECCV 2018.
  45. Yuan, C., et al. (2024). "General Flow as Foundation Affordance for Scalable Robot Learning." 2024.
  46. Yuan, W., et al. (2021). "SORNet: Spatial Object-Centric Representations for Sequential Manipulation." CoRL 2021.
  47. Zeng, A., et al. (2020). "Transporter Networks: Rearranging the Visual World for Robotic Manipulation." CoRL 2020.
  48. Zhu, Y., et al. (2020). "Hierarchical Planning for Long-Horizon Manipulation with Geometric and Symbolic Scene Graphs." ICRA 2021.
  49. Zhu, Y., et al. (2023). "GROOT: Learning Generalizable Manipulation Policies with Object-Centric 3D Representations." CoRL 2023.