Survey: LLM and VLM for Task and Motion Planning

1. Introduction & Problem Definition

Task and Motion Planning (TAMP) is the problem of jointly determining what to do (a sequence of symbolic actions like "pick," "place," "open") and how to do it (collision-free trajectories, feasible grasps, stable placements) to achieve a goal specified in natural language. It is the computational backbone of any robot that must execute multi-step manipulation in unstructured environments — cooking a meal, tidying a room, assembling a kit.

TAMP is hard for two reasons that compound each other. First, the task planning layer faces combinatorial explosion: with \(n\) objects and \(k\) actions, the search space grows exponentially in the plan horizon. Second, the motion planning layer faces continuous optimization in high-dimensional configuration spaces, where feasibility depends on geometry the task planner cannot see. Classical TAMP systems [7, 27] handle both layers through formal representations (PDDL, logic-geometric programs) solved by complete algorithms — but they require painstaking manual specification of every domain: predicates, action schemas, samplers, and heuristics. This engineering bottleneck has confined TAMP to laboratory demonstrations for decades.

The emergence of Large Language Models (LLMs) and Vision-Language Models (VLMs) offers a tantalizing shortcut. These models encode vast procedural knowledge ("to make coffee, first fill the kettle..."), commonsense spatial reasoning ("forks go left of plates"), and open-vocabulary perception — precisely the kinds of knowledge that TAMP engineers previously had to hand-code. The central question of this survey is: Can foundation models replace, augment, or bypass the manual engineering bottleneck in TAMP, and if so, at what cost to correctness, robustness, and generality?

The answer, as we will argue, is nuanced. LLMs are powerful specification generators but unreliable reasoners. VLMs add spatial grounding but still lack metric precision. The most successful systems use foundation models not as planners but as translators, heuristics, and constraint proposers within architectures that retain classical guarantees. The field is converging on this insight, but getting the interface right — what to delegate to the foundation model, what to keep in the solver — remains the core open problem.

This survey covers work from 2022 to early 2026, spanning the full arc from the first demonstrations of LLM-based robot planning to the current state of the art.

2. Taxonomy of Approaches

We organize the literature into five families, distinguished by what role the foundation model plays in the planning pipeline.

LLM/VLM for TAMP

A LLMs as Task Planners
- A1. Open-loop plan generation
- A2. Closed-loop planning with feedback
- A3. Code/program generation
B LLMs as Translators to Formal Specifications
- B1. NL → PDDL (for classical planners)
- B2. NL → Temporal Logic (STL/LTL)
- B3. NL → Constraints (for TAMP solvers)
C Foundation Models Integrated into TAMP Search
- C1. LLMs as heuristics / warm-starters
- C2. VLMs as constraint generators / validators
- C3. LLMs for failure reasoning and recovery
D VLMs/VLAs as End-to-End Action Models
- D1. VLMs fine-tuned for action prediction
- D2. VLMs for spatial grounding and affordances
E Structured World Representations as Interface
- E1. 3D scene graphs for LLM grounding
- E2. Explicit state tracking for LLM planners
- E3. Spatial VLMs for metric reasoning

Family A: LLMs as Task Planners. The earliest and most popular approach. An LLM receives a natural language instruction and generates a sequence of high-level actions, which are then executed by pre-trained low-level controllers. This is task planning only — the motion planning problem is assumed solved by existing primitives. The key design choices are: how to ground the LLM's outputs in the robot's actual capabilities (affordance scoring in SayCan, semantic translation in [11], code generation in [20]), and whether to close the loop with environment feedback.

Family B: LLMs as Translators. Rather than using the LLM as a planner, use it as a translator from natural language to a formal representation (PDDL, STL, constraint code) that a classical solver can optimize. This separates linguistic competence (the LLM's strength) from logical reasoning (the solver's strength). The key challenge is translation fidelity — LLMs produce syntactically or semantically incorrect formal specifications roughly 10-60% of the time, depending on domain complexity [8, 22].

Family C: Foundation Models in TAMP Search. The most tightly integrated approaches. Here, the LLM or VLM participates directly in the TAMP solver's search loop — proposing candidate plans to warm-start tree search [19], generating continuous constraint parameters for trajectory optimization [26], predicting which geometric refinements will fail [30], or reasoning about motion planning failures to guide replanning [29]. These systems retain the classical TAMP's completeness properties while using foundation models to accelerate search.

Family D: End-to-End VLAs. The most radical departure from classical TAMP. Vision-Language-Action models (RT-2, PaLM-E, OpenVLA) fine-tune VLMs to directly output robot actions as tokenized text, bypassing both symbolic task planning and explicit motion planning. These systems demonstrate impressive semantic generalization but operate at the single-action level — they do not perform multi-step planning in any meaningful sense.

Family E: Structured World Representations. A cross-cutting concern: how to represent the world so that foundation models can reason about it effectively. This includes 3D scene graphs for scalable environment grounding, explicit JSON state tracking for temporal reasoning, spatial VLMs for metric distance estimation, and 3D voxel value maps composed via code.

3. Timeline & Evolution

Phase 1: Proof of Concept (Early 2022)

Three near-simultaneous papers established that LLMs could be used for robot task planning at all. [11] showed that GPT-3 could decompose "make breakfast" into step-by-step plans, with a semantic translation layer boosting executability from 8% to 79%. SayCan [1] grounded these plans in real-robot affordances via value-function scoring, achieving 74% execution success on a mobile manipulator. Socratic Models [33] demonstrated that multiple foundation models could be composed via language as an intermodal bus.

These papers were exciting but limited. All operated at the task level only — they assumed a library of pre-trained skills handled motion planning. Plans were generated open-loop or with minimal grounding. The motion planning problem was nowhere in sight.

Phase 2: Closing the Loop (Late 2022 - Mid 2023)

The second wave addressed the brittleness of open-loop LLM planning. Inner Monologue [12] injected success detection, scene descriptions, and human corrections as textual feedback, improving real-robot kitchen task success from 30.8% to 60.4% under adversarial disturbances. Code as Policies [20] shifted the output representation from natural language to executable code, enabling spatial-geometric reasoning (98% accuracy on numerical computation tasks) and hierarchical composition.

Concurrently, the critical perspective emerged. [28] rigorously demonstrated that GPT-4 achieves only ~35% correct plans on Blocksworld — dropping to near-zero when action names are obfuscated — proving that LLM "planning" is pattern matching, not reasoning. This was an inflection point: it forced the community to take seriously the distinction between proposing plans and guaranteeing them.

Phase 3: Bridging Task and Motion (2023 - 2024)

The third wave began integrating LLMs with actual TAMP systems. LLM+P [22] used the LLM as a translator from natural language to PDDL, then solved the planning problem with Fast Downward — achieving 90% on Blocksworld where LLMs alone scored 20%. Text2Motion [21] interleaved LLM task planning with Q-function-based geometric feasibility checking, reaching 82% on multi-step manipulation versus 13% for SayCan/Inner Monologue.

On the VLM side, VoxPoser [13] used LLM-written code to compose 3D voxel value maps from VLM detections, enabling zero-shot manipulation trajectories (88% real-world success). RT-2 [Brohan et al., 2023] demonstrated that VLMs fine-tuned on robot data could transfer web-scale semantic knowledge to low-level control, doubling generalization performance over RT-1.

[15] formalized the emerging consensus as the LLM-Modulo framework: LLMs as approximate generators within a verified loop. This became the dominant architectural pattern.

Phase 4: Tightening the Integration (2024 - Present)

The current phase is characterized by increasingly precise integration of foundation models into the TAMP search loop. VLM-TAMP [31] uses VLMs to generate subgoals (not actions) that TAMP planners refine with full geometric reasoning — achieving 50-100% on 30-50-step kitchen tasks where direct VLM action generation scores 0%. OWL-TAMP [18] has VLMs generate both discrete plan constraints and continuous constraint functions as Python code, achieving 92% on open-world manipulation tasks. STaLM [19] warm-starts Monte Carlo Tree Search with LLM-generated plans, achieving 84-100% where PDDLStream scores 0-56%.

On the negative side, [23] benchmarked 16 LLM-based TAMP variants across 4,950 problems and found that all underperform engineered planners, with the counterintuitive finding that providing geometric information to the LLM increases task-planning errors.

4. Key Papers & Contributions

4.1 Family A: LLMs as Task Planners

SayCan [1] is the foundational paper of the field. Its core insight — multiply LLM semantic scores by learned value functions to score a fixed menu of robot skills — established the "LLM proposes, affordances dispose" paradigm. With PaLM as the LLM, SayCan achieved 84% planning success and 74% execution success across 101 real-world kitchen tasks.

Inner Monologue [12] addressed SayCan's limitations by feeding success detection, scene descriptions, and human corrections back to the LLM as natural language. Real-robot kitchen success improved from 30.8% to 60.4% under adversarial disturbances. The key insight was that language is a universal interface between perception, planning, and action.

Code as Policies [20] shifted the output representation from natural language to executable Python code: 98% accuracy on spatial-geometric reasoning tasks (vs. 58% for chain-of-thought), and 97.2% on seen simulation tasks.

KnowNo [25] addressed when should the robot ask for help? Using conformal prediction over LLM softmax scores, KnowNo provides distribution-free coverage guarantees — the only paper in the field with formal statistical guarantees on task success rates.

4.2 Family B: LLMs as Translators to Formal Specifications

LLM+P [22] is the cleanest articulation of the translator paradigm. Given a natural language problem description, a human-provided PDDL domain file, and one example, GPT-4 generates a problem-specific PDDL file that Fast Downward solves optimally. On Blocksworld, this achieves 90% (vs. 20% for LLM-as-planner).

AutoTAMP [4] extends the translator paradigm to Signal Temporal Logic, enabling continuous-time, multi-agent coordination. Its two-stage autoregressive error correction raised success rates from ~42% to ~88% on the hardest domain.

NL2Plan [8] pushed the frontier by generating both PDDL domain and problem files from minimal text descriptions — no expert-provided domain required. Its six-stage pipeline solved 260% more unseen tasks than direct LLM PDDL generation.

4.3 Family C: Foundation Models in TAMP Search

This is the most technically demanding and most promising family.

Text2Motion [21] was the first to interleave LLM task planning with geometric feasibility checking — 82% overall success versus 13% for SayCan/Inner Monologue.

VLM-TAMP [31] identified the right abstraction level: subgoals, not actions. 50-100% on 30-50-step tasks where direct VLM action generation scored 0%.

OWL-TAMP [18] achieved 92% overall success by having VLMs generate both discrete plan constraints and continuous constraint functions as Python code. 19/19 tasks on real dual-arm hardware with zero task-specific modification.

STaLM [19] used the LLM as a one-shot heuristic to warm-start Monte Carlo Tree Search — 84-100% across all problems vs. PDDLStream 0-56%.

MOPS [26] formulated the interface as nested optimization: LLM selects constraints, CMA-ES refines parameters, gradient-based trajectory optimization solves the NLP. 30-40% improvement over prior code-generation approaches.

VIZ-COAST [30] introduced proactive constraint generation: VLM predicts which symbolic plans will fail at the geometric level before planning begins. 100% success where baselines scored 0-7.5%.

4.4 Family D: End-to-End VLAs

RT-2 [2] — the seminal VLA paper. Generalization to unseen objects improved ~2x over RT-1. Over 6,000 real-world trials.

PaLM-E [6] — 562B parameters, doubling SayCan's success (82.5% vs. 38.7%) while retaining 96% of base language capability.

OpenVLA [16] — open-source 7B model outperforming 55B RT-2-X by 16.5%. Single-GPU adaptation practical.

4.5 Family E: Structured World Representations

SayPlan [24] — hierarchical 3D scene graphs, token usage dropped 60-82%, 86.6% executability on long-horizon tasks.

ConceptGraphs [9] — open-vocabulary 3D scene graphs from RGB-D, ~70% node label accuracy, ~90% edge accuracy.

VoxPoser [13] — 3D voxel value maps via LLM-written code, 88% real-world success, zero-shot trajectory priors accelerating dynamics learning from >12 hours to <3 minutes.

5. State of the Art

What Works Best Today

For pure task planning: Code as Policies and descendants. KnowNo for calibrated uncertainty.

For long-horizon TAMP: VLM-as-subgoal-generator + classical-TAMP [31, 18]. OWL-TAMP's 92% is the current high-water mark.

For scalable environment grounding: SayPlan + ConceptGraphs.

For low-level visuomotor control: VLAs (OpenVLA) as skill primitives.

Consensus

LLMs cannot plan autonomously. Established by multiple evaluations.
External verification is necessary. The LLM-Modulo pattern is dominant.
Code is better than natural language as output representation.
The LLM should not see raw geometry. Geometric details increase task-planning errors.

Disagreement

Modular vs. end-to-end.
How much PDDL is needed.
VLMs as spatial reasoners — qualitative vs. metric.

6. Open Problems & Future Directions

The Interface Problem. The central unsolved problem: designing the right interface between foundation models and TAMP solvers. A theoretical framework for what information the foundation model should provide, at what granularity, to which solver component would be enormously valuable.

Continuous and Contact-Rich Manipulation. Nearly all work evaluates on pick-and-place. Contact-rich manipulation — tool use, deformable objects, dexterous manipulation — remains almost entirely unaddressed.

Partial Observability and Dynamic Environments. Only CoCo-TAMP [17] addresses partial observability. Real-world TAMP requires integrating foundation model priors with probabilistic state estimation.

Real-Time Replanning and Reactivity. Foundation model query latency (0.5-5s) makes real-time replanning infeasible at manipulation frequencies (10-50 Hz). Needs either small fast on-robot models or pre-computed constraint libraries.

Scaling Beyond Tabletop Pick-and-Place. Most papers evaluate on tabletop with 3-10 objects. Benchmarks that stress-test at scale are urgently needed.

Connection to Diffusion-Based Motion Generation. Diffusion Policy and StructDiffusion are natural complements: language model reasons about what, diffusion model generates how.

Citation

If you find this survey useful, please cite it as

@misc{llm_vlm_tamp_survey_2026,
  author    = {Hu Tianrun},
  title     = {LLM and VLM for Task and Motion Planning},
  year      = {2026},
  publisher = {GitHub},
  url       = {https://h-tr.github.io/blog/surveys/llm-vlm-tamp.html}
}

References

Ahn, M., et al. (2022). "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances." arXiv:2204.01691.
Brohan, A., et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." arXiv:2307.15818.
Chen, B., et al. (2024). "SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities." CVPR 2024.
Chen, Y., et al. (2023). "AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers." arXiv:2306.06531.
Chi, C., et al. (2023). "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." RSS 2023.
Driess, D., et al. (2023). "PaLM-E: An Embodied Multimodal Language Model." ICML 2023.
Garrett, C. R., et al. (2020). "PDDLStream: Integrating Symbolic Planners and Blackbox Samplers via Optimistic Adaptive Planning." ICAPS 2020.
Gestrin, E., et al. (2024). "NL2Plan: Robust LLM-Driven Planning from Minimal Text Descriptions." ICAPS 2024.
Gu, Q., et al. (2024). "ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning." ICRA 2024.
Guo, W., et al. (2024). "CaStL: Constraints as Specifications through LLM Translation." arXiv:2410.22225.
Huang, W., et al. (2022). "Language Models as Zero-Shot Planners." ICML 2022.
Huang, W., et al. (2022b). "Inner Monologue: Embodied Reasoning through Planning with Language Models." CoRL 2022.
Huang, W., et al. (2023). "VoxPoser: Composable 3D Value Maps for Robotic Manipulation." CoRL 2023.
Huang, W., et al. (2024). "ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints." RSS 2024.
Kambhampati, S., et al. (2024). "LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks." ICML 2024.
Kim, M. J., et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv:2406.09246.
Kim, Y., et al. (2026). "CoCo-TAMP: LLM-Guided State Estimation for Partially Observable TAMP." arXiv:2603.03704.
Kumar, N., et al. (2024). "OWL-TAMP: Open-World TAMP via VLM Generated Constraints." IEEE RA-L 2025.
Lee, D., et al. (2025). "Prime the Search: Using LLMs for Guiding Geometric TAMP." IJRR 2025.
Liang, J., et al. (2022). "Code as Policies: Language Model Programs for Embodied Control." arXiv:2209.07753.
Lin, K., et al. (2023). "Text2Motion: From Natural Language Instructions to Feasible Plans." Autonomous Robots 2023.
Liu, B., et al. (2023). "LLM+P: Empowering LLMs with Optimal Planning Proficiency." arXiv:2304.11477.
Mendez-Mendez, J. (2025). "A Systematic Study of LLMs for TAMP With PDDLStream." arXiv:2510.00182.
Rana, K., et al. (2023). "SayPlan: Grounding LLMs using 3D Scene Graphs." CoRL 2023.
Ren, A. Z., et al. (2023). "KnowNo: Robots That Ask For Help." CoRL 2023.
Shcherba, D., et al. (2025). "MOPS: Meta-Optimization and Program Search using Language Models for TAMP." CoRL 2025.
Toussaint, M. (2015). "Logic-Geometric Programming: An Optimization-Based Approach to Combined Task and Motion Planning." IJCAI 2015.
Valmeekam, K., et al. (2023). "On the Planning Abilities of Large Language Models." NeurIPS 2023.
Wang, Y., et al. (2024). "LLM3: Large Language Model-based Task and Motion Planning with Motion Failure Reasoning." IEEE RA-L 2024.
Yan, M., et al. (2025). "Using VLM Reasoning to Constrain TAMP." arXiv.
Yang, C. R., et al. (2024). "Guiding Long-Horizon TAMP with Vision Language Models." ICRA 2025.
Yoneda, T., et al. (2023). "Statler: State-Maintaining Language Models for Embodied Reasoning." ICRA 2024.
Zeng, A., et al. (2022). "Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language." arXiv:2204.00598.