Back

Fast Whole-Body Planners and Upstream Architecture

1. Introduction

Whole-body motion planning for high-degree-of-freedom systems has long occupied a peculiar position in the robotics stack. It is indispensable, yet until recently it has been prohibitively slow for interactive or reactive use. A bimanual humanoid with 24 to 40 actuated joints has traditionally required seconds to minutes to produce a single collision-free trajectory [56, 39, 77]. That latency propagated upward through the entire system, forcing task planners to commit to coarse action sequences with minimal geometric feedback [27, 28], constraining base-placement selection to heuristic reachability lookups [94, 86], and effectively decoupling planning from real-time perception [42]. The resulting stack was engineered around the assumption that motion planning is expensive. A convergence of GPU parallelism, tighter optimization formulations, vectorized collision checking, and learned heuristics is now dissolving that assumption [84, 12, 3].

The central thesis of this survey is that the reduction of whole-body planning latency from seconds to milliseconds is not merely a quantitative speedup. It is a qualitative architectural shift. When a 24 DoF whole-body query completes in 1 to 50 milliseconds on commodity hardware [84, 3, 91], planning ceases to be a batch preprocessing step and becomes a real-time primitive. It can be invoked as a feasibility oracle across hundreds of candidate action bindings by an integrated task-and-motion planner [27, 28]. It can score dense grids of base poses during mobile-base placement [14, 93, 54, 46]. It can generate millions of demonstrations that train learned samplers and diffusion policies [38, 72, 4, 10]. And it can be rolled into the inner loop of model-predictive controllers that replan at 50 to 500 Hz as perception updates arrive [12, 3, 81, 20, 48, 60]. Each of these capabilities was architecturally infeasible under the old cost regime, and each demands a reorganization of the interfaces between stack components.

This scoping review maps the conceptual landscape of 2018 to 2026 at the intersection of whole-body planning, bimanual manipulation, and task-and-motion planning. Section 2 fixes definitions and draws the boundary of coverage. Section 3 traces the algorithmic and computational advances that have brought whole-body planning into the millisecond regime. Sections 4 through 7 examine the four architectural consequences the speed shift enables, namely finer-grained task decomposition (Section 4), data-driven and neural sampling (Section 5), base-placement optimization and IK pre-configuration (Section 6), and tighter planning-perception-action feedback loops (Section 7). Section 8 identifies cross-cutting patterns and tensions. Section 9 enumerates open problems. Section 10 concludes.

The single most important takeaway. Planning speed is an architectural parameter, not only a performance metric. When whole-body planning migrates from the periphery (invoked once, executed open-loop) to the center of the stack (a feasibility oracle for task planners, a training-data generator for learned components, a screening primitive for base placement, and the inner loop of reactive controllers), the interfaces between task, geometry, perception, and control must be redrawn. Most of the literature surveyed develops one architectural pattern at a time. A unified bimanual pipeline that exercises all of them concurrently at millisecond rates does not yet exist.

2. Background and Definitions

Whole-body motion planning is the computation of collision-free joint-space trajectories for the full kinematic chain of a robot, as distinct from planning for an isolated manipulator arm. For a bimanual humanoid this typically involves 24 to 40 DoF spanning legs, a multi-link torso, two 7 DoF arms, and a head assembly [29, 17, 16]. The combinatorial explosion of configuration space with dimensionality makes whole-body planning qualitatively harder than 7 DoF arm planning and requires fundamentally different algorithmic strategies [56, 77]. For mobile manipulators the base contributes further structure, in particular a non-holonomic constraint that distinguishes the base subgroup from the Euclidean arm.

Bimanual manipulation denotes tasks that require coordinated motion of two manipulator arms. They frequently carry constraints on the relative pose between end-effectors, for example when two hands grasp the same object, or sequential constraints such as hand-to-hand handoffs. The kinematic coupling through a shared torso distinguishes bimanual humanoid planning from dual-arm setups on independent bases. Analytic IK along the bimanual grasp constraint reparametrizes the equality manifold from a measure-zero set to a lower-dimensional positive-measure space, which resolves the degeneracy that naive sampling encounters under rigid relative-pose constraints [21, 73, 75].

Task and motion planning (TAMP) is the combined problem of finding a symbolic action sequence (which objects to grasp, in what order, with which arm) and continuous motion trajectories that realize each action [28]. The interface between task and motion levels, specifically how often and how cheaply the task planner can query the motion planner for feasibility, is the primary architectural joint whose redesign this survey documents [27, 24, 87].

We define millisecond-scale planning as achieving median query times below 100 ms for problems with 14 DoF or more (bimanual) and below 200 ms for 24 DoF or more (whole body), including collision checking against non-trivial environments. Real-time whole-body MPC at 50 to 500 Hz on legged manipulators [3, 81, 20, 60, 48] and 1 to 2 ms vectorized CPU planners on bimanual humanoid platforms [91] sit inside this envelope. The threshold is significant because it places planning within the update rates of typical perception systems, enabling their coupling in closed loops [30].

This review includes sampling-based, optimization-based, hybrid, and learning-accelerated whole-body planners, as well as base-placement optimizers, TAMP systems, and perception-driven replanners that use them. It excludes reinforcement-learning policies that replace rather than augment structured planning, single-arm manipulation in isolation, locomotion-only planning without manipulation, and grasp synthesis that does not interface with whole-body planning. Learned components (neural samplers, diffusion planners, reachability surrogates) are included insofar as they interact with or are trained by structured whole-body planners. Following the self-review of the source material, we acknowledge upfront that the evidence base is heaviest on the planner side and thinnest on bimanual-specific integration, and the upstream-reshaping argument is in part prospective.

3. Algorithmic Foundations of Millisecond-Scale Whole-Body Planning

3.1 Trajectory Optimization, GPU Parallelism, and Vectorized Collision Checking

The single most impactful driver of millisecond-scale whole-body planning has been the migration of trajectory and sampling computations to massively parallel hardware, coupled with differentiable or vectorized collision-checking primitives. Classical optimization-based planners established the template. CHOMP [66] introduced covariant gradient optimization with signed-distance collision costs. STOMP [36] introduced derivative-free stochastic rollouts that decouple cost expressiveness from differentiability, allowing hard constraints and torques to enter the objective directly. TrajOpt formalized sequential convex optimization with convex-hull signed-distance penalties, enabling planning from arbitrary and even colliding straight-line initializations across 7 to 34 DoF systems [68, 67]. Gaussian process motion planning [52, 51] reinterpreted the trajectory as a continuous random process whose smoothness prior is encoded by a linear time-varying SDE, compressing the optimization to a small number of support states while evaluating collisions along the full arc.

The GPU-parallel generation converts these building blocks into batch operations. cuRobo [84] formulates motion generation as parallel L-BFGS trajectory optimization with collision costs computed against voxelized world representations, achieving median times of 30 to 50 ms for 7 DoF arms and roughly 2x overhead for 14 DoF bimanual configurations. STORM [12] adapts model-predictive path integral control to manipulation by running thousands of stochastic rollouts at each timestep, reaching reactive rates of 50 to 100 Hz while respecting collision and joint-limit constraints. The same principle scales to whole-body legged systems. Alvarez-Padilla et al. [3] use MuJoCo as the constraint-checking backbone for real-time sampling-based MPC on high-DoF legged robots, so that collision avoidance and physical feasibility emerge implicitly from the simulator rather than being enforced analytically.

A complementary line of work converts collision checking itself into a vectorized primitive on commodity CPUs, side-stepping GPU deployment constraints. Vectorized sphere-mesh distance computations run across entire trajectories in a single SIMD pass, reducing the dominant cost of sampling-based planners by one to two orders of magnitude [91]. In practice this places RRT-Connect [44] on 24 DoF bimanual humanoid platforms at roughly 1 ms median query time. The consistency between CPU-vectorized and GPU-parallel numbers suggests the millisecond regime does not require exotic hardware. It requires that collision checking be rephrased as a batch operation.

Centroidal dynamics [17, 65, 88] provides a physically consistent yet tractable intermediate representation for whole-body trajectory optimization. Planners optimize joint trajectories subject to centroidal dynamics and a consistency constraint that requires centroidal quantities derived from joint trajectories to match those from the centroidal model [17, 49]. This avoids the intractability of full rigid-body dynamics while retaining the expressiveness needed for collision avoidance, reachability, and automatic contact-sequence discovery via complementarity conditions [64, 89, 35]. The biconvex algebraic structure of legged-robot dynamics (centroidal and whole-body kinematics forming two coupled subproblems each convex when the other is fixed) is exploited by BiConMP via alternating minimization to make nonlinear whole-body MPC real-time tractable [49].

Differential dynamic programming and its hybrid variants round out the optimization toolkit. HS-DDP [45] augments DDP with impact-aware reset-map steps, augmented-Lagrangian switching constraints, and switching-time optimization, enabling whole-body trajectory optimization that models contact transitions and optimizes contact timing without centroidal approximations. Crocoddyl [50] provides the infrastructure of constrained DDP with analytical derivatives from the Pinocchio rigid-body library [16], and has become a shared substrate for whole-body MPC work [81, 20, 48, 60]. Warm-start strategies for bipedal whole-body MPC [48] bring the full-body problem to real-time rates by exploiting the structure of repeated predictions across cycles, and Molnar et al. [60] demonstrate whole-body inverse-dynamics MPC that directly optimizes joint torques for loco-manipulation while maintaining stability.

3.2 Graph-Search, Sampling, and Convex Decomposition Approaches

Optimization-based methods dominate the current landscape of fast whole-body planning, but sampling-based and graph-search methods continue to contribute complementary strengths, particularly for global exploration and narrow constraint passages. Three algorithmic families (graph search, sampling-based methods, and trajectory optimization) each with distinct speed, completeness, and optimality trade-offs, structure the design space [70, 16]. Real-time operation in dynamic environments is the primary design pressure that dictates which family to pick for a given high-DoF problem.

Probabilistic roadmaps and rapidly exploring random trees remain the foundational sampling-based techniques [37, 44]. RRT-Connect [44] pairs two trees rooted at start and goal to accelerate convergence and retains near-linear complexity in many practical high-DoF settings. Constrained sampling on task-space manifolds was refined by Stilman [82] and by the articulated-object RRT of Burget et al. [15], where sampled object configurations parametrize required hand poses via IK, restricting the search to end-effector trajectories consistent with the object's constrained motion. Chance-constrained motion planning (p-Chekov) [22] propagates LQG state distributions along the trajectory, quadrature-estimates per-waypoint collision risk, and allocates risk across waypoints inside a roadmap-plus-trajectory-optimization framework, providing formal probabilistic collision guarantees for high-DoF problems without environmental convexification.

A mathematically distinct thread recasts motion planning as optimization over graphs of convex sets. By decomposing free configuration space into convex regions and formulating path-finding as a mixed-integer convex program, the GCS family achieves globally optimal paths with certificates of correctness that no sampling-based or local-optimization method can provide. Convex kinematic relaxations extend this to humanoid multi-contact locomotion [31], and analytic IK along bimanual constraint manifolds produces convex inner-approximations compatible with the same solver infrastructure [21]. Precomputed reachability maps [94, 63] amortize cost in another way, and can be dynamically relocated across graph-search transitions so that the same map remains valid across changing object poses [62].

3.3 Hybrid Architectures, Contact Reasoning, and Formal Guarantees

Purely monolithic approaches to whole-body planning face fundamental scalability challenges. The most effective millisecond-scale whole-body planners therefore exploit some form of hierarchical decomposition, either along the kinematic tree of the humanoid or along the temporal axis of contact events [29, 57, 9, 31].

Multi-contact whole-body planning exemplifies this hierarchical style. Reachability-based decompositions first identify feasible contact transitions using precomputed reachability volumes, then solve for whole-body configurations that realize each contact, then interpolate smooth trajectories between configurations [9, 62]. Small-space controllability of dynamic walking supports a decoupled two-phase scheme. A randomized constrained planner first finds collision-free statically balanced paths, and controllability guarantees convert them to dynamically balanced trajectories, avoiding the joint problem of simultaneous collision avoidance and dynamic feasibility [23]. Perceptive locomotion frameworks select optimal convex regions on terrain for footsteps inside a whole-body MPC that jointly handles state estimation error and perception noise [20], and recent loco-manipulation work folds inverse-dynamics MPC into the same formulation so that torque limits, friction cones, and object-contact forces all appear in one nonlinear program [60, 48].

Contact-implicit trajectory optimization [64, 50, 89] encodes contact forces as decision variables alongside joint trajectories, enabling the optimizer to discover contact-exploitation strategies (pushing against a surface for stability, using a wall as fixture) that a pure collision-avoidance planner cannot see. The complementarity constraints that encode contact physics were historically slow, but smoothed contact models and GPU-accelerated optimization have brought moderate-DoF problems into the sub-second regime, and whole-body contact optimization now scales to the full body surface as a manipulation primitive [47, 47]. (Levé et al. extend contact optimization explicitly to the whole-body contact surface.)

Orthogonal to optimization speed is the question of formal guarantees. Hierarchical deliberative-reactive architectures combine a sampling-based deliberative planner with a navigation-function or vector-field based reactive layer whose reachability guarantees are certified [90, 92]. When contracts between layers are expressed as provable reachability guarantees rather than specific trajectories, the deliberative planner reasons over guarantee composition rather than trajectory evaluation, shortening search depth while the reactive layer handles real-time perception-action coupling. The reactive layer itself can be extended to unexplored semantic environments by coupling online semantic SLAM and deep object detection into the vector-field world model, with formal guarantees preserved under suitably non-adversarial dynamics [92]. For unique contact modalities, pivoting-based manipulation, multi-limbed climbing, wheeled-quadruped manipulation, and slithering loco-manipulation all fit inside the same algorithmic chassis [97, 89, 11, 33].

Several specialized constructs sit alongside the main families. Actuator-aware planning encodes torque, velocity, power, and battery-voltage limits as configuration-dependent reaction-force bounds inside kino-dynamic optimization, bridging abstract centroidal trajectories and hardware-feasible high-power motions [19]. A five-point-mass inverted-pendulum abstraction enables analytic closed-form whole-body pose generation without iterative optimization, making whole-body motion tractable on resource-constrained position-controlled platforms [26]. IK-Geo [32] decomposes IK for any 6 DoF all-revolute manipulator into six canonical geometric subproblems with closed-form solutions, providing the fastest published general analytic IK baseline. Lie-theoretic state planning [76] produces smooth unified mobile-manipulator trajectories under the geometry of SE(3). TOCALib [83] provides a high-level optimal-control library for bimanual manipulation with obstacle avoidance and real-time interpolation. Gilbert-Johnson-Keerthi distance computation [34] remains the workhorse primitive for convex-object collision queries throughout all of the above. The common theme is that algorithmic craft, not hardware alone, is what moves high-DoF planning into the millisecond regime.

4. Task Decomposition and Subgoal Sequencing Under Cheap Planning

4.1 Planning as a Feasibility Oracle in Integrated TAMP

The integration of symbolic task planning with geometric motion planning has been a central challenge in robotics for more than a decade, and the computational cost of motion planning has been the dominant bottleneck constraining the design of its interface [27, 28, 42]. When each motion-planner call represented seconds of computation, task planners minimized geometric feasibility queries and committed to long action sequences before checking feasibility, discovering infeasible actions only late and triggering expensive backtracking. Incremental Task and Motion Planning [24] introduced a principled alternative in which motion-planning failures are encoded as incremental constraints added to the task-level solver and retracted on backtrack, turning discarded computation into reusable pruning information.

Logic-Geometric Programming [87] takes integration further by embedding continuous trajectory optimization directly inside a logic-based task search. Differentiable dynamic primitives (hitting, throwing, impulse exchange) become first-class TAMP subgoal types whose physical feasibility is verified through gradient-based trajectory optimization rather than geometric or learned predicates, extending combinatorial task-level search beyond contact-stable configurations [88]. When motion-planning queries cost tens of milliseconds, the TAMP architecture shifts from plan-then-verify to verify-while-planning. Concurrent feasibility checking becomes the default. The task planner generates candidate bindings, the motion planner evaluates each one immediately, and infeasible branches are pruned before the task-level search expands them further. This online pruning reduces task-level backtracking by orders of magnitude compared to lazy-evaluation approaches that defer geometric checks.

4.2 Fine-Grained Decomposition and Bimanual Coordination

Cheap feasibility checking enables task planners to decompose manipulation objectives into finer-grained subgoal sequences than was previously practical. Coarse decompositions ("pick up the box, carry it, set it down") require few checks but forgo opportunities to optimize intermediate configurations or exploit environmental structure. Fine-grained decompositions ("regrasp at this specific pose, pivot the torso, transfer at the table edge") require many more feasibility evaluations but unlock strategies that coarse decompositions miss entirely [87, 30]. With planning at tens of milliseconds per query, evaluating 1,000 candidate skeletons sits inside a minute, turning bimanual decomposition from heuristic commitment into systematic combinatorial search.

Bimanual decomposition has developed several orthogonal decision criteria that fast planning now makes testable. Arm-coordination mode (bimanual versus unimanual) can be derived from demonstrations by partitioning via semantic scene graph monitoring, with TAMP sequencing transitions between coordination modes [18, 53]. Cross-object spatial accessibility can be used as a bimanual-parallelism predicate in scene-agnostic settings, partitioning subgoal pairs into bimanual or unimanual assignment based on explicit geometric reasoning over affordance interaction points [53]. Learned physical stability prediction from simulation can serve as a subgoal-necessity predicate, activating a bimanual support subgoal only when a collapse-risk model judges it necessary during extraction [61]. Grasp-motion geometric coupling in confined spaces can be treated as a constrained co-optimization problem, since feasible grasp sets depend on reachable approach paths [75, 88, 61]. The common thread is that none of these fine-grained predicates would be testable if motion-planning queries cost seconds apiece.

Locomotion-manipulation unification is an extreme case of fine-grained decomposition. Symbolic actions defined as contact-mode changes (foot and hand contacts in a single combinatorial space) collapse the locomotion-manipulation boundary and enable emergent loco-manipulation behaviors without pre-partitioning the plan into separate phases [13, 41]. Dynamic whole-body sequencing for mobile manipulation under similar contact-centric decompositions brings the same approach to wheeled platforms [54]. Analogous decompositions apply to legged mobile manipulation [11, 48, 55] and to humanoid contact sequencing via graph search over reachability maps [62, 63]. Together these works argue that the traditional separation of locomotion and manipulation is a legacy of planning cost, and that the boundary dissolves when per-query cost drops.

4.3 Language-Grounded Task Planning with Geometric Feasibility

The integration of large language models and vision-language models into robotic task planning has introduced a new consumer of fast motion-planning queries. Language-grounded planners use LLMs to generate candidate action sequences from natural language instructions, but the sequences are grounded in linguistic rather than geometric knowledge and frequently propose physically infeasible actions [1, 41, 79]. Closing the reality gap requires a fast geometric feasibility checker that can evaluate LLM-proposed actions against the robot's actual kinematics and environment.

SayCan [1] scored LLM-proposed actions by their geometric affordance through learned affordance models, and subsequent systems have replaced or augmented those learned affordances with direct planning-based feasibility scores [41, 43]. Sensing behaviors are now first-class subgoal types alongside manipulation actions in LLM-grounded task decomposition, and planners interleave sensing with action subgoals so failure detection becomes a planned checkpoint rather than a monitoring afterthought [41, 74]. A unified VLM sequencer-monitor uses a single pretrained model as both task-level skill selector and continuous real-time skill-completion verifier, with subgoal advancement triggered by VLM perception of completion rather than discrete planned sensing subgoals [74]. VLM-grounded failure detection in RePLan [79] serves as a perception-to-planning feedback channel. When a VLM detects a discrepancy between expected and actual world state mid-execution, it triggers online LLM replanning, enabling recovery from unforeseen obstacles without pre-specified failure modes. Socially aware TAMP extends the decomposition axes still further by incorporating norms and signals as a co-evolving state dimension, with task urgency weighting soft-constraint violations [30a].

The granularity at which LLM proposals can be geometrically validated depends directly on planner speed. Early systems checked feasibility at the level of entire action primitives. With millisecond-scale planning, validation can occur at the level of individual waypoints or continuous trajectory segments. This finer-grained validation detects failure modes such as an arm configuration that reaches the grasp target but collides during approach, which coarse primitive-level checks would miss [25, 41, 74].

5. Data-Driven and Neural Sampling Strategies

5.1 Learned Configuration-Space Samplers and Guiding Spaces

Classical sampling-based planners rely on uniform or quasi-random sampling of configuration space, a strategy that becomes catastrophically inefficient as dimensionality grows [37, 44]. For a 30 DoF humanoid the volume of free configuration space relative to the total space is vanishingly small, and uniform sampling wastes its budget on irrelevant configurations. This motivates learned samplers that concentrate samples in regions likely to yield feasible solutions [38, 4]. The guiding-space framework [4] unifies biased sampling heuristics (including neural and learned distributions) under a single abstraction. Any mechanism that defines a non-uniform sampling distribution over configuration space constitutes a guiding space, giving learned samplers clear theoretical grounding within classical planning. An information-theoretic evaluation of sampling quality (how much the biased distribution reduces uncertainty about useful samples) provides a training signal cheaper than full planning rollouts and model-agnostic across architectures.

Two complementary decompositions of the learning problem have proved particularly productive. Localization versus generation decouples the role of the learner. A neural model identifies critical bottleneck regions in configuration space (a simpler classification task) while classical local sampling methods generate the actual samples within those regions, preserving probabilistic completeness [40]. Constraint-manifold learning from demonstrations uses VAEs or equivariant conditional manifold networks to learn implicit equality-constraint manifolds without explicit analytical equations, turning constraint satisfaction from a hard geometric requirement into a learned generative model that handles constraints known only through examples [73]. Local feature conditioning improves generalization by binding the learner to local obstacle-configuration interactions rather than global problem structure.

Stein Variational Gradient Descent [98] recasts trajectory sampling as variational inference. Particles (candidate trajectories) are iteratively transported to cover low-cost feasible trajectory space, entropy regularization preserves multi-modal diversity, and the output is a distribution over solutions rather than a single plan. This makes the variational framework a principled bridge between motion planning and learning from demonstrations, since the same planner can be used as a data-generation engine for imitation and reinforcement learning. A learned local control predictor for kinodynamic planning, trained once on obstacle-free dynamics data with dispersion-bounded pruned samples, can be reused as a modular expansion component across arbitrary environments, decoupling dynamics-specific learning from environment-specific planning [40a].

5.2 Diffusion Models and 3D Policy Architectures

The application of denoising diffusion models to motion generation has emerged as one of the most active research fronts since 2022. Unlike CVAE-based samplers that produce unimodal or mixture-of-Gaussians trajectory distributions, diffusion models capture the full diversity of valid motions, including topologically distinct homotopy classes, through iterative denoising [35a]. Diffusion Policy [18] shows that diffusion models trained on expert demonstrations produce manipulation trajectories with remarkable precision and multimodality, reaching state-of-the-art performance on a broad range of benchmarks. The architecture is readily adapted to use fast planner outputs as training signal. While demonstrations are expensive to collect and limited in diversity, a GPU-accelerated planner can generate millions of diverse constraint-satisfying trajectories [10].

Diffuser [35a] treats trajectory optimization as conditional sampling from a learned distribution over trajectories, and subsequent work has extended this to manipulation domains with richer constraint sets including collision avoidance and kinematic limits [10, 67]. SE(3)-DiffusionFields and related geometric-diffusion work bring geometric structure into the generator by operating directly in task-space, respecting the Riemannian geometry of the rotation group, and passing task-space trajectories into a fast IK stage [88]. Conditioning on 3D point-cloud observations rather than 2D images provides spatially invariant features that enable cross-scene generalization from single-scene training data, positioning 3D geometric representation as a key inductive bias for sample-efficient diffusion-policy learning [96, 95, 54a]. The kinematics-aware diffusion policy [54a] represents actions as 3D body-point sets spatially aligned with point-cloud observations, offloading the burden of learning nonlinear kinematics onto the action parameterization itself, with explicit kinematic priors injected into the denoising process. SafeBimanual [5] integrates diffusion-based trajectory generation with classical safety constraints for bimanual manipulation, enforcing collision avoidance and joint limits on the diffusion output.

A fundamental tension in diffusion-based motion generation is the inference-time cost. A single trajectory typically requires 20 to 100 denoising steps, imposing a floor of 50 to 200 ms that is competitive with but not dramatically faster than modern optimization-based planners. Consistency models and single-step distillation reduce this floor, and iterative student-teacher finetuning (the policy's own rollouts generate improved training targets) corrects specific failure modes without full retraining [95, 93]. Using a mixture-of-experts RL policy as teacher deliberately induces multi-modal structure in the demonstration data, and a diffusion policy is the appropriate student because its capacity to model multi-modal distributions prevents mode collapse when distilling heterogeneous expert behaviors [93].

5.3 Neural Scene, Dynamics, and Reachability Surrogates

Beyond generating full trajectories, neural networks can accelerate planning by providing fast approximate evaluations of intermediate states or partial plans. Differentiating through a neural implicit scene representation (NeRF augmented to approximate an ESDF) yields obstacle gradients analytically, replacing sampling-based collision queries entirely [71]. The same NeRF used for rendering doubles as a reactive geometry oracle, with planning speed becoming a function of network inference rather than environment complexity. A differentiable reachability map, a scalar field over task space trained on kinematic-model samples, replaces discrete IK feasibility checks with smooth gradient-providing constraints in trajectory optimization, encoding the non-convex reachability geometry of high-DoF humanoids as a learnable surrogate [63]. A learned latent world model, trained on offline demonstration-free data, serves as the predictive simulator inside sampling-based MPC by rolling out candidate trajectories in a compressed latent space, replacing physics simulation entirely [58]. This positions the world model as a neural rollout oracle for contact-rich humanoid planning.

Learning to predict motion-planning feasibility represents an extreme case of neural acceleration. Binary classifiers evaluable in microseconds serve as ultra-fast pre-filters within TAMP systems. The task planner consults the classifier before invoking the millisecond-scale motion planner, and the two-level filtering architecture is coherent only when the planner itself is fast enough that the combined pipeline meets real-time requirements while slow enough that the pre-filter's savings are meaningful [40, 4]. Offline reinforcement learning on sub-optimal whole-body controllers generates demonstrations constrained to task-relevant state-action space by randomizing a lightweight WBC, then uses offline RL with Q-chunking extended to action-chunked diffusion policies to identify and stitch improved behaviors, bypassing teleoperation and heavy reward engineering [39]. The interaction between neural components and structured planners is increasingly bidirectional. Learned value functions and cost-to-go estimators guide planners, and planner outputs update the learned functions, creating a bootstrapping loop whose convergence properties remain poorly characterized theoretically but have shown strong empirical results.

6. Base-Placement Optimization and IK Pre-Configuration for Bimanual Systems

6.1 Reachability-Driven Base Placement for Mobile Bimanual Systems

For a mobile bimanual humanoid, the choice of base position relative to the manipulation workspace is a decision of outsized consequence. It determines which objects are reachable, which grasps are kinematically feasible, and how much of the robot's joint range is consumed by postural compensation rather than task-directed motion. Classical approaches relied on precomputed reachability maps (capability maps) that discretely represented achievable end-effector poses from each base position aggregated over joint limits [94, 62]. These maps enabled fast lookup but suffered three limitations. They did not account for obstacles, they evaluated reachability independently per arm, and they considered only static reachability rather than feasibility of transitions between configurations.

Fast whole-body planning transforms base-placement optimization from a static reachability analysis into a dynamic feasibility search. Rather than asking "can the end-effector reach this pose from this base position" (a kinematic query), the planner asks "does a collision-free joint-limit-respecting trajectory exist that brings the whole body to a configuration achieving this end-effector pose" (a planning query). Burget et al. [14] demonstrated that incorporating full-trajectory feasibility into base-placement scoring produces qualitatively different and substantially better placement decisions than reachability maps alone, particularly in cluttered environments where the path to a reachable pose may be obstructed. The computational cost of this richer evaluation was initially prohibitive for dense search, but GPU-accelerated and vectorized planners reduce the per-evaluation cost to the point where hundreds of base candidates are screened within seconds [84, 91].

For bimanual tasks, the interaction between base placement and arm coordination creates an especially rich optimization landscape. A base position that maximizes the reachable workspace of one arm may minimize it for the other, or may force both arms into configurations that leave no collision-free coordination space between them. Bimanual reachability (the intersection of the two arms' reachable workspaces) produces significantly different optimal placements than single-arm reachability [93]. Extending this to full planning-based feasibility, where bimanual coordination constraints (inter-arm collision, relative end-effector poses) are checked online, enables the placement optimizer to account for the full complexity of two-arm coordination [21, 75].

6.2 Whole-Body IK as a Screening Primitive

Inverse kinematics for high-DoF systems is itself computationally expensive and benefits directly from the same acceleration techniques as trajectory planning. Fast whole-body IK is a screening primitive that precedes full trajectory optimization. Before committing to a heavier trajectory plan, the system checks whether a kinematically valid configuration exists at the target pose. For bimanual humanoids, whole-body IK must simultaneously satisfy constraints on two end-effectors, torso pose for balance, and possibly leg configurations for stance stability [29]. The multi-constraint IK problem has many solutions under high-DoF redundancy or none under infeasibility, and rapid enumeration of the solution manifold or certification of infeasibility is critical for upstream decision-making.

Neural warm-starts for a two-stage Jacobian-based IK solver reduce collision-free IK to under 10 ms per query on a single CPU core for 19 DoF bimanual humanoids, fast enough to screen many candidate base placements in real time before the robot commits to a position [86]. The analytic IK-Geo decomposition of any 6 DoF all-revolute manipulator into six canonical geometric subproblems [32] serves as the closed-form baseline that neural warm-starts must beat. The concept of IK pre-configuration extends screening by pre-computing IK solutions for each proposed action's target pose before the task planner commits to a sequence, using the existence and quality of the solutions to inform task-level decisions [27, 87, 46a]. For a bimanual handoff, pre-computing IK for both arms at the transfer pose reveals whether the handoff is feasible and how much of the robot's joint range it consumes, feeding back to the task planner, which can choose between alternative transfer locations or restructure the action sequence.

6.3 Co-Optimization of Stance and Manipulation

The most architecturally ambitious application of fast planning to base placement is co-optimization of stance (base position, foot placement, torso posture) and manipulation plan. Rather than fixing the base first and then planning arm motions (a sequential decomposition that can miss globally superior solutions), co-optimization treats the full whole-body configuration, base included, as a single decision variable [9, 81]. Feasibility depends on inner-loop planning speed. With classical planners at seconds per query, the outer loop was limited to a handful of preselected candidates. With GPU-accelerated planners the outer loop can afford dense grid search, gradient-based optimization over continuous base-pose parameters, or evolutionary strategies [14].

Kinodynamics-based pose optimization for humanoid loco-manipulation [46a] combines the object-robot dynamics model, robot kinematic constraints, and ground reaction force constraints to compute stance configurations that are dynamically feasible for pushing heavy objects, coupling object dynamics with whole-body pose before MPC execution. Pre-specifying contact timings as MPC inputs [46] handles discrete contact-mode switches (grasp, release, footstep) within the prediction horizon, avoiding combinatorial contact selection during online optimization, while modeling manipulated object dynamics as external forces rather than additional rigid-body states reduces model complexity while preserving dynamics consistency across loco-manipulation modes [46]. Affordance-guided coarse-to-fine exploration [54b] couples VLM-derived affordance priors with geometric feasibility in a coarse-to-fine loop. Semantics first narrow candidates to task-relevant regions, and spatial constraints refine them, achieving high zero-shot success in open-vocabulary mobile manipulation and avoiding local-optima failures of purely proximity-based or purely geometric planners. GenerativeMPC [25a] folds VLM-RAG-guided reasoning into a whole-body MPC with virtual impedance, producing a hierarchical framework that bridges semantic context and compliant physical interaction for bimanual mobile manipulation. Together these works exemplify the survey's general pattern. Cheap planning enables new integration patterns between learned and structured components.

7. Tighter Planning-Perception-Action Feedback Loops

7.1 Model-Predictive Whole-Body Control

When trajectory planning completes within one control cycle, the boundary between planning and control dissolves. The system replans a complete trajectory at every timestep, incorporating the latest state estimate and any changes in the environment. This model-predictive planning paradigm extends MPC from its traditional domain of trajectory tracking to the full motion-planning problem, including collision avoidance and task-level constraints. Whole-body MPC for humanoid-class systems has been demonstrated with steadily decreasing computation times. Unified MPC for legged mobile manipulators [81] jointly optimizes locomotion and manipulation through a single nonlinear program. BiConMP [49] and HS-DDP [45] bring whole-body trajectory optimization for humanoid-class systems into the sub-second regime. Warm-start whole-body MPC for bipedal locomotion with a novel kino-dynamic model [48] and inverse-dynamics MPC for loco-manipulation [60] reach the regime where full-body replanning keeps pace with perception updates. Collision-free whole-body MPC [20a] enforces self-collision and environment-collision avoidance as soft constraints inside multi-contact optimal control.

The architectural consequence of MPC-rate planning is the elimination of the traditional planning-execution-monitoring trichotomy. Classical architectures plan a trajectory once, execute it open-loop under a tracking controller, and monitor deviations that trigger replanning. Each component operates at a different rate and communicates through interfaces that introduce latency and information loss. MPC-rate planning collapses these into a single loop that plans, executes one step, observes, and plans again. This tight loop is inherently more robust to disturbances, model errors, and environmental changes than any open-loop architecture, but demands that the planner complete within the control period, a requirement only the fastest whole-body planners can meet [60, 48, 81].

7.2 Perception-Driven Online Replanning

Below MPC rate but above the traditional planning rate lies a regime of perception-driven replanning at 5 to 30 Hz, matching the update rates of typical vision systems. In this regime the planner incorporates new obstacle observations, object pose updates, and task-state transitions at each perception cycle, producing trajectories that are always consistent with the current world model. STORM [12] demonstrates this architecture for manipulation. At each perception update the GPU-parallel MPPI optimizer generates a new trajectory from the current state, using the latest voxel-grid representation. The manipulation system naturally avoids dynamically appearing obstacles, tracks moving targets, and recovers from perturbations without explicit replanning triggers. The insight is that when replanning is cheap and continuous, separate monitoring or recovery modules are unnecessary. Those behaviors emerge from the planner's interaction with a continuously updated world model. Predicted composite SDFs extend this principle to dynamic environments by accounting for predicted trajectories of moving objects inside the signed-distance representation, letting the planner reason jointly about current and predicted collision surfaces [30].

For bimanual humanoid systems, perception-driven replanning introduces additional complexity. The two arms may execute different phases of a coordinated task, and a perception update that invalidates one arm's plan may require replanning for both arms simultaneously. Whole-body planning handles this naturally, since the planner optimizes all joints jointly and can propagate the effect of a local obstacle change to the entire body posture, but the computational cost of whole-body replanning at perception rates remains at the boundary of current capability [84, 81]. The integration of neural scene representations (NeRFs, 3D Gaussians, neural SDFs) with planning creates new possibilities for perception-planning coupling. These representations provide smooth differentiable distance fields compatible with gradient-based trajectory optimization, enabling the planner to reason about partially observed environments through learned scene completions [71].

7.3 Contact-Aware and Intent-Aware Reactive Planning

Manipulation tasks, especially bimanual tasks such as regrasping, pivoting, and assembly, are fundamentally contact-rich. The robot must make and break contacts in controlled ways, and the dynamics of these contacts are discontinuous, hybrid, and partially observable. Classical motion planners treat contacts as hard constraints to be avoided, but a growing body of work treats contacts as resources to be exploited [64, 89, 58, 47]. Contact-implicit trajectory optimization formulates the planning problem with contact forces as decision variables alongside joint trajectories, enabling the optimizer to discover contact-exploitation strategies invisible to a collision-avoidance planner.

Motion intent inferred at planning time can be routed back to reweight perceptual features, for example via multi-scale attention, creating an intra-step planning-to-perception feedback loop that adapts what the system looks at based on its current manipulation stage rather than relying on static or purely reactive attention [54a, 92a, 79]. Decoupling base and arm action streams while jointly modeling their coordination (for example via flow matching on a shared latent) reduces optimization interference from control coupling, producing less entangled action signals that downstream perception and replanning can condition on more cleanly [6]. Adversarial motion priors extended to perceptual settings serve as a physically coherent feedback channel. The prior regularizes visually reactive behaviors to remain kinematically plausible, while a paired encoder-decoder recovers privileged state estimates from imperfect observations, effectively amplifying the information the policy can act on without ground-truth sensing at deployment [92a].

Contract-based abstraction boundaries between deliberative and reactive layers let the deliberative planner reason over provable reachability guarantees rather than specific trajectories, replacing trajectory evaluation with guarantee composition and tightening the deliberative-reactive loop [90, 92]. The reactive dimension of contact-aware planning (adjusting plans in response to tactile feedback during execution) requires planning speeds that match the bandwidth of tactile sensors. Full trajectory reoptimization at those rates remains out of reach for high-DoF systems, but hybrid architectures that combine a fast local reactive controller with a slower but more capable trajectory replanner achieve effective contact-reactive behavior. Recent humanoid teleoperation and real-time interaction systems (RHINO [19a], humanoid soccer skills [92a], Mobile ALOHA bimanual teleoperation [59]) push this integration onto real hardware by folding perception, intent, and contact reasoning into tightly coupled loops.

8. Cross-Cutting Analysis

Several structural patterns and tensions emerge across the thematic sections of this survey.

The virtuous cycle of speed and data. The most consequential cross-cutting pattern is the mutually reinforcing relationship between fast planners and learned components (Sections 3 and 5). Fast planners generate training data for learned samplers and diffusion models. Learned components accelerate planners by providing better initial samples or heuristics. The accelerated planners generate still more diverse training data. The cycle has been operationalized in guiding-space formulations [4], local control predictors [40a], constraint-manifold learning [73], and teacher-student diffusion distillation [93, 95]. The risk is that errors in learned components produce biased training data producing biased samplers, and the conditions under which this cycle converges reliably are not yet theoretically characterized.

The feasibility oracle as a universal interface. Across task decomposition (Section 4), base placement (Section 6), and language-grounded planning (Section 4.3), the motion planner is increasingly used not to produce trajectories for execution but to answer yes/no feasibility queries. This shift from planner as trajectory producer to planner as feasibility oracle changes the planner's design requirements. Low latency and high throughput matter more than trajectory quality, and false negatives (reporting infeasible when a plan exists) are more damaging than suboptimal trajectories. GPU-parallel planners that evaluate many trajectory seeds and return any feasible one are well-suited to this oracle role [3, 84, 86, 91]. Optimization-based planners that converge slowly toward high-quality solutions are less so. This tension suggests future systems may employ distinct planner configurations for each role.

Hierarchical decomposition versus end-to-end optimization. A persistent tension throughout the surveyed literature lies between hierarchical architectures that decompose the problem along kinematic, spatial, or temporal dimensions (Sections 3.3, 6.3) and monolithic approaches that optimize all variables jointly (Section 7.1). Hierarchical decomposition is more computationally tractable and more amenable to modular system design, but introduces interface constraints that can exclude globally superior solutions. End-to-end optimization avoids interface losses but scales poorly with system complexity. The current frontier lies in tightly coupled hierarchies, architectures that maintain decomposition for tractability but use fast cross-layer communication (enabled by cheap planning queries) to recover some of the global coherence lost to decomposition [87, 81, 90]. Logic-Geometric Programming and unified MPC approach this from opposite directions.

Locomotion and manipulation as a single contact-mode search. The traditional boundary between locomotion and manipulation is a legacy of planning cost. When symbolic actions are defined as contact-mode changes (shared across foot and hand contacts) the combinatorial space unifies, and emergent loco-manipulation behaviors appear without pre-partitioning [13, 41]. Multi-contact planning work bridging climbing, wheeled-quadruped manipulation, humanoid pivoting, and slithering loco-manipulation [11, 97, 89, 33] illustrates how the same decomposition transfers across embodiments. Progress here is gated more by contact-mode search than by planning speed per se.

The sim-to-real transfer problem for learned components. Learned samplers, diffusion planners, and neural cost-to-go estimators are typically trained in simulation using idealized kinematics and synthetic environments. Their transfer to physical robots, where sensor noise, model inaccuracies, and unmodeled contacts degrade performance, remains a significant concern. The interaction between fast planners and learned components creates a specific transfer challenge. A fast planner in simulation may generate training data that exercises parts of configuration space the physical robot cannot safely reach, producing learned components that are optimistic about feasibility. This concern is less acute for planning algorithms that operate on explicit geometric models, which can be updated from real sensor data more readily.

Bimanual planning as a stress test. Bimanual humanoid planning sits at the intersection of every theme surveyed. It requires fast planning (high DoF), tight task decomposition (coordinated two-arm actions), informed sampling (narrow feasible regions), careful base placement (dual-arm reachability), and reactive control (contact-rich coordination). It is a natural stress test and integration platform for advances in each individual area [21, 75, 5, 83, 18, 53, 61, 59]. Progress on bimanual planning is gated by the slowest component in the pipeline, which has historically been motion planning but is increasingly shifting to task-level reasoning about coordination strategies and to perception of contact states.

Foundation models and planning compute contention. Integrating LLMs and VLMs with fast planners raises a question largely unaddressed in the literature. Both compete for on-robot compute. If VLM inference takes 500 ms per call while planning takes 1 ms per call, the new bottleneck moves from geometry to language [1, 74, 79, 25a]. A realistic end-to-end stack must budget latency across perception, reasoning, and planning jointly, and future architectures will likely distribute inference between a shared accelerator and dedicated compute islands rather than treat each subsystem in isolation.

9. Open Problems and Future Directions

Several specific open problems emerge from this survey.

Certified completeness and optimality under speed constraints. GPU-parallel trajectory optimization achieves speed by sacrificing completeness. Running many random trajectory seeds in parallel and taking the best feasible one provides no guarantee that a feasible trajectory is found even when one exists [3, 84]. For safety-critical applications the absence of completeness certificates is a significant limitation. Extending graph-of-convex-sets style frameworks and convex kinematic relaxations [31, 21] to whole-body humanoid planning, which requires convex decomposition of free space in 24 to 40 dimensions, remains an open algorithmic challenge. Chance-constrained methods [22] and reactive guarantees [90, 92] are partial answers that cover specific subregimes but do not yet compose into a unified framework.

Unified training frameworks for learned planning components. The learned components surveyed in Section 5 (CVAE samplers, diffusion planners, neural cost-to-go estimators, feasibility classifiers) are currently developed and trained independently, each with its own data pipeline, architecture, and training procedure. A unified framework that co-trains these components end-to-end, using a fast planner as both data source and verification oracle, could capture cross-component dependencies and reduce the total data requirement. The challenge is designing loss functions and training curricula that balance the competing objectives.

Scaling to deformable objects and tool use. The fast planners surveyed in Section 3 assume rigid-body kinematics and collision geometries. Bimanual manipulation of deformable objects (cloth, cables, soft packaging) requires planning in the joint configuration-and-object-state space, which is infinite-dimensional in principle and very high-dimensional in any practical discretization. Extending GPU-accelerated planning to deformable-object manipulation, where the object's state depends on the entire history of manipulator contacts, is a major open challenge that will require new representations and algorithms.

Real-time whole-body planning with dynamic environments. Most fast planners assume quasi-static environments where obstacle positions are known and fixed during a planning cycle. In household or industrial settings where humans and other agents move unpredictably, the planner must reason about obstacle dynamics or replan fast enough that static snapshots are approximately valid. The interaction between prediction uncertainty and planning speed defines a largely unexplored design space [30]. A dedicated treatment of the foundation-model-latency versus planning-latency trade-off, and of how the two should share on-robot accelerators, would sharpen the design of the next generation of stacks.

Bimanual integration on real hardware. Bimanual coordination has essentially one dedicated paper in the curated theme [46] despite being central to the stated scope of this survey. A structured analysis of which upstream enablements have been demonstrated on real hardware versus only in simulation would sharpen practical significance claims. Teleoperation interfaces [59, 19a], imitation-learning stacks [18, 53, 10], and WBC-seeded offline RL [39, 25a] are the components currently nearest to a real-hardware bimanual integration, and a focused benchmark combining a bimanual humanoid platform with standardized task distributions would allow fair comparison across approaches.

Benchmarks and standardized metrics. The field lacks standardized benchmarks for whole-body bimanual planning that would enable fair comparison across methods. Existing benchmarks focus on single-arm planning in static environments. Bimanual benchmarks must additionally specify coordination constraints, base mobility assumptions, and task-level success criteria. A quantitative planner-family comparison (wall-clock time, DoF, constraint types, hardware demonstrated) is currently left to individual papers, and community-owned leaderboards would substantiate the millisecond-scale threshold claim with replicable evidence.

10. Conclusion

The reduction of whole-body motion planning from seconds to milliseconds is not merely an incremental performance improvement. It is a phase transition that reorganizes the robotics stack. When planning is expensive, it sits at the periphery of the system, invoked once, pre-computed, and executed open-loop. When planning is cheap, it migrates to the center, serving as a feasibility oracle for task planners, a training-data generator for learned components, a screening primitive for base-placement search, and the inner loop of reactive controllers. The specific architectural patterns this transition enables (fine-grained TAMP with online feasibility checking, neural samplers trained on planner-generated data, co-optimized stance-and-manipulation planning, and MPC-rate whole-body replanning) are individually well-documented in the literature but have not been previously recognized as consequences of a single underlying cause.

The most important open challenge is integration. Each architectural pattern has been demonstrated independently, yet no system yet combines all of them into a unified whole-body bimanual pipeline operating at millisecond rates in unstructured environments. Achieving this integration will require not only faster algorithms but new interface standards between components, particularly between learned perception models and structured planners, that preserve the speed advantages of each while enabling the tight coupling that reactive bimanual manipulation demands. The field is converging on this integration, and the next generation of bimanual humanoid systems will be defined by how well they exploit the architectural freedoms that fast planning provides.

Citation

If you find this survey useful, please cite it as

@misc{fast_planner_upstreams_survey_2026,
  author    = {Hu Tianrun},
  title     = {Fast Whole-Body Planners and Upstream Architecture},
  year      = {2026},
  publisher = {GitHub},
  url       = {https://h-tr.github.io/blog/surveys/fast-planner-upstreams.html}
}
          

References

  1. Ahn, M., Brohan, A., Brown, N., et al. (2022). “Do As I Can, Not As I Say. Grounding Language in Robotic Affordances (SayCan).” arXiv:2204.01691.
  2. Alvarez-Padilla, J., et al. (2024). “Real-Time Whole-Body Control of Legged Robots with Model-Predictive Path Integral Control.” arXiv:2409.10469.
  3. Attali, A., et al. (2022). “Evaluating Guiding Spaces for Motion Planning.” arXiv:2210.08640.
  4. SafeBimanual Authors. (2025). “SafeBimanual. Diffusion-based Trajectory Optimization for Safe Bimanual Manipulation.” arXiv:2508.18268.
  5. Liu, J. (2026). “InCoM. Intent-Driven Perception and Structured Coordination for Whole-Body Mobile Manipulation.” Open MIND.
  6. Tonneau, S., Del Prete, A., Pettré, J., Park, C., Manocha, D., & Mansard, N. (2018). “An Efficient Acyclic Contact Planner for Multiped Robots.” IEEE Transactions on Robotics, 34(3), 586–601.
  7. Carvalho, J., Le, A.T., Baierl, M., Koert, D., & Peters, J. (2023). “Motion Planning Diffusion. Learning and Planning of Robot Motions with Diffusion Models.” IROS 2023.
  8. Bjelonic, M., et al. (2019). “Keep Rollin'. Whole-Body Motion Control and Planning for Wheeled Quadrupedal Robots.” IEEE Robotics and Automation Letters.
  9. Bhardwaj, M., Sundaralingam, B., & Boots, B. (2022). “STORM. An Integrated Framework for Fast Joint-Space Model-Predictive Control for Reactive Manipulation.” CoRL 2022.
  10. Ciebielski, M., et al. (2025). “Task and Motion Planning for Humanoid Loco-manipulation.” arXiv:2508.14099.
  11. Burget, F., Bennewitz, M., & Burgard, W. (2015). “Stance Optimization for Whole-Body Reaching Movements in Humanoid Robots.” IROS 2015.
  12. Burget, F., Hornung, A., & Bennewitz, M. (2013). “Whole-Body Motion Planning for Manipulation of Articulated Objects.” ICRA 2013.
  13. Carpentier, J., et al. (2019). “The Pinocchio C++ Library. A Fast and Flexible Implementation of Rigid Body Dynamics Algorithms and their Analytical Derivatives.” IEEE SII 2019.
  14. Dai, H., Valenzuela, A., & Tedrake, R. (2014). “Whole-Body Motion Planning with Centroidal Dynamics and Full Kinematics.” Humanoids 2014.
  15. Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., & Song, S. (2023). “Diffusion Policy. Visuomotor Policy Learning via Action Diffusion.” RSS 2023.
  16. Chignoli, M., et al. (2021). “The MIT Humanoid Robot. Design, Motion Planning, and Control For Acrobatic Behaviors.” arXiv:2104.09025.
  17. Chen, J. (2025). “RHINO. Learning Real-Time Humanoid-Human-Object Interaction from Human Demonstrations.” arXiv:2502.13134.
  18. Corbères, T., et al. (2023). “Perceptive Locomotion through Whole-Body MPC and Optimal Region Selection.” arXiv:2305.08926.
  19. Chiu, J.-R., et al. (2022). “A Collision-Free MPC for Whole-Body Dynamic Locomotion and Manipulation.” arXiv:2202.12385.
  20. Cohn, T., Shaw, S., Simchowitz, M., & Tedrake, R. (2023). “Constrained Bimanual Planning with Analytic Inverse Kinematics.” arXiv:2309.08770.
  21. Dai, S. (2018). “Chance Constrained Motion Planning for High-Dimensional Robots.” arXiv:1811.03073.
  22. Dalibard, S., El Khoury, A., Lamiraux, F., Nakhaei, A., Taix, M., & Laumond, J.-P. (2013). “Dynamic Walking and Whole-Body Motion Planning for Humanoid Robots. An Integrated Approach.” IJRR, 32(9–10), 1089–1103.
  23. Dantam, N.T., Kingston, Z., Chaudhuri, S., & Kavraki, L.E. (2016). “Incremental Task and Motion Planning. A Constraint-Based Approach.” RSS 2016.
  24. Driess, D., Xia, F., Sajjadi, M.S.M., Lynch, C., et al. (2023). “PaLM-E. An Embodied Multimodal Language Model.” ICML 2023.
  25. Fernando, M.J. (2026). “GenerativeMPC. VLM-RAG-guided Whole-Body MPC with Virtual Impedance for Bimanual Mobile Manipulation.” arXiv:2604.19522.
  26. Ficht, G., & Behnke, S. (2018). “Online Balanced Motion Generation for Humanoid Robots.” arXiv:1810.08388.
  27. Garrett, C.R., Lozano-Pérez, T., & Kaelbling, L.P. (2020). “PDDLStream. Integrating Symbolic Planners and Blackbox Samplers via Optimistic Adaptive Planning.” ICAPS 2020.
  28. Garrett, C.R., Chitnis, R., Holladay, R., Kim, B., Silver, T., Kaelbling, L.P., & Lozano-Pérez, T. (2021). “Integrated Task and Motion Planning.” Annual Review of Control, Robotics, and Autonomous Systems, 4, 265–293.
  29. Escande, A., Mansard, N., & Wieber, P.-B. (2014). “Hierarchical Quadratic Programming. Fast Online Humanoid-Robot Motion Generation.” IJRR, 33(7), 1006–1028.
  30. Finean, M.N., et al. (2020). “Predicted Composite Signed-Distance Fields for Real-Time Motion Planning in Dynamic Environments.” arXiv:2008.00969.
  31. Frank, A., et al. (2020). “Socially Intelligent Task and Motion Planning for Human-Robot Interaction.” arXiv:2001.08398.
  32. Gonzalez, C., et al. (2024). “Guiding Collision-Free Humanoid Multi-Contact Locomotion using Convex Kinematic Relaxations and Dynamic Optimization.” arXiv:2410.08335.
  33. Elias, A.J. (2022). “IK-Geo. Unified Robot Inverse Kinematics Using Subproblem Decomposition.” arXiv:2211.05737.
  34. Gu, L., et al. (2024). “Loco-Manipulation with Nonimpulsive Contact-Implicit Planning in a Slithering Robot.” arXiv:2404.08174.
  35. Gilbert, E., Johnson, D., & Keerthi, S. (1988). “A Fast Procedure for Computing the Distance between Complex Objects in Three-Dimensional Space.” IEEE Journal on Robotics and Automation.
  36. Janner, M., Du, Y., Tenenbaum, J.B., & Levine, S. (2022). “Planning with Diffusion for Flexible Behavior Synthesis (Diffuser).” ICML 2022.
  37. Kalakrishnan, M., Chitta, S., Theodorou, E., Pastor, P., & Schaal, S. (2011). “STOMP. Stochastic Trajectory Optimization for Motion Planning.” ICRA 2011.
  38. Kavraki, L.E., Svestka, P., Latombe, J.-C., & Overmars, M.H. (1996). “Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces.” IEEE Transactions on Robotics and Automation, 12(4), 566–580.
  39. Ichter, B., Harrison, J., & Pavone, M. (2018). “Learning Sampling Distributions for Robot Motion Planning.” ICRA 2018.
  40. Jauhri, S. (2026). “Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers.” arXiv:2604.12509.
  41. Jenamani, R.K., et al. (2020). “Robotic Motion Planning using Learned Critical Sources and Local Sampling.” arXiv:2006.04194.
  42. Karten, S., et al. (2022). “Data-Efficient Learning of High-Quality Controls for Kinodynamic Planning used in Vehicular Navigation.” arXiv:2201.02254.
  43. Wang, J. (2024). “Autonomous Behavior Planning for Humanoid Loco-manipulation through Grounded Language Model.” arXiv:2408.08282.
  44. Kaelbling, L.P., & Lozano-Pérez, T. (2013). “Integrated Task and Motion Planning in Belief Space.” IJRR, 32(9–10), 1194–1227.
  45. Lin, K., Agia, C., Migimatsu, T., Pavone, M., & Bohg, J. (2023). “Text2Motion. From Natural Language Instructions to Feasible Plans.” Autonomous Robots, 47, 1345–1365.
  46. Kuffner, J.J., & LaValle, S.M. (2002). “RRT-Connect. An Efficient Approach to Single-Query Path Planning.” ICRA 2002.
  47. Li, H., et al. (2020). “Hybrid Systems Differential Dynamic Programming for Whole-Body Motion Planning of Legged Robots.” IEEE Robotics and Automation Letters.
  48. Li, J. (2022). “Multi-contact MPC for Dynamic Loco-manipulation on Humanoid Robots.” arXiv:2209.08662.
  49. Li, J. (2023). “Kinodynamics-based Pose Optimization for Humanoid Loco-manipulation.” arXiv:2303.04985.
  50. Levé, V. (2025). “Scaling Whole-body Multi-contact Manipulation with Contact Optimization.” arXiv:2508.12980.
  51. Kim, J. (2025). “Real-time Whole-body Model Predictive Control for Bipedal Locomotion with a Novel Kino-dynamic Model and Warm-start Method.” arXiv:2505.19540.
  52. Meduri, A. (2023). “BiConMP. A Nonlinear Model Predictive Control Framework for Whole Body Motion Planning.” IEEE Transactions on Robotics.
  53. Mastalli, C., et al. (2020). “Crocoddyl. An Efficient and Versatile Framework for Multi-Contact Optimal Control.” ICRA 2020.
  54. Mukadam, M. (2016). “Gaussian Process Motion Planning.” ICRA 2016.
  55. Mukadam, M., Dong, J., Yan, X., Dellaert, F., & Boots, B. (2018). “Continuous-Time Gaussian Process Motion Planning via Probabilistic Inference.” IJRR, 37(11), 1319–1340.
  56. Lee, K.B. (2025). “Scene-agnostic Hierarchical Bimanual Task Planning via Visual Affordance Reasoning.” arXiv:2512.09310.
  57. Mastalli, C., et al. (2020). “Motion Planning for Quadrupedal Locomotion. Coupled Planning, Terrain Mapping, and Whole-Body Control.” IEEE Transactions on Robotics.
  58. Lv, K. (2025). “Kinematics-Aware Diffusion Policy with Consistent 3D Observation and Action Space for Whole-Arm Robotic Manipulation.” arXiv:2512.17568.
  59. Lin, T.-J. (2025). “Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Open-Vocabulary Mobile Manipulation.” arXiv:2511.06240.
  60. Liu, F. (2024). “Opt2Skill. Imitating Dynamically-feasible Whole-Body Trajectories for Versatile Humanoid Loco-Manipulation.” arXiv:2409.20514.
  61. Orthey, A., Chamzas, C., & Kavraki, L.E. (2024). “Sampling-Based Robot Motion Planning. A Review.” Annual Review of Control, Robotics, and Autonomous Systems, 7, 285–310.
  62. Mirabel, J., et al. (2016). “HPP. A C++ Framework for Humanoid Path Planning.” Humanoids 2016.
  63. Liu, H. (2025). “Ego-Vision World Model for Humanoid Contact Planning.” arXiv:2510.11682.
  64. Fu, Z., et al. (2024). “Mobile ALOHA. Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation.” arXiv:2401.02117.
  65. Molnar, L. (2025). “Whole-Body Inverse Dynamics MPC for Legged Loco-Manipulation.” arXiv:2511.19709.
  66. Motoda, T. (2021). “Bimanual Shelf Picking Planner Based on Collapse Prediction.” arXiv:2105.14764.
  67. Murooka, M. (2025). “Humanoid Loco-manipulation Planning based on Graph Search and Reachability Maps.” arXiv:2505.23505.
  68. Murooka, M. (2025). “Learning Differentiable Reachability Maps for Optimization-based Humanoid Motion Generation.” arXiv:2508.11275.
  69. Posa, M., Cantu, C., & Tedrake, R. (2014). “A Direct Method for Trajectory Optimization of Rigid Bodies through Contact.” IJRR, 33(1), 69–81.
  70. Orin, D.E., Goswami, A., & Lee, S.-H. (2013). “Centroidal Dynamics of a Humanoid Robot.” Autonomous Robots.
  71. Ratliff, N., Zucker, M., Bagnell, J.A., & Srinivasa, S. (2009). “CHOMP. Gradient Optimization Techniques for Efficient Motion Planning.” ICRA 2009.
  72. Schulman, J., et al. (2013). “Finding Locally Optimal, Collision-Free Trajectories with Sequential Convex Optimization.” RSS 2013.
  73. Schulman, J., et al. (2014). “Motion Planning with Sequential Convex Optimization and Convex Collision Checking.” IJRR, 33(9), 1251–1270.
  74. Petrović, L. (2018). “Motion Planning in High-Dimensional Spaces.” arXiv:1806.07457.
  75. Pantic, M., et al. (2022). “Sampling-Free Obstacle Gradients and Reactive Planning in Neural Radiance Fields (NeRF).” arXiv:2205.01389.
  76. Qureshi, A.H., Miao, Y., Simeonov, A., & Yip, M.C. (2020). “Motion Planning Networks. Bridging the Gap between Learning-Based and Classical Motion Planners.” IEEE Transactions on Robotics.
  77. Fernández, I.M.R. (2020). “Learning Manifolds for Sequential Motion Planning.” arXiv:2006.07746.
  78. Schakkal, A. (2025). “Hierarchical Vision-Language Planning for Multi-Step Humanoid Manipulation.” arXiv:2506.22827.
  79. Rudorfer, M. (2025). “A Framework for Joint Grasp and Motion Planning in Confined Spaces.” arXiv:2505.07259.
  80. Smith, W. (2024). “Lie Theory Based Optimization for Unified State Planning of Mobile Manipulators.” arXiv:2410.15443.
  81. LaValle, S.M. (2006). Planning Algorithms. Cambridge University Press.
  82. Skreta, M. (2024). “RePLan. Robotic Replanning with Perception and Language Models.” arXiv:2401.04157.
  83. Sleiman, J.-P., Farshidian, F., Minniti, M.V., & Hutter, M. (2021). “A Unified MPC Framework for Whole-Body Dynamic Locomotion and Manipulation.” IEEE Robotics and Automation Letters.
  84. Stilman, M. (2010). “Global Manipulation Planning in Robot Joint Space with Task Constraints.” IEEE Transactions on Robotics.
  85. TOCALib Authors. (2025). “TOCALib. Optimal Control Library with Interpolation for Bimanual Manipulation and Obstacles Avoidance.” arXiv:2504.07708.
  86. Sundaralingam, B., et al. (2023). “cuRobo. Parallelized Collision-Free Robot Motion Generation.” ICRA 2023.
  87. Tenhumberg, J. (2023). “Efficient Learning of Fast Inverse Kinematics with Collision Avoidance.” arXiv:2311.05938.
  88. Toussaint, M., Harris, J., Ha, J.-S., Driess, D., & Schmitt, M. (2022). “Sequence-of-Constraints MPC. Reactive Timing-Optimal Control of Sequential Manipulation.” IROS 2022.
  89. Toussaint, M., et al. (2018). “Differentiable Physics and Stable Modes for Tool-Use and Manipulation Planning.” RSS 2018.
  90. Zhang, J. (2021). “Transition Motion Planning for Multi-Limbed Vertical Climbing Robots Using Complementarity Constraints.” arXiv:2106.07127.
  91. Vasilopoulos, V., et al. (2022). “A Hierarchical Deliberative-Reactive System Architecture for Task and Motion Planning in Partially Known Environments.” arXiv:2202.01385.
  92. Thomason, W., Kingston, Z., & Kavraki, L.E. (2023). “VAMP. SIMD-Accelerated Motion Planning for Robot Manipulators.” IROS 2023.
  93. Vasilopoulos, V., et al. (2020). “Reactive Semantic Planning in Unexplored Semantic Environments Using Deep Perceptual Feedback.” arXiv:2002.12349.
  94. Wang, Y. (2025). “Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots.” arXiv:2511.03996.
  95. Wu, T. (2024). “Dexterous Functional Pre-Grasp Manipulation with Diffusion Policy.” arXiv:2403.12421.
  96. Vahrenkamp, N., Asfour, T., & Dillmann, R. (2013). “Robot Placement Based on Reachability Inversion.” ICRA 2013.
  97. Yang, J. (2025). “Deep Reactive Policy. Learning Reactive Manipulator Motion Planning for Dynamic Environments.” arXiv:2509.06953.
  98. Ze, Y. (2024). “Generalizable Humanoid Manipulation with 3D Diffusion Policies.” arXiv:2410.10803.
  99. Yoshida, E., et al. (2008). “Whole-Body Motion Planning for Pivoting Based Manipulation by Humanoids.” ICRA 2008.
  100. Lambert, A. (2021). “Entropy Regularized Motion Planning via Stein Variational Inference.” arXiv:2107.05146.