Why is Mobile Manipulation So Hard?

Mobile manipulation is notoriously difficult. Large-scale environments, long-horizon tasks, complex kinematic constraints, partial observability, dynamic objects, uncertainty everywhere. But here's the question: Are these just engineering problems? Despite tremendous progress in individual components (excellent SLAM systems, sophisticated motion planners, powerful vision models), we haven't yet seen anyone successfully deploy a general-purpose mobile manipulation system. The core difficulty, I believe, comes from forced decomposition.

These challenges are deeply interdependent. Navigation and manipulation depend on each other, while perception depends on both. Yet the combinatorial complexity forces us to decompose, creating artificial boundaries that the problem itself resists. This is why progress has been slow. It's not that individual modules aren't good enough. It's that decomposition creates interdependencies that are extraordinarily difficult to manage. This blog explores this paradox and what we can do about it.

The Nature of Mobile Manipulation: An Inherently Coupled System

To understand why decomposition is so problematic, we need to examine the couplings themselves. Mobile manipulation exhibits three fundamental types of interdependence: spatial coupling (how different components of the robot affect each other), temporal coupling (how decisions now depend on future states and vice versa), and knowledge-grounding coupling (how high-level reasoning and low-level execution depend on each other). Let's explore each in turn.

1. Spatial Coupling

Consider a seemingly simple task: a mobile manipulator needs to pick up a cup from a table and place it in a dishwasher across the room. The robot's base position is critical: it determines which grasps are reachable, not just navigation feasibility.

Move 10cm to the left, and a side grasp becomes feasible. Move 10cm forward, and the robot might collide with the table. Now if we suppose the cup is full of water, the grasp choice completely changes the navigation problem. We need smooth motion, minimal acceleration, and the cup must stay upright.

The manipulation decision has transformed the navigation constraints. As the robot extends its arm to grasp, the center of mass shifts. On uneven terrain or during acceleration, this affects stability. The arm configuration isn't just about reaching the cup but also coupled to the navigation dynamics. The arm configuration also affects what the robot can perceive. It determines sensor viewpoints and can cause self-occlusion, making perception dependent on manipulation choices.

Base-Arm: Base position determines arm workspace and collision geometry
Grasp-Navigation: Grasp choice constrains navigation dynamics and trajectory
Configuration-Stability: Arm extension affects center of mass and stability
Perception-Action: Robot position determines what can be perceived

These aren't edge cases. This is the fundamental nature of mobile manipulation: everything affects everything else.

2. Temporal Coupling

The temporal dimension creates circular causal dependencies. Mobile manipulation requires reasoning about the future to make decisions now, yet the future depends on the decisions we make now.

Consider planning an approach to grasp an object. We need to know the target configuration, yet we can't observe it until we get closer. We must guess a subgoal based on incomplete information. Each guess commits us to a trajectory that limits future options.

Where we navigate now determines what we can see later. What we need to see depends on what manipulation we'll perform. What manipulation we perform depends on what we observe. The circle closes.

Future-to-Present: Current decisions require future observations we don't have yet
Present-to-Future: Current actions determine what we can observe later
Sequential Dependencies: Each action depends on outcomes of actions not yet executed

This creates a chicken-and-egg problem: planning navigation requires knowing manipulation requirements, which requires knowing our position, which requires knowing what we'll observe, which depends on where we position ourselves.

3. Knowledge and Grounding Coupling

There's a third dimension often overlooked: the coupling between high-level knowledge and low-level grounding. Modern systems use VLMs, LLMs, or symbolic planners that bring general knowledge but aren't prepared for specific scenarios.

The system lacks meta-knowledge about its own knowledge boundaries. An LLM might suggest "open the drawer to get the cup," unaware this drawer is stuck or the cup is elsewhere. Determining when enough information has been gathered is itself circular: the criteria depends on the task, which depends on what the system knows, which depends on gathered information.

High-level reasoning needs low-level grounding to know what's true. Low-level execution needs high-level guidance about what information matters. Neither can function without the other, yet they operate in different representational spaces.

Uncertainty About Uncertainty: Systems lack meta-knowledge about their knowledge boundaries
Circular Grounding: Knowing when we know enough depends on what we're trying to know
Bidirectional Dependency: High-level and low-level reasoning depend on each other across abstraction boundaries

The Common Approaches

Faced with this complexity, we have no choice but to make assumptions and decouple the system. We do this in three main ways: spatial decoupling through modularity (separating perception, navigation, and manipulation), temporal decoupling through hierarchy (separating high-level planning from low-level execution), or making strong assumptions to learn everything end-to-end. Each approach makes critical assumptions to manage complexity. These assumptions are precisely why they don't scale.

1. Spatial Decoupling: Assuming Independence

Faced with intractability, we decompose the system into modules: perception processes sensor data, navigation plans base motion, manipulation handles arm control, task planning sequences actions. Each module has well-defined inputs and outputs, can be developed independently, and has manageable computational complexity. This decomposition is engineering necessity.

The assumption is that these modules can operate independently. The navigation module positions the base assuming the arm configuration is fixed. The manipulation module plans grasps assuming the base won't move. Perception provides object poses without understanding how they'll be used. Each module optimizes its own objective in isolation.

This assumption breaks down immediately. When navigation and manipulation plan independently, we get locally optimal solutions that are globally suboptimal. The navigation module might position the base in a location that technically allows manipulation, yet makes the grasp unnecessarily difficult. Modules communicate through narrow interfaces—goal positions, object poses—losing rich information about constraints, preferences, and uncertainty at module boundaries. When modules make incompatible assumptions, the system fails in ways that are hard to diagnose.

Lost Optimality: Independent planning yields locally optimal but globally suboptimal solutions
Interface Brittleness: Narrow interfaces lose critical information at module boundaries
Implicit Assumptions: Modules make incompatible assumptions that cause hard-to-diagnose failures
Coordination Overhead: Significant engineering effort needed to manage module interactions

2. Temporal Decoupling: Assuming Lossless Abstraction

To handle long horizons, we introduce hierarchy. This appears in three main forms: Task and Motion Planning (TAMP), Hierarchical Reinforcement Learning (HRL), and LLM-as-planner with learned policies. The assumption is that we can compress low-level state into high-level abstractions without losing critical information. High-level planning reasons about abstract actions, low-level planning handles detailed execution.

In TAMP, symbolic planning reasons about abstract actions like "grasp cup" or "navigate to table." A symbolic predicate like At(robot, table) doesn't capture whether the base position actually enables a feasible grasp. Bilevel planning attempts to address this by planning all the way to low-level details before committing to high-level decisions. This is both inefficient (must explore complete low-level plans for every high-level option) and often impossible (can't plan low-level details for states we haven't observed yet under partial observability).

In Hierarchical RL, a high-level policy sets subgoals and a low-level policy achieves them. The high-level policy is trained on abstract representations of success. These abstractions create a gap: the high-level model learns to exploit the representation rather than solve the actual task. This is reward hacking at the architectural level. The high-level policy optimizes for abstract subgoals that appear successful in the learned representation yet don't correspond to real low-level success.

A recent variant uses LLMs as high-level planners with BC or RL policies as executors. The LLM generates task plans in natural language or code, while learned policies execute primitive skills. This "large-small brain" architecture faces the same issue: the LLM reasons about abstract task descriptions without understanding whether the low-level policies can actually execute them. The gap between language-level planning and action-level execution creates the same lossy abstraction problem.

All three approaches struggle with the same problem: hierarchical decomposition requires lossy compression of low-level state into high-level abstractions. This loss is not incidental—it's necessary to make high-level reasoning tractable. Yet it means the high level cannot fully understand what the low level can actually achieve.

Lossy Abstractions: High-level representations cannot faithfully express low-level success conditions
Bilevel Planning Intractability: Planning to full detail is inefficient and impossible under partial observability
Reward Hacking: High-level policies exploit abstract representations rather than solving the actual task
Necessary Compression: Abstraction is required for tractability but prevents full understanding

3. End-to-End Learning: Assuming Reactivity is Sufficient

Given all these problems with decomposition, a natural alternative is to avoid decomposition entirely and learn everything end-to-end with a single model. This is the promise of Vision-Language-Action (VLA) models: feed in images and language instructions, output robot actions. No modules, no hierarchies, no coordination problems.

The assumption is that a pure reactive system can solve mobile manipulation. Given current observations and task description, directly output actions. No explicit planning, no reasoning about future states, no beliefs about hidden information. The model learns to map observations to actions through pattern matching on large datasets.

This assumption breaks down for mobile manipulation. The problem inherently requires reasoning about invisible aspects: whether a base position enables a feasible grasp, whether a path keeps the object stable, whether the target is reachable. These considerations demand causal reasoning about counterfactuals and future states, not just reactive pattern matching. Building an entire system on the assumption that reactivity is sufficient ignores the circular dependencies, partial observability, and long-horizon planning that define the problem. VLAs don't solve the decomposition problem—they assume it away.

Reactive Assumption: Assumes current observations are sufficient to determine actions
No Explicit Planning: Cannot reason about future states or hidden information
Pattern Matching Limitation: Relies on learned correlations rather than causal reasoning
Ignoring Problem Structure: Assumes away the fundamental challenges instead of addressing them

Learning Within Structure

In my view, neither pure modular planning nor pure end-to-end learning solves the problem. A better approach lies in combining their strengths thoughtfully.

We start with modular assumptions—decomposing the system into components like perception, navigation, and manipulation. But instead of hand-coding rigid interfaces, we model each module as a probabilistic function mapping. Each module learns a distribution: perception learns \(P(\text{object\_pose} \mid \text{observations})\), navigation learns \(P(\text{trajectory} \mid \text{goal}, \text{constraints})\), manipulation learns \(P(\text{grasp} \mid \text{object\_state}, \text{base\_position})\).

The key insight is this: when we discover dependencies between modules, we don't redesign the module interfaces. Instead, we update the data pipeline. If navigation depends on arm configuration, we include arm state in the navigation module's input distribution. If grasp success depends on approach trajectory, we expand the manipulation module's conditioning variables. When dependencies are strong enough, we can combine modules into a bigger, more unified module that learns the joint distribution directly.

This approach is fundamentally different from traditional modular systems. We're not trying to eliminate dependencies through clever interface design. We're acknowledging dependencies and modeling them explicitly through the data and learning process. The structure guides what we learn, but learning reveals what structure we actually need.

The critical questions become: What are the right modules to start with? What distributions should they model? When do we expand a module's inputs versus combining modules entirely? These aren't just engineering decisions—they're fundamental design choices that determine what the system can learn and how modules can be composed.

As a colleague once said: planning people are overconfident about planning, and learning people are overconfident about learning. The real situation is that it's impossible to formulate everything clearly, and equally impossible to learn everything without carefully thinking about what exactly we're learning.

Pure planning assumes we can specify all the constraints, dynamics, and objectives precisely. But mobile manipulation involves too much uncertainty, too many edge cases, too much variation. Pure learning assumes we can collect enough data to cover the space and that the network will discover the right structure. But the combinatorial explosion makes comprehensive coverage impossible, and learned structure is often brittle.

The path forward requires humility from both camps. We need the structure and interpretability of planning with the flexibility and adaptability of learning. We need to carefully design what we formalize and what we learn, understanding that both are necessary and neither is sufficient.

Conclusion

Mobile manipulation is hard because the problem is fundamentally coupled. Navigation and manipulation depend on each other, perception depends on both, and high-level planning cannot be cleanly separated from low-level execution. Faced with this complexity, we have no choice but to make assumptions and decompose. Spatial decoupling assumes independence, temporal decoupling assumes lossless abstraction, end-to-end learning assumes reactivity is sufficient. Each approach makes critical assumptions to manage complexity. These assumptions are precisely why they don't scale.

The path forward isn't to eliminate decomposition or to learn everything end-to-end. It's to start with modular assumptions but model them as probabilistic functions. When we discover dependencies, we don't redesign interfaces—we update the data pipeline. We expand input distributions, combine modules when couplings are strong, and let learning reveal the structure we actually need. Mobile manipulation will remain hard. But by acknowledging the fundamental couplings rather than assuming them away, we can build systems that are more robust, more capable, and more honest about their limitations.