Beyond the Papers: Scaling Laws and Data Requirements in End-to-End Robotics

Introduction

After several months implementing and testing various end-to-end neural network approaches for robotics, I've discovered some interesting insights about what papers often don't tell us. I've trained ACT [1], Diffusion Policy [2,3], and 3D Diffusion Policy [4] on both household manipulation tasks (opening fridges, microwaves, cabinets) and tabletop skills (pouring water, cutting, punching, sweeping, brushing)-but the results weren't quite what the papers led me to expect. You can never predict the model's behaviors with your experience as human!

While papers paint optimistic pictures of generalization capabilities and sample efficiency, the practical reality tells a different story. In this blog, I'll share my observations on overfitting problems, the challenges of data representation, and perhaps most importantly—what scaling laws might tell us about the future of end-to-end robotics.

Reality Check: What Papers Don't Tell You About End-to-End Policy

1. Generalization: Expectation vs. Reality

What We Expected: At first glance, these learning-based policies seemed magical - train on a few hundred demonstrations and the robot should be able to perform the task in new situations. The promise of diffusion models felt especially compelling: just like they can generate images from noise, they should handle noise in robot trajectories too. We imagined the robot learning the essence of tasks - "grab the cup," "close the door" - rather than specific motions. Show it a few examples with a VR controller, and the robot would extract the underlying principles, not just memorize exact movements.

We also expected that using pretrained vision models would significantly improve performance. Since the image and depth views are embedded as conditions for the diffusion network, leveraging powerful vision backbones like DINOv2 or CLIP with their pretrained weights should help the model extract richer scene information and improve generalization.

What We Observed:

Initial Pose Sensitivity: What surprised us was how even small deviations from training positions caused dramatic performance drops. Moving a robot just 2-3cm from its training position often meant complete failure - something we didn't anticipate for a learning-based system.
Limited Task Adaptation: Despite collecting 200 demonstrations of door-closing from different starting angles, models struggled to understand the basic task structure. The robot couldn't consistently find the door handle and then close it, despite our expectation that this volume of data would be sufficient.
Visual Transfer Failure: Changing a purple cup to a yellow one reduced success rates by 40%. Slight camera angle changes led to complete failure. Objects in new locations weren't recognized. These visual generalization gaps were much wider than we initially expected.
Vision Backbone Indifference: Networks without any pretrained weights achieved exactly the same success rates as those with pretrained models! We extensively tested with various visual backbones including DINOv2 and CLIP, but found no observable performance difference compared to a basic ResNet18, except that they ran slower. Even more surprising, with DINOv2, using a single global feature performed identically to aggregating local features!
Validation Loss Trick: Validation loss cannot give us good enough estimation of the success rate here. We (and many other papers) found just increase the number of epoches and let the validation loss increase could give us smoother trajectory and higher success rate.

Our Takeaway: These models don't generalize as broadly as we had hoped. They seem to be performing sophisticated pattern matching rather than learning meaningful task representations. When conditions closely match training data, they succeed impressively. But the boundaries of what constitutes a "similar enough" scenario are much narrower than we initially thought.

2. Data Representation: The Hidden Variable

What We Expected: While papers mention action representation choices, we were surprised by how much the implementation details were left open to interpretation. We knew papers mentioned that positional actions worked better than velocity actions, but there were still several viable representation options: joint states, end-effector poses, or delta end-effector poses. For delta poses, we recognized these could be represented in different frames (robot frame or end-effector frame), each with potential benefits.

What We Discovered:

Action Format Is Critical: Our experiments revealed dramatic performance differences based on representation choice. ACT performed best with absolute end-effector poses as actions (without normalization), while Diffusion Policy required normalized delta movements in the robot frame (normalized to [-1, 1]). The impact of these choices far exceeded what we anticipated.
Model-Specific Sensitivities: Different algorithms showed completely different sensitivities to representation choices. ACT was relatively forgiving about normalization schemes, but Diffusion Policy needed careful tuning to ensure proper distribution of action values—a nuance that significantly impacted performance.
Frame of Reference Matters: For delta end-effector representations, we found the choice of reference frame created meaningful differences. Robot frame representations provided more consistent mapping between action values and view changes, while end-effector frame representations potentially offered better transfer to other robot embodiments.

Our Takeaway: Data representation choices often determine success or failure, regardless of the underlying algorithm. The "implementation details" that aren't thoroughly explored in papers can be more important than the core method itself. Different policies appear to have inherent preferences for specific representations, and finding the right match is essential for good performance.

3. Sample Efficiency Claims: Truth or Fiction?

What We Expected: Intuitively, diffusion models should excel at fitting complex, non-linear action distributions compared to simpler approaches like VAEs. Additionally, we expected 3D Diffusion Policy to perform better by leveraging a more straightforward mapping between actions and visual information. The natural progression should have been: 3D Diffusion Policy > Diffusion Policy > VAE approaches like ACT.

What Our Experiments Show:

Low-Data Performance: With fewer than 100 episodes, ACT (based on VAE) consistently outperformed Diffusion Policy by about 10% in success rate—completely contrary to our expectations and intuition about these methods.
3D Representations Underwhelm: Despite theoretical advantages, 3D Diffusion Policy using Azure Kinect largely performed simple replay behaviors with minimal adaptation abilities. It actually performed worse than both standard Diffusion Policy and VAE-based approaches.
Reversed Performance Hierarchy: We consistently observed VAE methods were more robust than Diffusion Policy, which in turn outperformed 3D Diffusion Policy—exactly the opposite of what theoretical advantages would suggest.

Our Takeaway: The expected sample efficiency narrative doesn't align with our experimental results. Different approaches excel in different scenarios, and the "most efficient" method depends heavily on specific task characteristics and implementation details that aren't always evident from theoretical considerations.

4. The Core Problem: Understanding vs. Mimicry

What We Hoped For: Models that truly understand manipulation concepts: what a door is, how handles work, the physics of pouring liquid, or the functional purpose of actions.

What We Found:

Limited Adaptation at Key Points: We didn't observe adaptive capabilities at critical moments like grasping and pouring. When an object was slightly moved from its training position, the robot would often attempt to grasp a "virtual object" at the original location rather than adapt to the new position.
No Failure Recovery: Our training data didn't contain recovery from mistakes (failed grasps were discarded during data collection). Consequently, the models showed no ability to recover from errors - they would continue executing the trajectory even when failing to grasp an object.
Interpolation Without Adaptation: If we collected data pouring water with a cup at positions A and B, placing the cup between A and B sometimes worked. But adaptation to positions outside this range or to different objects revealed limitations in true understanding.

Our Takeaway: Current models seem to rely more on trajectory playback with some interpolation capabilities rather than developing a deeper understanding of manipulation tasks. Their behavior suggests a form of sophisticated template matching that works well within the bounds of training examples, but lacks the conceptual understanding needed for robust generalization to new situations.

Scaling Laws and the Future of End-to-End Robotics

The Language Model Parallel: Can Scaling Laws Save Robotics?

Looking at the remarkable success of large language models, a natural question arises: Can we apply similar scaling laws to robotics? In language models, we've witnessed extraordinary improvements in generalization as we scale data and parameters. GPT-4, Claude, and others demonstrate how scaling unlocks emergent capabilities that weren't explicitly programmed.

Research from Gao Yang's team on "Data Scaling Laws in Imitation Learning for Robotic Manipulation" [5] suggests a similar phenomenon might be possible in robotics. Their finding that roughly 1600 episodes across diverse environments can achieve 90% success for pouring water is encouraging.

But what would true generalist robotics require in terms of data?

Calculating Generalist Robot Data Requirements

Let's make some reasonable assumptions according to the previous paper we mentioned:

32 environment-object pairs per task for good generalization
50 basic manipulation skills to master
Non-linear scaling relationship between skills and data (following a power law)

Using a conservative exponent of 0.5 for the scaling law:

\text{Data needed} \propto N^{1/\alpha}

where \(\alpha = 0.5\) (power law exponent) and \(N = 50\) (number of skills).

Substituting these values:

\text{Data needed} \propto 50^{1/0.5}\] \[\text{Data needed} \propto 50^2\] \[\text{Data needed} \propto 2,500

This means we need 2,500 times more data than for a single skill.

If a single skill needs 1,600 episodes for 90% success:

\text{Total episodes for basic skills} = 1,600 \text{ episodes} \times 2,500\] \[\text{Total episodes for basic skills} = 4,000,000 \text{ episodes}

Long-Horizon Task Combinations

For real-world tasks that combine multiple skills (assuming each task combines 5 basic skills on average), we need to consider how much additional data is required for each combination.

First, let's calculate the total number of possible 5-skill combinations:

\text{Possible combinations} = C(n,k) = \binom{n}{k}\] \[\text{Possible combinations} = \binom{50}{5}\] \[\text{Possible combinations} = \frac{50!}{5!(50-5)!}\] \[\text{Possible combinations} = \frac{50!}{5!45!}\] \[\text{Possible combinations} = \frac{50 \times 49 \times 48 \times 47 \times 46}{5 \times 4 \times 3 \times 2 \times 1}\] \[\text{Possible combinations} = \frac{254,251,200}{120}\] \[\text{Possible combinations} = 2,118,760

Now, assuming a conservative linear relationship between task length and data requirements, each 5-skill combination task would require additional training data beyond just learning the individual skills. Let's assume each combination requires an additional 20% data compared to the sum of its individual skills:

\text{Data per combination} = 5 \times 1,600 \times 1.2\] \[\text{Data per combination} = 9,600 \text{ episodes}

If we needed to learn all possible combinations:

\text{Total combination data} = 2,118,760 \times 9,600\] \[\text{Total combination data} = 20,340,096,000 \text{ episodes}

This is clearly infeasible. Even if we learn just 0.01% of all possible combinations:

\text{Practical combinations} = 2,118,760 \times 0.0001\] \[\text{Practical combinations} = 212 \text{ combinations}

\text{Practical combination data} = 212 \times 9,600\] \[\text{Practical combination data} = 2,035,200 \text{ episodes}

Therefore, the total data requirement becomes:

\text{Total data requirement} = 4,000,000 + 2,035,200\] \[\text{Total data requirement} = 6,035,200 \text{ episodes}

Conclusion

Throughout this analysis, I've been grappling with a sobering realization: our back-of-the-envelope calculations only scratch the surface of the true challenge. While collecting millions of episodes for basic skills might be feasible with enough parallelization and time, the deeper complexity of generalist robotics presents hurdles that go far beyond simple data scaling.

Task planning—the ability to sequence actions toward a goal—introduces a level of complexity that likely doesn't scale linearly with data. The decision-making process in long-horizon tasks is inherently sparse; critical decision points might occur only occasionally within a trajectory, making the signal-to-noise ratio extremely unfavorable. To develop genuine reasoning capabilities, we might need billions of trajectories to capture enough meaningful decision points across diverse contexts.

Even more daunting is the challenge of building a unified latent space between language and action. Unlike language models that can scrape the internet for nearly unlimited text, robotic data requires physical interactions that cannot be synthetically generated with the same fidelity. Each trajectory needs meaningful labeling to connect abstract concepts to physical behaviors—a process that's both labor-intensive and conceptually complex.

What's becoming clear is that today's vision-language-action (VLA) models face a fundamental data bottleneck. While language models can scale through self-supervised learning on readily available text, robotics lacks an equivalent data source. The discrete skills and combinations we've discussed represent just the foundation; building a true generalist that can reason about novel tasks requires orders of magnitude more data with rich semantic connections.

This isn't to say the challenge is insurmountable, but rather that we need to be realistic about the path forward. Approaches that can learn from fewer examples, leverage simulation intelligently, or build compositional understanding will be essential. The generalist robot of our imagination isn't impossible—it just might require rethinking our approach to learning itself.

Citation

Please cite this work as:

@misc{robotics-scaling-limitations,
  author    = {Hu Tianrun},
  title     = {Beyond the Papers: Scaling Laws and Data Requirements in End-to-End Robotics},
  year      = {2025},
  publisher = {GitHub},
  url       = {https://tianrunhu.github.io/blog/posts/robot-il-generalization-thinking.html}
}

References

Fu, Z., Zhao, T. Z., & Finn, C. (2024). Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation. arXiv preprint arXiv:2401.02117.
Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., & Song, S. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. In Proceedings of Robotics: Science and Systems (RSS).
Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., & Song, S. (2024). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. The International Journal of Robotics Research.
Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., & Xu, H. (2024). 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations. In Proceedings of Robotics: Science and Systems (RSS).
Lin, F., Hu, Y., Sheng, P., Wen, C., You, J., & Gao, Y. (2025). Data Scaling Laws in Imitation Learning for Robotic Manipulation. arXiv preprint arXiv:2410.18647.