Research

EgoScale

Scaling Human Video to Unlock Dexterous Robot Intelligence

Feb 19, 2026

Human behavior is among the most scalable sources of data for learning physical intelligence, yet how to effectively leverage it for dexterous manipulation remains unclear. While prior work demonstrates human-to-robot transfer in constrained settings, it is unclear whether large-scale human data can support fine-grained, high-degree-of-freedom dexterous manipulation. In this work, we show that effective dexterous human-to-robot transfer is fundamentally a scaling phenomenon, and present EgoScale, a human-to-dexterous-manipulation transfer framework built on large-scale egocentric human data. We train a vision–language–action (VLA) model on over 20k hours of action-labeled egocentric human video, more than 20× larger than prior efforts, and uncover a log-linear scaling law between human data scale and validation loss. This loss strongly correlates with downstream real-robot performance, establishing large-scale human data as a predictable supervision source. Beyond scale, we identify a simple transfer recipe: combining large-scale human pretraining with a small amount of aligned human–robot mid-training. It enables strong long-horizon dexterous manipulation and one-shot task adaptation with minimal robot supervision. The resulting policy improves average success rate by 54% over a no-pretraining baseline using a 22-DoF robotic hand, and also transfers effectively to robots with lower-DoF hands, indicating that large-scale human motion provides a reusable, embodiment-agnostic motor prior.


Human-to-Robot Learning Framework

EgoScale human-to-robot learning pipeline

A flow-based Vision-Language-Action (VLA) policy is first pretrained on 20,854 hours of egocentric human videos using wrist motion and retargeted dexterous hand actions. A lightweight mid-training stage with aligned human–robot play data (pairs highlighted with green and gray boundaries) adapts the representation to robot sensing and control. The resulting policy is post-trained on downstream tasks, enabling efficient learning of dexterous manipulation and one-shot generalization to unseen skills.

Use the arrows or dots below to explore more properties of our large-scale human dataset.


Model Architecture

EgoScale model architecture

A flow-based VLA policy with a VLM backbone and DiT action expert. Human and robot data are unified through a common wrist-level action representation, with lightweight embodiment-specific adapters handling proprioceptive inputs and hand actions.


EgoScale Focus on 5 Highly Dexterous Tasks

Autonomous policy rollout at 8× speed

Shirt Rolling. The robot coordinates both hands to alternately fold and roll a T-shirt into a cylindrical shape before placing it into a basket.


Pre-training Boosts Robot Task Performance

Human Pretrain + MidtrainHuman PretrainMidtrain OnlyNo Pretrain

Comparison of Human Pre-train + Mid-Training, Human Pretraining, Midtrain Only, and No Pretraining across five dexterous manipulation tasks under two evaluation metrics.


Policy Performance Scales with Pretraining Data Size

1k hrs2k hrs4k hrs10k hrs20k hrs

Scaling behavior of human pretraining. Left: Human validation loss versus training steps for models pretrained with increasing amounts of egocentric human data (1k–20k hours). Larger datasets yield stable, monotonic improvements, while smaller datasets exhibit early overfitting. Center: Optimal validation loss at convergence as a function of human data scale, revealing a near-perfect log-linear scaling law (R²=0.9983). Right: Downstream robot performance after post-training, measured by average task completion score, improves consistently with increased human data scale. Together, these results demonstrate predictable scaling of learned action representations and their direct translation to improved dexterous manipulation performance.


Aligned Human-Robot Mid-Training Enables One-Shot Transfer to New Tasks and Generalization to New Embodiments

Autonomous policy rollout at 8× speed

Fold Towel (Mid-training) → Fold Shirt (One-shot)

Midtrain OnlyHuman PretrainHuman Pretrain + Midtrain

Aligned mid-training enables emergent one-shot transfer. During post-training, the policy is trained on only a single robot demonstration per task, together with aligned human demonstrations (100 trajectories per object).

22-DoF Dexterous Hands ManipulationLarge Scale Human Data Pre-trainingVLAScaling Law for Robot Learning
Authors:
Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, Trevor Darrell, Furong Huang, Yuke Zhu, Danfei Xu, Linxi Fan

Acknowledgements:
This work would not have been possible without the dedication and expertise of our robot operators: Ivy Tam, Ashley Kim, Aly Khater, Rhea Alve, Noah Huang, Ty Seligman, Mahnoor Kareem, Christian Soto, Chaitanya Kothapalli, Kendra Shu, Rex Asato, Eley Barba Preciados, and April Zitkovich, who provided invaluable support in large-scale data collection and evaluation. We are also grateful to Johan Bjorck, Scott Reed, Runyu Ding and Joel Jang for their insightful discussions and feedback throughout the project.

BibTeX
@misc{zheng2026egoscalescalingdexterousmanipulation,
      title={EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data}, 
      author={Ruijie Zheng and Dantong Niu and Yuqi Xie and Jing Wang and Mengda Xu and Yunfan Jiang and Fernando Castañeda and Fengyuan Hu and You Liang Tan and Letian Fu and Trevor Darrell and Furong Huang and Yuke Zhu and Danfei Xu and Linxi Fan},
      year={2026},
      eprint={2602.16710},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.16710}, 
}