Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. However, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a fast and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360° panoramas.
Below we show some interactive 4D point clouds estimated by ViPE from raw videos. Click on the thumbnails to view the corresponding output. (Note that it might take a while to load the visualization.)
Visualizer code is borrowed and modified from SpatialTrackerV2.
ViPE estimates per-frame camera intrinsics, poses, and dense metric depth maps by solving a dense bundle adjustment problem over keyframes, combining three complementary constraints: (1) Dense flow constraint from the DROID-SLAM network for robust correspondence. (2) Sparse point constraint from the cuvslam library for sub-pixel accuracy. (3) Depth regularization from monocular metric depth networks to resolve scale ambiguity and consistency. A smooth depth alignment step fuses BA-derived depth with video depth estimates to yield temporally consistent, high-resolution metric depth.
On TUM-RGBD and KITTI/RDS benchmarks, ViPE surpasses MegaSAM, VGGT, and MASt3R-SLAM in both pose and intrinsics accuracy, while running at 3-5 FPS on a single GPU. It produces scale-consistent trajectories compared to prior works, and achieves good depth accuracy on SINTEL and ETH3D. We refer readers to the technical report for more details.
We will release the data download link on Hugging Face soon. Please stay tuned!
We use ViPE to annotate a large-scale collection of videos. In total the collection contains approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We hope this dataset will help accelerate the development of spatial AI systems. Please click on the following buttons to explore different splits of the dataset.
Wild-SDG-1M Dataset: We samlped 1M videos from the video diffusion models using our in-house curated and balanced video prompts, and annotate all the sampled frames using ViPE, resuling in ~78 million frames in total.
@inproceedings{huang2025vipe,
title={ViPE: Video Pose Engine for 3D Geometric Perception},
author={Huang, Jiahui and Zhou, Qunjie and Rabeti, Hesam and Korovko, Aleksandr and Ling, Huan and Ren, Xuanchi and Shen, Tianchang and Gao, Jun and Slepichev, Dmitry and Lin, Chen-Hsuan and Ren, Jiawei and Xie, Kevin and Biswas, Joydeep and Leal-Taixe, Laura and Fidler, Sanja},
booktitle={NVIDIA Research Whitepapers},
year={2025}
}