ViPE: Video Pose Engine for 3D Geometric Perception

Jun Gao^1,2,3

Dmitry Slepichev¹

Chen-Hsuan Lin¹

Jiawei Ren¹

Kevin Xie^1,2,3

Joydeep Biswas^1,4

Laura Leal-Taixe¹

Sanja Fidler^1,2,3

¹ NVIDIA

² University of Toronto

³ Vector Institute

⁴ UT Austin

article Technical Report code Code Release

Abstract

Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. However, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a fast and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360° panoramas.

Interactive Results

Below we show some interactive 4D point clouds estimated by ViPE from raw videos. Click on the thumbnails to view the corresponding output. (Note that it might take a while to load the visualization.)

4D Point Cloud Visualizer

Visualizer code is borrowed and modified from SpatialTrackerV2.

Method and Performance

ViPE estimates per-frame camera intrinsics, poses, and dense metric depth maps by solving a dense bundle adjustment problem over keyframes, combining three complementary constraints: (1) Dense flow constraint from the DROID-SLAM network for robust correspondence. (2) Sparse point constraint from the cuvslam library for sub-pixel accuracy. (3) Depth regularization from monocular metric depth networks to resolve scale ambiguity and consistency. A smooth depth alignment step fuses BA-derived depth with video depth estimates to yield temporally consistent, high-resolution metric depth.

On TUM-RGBD and KITTI/RDS benchmarks, ViPE surpasses MegaSAM, VGGT, and MASt3R-SLAM in both pose and intrinsics accuracy, while running at 3-5 FPS on a single GPU. It produces scale-consistent trajectories compared to prior works, and achieves good depth accuracy on SINTEL and ETH3D. We refer readers to the technical report for more details.

Dataset Release

We use ViPE to annotate a large-scale collection of videos. In total the collection contains approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We hope this dataset will help accelerate the development of spatial AI systems. Download links are available here. Please click on the following buttons to explore different splits of the dataset.

Wild-SDG-1M Dataset: We sampled 1M videos from the video diffusion models using our in-house curated and balanced video prompts, and annotate all the sampled frames using ViPE, resuling in ~78 million frames in total.

Citation


    @inproceedings{huang2025vipe,
        title={ViPE: Video Pose Engine for 3D Geometric Perception},
        author={Huang, Jiahui and Zhou, Qunjie and Rabeti, Hesam and Korovko, Aleksandr and Ling, Huan and Ren, Xuanchi and Shen, Tianchang and Gao, Jun and Slepichev, Dmitry and Lin, Chen-Hsuan and Ren, Jiawei and Xie, Kevin and Biswas, Joydeep and Leal-Taixe, Laura and Fidler, Sanja},
        booktitle={NVIDIA Research Whitepapers},
        year={2025}
    }