4DP-QA:
Scalable QA for 4D Perception
in Vision Language Models

1NVIDIA  ·  2Yale University  ·  3KAIST AI
Work done during an internship at NVIDIA.
TL;DR A scalable QA pipeline that teaches VLMs to understand motion in 4D. 4DP-QA turns geometric scene data into 400K motion-focused QA pairs (and a 2.2K benchmark), and introduces True-Motion Point Tracking to disentangle camera and object motion.

Equipping VLMs with 4D understanding

The world is 4D (3D + motion), but VLMs only ever see its 2D projection, with camera and object motion tangled together.
Four 4D understanding capabilities: camera understanding, object motion, 3D spatial understanding, and true-motion tracking, comparing NVILA against our model.

Our framework equips VLMs with better 4D understanding for in-the-wild videos. Training a state-of-the-art VLM (NVILA) on 4DP-QA yields large gains across camera, object-motion, and 3D spatial questions. We also introduce true-motion point tracking, which isolates true object motion from camera movement.

Challenge (a)
Motion seen only in 2D

A camera projects the 4D world onto a 2D sensor, discarding depth cues that are essential for reasoning about how objects move through space.

Challenge (b)
Entangled camera & object motion

A moving camera entangles its own motion with object motion, so apparent 2D tracks rarely reflect the true 3D motion of an object.

Our answer
Data + True-Motion Tracking

A scalable QA pipeline built on accurate geometry, plus a fixed-reference tracking task that disentangles motion, together teaching VLMs fine-grained 4D perception.

Abstract

TL;DR: Quality 4D data, with disentangled motion, makes 4D understanding emerge in standard VLMs.

Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection onto 2D images. Second, existing datasets fail to disentangle object and camera motion.

To address these challenges, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion.

From this pipeline, we generate a large-scale training dataset of 400K samples, 4DP-QA (4D Perception QA), and a 2.2K-sample benchmark, 4DP-QA-Bench. Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.

A scalable QA generation pipeline

Turning accurate geometry from real and synthetic sources into motion-focused QA pairs.
Dataset generation pipeline: standardized 4D input data, question templates, asset sampling, discrete labels, and a QA generator producing QA pairs.

The 4DP-QA generation pipeline. Standardized 4D input data (RGB, depth, instance segmentation, 6D object pose) is sampled and combined with pre-defined question templates and either discrete labels or continuous tracks to produce QA pairs for 13 question types.

QA generator

Instantiates pre-defined templates for each question type and uses an off-the-shelf LLM (Gemini-2.5-Pro) to produce diverse phrasings for questions and answers.

Asset sampling

Heuristics and dataset-specific thresholds select video segments and objects that exhibit the motion characteristics required by each question.

Discrete-label generation

Maps continuous geometry (translations, rotations, distances) into human-aligned categorical labels (e.g. closer / farther, pan left / right).

Standardized 4D input

Five data sources are normalized to a common format of coordinates, resolution, camera parameters, depth, segmentation, and 6D object poses.

3D point-track extraction

3D point tracks are recovered from depth maps, camera poses, and 6D object poses, then projected into both visual and true-motion 2D tracks.

Human validation

Multiple rounds of human validation ensure the QA pairs in the benchmark are accurate, unambiguous, and aligned with human perception.

Explore the dataset

Tour real 4DP-QA samples: the model input video, object references, and the QA pair. Use / to navigate.
Dataset
Question group
Loading samples…
0 / 0 use keys

True-Motion Point Tracking

A new perceptual task: track an object as if seen from a fixed reference camera.
A 3D point track on a walking cat imaged across the moving camera frames C[t-2], C[t-1], C[t], then re-projected through a single fixed reference camera.

A 3D point track on the cat, imaged across the moving camera frames C[t−2], C[t−1], C[t]. True-motion tracking re-images the same track through a single fixed reference camera, recovering motion as seen from a stationary viewpoint.

Visual point tracking images a 3D point track through the moving camera, so the object and camera motion stay entangled. True-motion point tracking instead images the same track through a fixed reference camera, recovering motion as it would appear from a stationary viewpoint.

The two are complementary: visual tracking captures dense appearance correspondences, while true-motion tracking teaches the model to reason about object motion in a stable, fixed reference system. We cast both as QA pairs, predicting normalized coordinates and an occlusion flag per target frame, and fold them into 4DP-QA.

See it in motion

The same cat scene, animated. As the camera pans right faster than the cat, (a) visual tracks make the cat appear to drift backward, while (b) true-motion tracks disentangle the camera and reveal the cat actually moving forward.

Visual vs. true-motion across diverse scenes

Across real and synthetic scenes, visual tracks stay entangled with camera motion, while true-motion tracks isolate each object's own 3D motion in a fixed reference frame.

Dataset taxonomy

13 question types organized into 4 categories, spanning camera, object, spatial, and tracking abilities.
I · Camera Motion
  • 1Camera Movement
  • 2Camera–Object Distance Change
II · Object Motion
  • 3Rotation
  • 4Direction
  • 5Agent Motion
  • 6Moved Distance
  • 7Object–Object Distance Change
III · 3D Spatial Understanding
  • 8Depth
  • 9Object–Object Distance
  • 10Multi-View Depth
  • 11Multi-View Object–Object Distance
IV · Point Tracking
  • 12Visual Point Tracking
  • 13True-Motion Point Tracking

Data sources

We collect from driving, indoor, and simulation datasets spanning synthetic and real-world scenes, all standardized to a common format consumable by the pipeline.

Dataset Type Domain # Videos # Frames
SHIFTDrivingSynthetic3,2001.6M
Virtual KITTI 2DrivingSynthetic5021.3K
Aria Digital TwinIndoor, EgocentricReal273644K
HOT3DIndoor, EgocentricReal4241M
KubricSimulationSynthetic9.7K320K
Total13.6K3.6M

Results

Training on 4DP-QA lifts standard VLMs past the strongest proprietary baseline.

4DP-QA-Bench: per-category accuracy

Accuracy (%) averaged within each category. Fine-tuning on 4DP-QA (highlighted rows) improves every backbone by +34 to +42 points overall, and all three trained models surpass Gemini-2.5-Pro.

Model Camera Motion Object Motion 3D Spatial Overall Δ
Random 41.528.550.040.8
GPT-4o 52.141.765.253.8
Gemini-2.5-Pro 63.250.882.266.8
Qwen2.5-VL-3B 47.136.954.446.7
Qwen2.5-VL-3B + 4DP-QA 81.373.986.881.3+34.6
Qwen2.5-VL-7B 39.545.156.146.6
Qwen2.5-VL-7B + 4DP-QA 84.479.688.184.3+37.7
NVILA-Lite-8B 42.426.055.442.3
NVILA-Lite-8B + 4DP-QA 83.581.688.684.4+42.1

All three fine-tuned models beat the strongest proprietary baseline (Gemini-2.5-Pro, 66.8%) by 14–18 points overall.

Generalization to VLM4D

Training on 4DP-QA transfers to the external VLM4D 4D-reasoning benchmark. After fine-tuning, Qwen2.5-VL-7B outperforms every model evaluated, and the small 3B model reaches parity with 32B baselines.

Model Real Synthetic Overall Δ
Gemini-2.5-Pro62.762.962.8
Qwen3-VL-32B57.056.056.8
Qwen2.5-VL-3B48.235.345.0
Qwen2.5-VL-3B + 4DP-QA55.056.955.5+10.5
Qwen2.5-VL-7B52.950.652.3
Qwen2.5-VL-7B + 4DP-QA60.673.063.6+11.3
NVILA-Lite-8B43.241.442.8
NVILA-Lite-8B + 4DP-QA56.473.360.5+17.7
 Key takeaway

A scalable, geometry-grounded QA pipeline, paired with a tracking task that disentangles camera and object motion, is enough to make fine-grained 4D perception emerge in standard VLMs, and it transfers to external benchmarks.

True-motion tracking in the wild

NVILA-Lite-8B trained on 4DP-QA, predicting true-motion tracks on challenging real-world scenes.

Predicted true-motion tracks for dynamic scenes with camera motion. Prompted with “Project the 3D trajectory of {object} onto the image plane of frame 0”, the model summarizes object motion relative to the first frame, an intuitive output even under significant camera movement.

References

VLM backbones & data engine

  • NVILA · Liu et al., CVPR 2025. NVILA: Efficient Frontier Visual Language Models. [arXiv]
  • Qwen2.5-VL · Bai et al., Technical Report, 2025. Qwen2.5-VL Technical Report. [arXiv]
  • Gemini 2.5 · Comanici et al., Technical Report, 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. [arXiv]
  • L4P · Badki et al., 3DV 2026. L4P: Towards Unified Low-Level 4D Vision Perception. [arXiv] · [Project]

Data sources

  • SHIFT · Sun et al., CVPR 2022. SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation. [arXiv]
  • Virtual KITTI 2 · Cabon et al., arXiv 2020. Virtual KITTI 2. [arXiv]
  • Aria Digital Twin · Pan et al., ICCV 2023. Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception. [arXiv]
  • HOT3D · Banerjee et al., CVPR 2025. HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos. [arXiv]
  • Kubric · Greff et al., CVPR 2022. Kubric: A Scalable Dataset Generator. [arXiv]

Tracking & external benchmark

  • TAP-Vid · Doersch et al., NeurIPS 2022. TAP-Vid: A Benchmark for Tracking Any Point in a Video. [arXiv]
  • CoTracker · Karaev et al., ECCV 2024. CoTracker: It is Better to Track Together. [arXiv]
  • VLM4D · Zhou et al., ICCV 2025. VLM4D: Towards Spatiotemporal Awareness in Vision Language Models. [arXiv]

BibTeX

If you find this work useful, please cite our paper.
@article{cho20264dpqa,
  title   = {4DP-QA: Scalable QA for 4D Perception in Vision Language Models},
  author  = {Cho, Seokju and Badki, Abhishek and Su, Hang and Jiang, Jindong and
             Zeng, Ziyao and Kim, Seungryong and Liu, Sifei and Gallo, Orazio},
  journal = {arXiv preprint},
  year    = {2026},
}