Home
Publications
NVIDIA Research
Light
Dark
Automatic
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
Shihao Wang
,
Guo Chen
,
De-An Huang
,
Zhiqi Li
,
Minghan Li
,
Guilin Liu
,
Jan Kautz
,
Jose M. Alvarez
,
Lei Zhang
,
Zhiding Yu
June 2026
Cite
arXiv
Type
Conference paper
Publication
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Highlight
De-An Huang
Jan Kautz
Team Leader
Zhiding Yu
Related
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation
Cite
×