Token-Efficient Long Video Understanding for Multimodal LLMs

Jindong Jiang ^1,2,†,*

Xiuyu Li ^1,3,†,*

Zhijian Liu ¹

Muyang Li ^1,4,*

Guo Chen ^1,5,*

Zhiqi Li ^1,5,*

De-An Huang ¹

Guilin Liu ¹

Zhiding Yu ¹

Kurt Keutzer ³

Sungjin Ahn ⁶

Jan Kautz ¹

Hongxu Yin ¹

Yao Lu ¹

Song Han ^1,4

Wonmin Byeon ¹

¹NVIDIA , ²Rutgers University , ³UC Berkeley , ⁴MIT , ⁵Nanjing University , ⁶KAIST ^†Equal contribution ^*Work performed during internship at NVIDIA

Paper Code arXiv

Abstract

Problem: Existing Video-LLMs often rely on frame-level representations without explicit temporal encoding, leading to inefficiencies in handling long video sequences and challenges in capturing temporal dynamics.

Solution: We STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), which integrates a Mamba-Based Temporal Projector between the image encoder and the LLM to enrich the visual tokens with temporal dynamics. This enriched encoding brings two benefits: (1) It improves the model’s video reasoning capabilities by capturing temporal dynamics, and (2) By preserving critical dynamics across tokens, it inherently allows for efficient downstream token reduction, such as test-time sampling and training-based temporal and spatial pooling. Extensive experiments demonstrate that our approach enhances long-context reasoning and achieves state-of-the-art performance, reducing computational costs by up to $8\times$ for visual inputs.

Method

Architecture

Critical Role of the Mamba Module