Token-Efficient Long Video Understanding for Multimodal LLMs

1NVIDIA , 2Rutgers University , 3UC Berkeley , 4MIT , 5Nanjing University , 6KAIST Equal contribution *Work performed during internship at NVIDIA

Teaser Example.

Comparison with Existing Video-LLMs

Comparison with Existing Video-LLMs.
Our approach outperforms all existing models while using significantly fewer tokens.

Abstract

Problem: Existing Video-LLMs often rely on frame-level representations without explicit temporal encoding, leading to inefficiencies in handling long video sequences and challenges in capturing temporal dynamics.

Solution: We STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), which integrates a Mamba-Based Temporal Projector between the image encoder and the LLM to enrich the visual tokens with temporal dynamics. This enriched encoding brings two benefits: (1) It improves the model’s video reasoning capabilities by capturing temporal dynamics, and (2) By preserving critical dynamics across tokens, it inherently allows for efficient downstream token reduction, such as test-time sampling and training-based temporal and spatial pooling. Extensive experiments demonstrate that our approach enhances long-context reasoning and achieves state-of-the-art performance, reducing computational costs by up to 8×8\times for visual inputs.

Method

Diagram of the STORM architecture.
STORM introduces a Mamba-based temporal projector to enrich the visual tokens with temporal dynamics. This not only improves the model's video reasoning capabilities but also preserves critical dynamics across tokens, making the tokens inherently suitable for downstream token reduction, such as test-time sampling and training-based temporal and spatial pooling.

Architecture

Token Compression Strategies

Diagram of token compression methods.
Temporal average pooling (left), spatial average pooling (middle), and training-free temporal token sampling (right). These methods can be applied individually or in combination.

Model Architecture

Diagram of components of the STORM architecture.
Three key components of STORM: (1) a Mamba-Based Temporal Projector to integrate spatio-temporal information into visual tokens, (2) temporal token compression using both training-free sampling and/or training-based pooling, and (3) spatial token compression to further optimize token efficiency. All compression methods—training-free and training-based, spatial and temporal—are independently applicable.

Critical Role of the Mamba Module

STORM vs VILA on VideoMME for various video lengths.
STORM (w/ Mamba) VS Baseline VILA for longer video inputs. Our Mamba module is critical for enabling robust improvements and efficiency as video length increases. On the other hand, without our Mamba module and simply extending video length and applying token compression in baseline VILA yields diminised gains.

Mamba Enables Efficient Long Video Understanding

STORM efficiency and effectiveness when applied token compression methods.
By compressing tokens before processing them in the LLM, STORM substantially lowers computational overhead, leading to faster inference (left and middle), while providing continuous performance gains on extended video inputs (right).

Comparison between VILA-Based Models on the Same Token Budget

Comparison between VILA-Based Models on the Same Token Budget.
We compare between STORM variants and the baseline VILA model, all trained with identical data and pipelines and use the same 8K token budget which represents the number of visual tokens provided to the LLM during training.

Examples on Long Video Inputs with Token Compression

Examples of how longer-video inputs plus token compression in STORM improves performance.
Examples of how longer-video inputs plus token compression in STORM improves performance.
Examples of how longer-video inputs plus token compression in STORM improves performance.
Longer-video inputs + token compression in STORM improves performance while maintaining low computational overhead.

Examples on Various Categories of Visual Reasoning Tasks

Information Synopsis
Top Example
Bottom Example
Examples of the Information Synopsis task.
Spatial Perception
Top Example
Bottom Example
Examples of the Spatial Perception task.
OCR Problem
Top Example
Bottom Example
Examples of the OCR Problem.
Temporal Reasoning
Top Example
Bottom Example
Examples of the Temporal Reasoning task.
Attribute Perception
Top Example
Bottom Example
Examples of the Attribute Perception task.

More Samples and Comparison with State-of-the-Art

Comparison with sota on the Open-Ended Problem.
Comparison with sota on the Open-Ended Problem.
Comparison with sota on the Open-Ended Problem.
Comparison with sota on the Open-Ended Problem.
Comparison with sota on the Open-Ended Problem.
Comparison with sota on the Open-Ended Problem.
Comparison with sota on the Open-Ended Problem.
Comparison with sota on the Open-Ended Problem.
Comparison with sota on the Open-Ended Problem.
Comparison with sota on the Open-Ended Problem.
Comparison with sota on the Open-Ended Problem.
Comparison with sota on the Open-Ended Problem.
Privacy PolicyManage My PrivacyDo Not Sell or Share My DataTerms of ServiceAccessibilityCorporate Policies Contact