Scaling Video Training with Parallelism
Long-video training changes the unit of distributed computation. A short video sample can fit on one GPU. A long video sample may already be too large or too structured for one GPU to handle.
The previous post argued that video generation is becoming an infrastructure problem. This post zooms in on one specific infrastructure question: how do we train on a single video sequence that is too long for one GPU, without changing what the model sees or what the loss is supposed to mean?
The answer is not simply “use more GPUs.” Data parallelism gives more samples to more workers. Tensor parallelism splits matrix dimensions. Model parallelism, including pipeline-style layer splits, splits the model itself. For long videos, the painful dimension is often inside one sample: the temporal/context sequence itself, whether represented as visual tokens or latents, plus the attention masks and target tokens that define the training objective.
1. What sequence parallelism actually parallelizes
SP is the parallelism axis for the inside of one sample. Data parallelism splits samples; tensor and model parallelism split model computation. Sequence/context parallelism splits the time or context dimension itself.
This matters once a single video becomes hundreds or thousands of frames, but the bottleneck is not always just token count. LongVILA is a token-heavy example: 1400 frames become about 274K tokens, with training contexts up to 2M tokens. LongLive-2.0 is different: AR teacher forcing represents each temporal chunk in two streams, clean history and noisy target. If generic SP shards the concatenated clean/noisy sequence, some ranks can become clean-heavy while others carry the target loss, and VAE encoding can still be replicated. Balanced SP changes the unit of work: each rank keeps the clean/noisy pair for the same temporal chunk and VAE-encodes only its local chunk plus a small left overlap, or halo [1] [2].
Short video training: many samples -> split the batch
Long video training: one sample -> split inside the sampleThe mental model is:
| Parallelism | What it splits | What it solves | Why it is not enough alone for long videos |
|---|---|---|---|
| Data parallelism / FSDP / ZeRO | Batch, parameters, gradients, optimizer state | Parameter and optimizer memory; throughput across samples | A single video sample can still be too long for one rank. |
| Tensor parallelism | Hidden dimensions, attention heads, MLP dimensions | Large matrix operations inside a layer | The sequence activation may still be too large. |
| Model / pipeline parallelism | Transformer layers or other model partitions | Deep models that do not fit on one device | Each stage or partition may still see the full sequence. |
| Sequence / context parallelism | Sequence, context, or time dimension | Long sequences whose activations and attention do not fit on one device | The shard must match token origin, training target, masks, and hardware. |
In this post, SP means the broad form of sequence/context parallelism: partitioning the context, time, or token dimension of one sample across ranks. For long video, that partition must also respect where tokens come from, what the model is trained to predict, the masks, and the GPU/node layout [4] [9].
2. A short map of SP systems
Before looking at video, it helps to place LongVILA and LongLive-2.0 on the SP map.
Sequence Parallelism from a system perspective. Li et al. framed sequence parallelism as a way to break the input sequence length limitation by splitting a long sequence into chunks and distributing those chunks across devices [3].
Megatron sequence parallelism and context parallelism. Megatron-style SP reduces activation memory and interacts naturally with tensor parallelism. Megatron Core’s later context parallelism generalizes the idea by partitioning the sequence dimension for network inputs and activations [4] [9].
DeepSpeed-Ulysses. Ulysses partitions input data along the sequence dimension and uses all-to-all communication during attention, which can be efficient when the number of attention heads supports the required partitioning [5].
Ring Attention. Ring Attention uses blockwise attention and ring communication of key-value blocks, letting devices stream KV chunks while computing local attention [6]. This makes context length scale naturally with the number of devices.
USP and LoongTrain. USP unifies Ulysses-style and Ring-style approaches into a broader sequence-parallel design space [7]. LoongTrain pushes the same direction toward 2D-Attention and head-context parallelism for long-sequence LLM training [8].
3. SP for long-video understanding
LongVILA is a long-context VLM system for long-video understanding. Algorithmically, it extends the VILA training recipe with context extension and long-video supervised fine-tuning. System-wise, its key idea is Multi-Modal Sequence Parallelism, or MM-SP [1].
The headline numbers show the regime: LongVILA extends VILA from 8 video frames to 2048 frames and reports 99.8% accuracy in a 6000-frame needle-in-a-haystack evaluation, where the video can exceed 1M tokens [1].
Why text-only SP is not enough
Ring-style or text-centric SP can shard a token sequence. But in a VLM, the model does not begin with a clean token sequence. It begins with frames/images and text, then uses the vision encoder to produce visual tokens. If the system only shards after this point, the vision tower can remain imbalanced.
LongVILA’s MM-SP therefore uses a two-stage sharding strategy:
- Stage 1: shard by images or frames. Frames are distributed across SP ranks to balance the vision tower workload.
- Stage 2: shard by tokens. After visual embeddings and text are assembled, the resulting sequence is balanced across ranks for the LLM.
This is a small but important shift in perspective. The SP boundary moves earlier in the pipeline. The system does not wait until the LLM sees a long token sequence; it starts balancing from the moment video becomes visual work.
Communication should match the hardware topology
LongVILA also shows that SP is not only about slicing tensors. It is about choosing a communication pattern that matches the machine. The paper contrasts Ring-style SP, which relies on point-to-point communication, with MM-SP’s 2D-Attention design: intra-node All-to-All uses fast NVLink bandwidth, while inter-node P2P handles the slower cross-node path [1].
SP for reinforcement learning. LongVILA-R1 extends the same SP idea to reinforcement learning, where one long video is reused across many rollouts plus policy/reference-model prefilling. Its Multi-modal Reinforcement Sequence Parallelism (MR-SP) first splits video-frame encoding across GPUs during rollout. Then it gathers and caches the video embeddings, so repeated rollouts can reuse them. Finally, it applies SP to the long video prefix used by both the policy model and the reference model. The system now has to split not only tokens, but also rollout-time encoding and cached video embeddings. The paper reports up to 2.1x speedup on 512-frame RL training and scales to 1024 frames without OOM on a single 8xA100 node [10].
4. SP for long-video generation
LongLive-2.0 attacks a different problem: long-video generation infrastructure. The full system includes NVFP4 training and inference, KV-cache compression, parallel dequantization, and asynchronous VAE decoding. For this post, the key training-side idea is Balanced SP [2].
The difference from LongVILA is important. In understanding, the bottleneck is how frames/images become visual tokens and then a long VLM sequence. In LongLive-2.0, the bottleneck comes from AR teacher forcing: the logical sequence has clean-history and noisy-target streams. Ulysses-style SP changes the order in which tokens are laid out for attention. The mask can still be built in that order, but the system must remember which clean and noisy chunks come from the same time segment. Balanced SP is therefore not about manually pairing mask entries; it chooses a work unit where each rank keeps matched clean/noisy chunks and a balanced share of target tokens.
The naive layout: clean-only ranks, target-heavy ranks
The efficient teacher-forcing formulation builds a sequence like:
[ clean history latents ; noisy target latents ]
If ordinary SP slices this concatenated sequence without understanding the AR objective, some ranks may contain mostly clean context while others contain noisy target tokens that carry loss. The sequence is partitioned, but the training work is not balanced.
The Balanced SP answer: paired chunks on each rank
Balanced SP changes the data layout. Each SP rank locally constructs the clean latents and noisy latents from the same temporal chunk. That gives every rank both context tokens and target tokens that carry loss. The teacher-forcing mask is then constructed in the Ulysses attention order from those clean/noisy identities, without materializing a separate global permutation [2].
Traditional SP:
GPU0: clean z0
GPU1: clean z1
GPU2: clean z2
GPU3: noisy z3 + loss
Balanced SP:
GPU0: clean z0 + noisy z0 + local loss
GPU1: clean z1 + noisy z1 + local loss
GPU2: clean z2 + noisy z2 + local loss
GPU3: clean z3 + noisy z3 + local loss
SP starts before the transformer
The most interesting detail is that Balanced SP reaches before the DiT. Each rank VAE-encodes only its local raw-video chunk plus a left halo covering the VAE temporal receptive field, then discards the halo latent and keeps the local latent chunk [2].
This is a video-specific lesson. If the transformer is sharded but the VAE pipeline is replicated, the system has not actually made long-video training scale. SP must begin where the expensive sequence is created.
The performance lesson
LongLive-2.0 reports that NVFP4 plus Balanced SP is the fastest training configuration: for 16s, 32s, and 64s videos, iteration time is 40.1s, 119.3s, and 639.5s, giving 1.3Ă—, 1.4Ă—, and 2.1Ă— speedups over the BF16+SP baseline. The paper also reports up to 2.15Ă— training speedup, 1.84Ă— inference speedup, and 45.7 FPS inference [2].
5. Differences between understanding and generation systems
LongVILA and LongLive-2.0 both split work inside a sample, but they use different work assignments: multimodal tokens for understanding, and clean/noisy latent chunks for generation.
| Design aspect | LongVILA | LongLive-2.0 |
|---|---|---|
| Task | Long-video understanding / VLM | Long-video generation / AR diffusion infrastructure |
| Sequence unit | Visual tokens plus text tokens | Video latent chunks |
| Main bottleneck | Vision encoder workload, LLM context length, attention communication | Clean/noisy layout, target-token imbalance, VAE latent preparation, DiT activation |
| Why naive SP fails | Text-only token sharding ignores where visual tokens come from | Concatenated clean/noisy sharding can leave some ranks target-heavy and others mostly clean-only |
| Core design | MM-SP: shard by frames/images, then by tokens | Balanced SP: each rank locally constructs matched clean/noisy latents from one temporal chunk |
| General lesson | Modality-aware work assignment | Objective-aware temporal work assignment |
The common abstraction is meaningful work assignment. A rank should not merely own a contiguous slice of a tensor. It should own a slice that makes the upstream encoder, the attention communication, the loss, and the hardware topology behave well together.
6. Design principles for long-video training systems
Principle 1: Shard the real bottleneck
Do not split the easiest tensor; split the work that actually limits scale. For LongVILA, that includes frame/image encoding. For LongLive-2.0, it includes VAE preparation and target-token distribution.
Principle 2: Keep the training objective the same
SP should not change temporal order, positions, attention visibility, loss masks, or which tokens are targets. Sharding should not change what the model is trained to predict.
Principle 3: Match the hardware topology
Ring, Ulysses, 2D-Attention, USP, and LoongTrain differ mainly in how they communicate. A good video system chooses All-to-All, P2P, intra-node traffic, and inter-node traffic deliberately.
Principle 4: Start before the transformer
Video sequence construction begins before attention: frame loading, vision encoding, VAE encoding with a small overlap from the previous chunk, latent chunking, and mask construction. If SP starts only inside transformer blocks, imbalance may already be baked in.
Principle 5: Check what each rank handles
Across samples? -> DP / FSDP / ZeRO
Inside one long sample? -> SP / context parallelism
Mostly text and enough attention heads? -> Ulysses-style SP
Need many nodes or beyond head limits? -> Ring / USP / 2D-Attention / LoongTrain-style SP
Heavy multimodal encoder work? -> MM-SP-style two-stage sharding
Clean/noisy AR video streams? -> Balanced-SP-style temporal work assignmentThe final diagnostic is simple: after sharding, every rank should have meaningful work. If one rank handles all the loss, every rank re-encodes the same video, or communication ignores the hardware topology, the layout is probably wrong.
Closing: The temporal dimension is the new batch dimension
Long-video training breaks a basic assumption: one sample does not necessarily belong to one GPU. Once a sample becomes hundreds or thousands of frames, the system has to distribute work inside the sample itself.
This is where a generic “split the sequence” SP story stops being enough. LongVILA cannot treat the input as only a long token list; the split has to respect how frames/images become visual tokens. LongLive-2.0 cannot simply shard the concatenated clean/noisy sequence; the split has to preserve each temporal chunk's clean history, noisy target, small left overlap for VAE encoding, and target tokens.
In long-video training, the temporal dimension becomes the new batch dimension. Sequence parallelism is how we scale it.
That is why SP belongs in the infrastructure stack for long video. A beautiful long-video demo proves capability. A well-designed SP system makes that capability trainable.
References
- LongVILA: Scaling Long-Context Visual Language Models for Long Videos. Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han. arXiv preprint, 2024. arXiv:2408.10188
- LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation. Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, Song Han. arXiv preprint, 2026. arXiv:2605.18739
- Sequence Parallelism: Long Sequence Training from System Perspective. Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, Yang You. Annual Meeting of the Association for Computational Linguistics (ACL), 2023 / arXiv preprint, 2021. arXiv:2105.13120
- Reducing Activation Recomputation in Large Transformer Models. Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro. MLSys, 2023 / arXiv preprint, 2022. arXiv:2205.05198
- DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He. arXiv preprint, 2023. arXiv:2309.14509
- Ring Attention with Blockwise Transformers for Near-Infinite Context. Hao Liu, Matei Zaharia, Pieter Abbeel. arXiv preprint, 2023. arXiv:2310.01889
- USP: A Unified Sequence Parallelism Approach for Long Context Generative AI. Jiarui Fang, Shangchun Zhao. arXiv preprint, 2024. arXiv:2405.07719
- LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism. Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, Yonggang Wen, Tianwei Zhang, Xin Jin, Xuanzhe Liu. arXiv preprint, 2024. arXiv:2406.18485
- Context Parallelism. NVIDIA Megatron Core Documentation. NVIDIA Docs
- Scaling RL to Long Videos. Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han. NeurIPS, 2025 / arXiv preprint, 2025. arXiv:2507.07966