Scaling Video Training with Parallelism

📝 Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Weian Mao, Song Han
📅 June 3, 2026 ⏱️ 12 min read

Long-video training changes the unit of distributed computation. A short video sample can fit on one GPU. A long video sample may already be too large or too structured for one GPU to handle.

The previous post argued that video generation is becoming an infrastructure problem. This post zooms in on one specific infrastructure question: how do we train on a single video sequence that is too long for one GPU, without changing what the model sees or what the loss is supposed to mean?

The answer is not simply “use more GPUs.” Data parallelism gives more samples to more workers. Tensor parallelism splits matrix dimensions. Model parallelism, including pipeline-style layer splits, splits the model itself. For long videos, the painful dimension is often inside one sample: the temporal/context sequence itself, whether represented as visual tokens or latents, plus the attention masks and target tokens that define the training objective.

Longer videos need sequence parallelism Short videos fit on one GPU. Longer videos create more tokens. Very long videos create too many tokens for one GPU, so sequence parallelism splits the sequence across GPUs. Short video ... Few tokensFits on one GPU Longer video ... ...More tokensHigher memory and compute Very long video ... ...Too many tokensDoes not fit on one GPU Sequence Parallelism(SP) ++... Split the sequenceacross GPUsBalance memory, compute,and workload

1. What sequence parallelism actually parallelizes

SP is the parallelism axis for the inside of one sample. Data parallelism splits samples; tensor and model parallelism split model computation. Sequence/context parallelism splits the time or context dimension itself.

This matters once a single video becomes hundreds or thousands of frames, but the bottleneck is not always just token count. LongVILA is a token-heavy example: 1400 frames become about 274K tokens, with training contexts up to 2M tokens. LongLive-2.0 is different: AR teacher forcing represents each temporal chunk in two streams, clean history and noisy target. If generic SP shards the concatenated clean/noisy sequence, some ranks can become clean-heavy while others carry the target loss, and VAE encoding can still be replicated. Balanced SP changes the unit of work: each rank keeps the clean/noisy pair for the same temporal chunk and VAE-encodes only its local chunk plus a small left overlap, or halo [1] [2].

Short video training: many samples  -> split the batch
Long video training:  one sample    -> split inside the sample

The mental model is:

Parallelism axes: what does each GPU own? An animated comparison of data parallelism, tensor parallelism, model/pipeline parallelism, and sequence parallelism. The cards appear one by one. Parallelism axes: what does each GPU own? The usual axes split batches, parameters, hidden dimensions, or layers. SP splits inside one sample. Data Parallelism sample A sample B Split the batch. Tensor Parallelism H/2 H/2 Split heads or hidden dims. Model Parallelism pipeline-style layer split L1 L2 L3 Split model depth. Sequence Parallelism t0 t1 t2 t3 Split context / time.
Figure 1. Generated animated diagram. Data parallelism splits samples, tensor parallelism splits hidden dimensions, model/pipeline parallelism splits model depth, and sequence/context parallelism splits inside a single sample.
ParallelismWhat it splitsWhat it solvesWhy it is not enough alone for long videos
Data parallelism / FSDP / ZeROBatch, parameters, gradients, optimizer stateParameter and optimizer memory; throughput across samplesA single video sample can still be too long for one rank.
Tensor parallelismHidden dimensions, attention heads, MLP dimensionsLarge matrix operations inside a layerThe sequence activation may still be too large.
Model / pipeline parallelismTransformer layers or other model partitionsDeep models that do not fit on one deviceEach stage or partition may still see the full sequence.
Sequence / context parallelismSequence, context, or time dimensionLong sequences whose activations and attention do not fit on one deviceThe shard must match token origin, training target, masks, and hardware.

In this post, SP means the broad form of sequence/context parallelism: partitioning the context, time, or token dimension of one sample across ranks. For long video, that partition must also respect where tokens come from, what the model is trained to predict, the masks, and the GPU/node layout [4] [9].

2. A short map of SP systems

Before looking at video, it helps to place LongVILA and LongLive-2.0 on the SP map.

Sequence Parallelism from a system perspective. Li et al. framed sequence parallelism as a way to break the input sequence length limitation by splitting a long sequence into chunks and distributing those chunks across devices [3].

Megatron sequence parallelism and context parallelism. Megatron-style SP reduces activation memory and interacts naturally with tensor parallelism. Megatron Core’s later context parallelism generalizes the idea by partitioning the sequence dimension for network inputs and activations [4] [9].

DeepSpeed-Ulysses. Ulysses partitions input data along the sequence dimension and uses all-to-all communication during attention, which can be efficient when the number of attention heads supports the required partitioning [5].

Ring Attention. Ring Attention uses blockwise attention and ring communication of key-value blocks, letting devices stream KV chunks while computing local attention [6]. This makes context length scale naturally with the number of devices.

USP and LoongTrain. USP unifies Ulysses-style and Ring-style approaches into a broader sequence-parallel design space [7]. LoongTrain pushes the same direction toward 2D-Attention and head-context parallelism for long-sequence LLM training [8].

The map gives us vocabulary, not the final answer. These systems answer how to distribute long transformer sequences. Long videos add another layer: the sequence is produced by a multimodal pipeline or by a structured generation objective.
Why video is not just longer textVideo understanding produces tokens through a vision encoder. Video generation makes sequence layout part of the objective. ?Why video is not just longer text? Video understanding:tokens are producedVideo frames Vision encoderVisual tokens...LLMbalance encoding + token sharding Video generation:layout is the objectiveClean history...Noisy target chunks...loss on noisy sideLLM Video SP must respect token origin and token meaning.

3. SP for long-video understanding

LongVILA is a long-context VLM system for long-video understanding. Algorithmically, it extends the VILA training recipe with context extension and long-video supervised fine-tuning. System-wise, its key idea is Multi-Modal Sequence Parallelism, or MM-SP [1].

The headline numbers show the regime: LongVILA extends VILA from 8 video frames to 2048 frames and reports 99.8% accuracy in a 6000-frame needle-in-a-haystack evaluation, where the video can exceed 1M tokens [1].

LongVILA Multi-Modal Sequence Parallelism animation Animated diagram showing LongVILA MM-SP: baseline Ring SP, two-stage image then token sharding, and topology-aware communication. LongVILA MM-SP: two-stage sharding MM-SP first balances image/frame work, then re-shards visual and text tokens for the LLM. image tokens text tokens P2P All-to-All Two-stage sharding strategy Baseline: Ring SPTokens are split, but the vision encoder workload is not balanced. GPU 0 350 text GPU 1 350 text P2P KV MM-SP: shard by images, then by tokens GPU 0 input <img> <img> GPU 1 input <img> <img> 300 text Stage 1: by images 100 100 100 100 300 text Stage 2: by tokens GPU 0 final shard 100 100 100 50 GPU 1 final shard 50 300 text Topology-aware communication Ring SP: P2P everywhere node 0 GPU GPU GPU GPU node 1 GPU GPU GPU GPU P2P MM-SP: 2D-AttentionAll-to-All inside each node, P2P across nodes. GPU GPU GPU GPU GPU GPU GPU GPU intra-node A2A intra-node A2A inter-node P2P
Figure 2. Generated animated diagram based on LongVILA. MM-SP first balances the image/frame workload and then balances token workload; its 2D-Attention communication uses intra-node All-to-All and inter-node P2P. Source: LongVILA [1].

Why text-only SP is not enough

Ring-style or text-centric SP can shard a token sequence. But in a VLM, the model does not begin with a clean token sequence. It begins with frames/images and text, then uses the vision encoder to produce visual tokens. If the system only shards after this point, the vision tower can remain imbalanced.

LongVILA’s MM-SP therefore uses a two-stage sharding strategy:

  1. Stage 1: shard by images or frames. Frames are distributed across SP ranks to balance the vision tower workload.
  2. Stage 2: shard by tokens. After visual embeddings and text are assembled, the resulting sequence is balanced across ranks for the LLM.

This is a small but important shift in perspective. The SP boundary moves earlier in the pipeline. The system does not wait until the LLM sees a long token sequence; it starts balancing from the moment video becomes visual work.

Communication should match the hardware topology

LongVILA also shows that SP is not only about slicing tensors. It is about choosing a communication pattern that matches the machine. The paper contrasts Ring-style SP, which relies on point-to-point communication, with MM-SP’s 2D-Attention design: intra-node All-to-All uses fast NVLink bandwidth, while inter-node P2P handles the slower cross-node path [1].

LongVILA takeaway: for long-video understanding, SP has to become multi-modal SP. The system must know where visual tokens come from, not only where transformer tokens go.

SP for reinforcement learning. LongVILA-R1 extends the same SP idea to reinforcement learning, where one long video is reused across many rollouts plus policy/reference-model prefilling. Its Multi-modal Reinforcement Sequence Parallelism (MR-SP) first splits video-frame encoding across GPUs during rollout. Then it gathers and caches the video embeddings, so repeated rollouts can reuse them. Finally, it applies SP to the long video prefix used by both the policy model and the reference model. The system now has to split not only tokens, but also rollout-time encoding and cached video embeddings. The paper reports up to 2.1x speedup on 512-frame RL training and scales to 1024 frames without OOM on a single 8xA100 node [10].

4. SP for long-video generation

LongLive-2.0 attacks a different problem: long-video generation infrastructure. The full system includes NVFP4 training and inference, KV-cache compression, parallel dequantization, and asynchronous VAE decoding. For this post, the key training-side idea is Balanced SP [2].

The difference from LongVILA is important. In understanding, the bottleneck is how frames/images become visual tokens and then a long VLM sequence. In LongLive-2.0, the bottleneck comes from AR teacher forcing: the logical sequence has clean-history and noisy-target streams. Ulysses-style SP changes the order in which tokens are laid out for attention. The mask can still be built in that order, but the system must remember which clean and noisy chunks come from the same time segment. Balanced SP is therefore not about manually pairing mask entries; it chooses a work unit where each rank keeps matched clean/noisy chunks and a balanced share of target tokens.

Animated Traditional SP diagram for LongLive-2.0 AR training
Figure 3a. Generated animated diagram based on LongLive-2.0. In traditional SP, VAE preparation is centralized and sharding over the concatenated clean/noisy sequence can leave target tokens that carry loss on only a few ranks. Source: LongLive-2.0 [2].
Animated Balanced SP diagram for LongLive-2.0 AR training
Figure 3b. Generated animated diagram based on LongLive-2.0. Balanced SP assigns each GPU a temporal clean/noisy pair, so local VAE encoding, teacher-forcing mask construction in Ulysses order, and target tokens are distributed across ranks. Source: LongLive-2.0 [2].

The naive layout: clean-only ranks, target-heavy ranks

The efficient teacher-forcing formulation builds a sequence like:

[ clean history latents ; noisy target latents ]

If ordinary SP slices this concatenated sequence without understanding the AR objective, some ranks may contain mostly clean context while others contain noisy target tokens that carry loss. The sequence is partitioned, but the training work is not balanced.

The Balanced SP answer: paired chunks on each rank

Balanced SP changes the data layout. Each SP rank locally constructs the clean latents and noisy latents from the same temporal chunk. That gives every rank both context tokens and target tokens that carry loss. The teacher-forcing mask is then constructed in the Ulysses attention order from those clean/noisy identities, without materializing a separate global permutation [2].

Traditional SP:
GPU0: clean z0
GPU1: clean z1
GPU2: clean z2
GPU3: noisy z3 + loss

Balanced SP:
GPU0: clean z0 + noisy z0 + local loss
GPU1: clean z1 + noisy z1 + local loss
GPU2: clean z2 + noisy z2 + local loss
GPU3: clean z3 + noisy z3 + local loss

SP starts before the transformer

The most interesting detail is that Balanced SP reaches before the DiT. Each rank VAE-encodes only its local raw-video chunk plus a left halo covering the VAE temporal receptive field, then discards the halo latent and keeps the local latent chunk [2].

This is a video-specific lesson. If the transformer is sharded but the VAE pipeline is replicated, the system has not actually made long-video training scale. SP must begin where the expensive sequence is created.

The performance lesson

LongLive-2.0 reports that NVFP4 plus Balanced SP is the fastest training configuration: for 16s, 32s, and 64s videos, iteration time is 40.1s, 119.3s, and 639.5s, giving 1.3Ă—, 1.4Ă—, and 2.1Ă— speedups over the BF16+SP baseline. The paper also reports up to 2.15Ă— training speedup, 1.84Ă— inference speedup, and 45.7 FPS inference [2].

LongLive-2.0 takeaway: for long-video generation, the right SP unit is not just tokens. It is a temporal chunk responsible for clean history, noisy target, local VAE encoding with a small left overlap, Ulysses-order teacher-forcing mask construction, and target tokens.

5. Differences between understanding and generation systems

LongVILA and LongLive-2.0 both split work inside a sample, but they use different work assignments: multimodal tokens for understanding, and clean/noisy latent chunks for generation.

Design aspectLongVILALongLive-2.0
TaskLong-video understanding / VLMLong-video generation / AR diffusion infrastructure
Sequence unitVisual tokens plus text tokensVideo latent chunks
Main bottleneckVision encoder workload, LLM context length, attention communicationClean/noisy layout, target-token imbalance, VAE latent preparation, DiT activation
Why naive SP failsText-only token sharding ignores where visual tokens come fromConcatenated clean/noisy sharding can leave some ranks target-heavy and others mostly clean-only
Core designMM-SP: shard by frames/images, then by tokensBalanced SP: each rank locally constructs matched clean/noisy latents from one temporal chunk
General lessonModality-aware work assignmentObjective-aware temporal work assignment
Animated diagram comparing work assignment in LongVILA and LongLive-2.0
Figure 4. Generated animated diagram. The shared SP principle is to split inside a sample; the video-specific question is which meaningful unit of work each rank should handle.

The common abstraction is meaningful work assignment. A rank should not merely own a contiguous slice of a tensor. It should own a slice that makes the upstream encoder, the attention communication, the loss, and the hardware topology behave well together.

6. Design principles for long-video training systems

Principle 1: Shard the real bottleneck

Do not split the easiest tensor; split the work that actually limits scale. For LongVILA, that includes frame/image encoding. For LongLive-2.0, it includes VAE preparation and target-token distribution.

Principle 2: Keep the training objective the same

SP should not change temporal order, positions, attention visibility, loss masks, or which tokens are targets. Sharding should not change what the model is trained to predict.

Principle 3: Match the hardware topology

Ring, Ulysses, 2D-Attention, USP, and LoongTrain differ mainly in how they communicate. A good video system chooses All-to-All, P2P, intra-node traffic, and inter-node traffic deliberately.

Principle 4: Start before the transformer

Video sequence construction begins before attention: frame loading, vision encoding, VAE encoding with a small overlap from the previous chunk, latent chunking, and mask construction. If SP starts only inside transformer blocks, imbalance may already be baked in.

Principle 5: Check what each rank handles

Across samples?                         -> DP / FSDP / ZeRO
Inside one long sample?                 -> SP / context parallelism
Mostly text and enough attention heads? -> Ulysses-style SP
Need many nodes or beyond head limits?  -> Ring / USP / 2D-Attention / LoongTrain-style SP
Heavy multimodal encoder work?          -> MM-SP-style two-stage sharding
Clean/noisy AR video streams?           -> Balanced-SP-style temporal work assignment

The final diagnostic is simple: after sharding, every rank should have meaningful work. If one rank handles all the loss, every rank re-encodes the same video, or communication ignores the hardware topology, the layout is probably wrong.

Closing: The temporal dimension is the new batch dimension

Long-video training breaks a basic assumption: one sample does not necessarily belong to one GPU. Once a sample becomes hundreds or thousands of frames, the system has to distribute work inside the sample itself.

This is where a generic “split the sequence” SP story stops being enough. LongVILA cannot treat the input as only a long token list; the split has to respect how frames/images become visual tokens. LongLive-2.0 cannot simply shard the concatenated clean/noisy sequence; the split has to preserve each temporal chunk's clean history, noisy target, small left overlap for VAE encoding, and target tokens.

In long-video training, the temporal dimension becomes the new batch dimension. Sequence parallelism is how we scale it.

That is why SP belongs in the infrastructure stack for long video. A beautiful long-video demo proves capability. A well-designed SP system makes that capability trainable.

References

  1. LongVILA: Scaling Long-Context Visual Language Models for Long Videos. Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han. arXiv preprint, 2024. arXiv:2408.10188
  2. LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation. Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, Song Han. arXiv preprint, 2026. arXiv:2605.18739
  3. Sequence Parallelism: Long Sequence Training from System Perspective. Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, Yang You. Annual Meeting of the Association for Computational Linguistics (ACL), 2023 / arXiv preprint, 2021. arXiv:2105.13120
  4. Reducing Activation Recomputation in Large Transformer Models. Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro. MLSys, 2023 / arXiv preprint, 2022. arXiv:2205.05198
  5. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He. arXiv preprint, 2023. arXiv:2309.14509
  6. Ring Attention with Blockwise Transformers for Near-Infinite Context. Hao Liu, Matei Zaharia, Pieter Abbeel. arXiv preprint, 2023. arXiv:2310.01889
  7. USP: A Unified Sequence Parallelism Approach for Long Context Generative AI. Jiarui Fang, Shangchun Zhao. arXiv preprint, 2024. arXiv:2405.07719
  8. LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism. Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, Yonggang Wen, Tianwei Zhang, Xin Jin, Xuanzhe Liu. arXiv preprint, 2024. arXiv:2406.18485
  9. Context Parallelism. NVIDIA Megatron Core Documentation. NVIDIA Docs
  10. Scaling RL to Long Videos. Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han. NeurIPS, 2025 / arXiv preprint, 2025. arXiv:2507.07966