Pushing Intelligence to 4-bit
Four-bit floating point (FP4) encodes each value in just sixteen levels. Until recently that was usable only for storage; with NVIDIA’s NVFP4 format and Blackwell hardware, it now supports the full lifecycle of large models—training, inference, and even long-video generation—at close to 16-bit accuracy. This post explains how FP4 reaches that point across LLMs, diffusion, and video, and what remains unsolved.
The core difficulty is representational: four bits afford only fifteen distinct magnitudes, so accuracy depends entirely on how those magnitudes are scaled. NVFP4 addresses this with fine-grained, two-level scaling that Blackwell executes natively—and the payoff now extends across the whole stack, from LLM weights and activations to the KV cache, attention, and full diffusion-based video generation.
FP4 has moved from a storage-only compression trick to a primitive for both inference and training.
NVFP4 gives Blackwell Tensor Cores a practical 4-bit path for weights, activations, the KV cache, and attention—across language, vision, and video models alike.
1. Why Four Bits Is Hard
Most people meet quantization through integers—INT8, INT4, weights packed into fewer bits with one scale factor. FP4 is different: still four bits, but with a floating-point shape (sign, exponent, mantissa). The E2M1 flavor has 1 sign, 2 exponent, 1 mantissa bit, which yields exactly fifteen distinct values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}.
Think of FP4 as a tiny ruler with only a handful of marks. A 4-bit code only picks which mark; something else has to decide where the ruler is placed over the real numbers. That “something” is the scale, and it is what the format’s accuracy ultimately depends on—whether the tensor is an LLM’s weights, a diffusion model’s activations, or a KV cache. Fifteen marks is almost nothing: if a tensor mixes tiny values, big outliers, and everything between, one global ruler wastes most of its marks on empty space.
The fix is to stop using one ruler for the whole tensor. MXFP4, the Open Compute Project’s microscaling format, gives every block of 32 values its own power-of-two scale (an E8M0 exponent) [2], so one outlier only distorts its local block. The catch is that a power-of-two dial is coarse: if the ideal scale sits between two powers of two, MXFP4 must round it, and the tiny 4-bit budget pays for the error. NVFP4 sharpens exactly this—and NVIDIA’s own diagram says it best:
2. NVFP4: A Smarter Ruler
NVFP4 keeps the recipe—4-bit E2M1 values plus block scaling—but sharpens both knobs. It uses one shared FP8 (E4M3) scale per 16-value block, plus a higher-level FP32 per-tensor scale [1]. Two changes versus MXFP4, both decisive:
- Smaller blocks (16 vs 32): one outlier contaminates half as many values.
- A finer dial (E4M3 with mantissa bits, vs power-of-two E8M0): the ruler lands where the data actually is. NVIDIA reports ~88% lower quantization error than power-of-two scaling [1].
- Two levels: the FP32 per-tensor scale normalizes the tensor so each block’s E4M3 scale fits cleanly into FP8—global range up top, local fit per block.
| Format | Mental model | Scaling | Tradeoff |
|---|---|---|---|
| FP4 E2M1 | A 15-mark codebook. | Needs external scaling. | Only 15 values. |
| MXFP4 | A power-of-two dial per 32 values. | E8M0 / 32-value block. | Simple, but coarse. |
| NVFP4 | A finer dial per 16 values + a global trim. | E4M3 / 16-value block + FP32 tensor scale. | Better accuracy; needs Blackwell. ~4.5 bits/value vs MXFP4’s ~4.25. |
This post follows FP4 up the stack—from LLMs to diffusion to video—and into the two places it is hardest: the KV cache and attention. For the long-video parallelism and KV-cache infrastructure that sits alongside this story, see our companion posts on scaling video training with parallelism and KV cache compression.
3. LLMs with FP4
FP4 was first applied at scale to LLM inference, which is bottlenecked by exactly the operations low precision most helps: moving weights from memory, multiplying weights by activations, and storing a KV cache that grows with context. The most demanding target is W4A4—both weights and activations in four bits—because quantizing both is what accelerates the arithmetic, not merely the storage. That arithmetic is dominated by matrix multiplication, which in W4A4 executes entirely in 4-bit on Blackwell Tensor Cores:
Those four-bit GEMMs are not free to feed, though. Every activation has to be quantized to NVFP4 just before the multiply, and if that cast runs as its own kernel it adds an extra round-trip through HBM—an op-overhead tax that can quietly erode the 4× ceiling, especially on smaller, memory-bound layers. The standard fix is kernel fusion: fold the activation quantization into the GEMM’s prologue (and the preceding normalization into the epilogue) so the cast happens inline, in registers, with no extra pass over memory. Production NVFP4 stacks lean on exactly these fused kernels—NVIDIA’s TransformerEngine supplies the fused quantize-and-GEMM path, and the TensorRT Model Optimizer supplies the NVFP4 quantization recipes [9][10].
Does the accuracy survive? Largely, yes. Post-training-quantizing DeepSeek-R1 to NVFP4 stays within about 1% of FP8 across reasoning and knowledge benchmarks (MMLU-Pro 85→84, GPQA 81→80, AIME 2024 actually up 89→91) [1]. And NVFP4 consistently beats MXFP4: in a head-to-head pretraining run, MXFP4 needed 36% more tokens to reach the same loss as NVFP4 [12][3]. This accuracy headroom is what lets the format’s efficiency gains be realized without retraining.
This is no longer a lab result. NVFP4 inference ships in TensorRT-LLM, the quantization recipes ship in the TensorRT Model Optimizer, and NVFP4 checkpoints of frontier models (including DeepSeek-R1) are published for Blackwell deployment [1][11]. Newer designs go further and build FP4 into the architecture itself: DeepSeek-V4—a 1.6-trillion-parameter Mixture-of-Experts model—trains its expert weights and its sparse-attention indexer directly in FP4 with quantization-aware training, keeping the remaining components in FP8. The largest part of the model is therefore stored and computed in four bits by design, rather than quantized after the fact [15]. OpenAI’s open-weight GPT-OSS models make the same choice with a different format: their MoE experts—over 90% of the parameters—are trained quantization-aware in MXFP4, which is what lets the 120B model run on a single 80 GB GPU [16]. The contrast is instructive: both bake 4-bit into the model through QAT, but DeepSeek-V4 uses NVFP4’s finer 16-value blocks while GPT-OSS uses MXFP4’s simpler 32-value blocks—the same tradeoff from Section 2, now decided inside frontier models.
4. Video Generation with FP4
Diffusion and video generation are a more recent target, and the same W4A4 approach applies: a diffusion transformer (DiT) is largely the matrix multiplies of the figure above, repeated across many denoising steps. Quantization works here too: methods like ViDiT-Q push DiTs to W8A8 and W4A8 with negligible visual loss using custom kernels [14]. The aggressive step is to go all the way to W4A4 NVFP4, end to end.
This direction builds on our earlier Blackwell work. In February 2025, our RTX 5090 setup report documented early consumer-GPU access to native FP4, and our SVDQuant + NVFP4 demo then showed FLUX running 4× smaller and 3× faster than BF16 with near-16-bit quality—and better image quality than INT4 [23][24]. LongLive-2.0 extends that inference-first line into an end-to-end long-video system spanning training, W4A4 execution, the KV cache, and attention.
LongLive-2.0 is the clearest example: an autoregressive (AR) long-video model built on Wan2.2-TI2V-5B that runs NVFP4 in both training and inference—to our knowledge the first end-to-end NVFP4 recipe for long video generation [5][8]. The result is a 5B model that generates minute-long, 720p video in real time:
Training: one NVFP4 pipeline, two stages
Quality survives the drop to four bits because the model is trained NVFP4-aware rather than quantized after the fact, and that training runs in two NVFP4 stages. First, the bidirectional base model is fine-tuned into a chunk-level AR generator with clean-context teacher forcing—each chunk is denoised conditioned on the clean history before it. To fit the long sequences, Balanced sequence parallelism shards the paired clean and noisy chunks across GPUs so the loss-bearing work stays even, while NVFP4 accelerates the GEMM-heavy DiT; together they cut 64-second AR-training iteration time up to 2.1× over BF16+SP (plain BF16 without SP runs out of memory).
Second, to reach real-time speed, the model is distilled to a few denoising steps with distribution-matching distillation (DMD)—run, again, in four bits. DMD co-locates three networks on each GPU: a generator, a real-score model, and a fake-score model. All three share a frozen W4A4 NVFP4 backbone and train only small LoRA adapters—an idea borrowed from QeRL, which showed that pairing an NVFP4 backbone with LoRA makes even reinforcement learning cheap (its quantization noise can even aid exploration) [21]. An adaptive “4-or-6” scale search picks the lower-error magnitude per block, and because the distillation is single-stage—no ODE initialization or progressive long-tuning—the trained LoRA simply plugs into the 4-step AR model and halves it to 2 steps with no further training. Quantizing the three branches one after another walks DMD peak memory from 70.5 GB to 49.0 GB per GPU (0.69×).
Inference: W4A4, an NVFP4 KV cache, and overlapped decode
At deployment the generator runs in W4A4 NVFP4, the KV cache is stored in NVFP4, and VAE decoding is overlapped with denoising on a separate GPU so it never extends the critical path. Stacked, these carry the 5B model to 45.7 FPS at 720p. The two sides reinforce each other: the same NVFP4 recipe that makes inference fast is what made the long-video fine-tune affordable to train in the first place.
The NVFP4 KV cache is what lets the model retain a minute of generated history within a fixed memory budget. Because it raises a distinct set of problems, we treat it on its own.
5. KV Cache Quantization with FP4
In any autoregressive model, the keys and values of past tokens are the model’s memory—and that memory grows linearly with length until it dominates everything. For LLMs this is the long-context wall: KVQuant showed that quantizing the cache to ~3 bits preserves accuracy well enough to reach 10-million-token contexts [13]. AR video has the exact same problem, only heavier—each generated chunk becomes history that later chunks attend to, and video tokens are far larger than text tokens. Quant VideoGen, for one, pushes the cache to just 2 bits for autoregressive video diffusion—up to 7× smaller with under 4% latency overhead—via semantic-aware smoothing and progressive residual quantization [22].
K-smoothing is a shared ingredient, not a point of contrast. Keys often contain offsets and outliers that waste the limited resolution of a low-bit codebook. Smoothing or centering K before quantization tightens its effective range, so more of those scarce levels represent useful variation. Quant VideoGen uses semantic-aware smoothing; LongLive-2.0 subtracts each key vector’s channel mean before NVFP4 micro-block quantization; and SageAttention3 explicitly inherits K-smoothing from SageAttention in its FP4 attention kernel. The exact recipes differ, but the principle is broadly effective: smooth first, then quantize [22][5][7].
Quantizing the cache to NVFP4 attacks this directly. NVIDIA reports the NVFP4 KV cache cutting footprint up to ~50% versus FP8 with under 1% accuracy loss, and beating MXFP4 KV cache by ~5% thanks to the finer block scaling [4]. The deeper advantage is hardware: NVFP4 dequantizes along Blackwell’s native FP4→FP8 datapath, whereas generic INT4/INT2 KV caches have no such datapath and must dequantize in software [1].
Quant VideoGen and NVFP4 therefore target different points on the quality–memory frontier. This is not an apples-to-apples benchmark, but the design tradeoff is clear: INT2 is capacity-first, using an extremely compact code and progressive residual refinement to maximize compression; NVFP4 is quality-first, using twice the bit width and floating-point dynamic range to preserve more quality headroom when small cache errors compound over a long video. NVFP4 also keeps K and V in the same Blackwell-native format used by SageAttention3, so an NVFP4 KV cache can feed an NVFP4 attention path without introducing an INT2-to-FP4 format boundary [22][7].
And dequantization is the catch. KV is generated autoregressively, so the cache is re-read in full on every single decode step—if it is stored quantized, it must be dequantized again and again in a tight per-step loop, a recurring, bandwidth-bound tax rather than a one-time conversion. A naive implementation does this one cached chunk per kernel launch, and the launch latency stacks up. LongLive-2.0’s answer is a custom parallel dequantization kernel that rebuilds every in-window chunk in a single launch, keeping total dequant overhead under 2% [5][6]:
6. FP4 Attention
With weights, activations, and the KV cache in four bits, the remaining component is attention itself—the two matrix multiplies that turn queries and keys into scores, and scores into a weighted sum of values. Attention has followed the same precision curve as the rest of the model, one format at a time.
FlashAttention-2 set the modern baseline by computing exact attention in FP16/BF16 [17]. FlashAttention-3 then added an FP8 path on Hopper—running both matmuls in 8-bit and reaching roughly 1.2 PFLOP/s on an H100 [18]. The SageAttention line went lower still, using integers: the original quantized the query–key score matmul to INT8 (keeping the probability×value matmul in FP16) for about 2.1× over FlashAttention-2 [19], and SageAttention2 took queries and keys to INT4 with the probabilities and values in FP8 for about 3× [20]. SageAttention3 is the most recent step: both matmuls in NVFP4 on Blackwell.
Pushing attention to four bits is delicate, and one tensor is the reason. Q, K, and V are roughly zero-centered with a wide range, so ordinary block-wise FP4 handles them. The softmax map P is the hard one: after softmax its values live in [0, 1], crammed near zero, so a naive 4-bit scale wastes almost all of its range. SageAttention3 solves this with a two-level trick—first stretch each row of P by a per-token FP32 factor so it fills the representable range, then quantize. Crucially, both matmuls run in NVFP4: P’s values end up in 4-bit (only its block scale is FP8) [7].
The result is substantial: 1038 TOPS on an RTX 5090, roughly 5× the fastest FlashAttention available there, and about 2.4–3× end-to-end on video diffusion—with the two-level stretch raising P’s cosine similarity from 93.3% to 99.5%, and NVFP4 again ahead of MXFP4 [7]. It is not free: FP4 attention still incurs more accuracy risk than FP4 GEMMs, which is why it remains an active research area rather than a settled one.
Closing: FP4 Is The Future
Historically, 4-bit precision implied a large accuracy penalty. On Blackwell, NVFP4 reduces that penalty to roughly 1% on many tasks while delivering substantial speedups, and the approach now spans the landscape: LLMs (DeepSeek-R1), diffusion and image generation (FLUX), and autoregressive video (LongLive-2.0). The recurring lesson is that FP4 works only when the format, the kernels, the cache, and attention are co-designed—fine-grained scaling, fused kernels, hardware-accelerated dequantization, and the dedicated handling that the softmax map requires.
It is also far from finished. The open problems are where the next round of speed and quality will come from:
- Better scales, lower quantization error. Smarter per-block scale search, rotations / Hadamard transforms, and outlier handling to squeeze more signal out of fifteen marks.
- KV-cache dequantization efficiency. The autoregressive loop-dequant tax is real; faster fused dequant (and storing more of attention in low precision) is wide open.
- FP4 attention quality. The performance loss is still higher than for FP4 GEMMs—closing that gap would put the entire model on a 4-bit path.
- QAT vs PTQ. PTQ is cheap but leaves accuracy on the table; quantization-aware (or fully quantized) training recovers it but costs compute. Narrowing that gap—cheap recipes with QAT-level quality—may matter most of all.
Four-bit precision is becoming a practical default for large-model inference, and increasingly for training.
Realizing it fully is as much a systems problem—formats, kernels, and memory—as a modeling one.
References
- Introducing NVFP4 for Efficient and Accurate Low-Precision Inference. Eduardo Alvarez, Omri Almog, Eric Chung, Simon Layton, Dusan Stosic, Ronny Krashinsky, Kyle Aubrey, NVIDIA Technical Blog, 2025. NVIDIA Technical Blog
- OCP Microscaling Formats (MX) Specification. Open Compute Project, 2023. OCP Specification
- NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit. Kirthi Devleker and Farshad Ghodsian, NVIDIA Technical Blog, 2025. NVIDIA Technical Blog
- Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache. Eduardo Alvarez, Wei-Ming Chen, Huizi Mao, NVIDIA Technical Blog, 2025. NVIDIA Technical Blog
- LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation. Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, Song Han. arXiv, 2026. arXiv:2605.18739 and project page; code at github.com/NVlabs/LongLive
- LongLive2.0 Documentation. NVlabs, 2026. Documentation
- SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training. Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jianfei Chen, Jun Zhu. arXiv, 2025. arXiv:2505.11594
- Wan2.2: Open and Advanced Large-Scale Video Generative Models (Wan2.2-TI2V-5B base model). Wan Team, 2025. Wan2.2 repository
- NVIDIA TransformerEngine. NVIDIA, 2024. Library of fused low-precision (FP8/FP4) quantization-and-GEMM kernels for Hopper and Blackwell GPUs. github.com/NVIDIA/TransformerEngine
- NVIDIA TensorRT Model Optimizer. NVIDIA, 2024. Quantization toolkit with NVFP4 post-training and quantization-aware recipes for efficient deployment. github.com/NVIDIA/TensorRT-Model-Optimizer
- 3 Ways NVFP4 Accelerates AI Training and Inference. NVIDIA Technical Blog, 2025. NVIDIA Technical Blog
- Pretraining LLMs with NVFP4. NVIDIA, arXiv, 2025. arXiv:2509.25149
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami. NeurIPS, 2024. arXiv:2401.18079
- ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation. Tianchen Zhao et al. arXiv, 2024. arXiv:2406.02540
- DeepSeek-V4. DeepSeek-AI, 2026. A 1.6T-parameter Mixture-of-Experts model that applies FP4 quantization-aware training to its expert weights and sparse-attention indexer (FP8 elsewhere), targeting NVIDIA Blackwell. DeepSeek-V4-Pro model card
- OpenAI gpt-oss (gpt-oss-120b, gpt-oss-20b). OpenAI, 2025. Open-weight Mixture-of-Experts models whose MoE expert weights (over 90% of parameters) are quantization-aware-trained in MXFP4, enabling the 120B model to run on a single 80 GB GPU. gpt-oss-120b model card
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. Tri Dao. arXiv, 2023. arXiv:2307.08691
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision. Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao. arXiv, 2024. arXiv:2407.08608
- SageAttention: Accurate 8-Bit Attention for Plug-and-Play Inference Acceleration. Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, Jianfei Chen. arXiv, 2024. arXiv:2410.02367
- SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-Thread INT4 Quantization. Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen. arXiv, 2024. arXiv:2411.10958
- QeRL: Beyond Efficiency — Quantization-enhanced Reinforcement Learning for LLMs. Huang et al. arXiv, 2025. Pairs an NVFP4-quantized backbone with LoRA for efficient RL training of LLMs. arXiv:2510.11696
- Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization. Haocheng Xi et al. ICML, 2026. Training-free 2-bit KV-cache quantization for AR video diffusion using semantic-aware smoothing and progressive residual quantization (up to 7× smaller, <4% latency overhead). arXiv:2602.02958
- RTX 5090 Workstation Configuration Journey. Qinghao Hu, Jiaming Tang, Yujun Lin, Zhuoyang Zhang, Zhekai Zhang, Shang Yang, Song Han. MIT Han Lab Blog, February 2025. Han Lab Blog
- SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs. Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han. MIT Han Lab Blog, February 2025. Han Lab Blog