Pushing Intelligence to 4-bit

📝 Wei Huang, Yukang Chen, Weian Mao, Luozhou Wang, Shuai Yang, Song Han
📅 June 29, 2026 ⏳ 16 min read

Four-bit floating point (FP4) encodes each value in just sixteen levels. Until recently that was usable only for storage; with NVIDIA’s NVFP4 format and Blackwell hardware, it now supports the full lifecycle of large models—training, inference, and even long-video generation—at close to 16-bit accuracy. This post explains how FP4 reaches that point across LLMs, diffusion, and video, and what remains unsolved.

The core difficulty is representational: four bits afford only fifteen distinct magnitudes, so accuracy depends entirely on how those magnitudes are scaled. NVFP4 addresses this with fine-grained, two-level scaling that Blackwell executes natively—and the payoff now extends across the whole stack, from LLM weights and activations to the KV cache, attention, and full diffusion-based video generation.

FP4 has moved from a storage-only compression trick to a primitive for both inference and training.
NVFP4 gives Blackwell Tensor Cores a practical 4-bit path for weights, activations, the KV cache, and attention—across language, vision, and video models alike.

In this post
  1. Why four bits is hard
  2. NVFP4: a smarter ruler
  3. LLMs with FP4
  4. Video generation with FP4
  5. KV cache quantization with FP4
  6. FP4 attention

1. Why Four Bits Is Hard

Most people meet quantization through integers—INT8, INT4, weights packed into fewer bits with one scale factor. FP4 is different: still four bits, but with a floating-point shape (sign, exponent, mantissa). The E2M1 flavor has 1 sign, 2 exponent, 1 mantissa bit, which yields exactly fifteen distinct values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}.

Think of FP4 as a tiny ruler with only a handful of marks. A 4-bit code only picks which mark; something else has to decide where the ruler is placed over the real numbers. That “something” is the scale, and it is what the format’s accuracy ultimately depends on—whether the tensor is an LLM’s weights, a diffusion model’s activations, or a KV cache. Fifteen marks is almost nothing: if a tensor mixes tiny values, big outliers, and everything between, one global ruler wastes most of its marks on empty space.

The fix is to stop using one ruler for the whole tensor. MXFP4, the Open Compute Project’s microscaling format, gives every block of 32 values its own power-of-two scale (an E8M0 exponent) [2], so one outlier only distorts its local block. The catch is that a power-of-two dial is coarse: if the ideal scale sits between two powers of two, MXFP4 must round it, and the tiny 4-bit budget pays for the error. NVFP4 sharpens exactly this—and NVIDIA’s own diagram says it best:

NVIDIA diagram comparing MXFP4's coarse 32-value power-of-two block scaling against NVFP4's finer 16-value blocks each with a dynamically computed FP8 scale
Figure 1. MXFP4 puts one coarse power-of-two scale on every 32 values; NVFP4 puts a finer, dynamically computed scale on every 16. Smaller blocks and a sharper dial fit awkward distributions more tightly. Source: NVIDIA, “Introducing NVFP4 for Efficient and Accurate Low-Precision Inference,” NVIDIA Technical Blog, 2025 [1].

2. NVFP4: A Smarter Ruler

NVFP4 keeps the recipe—4-bit E2M1 values plus block scaling—but sharpens both knobs. It uses one shared FP8 (E4M3) scale per 16-value block, plus a higher-level FP32 per-tensor scale [1]. Two changes versus MXFP4, both decisive:

NVIDIA diagram of NVFP4's two-level scaling: a 4-bit E2M1 element, groups of 16 values each sharing an FP8 E4M3 block scale, and a global FP32 per-tensor scale
Figure 2. NVFP4’s two-level structure: 4-bit E2M1 elements, an FP8 (E4M3) scale shared across each 16-value micro-block, and a global FP32 per-tensor scale. Source: NVIDIA, “Introducing NVFP4 for Efficient and Accurate Low-Precision Inference,” NVIDIA Technical Blog, 2025 [1].
FormatMental modelScalingTradeoff
FP4 E2M1A 15-mark codebook.Needs external scaling.Only 15 values.
MXFP4A power-of-two dial per 32 values.E8M0 / 32-value block.Simple, but coarse.
NVFP4A finer dial per 16 values + a global trim.E4M3 / 16-value block + FP32 tensor scale.Better accuracy; needs Blackwell. ~4.5 bits/value vs MXFP4’s ~4.25.

This post follows FP4 up the stack—from LLMs to diffusion to video—and into the two places it is hardest: the KV cache and attention. For the long-video parallelism and KV-cache infrastructure that sits alongside this story, see our companion posts on scaling video training with parallelism and KV cache compression.

3. LLMs with FP4

FP4 was first applied at scale to LLM inference, which is bottlenecked by exactly the operations low precision most helps: moving weights from memory, multiplying weights by activations, and storing a KV cache that grows with context. The most demanding target is W4A4—both weights and activations in four bits—because quantizing both is what accelerates the arithmetic, not merely the storage. That arithmetic is dominated by matrix multiplication, which in W4A4 executes entirely in 4-bit on Blackwell Tensor Cores:

W4A4 NVFP4 matrix multiply on a Blackwell Tensor Core Figure 3 — NVFP4 W4A4 matrix multiply Output = Weights x Activations, both operands in 4-bit, on a Blackwell Tensor Core W · weights NVFP4 4-bit × X · activations NVFP4 4-bit Blackwell Tensor Core W4A4 GEMM · low precision A · B = C Y · output accumulated memory traffic per element BF16 16-bit NVFP4 4-bit · ¼ W4A4: both weights and activations in 4 bits up to 4× GEMM throughput · ¼ the memory traffic vs BF16 weights AND activations are 4-bit — the multiply itself runs in low precision trained NVFP4-aware, so W4A4 inference keeps quality
Figure 3. W4A4 NVFP4 inference: both weights and activations are 4-bit, so the dominant matrix multiplies run directly on Blackwell Tensor Cores in low precision — up to a 4× throughput ceiling over BF16, with far less memory traffic [1][5].

Those four-bit GEMMs are not free to feed, though. Every activation has to be quantized to NVFP4 just before the multiply, and if that cast runs as its own kernel it adds an extra round-trip through HBM—an op-overhead tax that can quietly erode the 4× ceiling, especially on smaller, memory-bound layers. The standard fix is kernel fusion: fold the activation quantization into the GEMM’s prologue (and the preceding normalization into the epilogue) so the cast happens inline, in registers, with no extra pass over memory. Production NVFP4 stacks lean on exactly these fused kernels—NVIDIA’s TransformerEngine supplies the fused quantize-and-GEMM path, and the TensorRT Model Optimizer supplies the NVFP4 quantization recipes [9][10].

Does the accuracy survive? Largely, yes. Post-training-quantizing DeepSeek-R1 to NVFP4 stays within about 1% of FP8 across reasoning and knowledge benchmarks (MMLU-Pro 85→84, GPQA 81→80, AIME 2024 actually up 89→91) [1]. And NVFP4 consistently beats MXFP4: in a head-to-head pretraining run, MXFP4 needed 36% more tokens to reach the same loss as NVFP4 [12][3]. This accuracy headroom is what lets the format’s efficiency gains be realized without retraining.

FP4 across the LLM lifecycle — training and inference Training (Llama-3.1 405B pretraining) 1.0× FP8 1.9× NVFP4 Inference (peak throughput, Blackwell) 1.0× FP8 3.0× NVFP4 Relative to FP8 (= 1.0×). Inference is same-hardware peak throughput; training is Llama-3.1 405B pretraining.
Figure 4. NVFP4 across the LLM lifecycle: about 1.9× faster pretraining than FP8 (Llama-3.1 405B) and up to ~3× the peak inference throughput of FP8 on Blackwell, at FP8-level accuracy and ahead of MXFP4 [11][1].

This is no longer a lab result. NVFP4 inference ships in TensorRT-LLM, the quantization recipes ship in the TensorRT Model Optimizer, and NVFP4 checkpoints of frontier models (including DeepSeek-R1) are published for Blackwell deployment [1][11]. Newer designs go further and build FP4 into the architecture itself: DeepSeek-V4—a 1.6-trillion-parameter Mixture-of-Experts model—trains its expert weights and its sparse-attention indexer directly in FP4 with quantization-aware training, keeping the remaining components in FP8. The largest part of the model is therefore stored and computed in four bits by design, rather than quantized after the fact [15]. OpenAI’s open-weight GPT-OSS models make the same choice with a different format: their MoE experts—over 90% of the parameters—are trained quantization-aware in MXFP4, which is what lets the 120B model run on a single 80 GB GPU [16]. The contrast is instructive: both bake 4-bit into the model through QAT, but DeepSeek-V4 uses NVFP4’s finer 16-value blocks while GPT-OSS uses MXFP4’s simpler 32-value blocks—the same tradeoff from Section 2, now decided inside frontier models.

4. Video Generation with FP4

Diffusion and video generation are a more recent target, and the same W4A4 approach applies: a diffusion transformer (DiT) is largely the matrix multiplies of the figure above, repeated across many denoising steps. Quantization works here too: methods like ViDiT-Q push DiTs to W8A8 and W4A8 with negligible visual loss using custom kernels [14]. The aggressive step is to go all the way to W4A4 NVFP4, end to end.

This direction builds on our earlier Blackwell work. In February 2025, our RTX 5090 setup report documented early consumer-GPU access to native FP4, and our SVDQuant + NVFP4 demo then showed FLUX running 4× smaller and 3× faster than BF16 with near-16-bit quality—and better image quality than INT4 [23][24]. LongLive-2.0 extends that inference-first line into an end-to-end long-video system spanning training, W4A4 execution, the KV cache, and attention.

LongLive-2.0 is the clearest example: an autoregressive (AR) long-video model built on Wan2.2-TI2V-5B that runs NVFP4 in both training and inference—to our knowledge the first end-to-end NVFP4 recipe for long video generation [5][8]. The result is a 5B model that generates minute-long, 720p video in real time:

A minute of video, drawn chunk by chunk — in real time Five real NVFP4-generated frames of a robot planting a seedling, with a green generating highlight stepping across each frame, an FPS callout pill, and a speedup ribbon. A minute of video, drawn chunk by chunk — in real time 45.7 FPS · 1280×720 · ~2× real-time generating… 3.3 FPS (50-step base) → 45.7 FPS · ~14× faster, same 720p each chunk is generated from the clean history before it the past stays frozen — only the next chunk is computed
Figure 5. Real NVFP4-generated frames from LongLive-2.0 (a robot planting a seedling): a 5B autoregressive model streams 720p video in real time, each chunk built from the clean history before it. Frames from the LongLive-2.0 paper teaser [5].

Training: one NVFP4 pipeline, two stages

Quality survives the drop to four bits because the model is trained NVFP4-aware rather than quantized after the fact, and that training runs in two NVFP4 stages. First, the bidirectional base model is fine-tuned into a chunk-level AR generator with clean-context teacher forcing—each chunk is denoised conditioned on the clean history before it. To fit the long sequences, Balanced sequence parallelism shards the paired clean and noisy chunks across GPUs so the loss-bearing work stays even, while NVFP4 accelerates the GEMM-heavy DiT; together they cut 64-second AR-training iteration time up to 2.1× over BF16+SP (plain BF16 without SP runs out of memory).

Second, to reach real-time speed, the model is distilled to a few denoising steps with distribution-matching distillation (DMD)—run, again, in four bits. DMD co-locates three networks on each GPU: a generator, a real-score model, and a fake-score model. All three share a frozen W4A4 NVFP4 backbone and train only small LoRA adapters—an idea borrowed from QeRL, which showed that pairing an NVFP4 backbone with LoRA makes even reinforcement learning cheap (its quantization noise can even aid exploration) [21]. An adaptive “4-or-6” scale search picks the lower-error magnitude per block, and because the distillation is single-stage—no ODE initialization or progressive long-tuning—the trained LoRA simply plugs into the 4-step AR model and halves it to 2 steps with no further training. Quantizing the three branches one after another walks DMD peak memory from 70.5 GB to 49.0 GB per GPU (0.69×).

NVFP4 DMD distillation — frozen 4-bit backbones, trainable LoRA Three models, one GPU — only the green LoRA modules train. update update rollout video chunks Fake Score NVFP4 · frozen LoRA trainable Real Score NVFP4 · frozen · no LoRA Generator NVFP4 · frozen LoRA trainable Diffusion Loss DMD Loss real − fake score
Figure 6. NVFP4 DMD distillation in LongLive-2.0. The generator, real-score, and fake-score models are co-located in a frozen W4A4 NVFP4 setup; only the small green LoRA modules are trainable. The generator rolls out video chunks (left loop) that both score models evaluate—the DMD loss (real − fake) updates the generator’s LoRA, while a diffusion loss updates the fake-score LoRA—and the trained LoRA later converts the 4-step model to 2 steps. Redrawn from the LongLive-2.0 paper [5].

Inference: W4A4, an NVFP4 KV cache, and overlapped decode

At deployment the generator runs in W4A4 NVFP4, the KV cache is stored in NVFP4, and VAE decoding is overlapped with denoising on a separate GPU so it never extends the critical path. Stacked, these carry the 5B model to 45.7 FPS at 720p. The two sides reinforce each other: the same NVFP4 recipe that makes inference fast is what made the long-video fine-tune affordable to train in the first place.

Figure 7 — LongLive-2.0: FP4 cuts training time and inference memory AR training — 64s iteration (s) lower = better 1372.9 BF16+SP 1196.5 Balanced SP 639.5 NVFP4 + Bal SP 2.1× faster Inference — peak memory (GB) lower = better 36.4 BF16 24.8 FPS 29.7 +NVFP4 32.0 FPS 19.4 +KV cache 29.7 FPS
Figure 7. LongLive-2.0's measured results. Left: NVFP4 with Balanced sequence parallelism cuts 64s AR-training iteration time up to 2.1× (plain BF16 without SP runs out of memory). Right: W4A4 NVFP4 then an NVFP4 KV cache drop inference peak memory from 36.4 GB to 19.4 GB while raising throughput (24.8 → 32.0 FPS) [5].

The NVFP4 KV cache is what lets the model retain a minute of generated history within a fixed memory budget. Because it raises a distinct set of problems, we treat it on its own.

5. KV Cache Quantization with FP4

In any autoregressive model, the keys and values of past tokens are the model’s memory—and that memory grows linearly with length until it dominates everything. For LLMs this is the long-context wall: KVQuant showed that quantizing the cache to ~3 bits preserves accuracy well enough to reach 10-million-token contexts [13]. AR video has the exact same problem, only heavier—each generated chunk becomes history that later chunks attend to, and video tokens are far larger than text tokens. Quant VideoGen, for one, pushes the cache to just 2 bits for autoregressive video diffusion—up to 7× smaller with under 4% latency overhead—via semantic-aware smoothing and progressive residual quantization [22].

K-smoothing is a shared ingredient, not a point of contrast. Keys often contain offsets and outliers that waste the limited resolution of a low-bit codebook. Smoothing or centering K before quantization tightens its effective range, so more of those scarce levels represent useful variation. Quant VideoGen uses semantic-aware smoothing; LongLive-2.0 subtracts each key vector’s channel mean before NVFP4 micro-block quantization; and SageAttention3 explicitly inherits K-smoothing from SageAttention in its FP4 attention kernel. The exact recipes differ, but the principle is broadly effective: smooth first, then quantize [22][5][7].

Quantizing the cache to NVFP4 attacks this directly. NVIDIA reports the NVFP4 KV cache cutting footprint up to ~50% versus FP8 with under 1% accuracy loss, and beating MXFP4 KV cache by ~5% thanks to the finer block scaling [4]. The deeper advantage is hardware: NVFP4 dequantizes along Blackwell’s native FP4→FP8 datapath, whereas generic INT4/INT2 KV caches have no such datapath and must dequantize in software [1].

Quant VideoGen and NVFP4 therefore target different points on the quality–memory frontier. This is not an apples-to-apples benchmark, but the design tradeoff is clear: INT2 is capacity-first, using an extremely compact code and progressive residual refinement to maximize compression; NVFP4 is quality-first, using twice the bit width and floating-point dynamic range to preserve more quality headroom when small cache errors compound over a long video. NVFP4 also keeps K and V in the same Blackwell-native format used by SageAttention3, so an NVFP4 KV cache can feed an NVFP4 attention path without introducing an INT2-to-FP4 format boundary [22][7].

And dequantization is the catch. KV is generated autoregressively, so the cache is re-read in full on every single decode step—if it is stored quantized, it must be dequantized again and again in a tight per-step loop, a recurring, bandwidth-bound tax rather than a one-time conversion. A naive implementation does this one cached chunk per kernel launch, and the launch latency stacks up. LongLive-2.0’s answer is a custom parallel dequantization kernel that rebuilds every in-window chunk in a single launch, keeping total dequant overhead under 2% [5][6]:

Reconstructing the NVFP4 KV cache before attention NVFP4 cache ≈ 3.6× smaller · dequant overhead < 2% time → Serial — 5 launches, one chunk at a time c0 c1 c2 c3 c4 Parallel — 1 fused launch c0 c1 c2 c3 c4 latency saved Serial: 5 launches run one after another — latency stacks up. Parallel: one fused launch dequantizes every chunk at once.
Figure 8. Reconstructing the NVFP4 KV cache before attention. A naive path dequantizes one chunk per kernel launch, so latency stacks up across the sliding window (top); LongLive-2.0’s fused kernel rebuilds every in-window chunk in a single launch (bottom), keeping total dequant overhead under 2% [5].

6. FP4 Attention

With weights, activations, and the KV cache in four bits, the remaining component is attention itself—the two matrix multiplies that turn queries and keys into scores, and scores into a weighted sum of values. Attention has followed the same precision curve as the rest of the model, one format at a time.

FlashAttention-2 set the modern baseline by computing exact attention in FP16/BF16 [17]. FlashAttention-3 then added an FP8 path on Hopper—running both matmuls in 8-bit and reaching roughly 1.2 PFLOP/s on an H100 [18]. The SageAttention line went lower still, using integers: the original quantized the query–key score matmul to INT8 (keeping the probability×value matmul in FP16) for about 2.1× over FlashAttention-2 [19], and SageAttention2 took queries and keys to INT4 with the probabilities and values in FP8 for about 3× [20]. SageAttention3 is the most recent step: both matmuls in NVFP4 on Blackwell.

The precision ladder for attention: fewer bits, more speed. The precision ladder for attention: fewer bits, more speed BF16 / FP16 FlashAttention-2 Exact softmax, 16-bit baseline Hardware: A100 / H100 16 bits 1x FP8 FlashAttention-3 8-bit QK / PV, FP32 accum Hardware: Hopper (H100) 8 bits ~1.2 PFLOPs/s INT8 → INT4 SageAttention / Sage2 INT8/INT4 Q,K; FP8 P,V Hardware: RTX 4090 / Hopper 8 → 4 bits ~2.1x → ~3x vs FA2 FP4 (NVFP4) SageAttention3 4-bit float, BOTH matmuls Hardware: Blackwell 4 bits ~5x / 1038 TOPS faster → precision drops, throughput rises
Figure 9. Attention has tracked the same precision curve as the rest of the model—FP16 (FlashAttention-2) → FP8 (FlashAttention-3) → INT8/INT4 (SageAttention 1/2) → FP4 (SageAttention3)—each step trading representational range for throughput [7].

Pushing attention to four bits is delicate, and one tensor is the reason. Q, K, and V are roughly zero-centered with a wide range, so ordinary block-wise FP4 handles them. The softmax map P is the hard one: after softmax its values live in [0, 1], crammed near zero, so a naive 4-bit scale wastes almost all of its range. SageAttention3 solves this with a two-level trick—first stretch each row of P by a per-token FP32 factor so it fills the representable range, then quantize. Crucially, both matmuls run in NVFP4: P’s values end up in 4-bit (only its block scale is FP8) [7].

FP4 attention (SageAttention3), step by step 1038 TOPS · ~5× FlashAttention2 (RTX 5090) NVFP4 99.52% vs MXFP4 98.37% Inputs Q, K, V quantize → FP4 E2M1 · 1×16 block FP8 E4M3 scale S = Q·Kᵀ FP4 Tensor Cores online softmax → P (full precision) O = P·V FP4 Tensor Cores output O KEY TRICK — why FP4 attention works P ∈ [0,1] → stretch ÷(448×6) in FP32, then quantize P to FP4 P values FP4, scale FP8 Q, K, V and even the softmax map P are quantized to 4 bits — both matmuls run on FP4 Tensor Cores. The trick: stretch P to fill the range before quantizing, so [0,1] probabilities survive 4-bit.
Figure 10. SageAttention3 runs both attention matmuls in NVFP4 on Blackwell. The hard part is the softmax map P, whose [0,1] values are stretched (×1/(448·6) in FP32) to fill the range before 4-bit quantization — lifting cosine similarity from 93.3% (direct) to 99.5% — reaching 1038 TOPS, ~5× FlashAttention2 on an RTX 5090 [7].

The result is substantial: 1038 TOPS on an RTX 5090, roughly 5× the fastest FlashAttention available there, and about 2.4–3× end-to-end on video diffusion—with the two-level stretch raising P’s cosine similarity from 93.3% to 99.5%, and NVFP4 again ahead of MXFP4 [7]. It is not free: FP4 attention still incurs more accuracy risk than FP4 GEMMs, which is why it remains an active research area rather than a settled one.

Closing: FP4 Is The Future

Historically, 4-bit precision implied a large accuracy penalty. On Blackwell, NVFP4 reduces that penalty to roughly 1% on many tasks while delivering substantial speedups, and the approach now spans the landscape: LLMs (DeepSeek-R1), diffusion and image generation (FLUX), and autoregressive video (LongLive-2.0). The recurring lesson is that FP4 works only when the format, the kernels, the cache, and attention are co-designed—fine-grained scaling, fused kernels, hardware-accelerated dequantization, and the dedicated handling that the softmax map requires.

It is also far from finished. The open problems are where the next round of speed and quality will come from:

Four-bit precision is becoming a practical default for large-model inference, and increasingly for training.
Realizing it fully is as much a systems problem—formats, kernels, and memory—as a modeling one.

References

  1. Introducing NVFP4 for Efficient and Accurate Low-Precision Inference. Eduardo Alvarez, Omri Almog, Eric Chung, Simon Layton, Dusan Stosic, Ronny Krashinsky, Kyle Aubrey, NVIDIA Technical Blog, 2025. NVIDIA Technical Blog
  2. OCP Microscaling Formats (MX) Specification. Open Compute Project, 2023. OCP Specification
  3. NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit. Kirthi Devleker and Farshad Ghodsian, NVIDIA Technical Blog, 2025. NVIDIA Technical Blog
  4. Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache. Eduardo Alvarez, Wei-Ming Chen, Huizi Mao, NVIDIA Technical Blog, 2025. NVIDIA Technical Blog
  5. LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation. Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, Song Han. arXiv, 2026. arXiv:2605.18739 and project page; code at github.com/NVlabs/LongLive
  6. LongLive2.0 Documentation. NVlabs, 2026. Documentation
  7. SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training. Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jianfei Chen, Jun Zhu. arXiv, 2025. arXiv:2505.11594
  8. Wan2.2: Open and Advanced Large-Scale Video Generative Models (Wan2.2-TI2V-5B base model). Wan Team, 2025. Wan2.2 repository
  9. NVIDIA TransformerEngine. NVIDIA, 2024. Library of fused low-precision (FP8/FP4) quantization-and-GEMM kernels for Hopper and Blackwell GPUs. github.com/NVIDIA/TransformerEngine
  10. NVIDIA TensorRT Model Optimizer. NVIDIA, 2024. Quantization toolkit with NVFP4 post-training and quantization-aware recipes for efficient deployment. github.com/NVIDIA/TensorRT-Model-Optimizer
  11. 3 Ways NVFP4 Accelerates AI Training and Inference. NVIDIA Technical Blog, 2025. NVIDIA Technical Blog
  12. Pretraining LLMs with NVFP4. NVIDIA, arXiv, 2025. arXiv:2509.25149
  13. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami. NeurIPS, 2024. arXiv:2401.18079
  14. ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation. Tianchen Zhao et al. arXiv, 2024. arXiv:2406.02540
  15. DeepSeek-V4. DeepSeek-AI, 2026. A 1.6T-parameter Mixture-of-Experts model that applies FP4 quantization-aware training to its expert weights and sparse-attention indexer (FP8 elsewhere), targeting NVIDIA Blackwell. DeepSeek-V4-Pro model card
  16. OpenAI gpt-oss (gpt-oss-120b, gpt-oss-20b). OpenAI, 2025. Open-weight Mixture-of-Experts models whose MoE expert weights (over 90% of parameters) are quantization-aware-trained in MXFP4, enabling the 120B model to run on a single 80 GB GPU. gpt-oss-120b model card
  17. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. Tri Dao. arXiv, 2023. arXiv:2307.08691
  18. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision. Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao. arXiv, 2024. arXiv:2407.08608
  19. SageAttention: Accurate 8-Bit Attention for Plug-and-Play Inference Acceleration. Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, Jianfei Chen. arXiv, 2024. arXiv:2410.02367
  20. SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-Thread INT4 Quantization. Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen. arXiv, 2024. arXiv:2411.10958
  21. QeRL: Beyond Efficiency — Quantization-enhanced Reinforcement Learning for LLMs. Huang et al. arXiv, 2025. Pairs an NVFP4-quantized backbone with LoRA for efficient RL training of LLMs. arXiv:2510.11696
  22. Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization. Haocheng Xi et al. ICML, 2026. Training-free 2-bit KV-cache quantization for AR video diffusion using semantic-aware smoothing and progressive residual quantization (up to 7× smaller, <4% latency overhead). arXiv:2602.02958
  23. RTX 5090 Workstation Configuration Journey. Qinghao Hu, Jiaming Tang, Yujun Lin, Zhuoyang Zhang, Zhekai Zhang, Shang Yang, Song Han. MIT Han Lab Blog, February 2025. Han Lab Blog
  24. SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs. Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han. MIT Han Lab Blog, February 2025. Han Lab Blog