Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding

We introduce Nemotron-Labs-Diffusion, a tri-mode language model (LM) that unifies AR, diffusion, and self-speculation decoding within a single architecture. Trained with a joint AR-diffusion objective, Nemotron-Labs-Diffusion can switch modes to sustain high throughput across deployment settings and concurrency levels. Our study shows that (1) AR and diffusion objectives are complementary: diffusion improves lookahead planning, while AR provides left-to-right linguistic priors. (2) In self-speculation mode, diffusion drafts while AR verifies, outperforming multi-token prediction (MTP) methods in both acceptance rate and real-device efficiency. (3) A speed-of-light analysis further demonstrates diffusion’s long-term potential, with up to 76.5% more tokens per forward pass than self-speculation under an optimal sampler. Scaling to 3B, 8B, and 14B parameters, our Nemotron-Labs-Diffusion family, including base, instruct, and vision-language models, consistently outperforms state-of-the-art open-source AR and diffusion LMs in both accuracy and speed. For example, Nemotron-Labs-Diffusion-8B decodes 5.9×more tokens per forward than Qwen3-8B with better accuracy, translating to 4× higher throughput on SPEED-Bench with SGLang on a GB200 GPU.

HF collection: https://huggingface.co/collections/nvidia/nemotron-labs-diffusion 

Authors

Lexington Whalen (NVIDIA)
Abhinav Garg (NVIDIA)
Chengyue Wu (NVIDIA)
Maksim Khadkevich (NVIDIA)
Nicolai Oswald (NVIDIA)
Enze Xie (NVIDIA)
Daniel Egert (NVIDIA)
Sharath Turuvekere Sreenivas, (NVIDIA)
Shizhe Diao (NVIDIA)
Chenhan Yu (NVIDIA)
Ye Yu (NVIDIA)
Weijia Chen (NVIDIA)
Sajad Norouzi (NVIDIA)
Shiyi Lan (NVIDIA)
Ligeng Zhu (NVIDIA)
Jin Wang (NVIDIA)
Jindong Jiang (NVIDIA)
Morteza Mardani (NVIDIA)
Mehran Maghoumi (NVIDIA)
Song Han (NVIDIA)
Ante Jukić (NVIDIA)
Nima Tajbakhsh (NVIDIA)
Jan Kautz (NVIDIA)

Publication Date

Uploaded Files