Accelerating RL Post-Training with Speculative Decoding in NeMo RL

Published: April 21, 2026

arXiv NeMo RL Documentation

Authors: Hayate Iso, Tiyasa Mitra, Sudipta Mondal, Rasoul Shafipour, Venmugil Elango, Terry Kong, Yuki Huang, Seonjin Na, Izzy Putterman, Benjamin Chislett, Maor Ashkenazi, Joseph Guman, Gerald Shen, Tugrul Konuk, Ashwath Aithal, Ritika Borkar, Ran Zilberstein, Bita Rouhani

Overview

Cost of RL post-training is dominated by a key component: rollout generation.

In reasoning workloads, the bottleneck is long autoregressive outputs.
In agentic RL, it’s the sequential model calls across multiple turns.

Different sources, same problem: generation cost dominates.

Speculative decoding cuts through it, without changing training semantics. In speculative decoding, a drafter proposes multiple tokens, a verifier accepts/rejects them, and the final rollout still matches the verifier policy.

We’ve integrated speculative decoding into NeMo RL with a vLLM backend, supporting major techniques such as EAGLE-3 and multi-token prediction (MTP). We also surface the key system trade-offs and implementation details required to realize speedups in practice. To our knowledge, this is the first time speculative decoding is being integrated into an OSS production-grade RL framework.

Results (8B reasoning workloads):

1.5x-1.8x faster rollout generation
Up to 1.4x faster end-to-end RL steps
No change to the optimization trajectory

At scale, projections show ~2.5x end-to-end speedup at 235B.

Highlights

System overview

We integrate speculative decoding directly into the NeMo RL training pipeline, not just as a standalone inference optimization.
The verifier-side training path remains unchanged: policy loss, KL, and log-probability recomputation are still computed under the verifier policy.
On 8B reasoning workloads, EAGLE-3 yields up to 1.8x faster rollout generation and up to 1.4x faster RL steps.
Simulator projections suggest that the benefit grows at frontier scale, reaching roughly 2.5x end-to-end speedup in favorable 235B deployments.

Main Results

We evaluate GRPO-based RL post-training in two reasoning settings:

RL-Think: continued training from a reasoning-capable Qwen3-8B model
RL-Zero: training starting from Qwen3-8B-Base

In both settings, rollout generation accounts for 65-72% of step time in the autoregressive baseline. That makes it the obvious target for systems optimization.

Generation latency per training step

Speculative decoding cuts rollout latency in both regimes:

RL-Zero: rollout generation latency decreases from 100.0s to 56.6s per step, a 1.8x speedup
RL-Think: rollout generation latency decreases from 133.6s to 87.0s per step, a 1.5x speedup
Overall RL step speedup reaches 1.4x on RL-Zero and 1.3x on RL-Think

Validation accuracy versus training step

Validation curves stay closely aligned between autoregressive and speculative decoding. The faster rollout path doesn’t shift the optimization trajectory.

We also compare against n-gram drafting, a model-free speculative baseline. It achieves non-trivial acceptance length but is still slower than autoregressive decoding end-to-end. The lesson: acceptance alone is not enough. Verification overhead and draft quality both matter.

Projected Deployment Impact

The 8B gains are clear. The bigger opportunity is at larger model sizes, where rollout generation still dominates training time.

Simulated end-to-end speedup heatmap

For a simulated Qwen3-235B-A22B deployment on 512 GB200 GPUs under synchronous RL, favorable operating points exceed 2.2x end-to-end speedup. Non-generation stages cap the total gain, so rollout acceleration doesn’t translate one-for-one into training speedup.

Sensitivity to policy lag at 235B scale

The simulations also show that speculative decoding composes naturally with asynchronous RL. As policy lag grows, smaller deployments lose more speedup; larger ones stay stable. At favorable 235B operating points, the projected end-to-end gain reaches roughly 2.5x.

Takeaways

Speculative decoding is a practical systems lever when rollout generation dominates wall-clock time and verifier-side semantics matter. Integrated into NeMo RL, it delivers substantial throughput gains at 8B scale, with larger projected benefits at frontier scale.

What we learned along the way:

Lossless throughput: the rejection procedure guarantees rollouts still follow the verifier policy. Unlike async execution, off-policy replay, or low-precision rollouts, speculative decoding raises throughput without changing the sampling distribution the RL signal depends on.
Generation share sets the ceiling: by Amdahl’s law, end-to-end speedup is bounded by the generation share and the mean acceptance length. Our 8B workloads sit at 65-72% generation share, and that share grows with model size, which is why the projected gains at 235B are larger.
Draft-policy alignment is the master variable: in-domain initialization, online adaptation, and the choice of drafter all work by reducing the gap between draft and policy. Switching from generic chat data to in-domain post-training data lifts RL-Zero speedup from 1.5x to 1.8x; online adaptation mostly helps when initialization is weak. Verifier-exact semantics are preserved by reusing cached hidden states from the verifier’s forward pass through a gradient-detached pathway.
Acceptance length is not speedup: longer drafts raise acceptance, but raise speculative work faster. At 8B, k=3 is the sweet spot; k≥5 on RL-Think is actually slower than autoregressive decoding. Optimal k depends on model size, deployment scale, and per-instance batch size, and should be re-tuned per configuration.
Async is complementary, not competitive: async overlaps generation with training and log-prob, shrinking the rollout-side margin (~10% step-time gain at lag 1 vs. 35-41% in sync at 8B). At frontier scale, generation re-emerges as the bottleneck: ~3.5x rollout and ~2.5x end-to-end at 235B on 2048 GB200 GPUs with lag 2.

For full details, see the arXiv version. For implementation details and training guidance in NeMo RL, see the documentation.

Citation

@misc{iso2026rlacceleration,
  title={Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding},
  author={Hayate Iso and Tiyasa Mitra and Sudipta Mondal and Rasoul Shafipour and Venmugil Elango and Terry Kong and Yuki Huang and Seonjin Na and Izzy Putterman and Benjamin Chislett and Maor Ashkenazi and Joseph Guman and Gerald Shen and Tugrul Konuk and Ashwath Aithal and Ritika Borkar and Ran Zilberstein and Bita Rouhani},
  year={2026},
  eprint={2604.26779},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2604.26779}
}

Share on

Twitter Facebook LinkedIn

NVIDIA