RLP: Reinforcement as a Pretraining Objective
Published:
[Paper] [Code
]
Author: Ali Hatamizadeh1, Syeda Nahida Akter1, Shrimai Prabhumoye1, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi
📌 Summary:
Reinforcement Learning Pretraining (RLP) brings reinforcement learning directly into the pretraining stage, rewarding models for generating useful chains-of-thought (CoT) that actually help predict future tokens. Unlike verifier-based methods, RLP is verifier-free, dense, and scalable, making “thinking before predicting” part of the pretraining recipe itself. RLP enables models to:
- Integrate reasoning into pretraining: CoT is treated as an explicit action; thoughts are rewarded in proportion to their predictive utility.
- Achieve robust gains at scale: On Qwen3-1.7B-Base, RLP boosts benchmark averages by +19% over base and +17% over continuous pretraining.
- Scales with model Size: Applied to the Nemotron-Nano-12B-V2, RLP raises the overall average from 42.81% → 61.32% and improves science reasoning by an absolute +23%, despite using ~200B fewer tokens.
- Compound with post-training: RLP establishes durable reasoning foundations that persist and strengthen after SFT and RLVR, the gains compound (+8% relative).
With comprehensive ablations and scaling experiments, RLP emerges as a broadly applicable reinforcement pretraining objective—bridging next-token prediction with reasoning and establishing a new foundation for building models that think before they predict.
Overview

Figure 1: Quantitative benchmarks for Qwen3-1.7B-Base, showing the impact of RLP. Shaded columns indicate RLP variants; “Post” indicates SFT + RLVR post-training.
The standard approach to training large language models (LLMs) is to first build a general foundation with next-token prediction and then try to teach complex reasoning skills much later, during a final post-training phase. This treats reasoning as an add-on rather than a core capability. We hypothesize that a model’s foundational reasoning ability can be significantly improved by integrating reinforcement learning directly into the pre-training process itself.
We introduce RLP (Reinforcement Learning Pre-training)—a scalable method that reframes reasoning as an intrinsic part of pre-training. Instead of just passively predicting the next word, RLP encourages the model to actively “think before it predicts” by generating an internal chain-of-thought. This “thought” is then rewarded based on how much it helps the model predict the actual next token in the sequence. The result is a model that learns a foundational, self-supervised motivation to reason from any ordinary text.
Key Characteristics of RLP:
- Verifier-Free, Information-Gain Reward 🧠: RLP rewards internal thoughts (CoT) based on their information gain for next-token prediction, creating a dense, self-supervised, and verifier-free signal from any text.
- Reasoning as an Exploratory Action: It treats chain-of-thought generation as an exploratory action, encouraging the model to proactively reason about how its internal thoughts influence future predictions.
- Dynamic EMA Baseline: Rewards are calculated as the advantage over a slowly updated EMA baseline of the model itself. This dynamic comparison stabilizes training and ensures meaningful credit assignment.
- Seamless Pre-training Integration: The objective directly augments next-token prediction, allowing it to operate on massive text streams and teach reasoning within a single, unified pre-training phase.
How does RLP work?

Figure 2: Visualization of the RLP framework. A chain-of-thought is sampled before next-token prediction. Rewards are computed by contrasting the predictor conditioned on the CoT with a No-think EMA baseline, yielding a verifier-free, dense signal.
As shown in Figure 2 (right), RLP treats Chain-of-Thought (CoT) generation as an explicit action taken before predicting each next token. The model first samples an internal thought, which is “The sentence describes how plants, algae, and bacteria make food. Common knowledge says this process relies on energy from the sun. So the next token is most likely ‘sunlight’”, then predicts the observed token “sunlight” from the same context augmented with the CoT. The reward, as shown in Figure 2 (left), is the increase in log-likelihood of the observed token when the CoT is present compared to a no-think baseline. This yields a verifier-free and dense reward that assigns position-wise credit wherever thinking improves prediction. RLP reframes reinforcement learning for reasoning as reinforcement pretraining on the same streams used for maximum likelihood.
Potential of RLP
To isolate the impact of RLP, we compared three models built on the Qwen3-1.7B-Base architecture:
- The original base model (BASE)
- A compute-matched Continuous Pre-training (CPT) baseline
- Our RLP model
For a fair, apples-to-apples comparison, all three models were then put through an identical post-training pipeline consisting of Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verified Rewards (RLVR). The results are detailed in Figure 1.
🔥 Key Takeaways
- RLP Establishes a Decisive Pre-training Advantage. During the pre-training phase alone, RLP demonstrates superior performance, outperforming the original base model by +19% and the compute-matched CPT baseline by +17% on average.
- Gains Compound After Post-Training. The benefits of RLP are not temporary or washed out by alignment. Instead, they compound, with the final RLP-enhanced model maintaining a +7-8% relative advantage over the other post-trained models.
- Broad Generalization Beyond Math. RLP’s gains are not limited to a single domain. We observed particularly strong improvements in science benchmarks, where the RLP model achieved a +3 absolute point gain over the CPT model after post-training, showcasing its versatile, multi-step reasoning capabilities.
Scaling RLP to large model size and different model architecture
In this comparison we take an intermediate checkpoint of NEMOTRON-NANO-12B-V2 trained till 19.8 trillion tokens and apply RLP for 250 million tokens only. Base on the other hand is trained for 20 trillion tokens.

Figure 3: Comparison of BASE and RLP on NEMOTRON-NANO-12B-V2. RLP, trained on ~200B fewer tokens, achieves a 35% average gain, with the largest boost in science reasoning (+23% absolute), showing robust cross-domain benefits at scale.
🔥 Key Takeaways:
- Figure 3 demonstrates that the benefits of RLP persist and even amplify when scaling to larger model sizes and generalizes to different model architectures.
- RLP substantially outperforms Base across all domains, and particularly RLP is relatively 35% on average better than Base in spite of being trained on approx. 200 billion less tokens.
- While math performance improves moderately, the most striking gains emerge in science reasoning, where Science Avg jumps an absolute 23%.
RLP provides generalizable improvements across diverse corpora!
Our experiments with the Qwen model across six different corpus families highlight a major strength of RLP—its scalability to large, diverse corpora. Unlike RLVR, which depends on small, curated reasoning datasets and struggles to generalize, RLP can operate directly on ordinary pretraining streams—academic papers, textbooks, web crawl, or even SFT-style data. This makes it practical at pretraining scale without the costly curation required by prior approaches.

Figure 4: RLP trained on six SFT-style and general-purpose datasets yields consistent gains, indicating transferable reasoning from mixed/open-ended data.
🔥 Key Takeaways:
- Consistent Gains Across Domains. On Qwen3-1.7B-Base, RLP improves averages by 7–9%, with the strongest lifts on SFT-style and general-purpose corpora.
- True Cross-Domain Transfer. Unlike prior methods where RL gains were confined to math and weakened under mixed data, RLP achieves simultaneous improvements across all benchmarks, proving genuine cross-domain transfer.
- Finding Reasoning Signals Everywhere. Even on purely non-reasoning corpora like web crawl, RLP leverages data diversity to uncover reasoning signals. This eliminates the need for costly data curation and proves that RLP can enhance a model’s reasoning ability using the same data streams as standard pre-training, making it a truly scalable solution.
Why Relative Advantages Don’t Reward “Bad Thoughts”
➣ Proof of Monotonic Improvement
It may seem paradoxical that, when all thoughts perform poorly \((𝑟(𝑐_𝑡) <0)\), the group-relative formulation still labels one as “better” and reinforces it. Does this mean the model is being trained to favor bad reasoning?
We demonstrate that, mathematically, this mechanism is sound: the update remains an unbiased gradient step on \(J(\theta)\), ensuring monotonic improvement even in such cases.
1. Objective.
For context \(𝑥_{<𝑡}\) and target token \(𝑥_𝑡\):
Maximizing \(J\) reduces cross-entropy versus the no-think baseline. Ignoring stop-gradients, the policy gradient is
2. Group-relative advantages are unbiased.
We draw \(𝐺 \geq 2\) thoughts \(𝑐^{(1)},\ldots,𝑐^{(𝐺)} ∼{\pi}_{\theta}\) and form
Let \(\mu= \mathbb{E}[𝑟(𝑐)]\). Then
and
Hence, the estimator is unbiased. Even if all rewards are negative, the update follows the correct gradient direction.
3. Why positive advantage for the “least-bad” rollout is correct
As the model learns, it gradually increases the probability of generating thoughts that help prediction and decreases the probability of those that do not. This process, known as the replicator dynamic, captures how relative advantages drive steady improvement over time:
whose improvement rate is
Even if all \(𝑟(𝑐) <0\), shifting probability mass from more-negative to less-negative thoughts increases \(𝐽\). Thus, a positive advantage for the least-bad thought reflects correct relative improvement, not misaligned reward.
4. Monotonic expected improvement.
With unbiased gradient estimator \(\hat{𝑔} ≈ \nabla J\) and small step size \(\alpha\):
ensuring monotonic improvement in expectation.
5. The gradient does not blindly increase harmful thoughts
A remaining concern is that a thought with negative reward \(r(c)<0\) might still receive a positive advantage \(A(c)>0\) if it is simply less harmful than its peers, apparently encouraging bad reasoning. However, the gradient update does not blindly amplify such thoughts; it reallocates probability mass among them in a way that improves the expected objective.
First, because the advantages are defined as
the total \(\sum_i A(c^{(i)}) = 0\). Hence, even if every reward is negative, the update is zero-sum: probability increases only for thoughts that are \emph{less negative} than average, while it decreases for those that are worse. This shift raises the expected reward $J(\theta)$ because the expected improvement rate is
Thus, the method performs a relative reallocation and guarantees monotonic ascent in expectation.
Second, a positive advantage \(A(c)>0\) does not deterministically increase the corresponding $r(c)$ on the next update; it increases it in expectation. The policy gradient on thought tokens,
acts on the relative usefulness of each thought, not its absolute reward value. Over repeated steps, the model raises the log-evidence \(\log p_\theta(x_t\mid x_{<t},c)\) for those thoughts that contribute more to prediction, thereby increasing their expected \(r(c)\) relative to the slowly moving EMA baseline \(\bar p_\phi\).
Third, the EMA baseline prevents artificial reward inflation. Because \(\bar p_\phi\) lags behind \(\theta\) through a slow exponential moving average, any transient or spurious improvement in \(r(c)\) dissipates as the baseline catches up. Sustained positive advantages arise only when the model genuinely improves predictive likelihood relative to the no-think counterfactual.
Finally, while a positive advantage can momentarily reinforce a thought whose raw reward remains negative, this update is not pathological. It simply redirects probability toward the least harmful reasoning pattern available, reducing overall loss. Over time, these relatively better thoughts typically evolve into genuinely helpful ones as their predictive evidence increases, ensuring that the training process remains stable and aligned with maximizing \(J(\theta)\).
➣ Numerical Illustration of Relative Advantage Updates
To make the abstract dynamics more concrete, we present a simple numerical example showing how the group-relative advantage mechanism improves the expected objective \(J(\pi;r)\) even when all rewards are initially negative. Note that in this illustrative example we denote the expected reward as \(J(\pi; r)\) to emphasize its dependence on the discrete policy over thoughts \(\pi\) and fixed rewards \(r_i\). Conceptually, this corresponds to the same information-gain objective \(J(\theta)\) introduced in the main text, expressed here in a simplified form.
We consider four sampled thoughts \(c_1,c_2,c_3,c_4\) with policy \(\pi = [\pi_1,\pi_2,\pi_3,\pi_4]\), initialized uniformly. For each thought,the information-gain reward is
and the group size is $G=4$ with mean reward \(\bar r = \tfrac{1}{4}\sum_i r_i\). The group-relative advantage is
and we assume each thought has length \(|c_i|=4\) so that per-token weight is \(A_i/4\). The policy is updated by an exponentiated-gradient (replicator) step
and the expected objective is \(J(\pi;r) = \sum_i \pi_i r_i\).
Although a positive advantage can momentarily reinforce a thought whose raw reward $r_i$ is still negative, this update is not pathological. Because advantages are computed relative to the group mean, a positive \(A_i\) simply indicates that \(c_i\) is less harmful than its peers. Increasing its probability reallocates mass away from worse alternatives, thereby improving the expected objective \(J\). Over subsequent updates, the model typically adapts to make these less-harmful thoughts genuinely helpful, raising \(r_i\) in expectation.
Iteration 1: all thoughts are harmful ($r_i<0$), but one is least bad.
Mean and advantages:
Per-token weights: \(A^{(1)}/|c| = [-0.0833, -0.0167, +0.0167, +0.0833]\). Note that \(c_4\) has \(r_4=-0.30<0\) yet receives a positive advantage \(A_4=+0.3333\), so every token in \(c_4\) gets a positive gradient. Policy update with \(\eta=0.5\) gives
yielding \(J(\pi^{(0)};r^{(1)})=-0.5500\) and \(J(\pi^{(1)};r^{(1)})=-0.5284\). This is a small but consistent improvement.
Iteration 2: dense updates improve $r$ on \(c_3,c_4\).
Update:
Expected objective: \(J(\pi^{(1)};r^{(2)})=-0.4313\), \(J(\pi^{(2)};r^{(2)})=-0.3856\).
Iteration 3: the least-bad thought becomes genuinely helpful.
Policy update:
and the expected objective improves again: \(J(\pi^{(2)};r^{(3)})=-0.2916, J(\pi^{(3)};r^{(3)})=-0.2244\).
As shown above, in Iteration 1, all rewards are negative, yet \(c_4\) (the least bad) has a positive advantage, showing how the dense loss pushes probability toward less harmful thoughts and increases $J$. Since rewards are tied to log-evidence, these positive gradients directly improve the corresponding \(r(c)\) values, leading to less-negative and eventually positive rewards in later iterations.
Conclusion
We introduce Reinforcement Learning Pretraining (RLP) that reframes how we think about training large language models. Instead of waiting until post-training to add reinforcement learning, RLP weaves reasoning directly into the pretraining stage—rewarding chains of thought by the value they bring to next-token prediction. The result is models that think before they predict, with reasoning skills that persist and compound through alignment.
Citation
@article{hatamizadeh2025rlp,
title={RLP: Reinforcement as a Pretraining Objective},
author={Hatamizadeh, Ali and Akter, Syeda Nahida and Prabhumoye, Shrimai and Kautz, Jan and Patwary, Mostofa and Shoeybi, Mohammad and Catanzaro, Bryan and Choi, Yejin},
journal={arXiv preprint arXiv:2510.01265},
year={2025}
}