Abstract
Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) — a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs' generations. We conduct experiments on 7B and 32B models in math reasoning tasks. Results consistently show that through the additional leverage of negative feedback, NFT significantly improves over SL baselines like Rejection sampling Fine-Tuning, matching or even surpassing leading RL algorithms like GRPO and DAPO. Furthermore, we demonstrate that NFT and GRPO are actually equivalent in strict-on-policy training, even though they originate from entirely different theoretical foundations. Our experiments and theoretical findings bridge the gap between SL and RL methods in binary-feedback learning systems.
TL;DR
We propose NFT, a supervised learning method for improving LLMs' math-reasoning abilities with no external teachers.
- As an SL method, NFT outperforms leading RL algorithms like GRPO and DAPO in 7B model experiments and performs similarly to DAPO in 32B settings.
- NFT allows directly optimizing LLMs on negative data, thereby significantly outperforming other SL baselines such as Rejection sampling Fine-Tuning (RFT).
- NFT is equivalent to GRPO when training is strictly on-policy, despite their entirely different theoretical foundations.
Our findings show that self-reflective improvement is not an inherent priority of RL algorithms. Rather, the current gap between SL and RL methods actually stems from their ability to effectively leverage negative data.
A spectrum of online algorithms for LLM fine-tuning. NFT bridges reinforcement learning and supervised learning methods through the leverage of negative feedback via supervision.
Comparison of the released NFT-7B with other zero-style math models of Qwen series.
Validation accuracy averaged on 6 math benchmarks for 7B (left) and 32B (right) training. For 7B experiments, we report mean ± std results across 3–4 independent experiments.
Method
NFT Pipeline:
- Data Collection: An LLM generates answers to a set of math questions. Generation results are split into two sub-datasets based on their answer correctness.
- Implicit Negative Policy: We construct an implicit negative policy to model the negative answers. This implicit policy is parameterized with the same positive policy we target to optimize on positive data, enabling direct policy optimization on all LLMs' generations.
- Policy Optimization: Positive answers and negative answers are respectively used to optimize the LLM policy via supervised learning.
The success of NFT is built on two theoretical insights:
- Policy Splitting. The generation policy can be split into a positive policy and a negative policy, and re-expressed as their linear combination.
- Policy Improvement. By iteratively optimizing towards its positive split, an LLM policy can improve continuously.
Training Objective:
Results
As an SL method, NFT outperforms leading RL algorithms like GRPO and DAPO in Qwen-7B model experiments and performs similarly to DAPO in 32B settings.
7B training results. We report avg@32 for AIME24, AIME25, and AMC23 and avg@1 for others. Numbers within 1% of the max are bolded.
Negative feedback enhances performance and exploration. NFT consistently outperforms RFT by a clear margin, highlighting the benefit of incorporating negative feedback during training. Meanwhile NFT has higher entropy than RFT, indicating more active exploration.
Entropy curves for 7B and 32B runs.
Training and validation accuracy curves for 7B experiments. We conducted 3–4 random and independent experiments for each algorithm and report their mean ± std results.
Citation
@inproceedings{chen2026bridging,
title={Bridging Supervised Learning and Reinforcement Learning in Math Reasoning},
author={Chen, Huayu and Zheng, Kaiwen and Zhang, Qinsheng and Cui, Ganqu and Cui, Yin and Ye, Haotian and Lin, Tsung-Yi and Liu, Ming-Yu and Zhu, Jun and Wang, Haoxiang},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}