NFT

Implementation of Negative-aware Finetuning (NFT) algorithm for "Bridging Supervised Learning and Reinforcement Learning in Math Reasoning"

Generate Convert Improve

Install / Use

/learn @NVlabs/NFT

About this skill

Quality Score

0/100

README

<div align='center'> <h1>Negative-aware Fine-Tuning (NFT): Bridging Supervised Learning and Reinforcement Learning in Math Reasoning </h1>

</div> <p align="center"> <img src="./assets/algorithm_spectrum_NFT.jpg" alt="seed logo" style="width:80%;"> </p>

NFT is a pure supervised learning method for improving LLMs' math-reasoning abilities with no external teachers.

As an SL method, NFT outperforms leading RL algorithms like GRPO and DAPO in 7B model experiments and performs similarly to DAPO in 32B settings.
NFT allows directly optimizing LLMs on negative data, thereby significantly outperforming other SL baselines such as Rejective sampling Fine-Tuning (RFT).
NFT is equivalent to GRPO when training is strictly on-policy, despite their entirely different theoretical foundations.

NFT shows self-reflective improvement is not an inherent priority of RL algorithms. Rather, the current gap between SL and RL methods actually stems from their ability to effectively leverage negative data.

Algorithm Overview

NFT bridges reinforcement learning and supervised learning methods through the leverage of negative feedback via supervision.

The NFT pipeline consists of:

Data Collection: LLM generates answers to math questions, split into positive/negative based on correctness
Implicit Negative Policy: Constructs a policy to model negative answers using the same parameters as the positive policy
Policy Optimization: Both positive and negative answers optimize the LLM via supervised learning

Experimental Results

Comparison of NFT-7B with other zero-shot math models in the Qwen series.

NFT performs competitively compared with other algorithms. We report avg@32 for AIME24, AIME25, and AMC23 and avg@1 for others.

Validation accuracy curves showing NFT's ability to leverage negative data for continuous improvement.

Evaluation

Environment setup

We use exactly the same environment configuration as the official DAPO codebase.

pip install git+ssh://git@github.com/volcengine/verl.git@01ef7184821d0d7844796ec0ced17665c1f50673

Benchmarking

Pretrained 7B and 32B models can be found at Huggingface.

We provide the evaluation codebase integrated in the VeRL infra:

Please refer to eval_local_7B.sh and eval_local_32B.sh for evaluation scripts.

Training

Environment setup

We use exactly the same environment configuration as the official DAPO codebase.

pip install git+ssh://git@github.com/volcengine/verl.git@01ef7184821d0d7844796ec0ced17665c1f50673

Datasets

We employ public dataset DAPO-Math-17k for training, and 6 public math benchmarks for validation. Download pre-sorted training and validation data by

bash download_data.sh

Base Model

bash download_model.sh

Starting Experiments

Please see train_7B.sh and train_32B.sh for a running script (one node). Note that we run 7B experiments using 4×8 H100s, and 32B experiments using 16×8 H100s. Please refer to the instruction of VeRL for launching distributed tasks.

Hyperparameter:

neg_weight: The weight of negative data in NFT's objective. Set to 1.0 for default NFT config. Set to 0.0 for RFT by masking out all negative data loss. Set to -1.0 for the DAPO algorithm for comparison.
normalize: Controls the prompt weight in NFT's objective. Set to 0 so that all question data is treated equally. Set to 1 (default) or 2 to prioritize harder questions. normalize=1 matches Dr. GRPO algorithm in on-policy training, while normalize=2 matches standard GRPO.

Acknowledgement

We thank the verl for providing the awesome open-source RL infrastructure.

Citation

If you find our project helpful, please consider citing

@article{chen2025bridging,
      title         = {Bridging Supervised Learning and Reinforcement Learning in Math Reasoning},
      author        = {Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, Haoxiang Wang},
      journal       = {arXiv preprint arXiv:2505.18116},
      year          = {2025}
}

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

flutter-tutor

Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

last30days-skill

16.9k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary