SkillAgentSearch skills...

UFT

UFT: Unifying Supervised and Reinforcement Fine-Tuning

Install / Use

/learn @liumy2010/UFT
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

image

<div align="center">

UFT: Unifying Supervised and Reinforcement Fine-Tuning

Mingyang Liu, Gabriele Farina, Asuman Ozdaglar

Paper Hugging Face Collection

<div align="center" style="font-family: Arial, sans-serif;"> <p> <a href="#results" style="text-decoration: none; font-weight: bold;">📊 Results</a> • <a href="#installation" style="text-decoration: none; font-weight: bold;">🛠️ Installation</a> </p> <p> <a href="#usage" style="text-decoration: none; font-weight: bold;">⚙️ Usage </a> • <a href="#acknowledgement" style="text-decoration: none; font-weight: bold;">🌻 Acknowledgement</a> • <a href="#citation" style="text-decoration: none; font-weight: bold;">📝 Citation</a> </p> </div> </div>

Results

Accuracy of different algorithms averaged over Qwen2.5-0.5/1.5/3B

image

Accuracy of different algorithms on Qwen2.5-0.5B

image

Accuracy of different algorithms on Qwen2.5-3B

image

Installation

conda create -n uft python=3.9
conda activate uft
bash install.sh

Usage

Training

python run.py
  --algo              Algorithm to use: {sft, rft, stage, r3, uft}
  --n_gpu             Number of GPUs
  --visible-devices   GPU index to use, e.g., "0,1,2,3"
  --T                 Total training steps (default: 500)
  --T_hint            Maximum training steps with hint (default: 300)
  --data              Dataset: {countdown,math,kk_logic,others}
  --model             Model name (e.g., Qwen2.5-1.5B)
  --tp_size           
  --eval              Triggered to evaluate the model, otherwise training
  --idx IDX           Index of the current process (default=0)
  --sft_loss_coef     Coefficient for the additional log-likelihood term on hint
  --n_rollout        Number of trajectory rollouts (default 4)

Example

python run.py --model Qwen/Qwen2.5-1.5B --data countdown

Requirement

  • Qwen2.5-0.5/1.5B and Llama-3.2-1B: 2 H100
  • Qwen2.5-3B and Llama-3.2-3B: 4 H100

Qwen2.5-0.5/1.5B / Llama-3.2-1B can be trained with 1 H100 by setting n_rollouts=2

Major Modifications from VERL

Evaluate

Change model and dataset to the the model name (e.g., Qwen/Qwen2.5-1.5B) and dataset name (e.g., countdown) to evaluate

python run.py --model {model} --data {dataset} --eval

Acknowledgement

Citation

@article{UFT,
author       = {Liu, Mingyang and Farina, Gabriele and Ozdaglar, Asuman},
title        = {UFT: Unifying Supervised and Reinforcement Fine-Tuning},
journal      = {arXiv preprint arXiv:2505.16984},
year         = {2025}
}
View on GitHub
GitHub Stars27
CategoryDevelopment
Updated16d ago
Forks1

Languages

Python

Security Score

90/100

Audited on Mar 21, 2026

No findings