SkyThought
Sky-T1: Train your own O1 preview model within $450
Install / Use
/learn @NovaSky-AI/SkyThoughtREADME
SkyThought
<div align="center" style="font-family: Arial, sans-serif;"> <p> <a href="#news" style="text-decoration: none; font-weight: bold;">News</a> • <a href="#links" style="text-decoration: none; font-weight: bold;">Links</a> • <a href="#getting-started" style="text-decoration: none; font-weight: bold;">Getting Started</a> • <a href="#evaluation" style="text-decoration: none; font-weight: bold;">Evaluation</a> • <a href="#citation" style="text-decoration: none; font-weight: bold;">Citation</a> • <a href="#acknowledgement" style="text-decoration: none; font-weight: bold;">Acknowledgement</a> </p> </div> </div>News
- [2025/02/21] 🎉 We released S*: Test time scaling for code generation (paper, code), a simple and extensible test time scaling framework for code generation.
- [2025/02/11] 🎉 We released Sky-T1-7B (model) and Sky-T1-mini (model) to demonstrate the potential of RL in further enhancing model's capability beyond distillation.
- [2025/01/23] ⚡️ We released Sky-T1-32B-Flash (model, data) to tackle overthinking and reduce reasoning sequence lengths while maintaining accuracy.
- [2025/01/19] 🎉 Chat demo for Sky-T1-32B-Preview is alive! Please check it out!
- [2025/01/10] 🎉 We have released our Sky-T1-32B-Preview model and data through HuggingFace!
Links
- 📜 Sky-T1-7B and Sky-T1-mini Blog Post
- 📜 Sky-T1-32B-Flash Blog Post
- 📜 Sky-T1-32B-Preview model Blog Post
- 🤗 Sky-T1-32B-Preview model
Getting Started
We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.
recipes: Recipes - data curation steps and training strategies - for building our modelsSky-T1-32B-Flash,Sky-T1-32B-PreviewandSky-T1-7Bseries.skythought/evals: Our data generation and evaluation library. We provide a convenient CLI for evaluation as well as aScorerAPI for scoring during data curation and training (example).skythought/train: Training scripts for Sky-T1. We use Llama-Factory to perform training.skythought/skythought-rl: RL training code for Sky-T1-7B and Sky-T1-mini.
Evaluation
Usage
You can install the latest release from PyPI or from source:
pip install skythought
Installing from source
# Clone the repository
git clone https://github.com/NovaSky-AI/SkyThought.git
cd SkyThought
# Create and activate a virtual environment (using uv here)
uv venv --python 3.10
source .venv/bin/activate
# Install the package in editable mode
uv pip install -e .
Running evaluation is as simple as:
skythought evaluate --model NovaSky-AI/Sky-T1-32B-Preview --task aime24
We support a wide variety of datasets in mathematics, science and coding:
- AIME'24
- MATH500
- GPQADiamond
- MMLU
- ARC-Challenge
- OlympiadBench
- AMC'23
- TACO
- APPS
- LiveCodeBench
- MMLU Pro
- MinervaMath
- GSM8K
- AIME'25
For more details, please refer to our evaluation guide and the evaluation README.
Evaluation results
Following, we show our evaluation results for the Sky-T1-32B-Preview model across math, coding, and science benchmarks.
| Metric | Sky-T1-32B-Preview | Qwen-2.5-32B-Instruct | QwQ | o1-preview | |-----------------------|---------------------|--------|-------|------------| | Math500 | 86.4 | 81.4 | 92.2 | 81.4 | | AIME2024 | 43.3 | 16.7 | 50.0 | 40.0 | | LiveCodeBench-Easy | 86.3 | 84.6 | 90.7 | 92.9 | | LiveCodeBench-Medium | 56.8 | 40.8 | 56.3 | 54.9 | | LiveCodeBench-Hard | 17.9 | 9.8 | 17.1 | 16.3 | | GPQA-Diamond | 56.8 | 45.5 | 52.5 | 75.2 | | OlympiadBench (Math, EN) | 59.79 | 46.74 | 62.17 | 59.2 |
Results on non-reasoning benchmarks
We also evaluate on non-reasoning benchmarks (these are benchmarks for instruction-following, QA, etc) to test whether the model has traded-off capability in other domains for better performance in reasoning-related benchmarks.
| Metric | Sky-T1-32B-Preview | Qwen-2.5-32B-Instruct | QwQ-32B-Preview | Eval Implementation |
|---------|-------------------|---------------------|-----------------|-------------------|
| MMLU (0 shot; no CoT) | 78.36 | 74.14 | 71.23 | lm_eval |
| MMLU (5 shot; no CoT) | 82.46 | 82.62 | 82.32 | lm_eval |
| ARC-C (0 shot; no CoT) | 49.49 | 49.4 | 49.66 | lm_eval |
| IFEval | 75.79 | 78.74 | 42.51 | lm_eval |
| LLM-as-a-Judge | 9.12 | 9.19 | 8.30 | fastchat |
| MGSM (0 shot; direct) | 33 | 42.3 | 19.07 | lm_eval |
| MGSM (8-shot; direct) | 58.4 | 61.47 | 58.5 | lm_eval |
| BFCL-v3 | 53.18 | 58.92 | 17.41 | BFCL |
| Arena-Hard | 74.79 | 66.51 | 52.6 | Arena-Hard-Auto |
For more details, refer here.
Fully Open-source: Driving Progress Together
We believe that open-source collaboration drives progress, and with Sky-T1-32B-Preview, we are fully committed to empowering the community. We open-source all details (i.e., data, codes, model weights) to enable the community to replicate and improve on our results easily:
<table> <thead> <tr> <th>Model</th> <th style="background-color: #f2f2f2;"><div align="center">Sky-T1-32B-Preview</div></th> <th><div align="center">STILL-2</div></th> <th><div align="center">Journey</div></th> <th><div align="center">QwQ</div></th> <th><div align="center">o1</div></th> </tr> </thead> <tbody> <tr> <td>Data</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> </tr> <tr> <td>Code</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> </tr> <tr> <td>Report</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> </tr> <tr> <td>Math domain</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> </tr> <tr> <td>Coding domain</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> </tr> <tr> <td>Model Weights</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">✅</div></td> <td><div align="center">❌</div></td> </tr> </tbody> </table>Citation
The code in this repository is mostly described in the post below. Please consider citing this work if you find the repository helpful.
@misc{sky_t1_2025,
author = {NovaSky Team},
title = {Sky-T1: Train your own O1 preview model within $450},
howpublished = {https://novasky-ai.github.io/posts/sky-t1},
note = {Accessed: 2025-01-
