SkillAgentSearch skills...

SkyThought

Sky-T1: Train your own O1 preview model within $450

Install / Use

/learn @NovaSky-AI/SkyThought
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center">

SkyThought

Github Twitter Hugging Face Collection Discord

<div align="center" style="font-family: Arial, sans-serif;"> <p> <a href="#news" style="text-decoration: none; font-weight: bold;">News</a> • <a href="#links" style="text-decoration: none; font-weight: bold;">Links</a> • <a href="#getting-started" style="text-decoration: none; font-weight: bold;">Getting Started</a> • <a href="#evaluation" style="text-decoration: none; font-weight: bold;">Evaluation</a> • <a href="#citation" style="text-decoration: none; font-weight: bold;">Citation</a> • <a href="#acknowledgement" style="text-decoration: none; font-weight: bold;">Acknowledgement</a> </p> </div> </div>

News

  • [2025/02/21] 🎉 We released S*: Test time scaling for code generation (paper, code), a simple and extensible test time scaling framework for code generation.
  • [2025/02/11] 🎉 We released Sky-T1-7B (model) and Sky-T1-mini (model) to demonstrate the potential of RL in further enhancing model's capability beyond distillation.
  • [2025/01/23] ⚡️ We released Sky-T1-32B-Flash (model, data) to tackle overthinking and reduce reasoning sequence lengths while maintaining accuracy.
  • [2025/01/19] 🎉 Chat demo for Sky-T1-32B-Preview is alive! Please check it out!
  • [2025/01/10] 🎉 We have released our Sky-T1-32B-Preview model and data through HuggingFace!

Links

Getting Started

We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.

  • recipes: Recipes - data curation steps and training strategies - for building our models Sky-T1-32B-Flash, Sky-T1-32B-Preview and Sky-T1-7B series.
  • skythought/evals: Our data generation and evaluation library. We provide a convenient CLI for evaluation as well as a Scorer API for scoring during data curation and training (example).
  • skythought/train: Training scripts for Sky-T1. We use Llama-Factory to perform training.
  • skythought/skythought-rl: RL training code for Sky-T1-7B and Sky-T1-mini.

Evaluation

Usage

You can install the latest release from PyPI or from source:

pip install skythought

Installing from source

# Clone the repository
git clone https://github.com/NovaSky-AI/SkyThought.git
cd SkyThought

# Create and activate a virtual environment (using uv here)
uv venv --python 3.10
source .venv/bin/activate

# Install the package in editable mode
uv pip install -e .

Running evaluation is as simple as:

skythought evaluate --model NovaSky-AI/Sky-T1-32B-Preview --task aime24

We support a wide variety of datasets in mathematics, science and coding:

  • AIME'24
  • MATH500
  • GPQADiamond
  • MMLU
  • ARC-Challenge
  • OlympiadBench
  • AMC'23
  • TACO
  • APPS
  • LiveCodeBench
  • MMLU Pro
  • MinervaMath
  • GSM8K
  • AIME'25

For more details, please refer to our evaluation guide and the evaluation README.

Evaluation results

Following, we show our evaluation results for the Sky-T1-32B-Preview model across math, coding, and science benchmarks.

| Metric | Sky-T1-32B-Preview | Qwen-2.5-32B-Instruct | QwQ | o1-preview | |-----------------------|---------------------|--------|-------|------------| | Math500 | 86.4 | 81.4 | 92.2 | 81.4 | | AIME2024 | 43.3 | 16.7 | 50.0 | 40.0 | | LiveCodeBench-Easy | 86.3 | 84.6 | 90.7 | 92.9 | | LiveCodeBench-Medium | 56.8 | 40.8 | 56.3 | 54.9 | | LiveCodeBench-Hard | 17.9 | 9.8 | 17.1 | 16.3 | | GPQA-Diamond | 56.8 | 45.5 | 52.5 | 75.2 | | OlympiadBench (Math, EN) | 59.79 | 46.74 | 62.17 | 59.2 |

Results on non-reasoning benchmarks

We also evaluate on non-reasoning benchmarks (these are benchmarks for instruction-following, QA, etc) to test whether the model has traded-off capability in other domains for better performance in reasoning-related benchmarks.

| Metric | Sky-T1-32B-Preview | Qwen-2.5-32B-Instruct | QwQ-32B-Preview | Eval Implementation | |---------|-------------------|---------------------|-----------------|-------------------| | MMLU (0 shot; no CoT) | 78.36 | 74.14 | 71.23 | lm_eval | | MMLU (5 shot; no CoT) | 82.46 | 82.62 | 82.32 | lm_eval | | ARC-C (0 shot; no CoT) | 49.49 | 49.4 | 49.66 | lm_eval | | IFEval | 75.79 | 78.74 | 42.51 | lm_eval | | LLM-as-a-Judge | 9.12 | 9.19 | 8.30 | fastchat | | MGSM (0 shot; direct) | 33 | 42.3 | 19.07 | lm_eval | | MGSM (8-shot; direct) | 58.4 | 61.47 | 58.5 | lm_eval | | BFCL-v3 | 53.18 | 58.92 | 17.41 | BFCL | | Arena-Hard | 74.79 | 66.51 | 52.6 | Arena-Hard-Auto |

For more details, refer here.

Fully Open-source: Driving Progress Together

We believe that open-source collaboration drives progress, and with Sky-T1-32B-Preview, we are fully committed to empowering the community. We open-source all details (i.e., data, codes, model weights) to enable the community to replicate and improve on our results easily:

<table> <thead> <tr> <th>Model</th> <th style="background-color: #f2f2f2;"><div align="center">Sky-T1-32B-Preview</div></th> <th><div align="center">STILL-2</div></th> <th><div align="center">Journey</div></th> <th><div align="center">QwQ</div></th> <th><div align="center">o1</div></th> </tr> </thead> <tbody> <tr> <td>Data</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> </tr> <tr> <td>Code</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> </tr> <tr> <td>Report</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> </tr> <tr> <td>Math domain</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> </tr> <tr> <td>Coding domain</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> </tr> <tr> <td>Model Weights</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">✅</div></td> <td><div align="center">❌</div></td> </tr> </tbody> </table>

Citation

The code in this repository is mostly described in the post below. Please consider citing this work if you find the repository helpful.

@misc{sky_t1_2025,
  author       = {NovaSky Team},
  title        = {Sky-T1: Train your own O1 preview model within $450},
  howpublished = {https://novasky-ai.github.io/posts/sky-t1},
  note         = {Accessed: 2025-01-
View on GitHub
GitHub Stars3.4k
CategoryDevelopment
Updated1d ago
Forks342

Languages

Python

Security Score

95/100

Audited on Mar 27, 2026

No findings