SkyThought

<div align="center" style="font-family: Arial, sans-serif;"> <p> <a href="#news" style="text-decoration: none; font-weight: bold;">News</a> • <a href="#links" style="text-decoration: none; font-weight: bold;">Links</a> • <a href="#getting-started" style="text-decoration: none; font-weight: bold;">Getting Started</a> • <a href="#evaluation" style="text-decoration: none; font-weight: bold;">Evaluation</a> • <a href="#citation" style="text-decoration: none; font-weight: bold;">Citation</a> • <a href="#acknowledgement" style="text-decoration: none; font-weight: bold;">Acknowledgement</a> </p> </div> </div>

News

[2025/02/21] 🎉 We released S*: Test time scaling for code generation (paper, code), a simple and extensible test time scaling framework for code generation.
[2025/02/11] 🎉 We released Sky-T1-7B (model) and Sky-T1-mini (model) to demonstrate the potential of RL in further enhancing model's capability beyond distillation.
[2025/01/23] ⚡️ We released Sky-T1-32B-Flash (model, data) to tackle overthinking and reduce reasoning sequence lengths while maintaining accuracy.
[2025/01/19] 🎉 Chat demo for Sky-T1-32B-Preview is alive! Please check it out!
[2025/01/10] 🎉 We have released our Sky-T1-32B-Preview model and data through HuggingFace!

Getting Started

We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.

recipes: Recipes - data curation steps and training strategies - for building our models Sky-T1-32B-Flash, Sky-T1-32B-Preview and Sky-T1-7B series.
skythought/evals: Our data generation and evaluation library. We provide a convenient CLI for evaluation as well as a Scorer API for scoring during data curation and training (example).
skythought/train: Training scripts for Sky-T1. We use Llama-Factory to perform training.
skythought/skythought-rl: RL training code for Sky-T1-7B and Sky-T1-mini.

Evaluation

Usage

You can install the latest release from PyPI or from source:

pip install skythought

Installing from source

# Clone the repository
git clone https://github.com/NovaSky-AI/SkyThought.git
cd SkyThought

# Create and activate a virtual environment (using uv here)
uv venv --python 3.10
source .venv/bin/activate

# Install the package in editable mode
uv pip install -e .

Running evaluation is as simple as:

skythought evaluate --model NovaSky-AI/Sky-T1-32B-Preview --task aime24

We support a wide variety of datasets in mathematics, science and coding:

AIME'24
MATH500
GPQADiamond
MMLU
ARC-Challenge
OlympiadBench
AMC'23
TACO
APPS
LiveCodeBench
MMLU Pro
MinervaMath
GSM8K
AIME'25

For more details, please refer to our evaluation guide and the evaluation README.

Evaluation results

Following, we show our evaluation results for the Sky-T1-32B-Preview model across math, coding, and science benchmarks.

| Metric | Sky-T1-32B-Preview | Qwen-2.5-32B-Instruct | QwQ | o1-preview | |-----------------------|---------------------|--------|-------|------------| | Math500 | 86.4 | 81.4 | 92.2 | 81.4 | | AIME2024 | 43.3 | 16.7 | 50.0 | 40.0 | | LiveCodeBench-Easy | 86.3 | 84.6 | 90.7 | 92.9 | | LiveCodeBench-Medium | 56.8 | 40.8 | 56.3 | 54.9 | | LiveCodeBench-Hard | 17.9 | 9.8 | 17.1 | 16.3 | | GPQA-Diamond | 56.8 | 45.5 | 52.5 | 75.2 | | OlympiadBench (Math, EN) | 59.79 | 46.74 | 62.17 | 59.2 |

Results on non-reasoning benchmarks

We also evaluate on non-reasoning benchmarks (these are benchmarks for instruction-following, QA, etc) to test whether the model has traded-off capability in other domains for better performance in reasoning-related benchmarks.

| Metric | Sky-T1-32B-Preview | Qwen-2.5-32B-Instruct | QwQ-32B-Preview | Eval Implementation | |---------|-------------------|---------------------|-----------------|-------------------| | MMLU (0 shot; no CoT) | 78.36 | 74.14 | 71.23 | lm_eval | | MMLU (5 shot; no CoT) | 82.46 | 82.62 | 82.32 | lm_eval | | ARC-C (0 shot; no CoT) | 49.49 | 49.4 | 49.66 | lm_eval | | IFEval | 75.79 | 78.74 | 42.51 | lm_eval | | LLM-as-a-Judge | 9.12 | 9.19 | 8.30 | fastchat | | MGSM (0 shot; direct) | 33 | 42.3 | 19.07 | lm_eval | | MGSM (8-shot; direct) | 58.4 | 61.47 | 58.5 | lm_eval | | BFCL-v3 | 53.18 | 58.92 | 17.41 | BFCL | | Arena-Hard | 74.79 | 66.51 | 52.6 | Arena-Hard-Auto |

For more details, refer here.

Fully Open-source: Driving Progress Together

We believe that open-source collaboration drives progress, and with Sky-T1-32B-Preview, we are fully committed to empowering the community. We open-source all details (i.e., data, codes, model weights) to enable the community to replicate and improve on our results easily:

<table> <thead> <tr> <th>Model</th> <th style="background-color: #f2f2f2;"><div align="center">Sky-T1-32B-Preview</div></th> <th><div align="center">STILL-2</div></th> <th><div align="center">Journey</div></th> <th><div align="center">QwQ</div></th> <th><div align="center">o1</div></th> </tr> </thead> <tbody> <tr> <td>Data</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> </tr> <tr> <td>Code</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> </tr> <tr> <td>Report</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> </tr> <tr> <td>Math domain</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> </tr> <tr> <td>Coding domain</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">❌</div></td> <td><div align="center">✅</div></td> <td><div align="center">✅</div></td> </tr> <tr> <td>Model Weights</td> <td style="background-color: #f2f2f2;"><div align="center">✅</div></td> <td><div align="center">✅</div></td> <td><div align="center">❌</div></td> <td><div align="center">✅</div></td> <td><div align="center">❌</div></td> </tr> </tbody> </table>

Citation

The code in this repository is mostly described in the post below. Please consider citing this work if you find the repository helpful.

@misc{sky_t1_2025,
  author       = {NovaSky Team},
  title        = {Sky-T1: Train your own O1 preview model within $450},
  howpublished = {https://novasky-ai.github.io/posts/sky-t1},
  note         = {Accessed: 2025-01-

SkyThought

Install / Use

README

SkyThought

News

Links

Getting Started

Evaluation

Usage

Installing from source

Evaluation results

Results on non-reasoning benchmarks

Fully Open-source: Driving Progress Together

Citation