Magpie

[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data generation pipeline!

Generate Convert Improve

Install / Use

/learn @magpie-align/Magpie

About this skill

Quality Score

0/100

README

This is the official repository for ICLR 2025 paper "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Magpie generates high-quality alignment data by prompting aligned LLMs with their pre-query templates. Unlike many existing synthetic data generation methods, Magpie doesn't rely on prompt engineering or seed questions for generating synthetic data. Instead, it uses the prompt template of an aligned LLM to generate both the user query and an LLM response.

🤗 Huggingface (Models and Datasets)
🧭 Dataset Navigation
🕸️ Website
📄 Technical Report
🤗 Magpie Demo (Thanks a lot for the implementation from @davanstrien!)
🐦 Chat with Magpie

🐦 News

[2025/01/22] Magpie paper is accepted by ICLR 2025!
[2025/01/09] Magpie Reasoning V2 dataset is out! 250K from Llama, Skywork-o1 and QwQ! This time, we focus on CoT 🤯
[2025/01/01] Magpie Llama-3.3 dataset is out! 1M from Llama-3.3-70B-Instruct! Happy New Year!
[2024/10/20] Magpie Qwen2.5 dataset is out! 1M from Qwen2.5 72B!
[2024/09/17] Ship two new models with SOTA performance: 𝙼𝚊𝚐𝚙𝚒𝚎𝙻𝙼-𝙲𝚑𝚊𝚝 (4B & 8B)! See collection here!
[2024/08/19] Three preference optimization datasets, Magpie-Air-DPO-100K-v0.1, Magpie-Pro-DPO-100K-v0.1, and Magpie-Llama-3.1-Pro-DPO-100K-v0.1 are out!
[2024/07/25] Magpie Llama-3.1 dataset is out! 1M from Llama-3.1-70B-Instruct! More friendly license compared with Llama-3 😃!
[2024/07/21] Magpie Gemma2 dataset is out! 534K from Gemma-2-27b-it!
[2024/07/19] Llama-3-8B-Magpie-Align-v0.3 is out with enhanced Chinese question-answering ability, thanks to our new Chinese instruction dataset!
[2024/07/14] Llama-3-8B-Magpie-Align-v0.2 is out with enhanced reasoning ability, thanks to our new reasoning booster dataset!
[2024/07/04] Magpie Qwen2 dataset is out! 1M from Qwen2 72B and 3M from Qwen2 7B.
[2024/07/03] 🏆 Our open aligned model, Llama-3-8B-Magpie-Align-v0.1 is out! It is 🏆 the best <30B Model in AI2 WildBench Leaderboard! Even better than the official Meta-Llama-3-8B-Instruct model!
[2024/06/24] Magpie Phi 3 dataset is out! 1M from Phi 3 Medium.
[2024/06/12] Magpie Llama-3 dataset is out! 1M from Llama-3 70B and 3M from Llama-3 8B.
[2024/06/12] Magpie technical report is out! Let's make high-quality alignment data open for all!

Magpie Supports

Currently, Magpie has been tested on the Llama-3, Qwen2, Phi 3 and Gemma-2 series. Please submit an issue for more model support.

|Model Family | Magpie | Magpie Scripts | Datasets | Size | |-------------|:------:|:-------|:-------|:-------| | Llama 3.3 | ✅ | 70B | 70B | 1M | | Llama 3.1 | ✅ * | 8B,70B | 70B,405B(Argilla) | 1M | | Llama 3 | ✅ | 8B,70B | 8B,70B | 3M + 1M | | Qwen2.5 | ✅ | 3B,7B,14B,32B,72B | 72B | 1M | | Qwen2 | ✅ | 7B,72B,Math 7B | 7B,72B | 3M + 1M | | Phi 3 | ✅ | mini,small,medium | medium | 1M | | Gemma-2 | ✅ ** | 9B,27B | 27B | 534K | | Gemma-1.1 | ⭕️ | 7B | Llama 2 | ⭕️ | 7B,70B | Vicuna | ⭕️ | 7B | Mistral | ⭕️ | 7B | Yi | ⭕️ | 34B | DeepSeek Coder | ⭕️ | Coder V2 Lite

✅: It works great! (* Apply a logits processor to surpress markdown; ** Apply a filter before generating responses.)
⭕️: It works! We can get something interesting, but we may need to design an additional logit processor and/or a filter.
❌: Not work.
❓: Untested.

The navigation of all available Magpie datasets can be found here.

We hope Magpie can contribute to the democratization of AI with enhanced transparency of model alignment processes!

Abstract

<details><summary>Click Here</summary> High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the diversity and quality of public alignment datasets. Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named Magpie. Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the left-side templates up to the position reserved for user messages, thanks to their auto-regressive nature. We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We perform a comprehensive analysis of the extracted data and select 300K high-qual

Related Skills

node-connect

337.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

337.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.1k

Commit, push, and open a PR