Torchtune
PyTorch native post-training library
Install / Use
/learn @meta-pytorch/TorchtuneREADME
torchtune
Overview | Installation | Get Started | Documentation | Community | Citing torchtune | License
📣 Recent updates 📣
- May 2025: torchtune has added support for Qwen3 models! Check out all the configs here
- April 2025: Llama4 is now available in torchtune! Try out our full and LoRA finetuning configs here
- February 2025: Multi-node training is officially open for business in torchtune! Full finetune on multiple nodes to take advantage of larger batch sizes and models.
- December 2024: torchtune now supports Llama 3.3 70B! Try it out by following our installation instructions here, then run any of the configs here.
- November 2024: torchtune has released v0.4.0 which includes stable support for exciting features like activation offloading and multimodal QLoRA
- November 2024: torchtune has added Gemma2 to its models!
- October 2024: torchtune added support for Qwen2.5 models - find the configs here
- September 2024: torchtune has support for Llama 3.2 11B Vision, Llama 3.2 3B, and Llama 3.2 1B models! Try them out by following our installation instructions here, then run any of the text configs here or vision configs here.
Overview 📚
torchtune is a PyTorch library for easily authoring, post-training, and experimenting with LLMs. It provides:
- Hackable training recipes for SFT, knowledge distillation, DPO, PPO, GRPO, and quantization-aware training
- Simple PyTorch implementations of popular LLMs like Llama, Gemma, Mistral, Phi, Qwen, and more
- Best-in-class memory efficiency, performance improvements, and scaling, utilizing the latest PyTorch APIs
- YAML configs for easily configuring training, evaluation, quantization or inference recipes
Post-training recipes
torchtune supports the entire post-training lifecycle. A successful post-trained model will likely utilize several of the below methods.
Supervised Finetuning (SFT)
| Type of Weight Update | 1 Device | >1 Device | >1 Node | |-----------------------|:--------:|:---------:|:-------:| | Full | ✅ | ✅ | ✅ | | LoRA/QLoRA | ✅ | ✅ | ✅ |
Example: tune run lora_finetune_single_device --config llama3_2/3B_lora_single_device <br />
You can also run e.g. tune ls lora_finetune_single_device for a full list of available configs.
Knowledge Distillation (KD)
| Type of Weight Update | 1 Device | >1 Device | >1 Node | |-----------------------|:--------:|:---------:|:-------:| | Full | ❌ | ❌ | ❌ | | LoRA/QLoRA | ✅ | ✅ | ❌ |
Example: tune run knowledge_distillation_distributed --config qwen2/1.5B_to_0.5B_KD_lora_distributed <br />
You can also run e.g. tune ls knowledge_distillation_distributed for a full list of available configs.
Reinforcement Learning / Reinforcement Learning from Human Feedback (RLHF)
| Method | Type of Weight Update | 1 Device | >1 Device | >1 Node | |------------------------------|-----------------------|:--------:|:---------:|:-------:| | DPO | Full | ❌ | ✅ | ❌ | | | LoRA/QLoRA | ✅ | ✅ | ❌ | | PPO | Full | ✅ | ❌ | ❌ | | | LoRA/QLoRA | ❌ | ❌ | ❌ | | GRPO | Full | 🚧 | ✅ | ✅ | | | LoRA/QLoRA | ❌ | ❌ | ❌ |
Example: tune run lora_dpo_single_device --config llama3_1/8B_dpo_single_device <br />
You can also run e.g. tune ls full_dpo_distributed for a full list of available configs.
Quantization-Aware Training (QAT)
| Type of Weight Update | 1 Device | >1 Device | >1 Node | |-----------------------|:--------:|:---------:|:-------:| | Full | ✅ | ✅ | ❌ | | LoRA/QLoRA | ❌ | ✅ | ❌ |
Example: tune run qat_distributed --config llama3_1/8B_qat_lora <br />
You can also run e.g. tune ls qat_distributed or tune ls qat_single_device for a full list of available configs.
The above configs are just examples to get you started. The full list of recipes can be found here. If you'd like to work on one of the gaps you see, please submit a PR! If there's a entirely new post-training method you'd like to see implemented in torchtune, feel free to open an Issue.
Models
For the above recipes, torchtune supports many state-of-the-art models available on the Hugging Face Hub or Kaggle Hub. Some of our supported models:
| Model | Sizes | |-----------------------------------------------|-----------| | Llama4 | Scout (17B x 16E) [models, configs] | | Llama3.3 | 70B [models, configs] | | Llama3.2-Vision | 11B, 90B [models, configs] | | Llama3.2 | 1B, 3B [models, configs] | | Llama3.1 | 8B, 70B, 405B [models, configs] | | Mistral | 7B [models, configs] | | Gemma2 | 2B, 9B, 27B [models, configs] | | Microsoft Phi4 | 14B [models, configs] | Microsoft Phi3 | Mini [models, configs] | Qwen3 | 0.6B, 1.7B, 4B, 8B, 14B, 32B [models, configs] | Qwen2.5 | 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B [models, configs] | Qwen2 | 0.5B, 1.5B, 7B [models, configs]
We're always adding new models, but feel free to file an issue if there's a new one you would like to see in torchtune.
Memory and training speed
Below is an example of the memory requirements and training speed for different Llama 3.1 models.
[!NOTE] For ease of comparison, all the below numbers are provided for batch size 2 (without gradient accumulation), a dataset packed to sequence length 2048, and torch compile enabled.
If you are interested in running on different hardware or with different models, check out our documentation on memory optimizations here to find the right setup for you.
| Model | Finetuning Method | Runnable On | Peak Memory per GPU | Tokens/sec * | |:-:|:-:|:-:|:-:|:-:| | Llama 3.1 8B | Full finetune | 1x 4090 | 18.9 GiB | 1650 | | Llama 3.1 8B | Full finetune | 1x A6000 | 37.4 GiB | 2579| | Llama 3.1 8B | LoRA | 1x 4090 | 16.2 GiB | 3083 | | Llama 3.1 8B | LoRA | 1x A6000 | 30.3 GiB | 4699 | | Llama 3.1 8B | QLoRA | 1x 4090 | 7.4 GiB | 2413 | | Llama 3.1 70B | Full finetune | 8x A100 | 13.9 GiB ** | 1568 | | Llama 3.1 70B | LoRA | 8x A100 | 27.6 GiB | 3497 | | Llama 3.1 405B | QLoRA | 8x A100 | 44.8 GB | 653 |
*= Measured over one full training epoch <br /> **= Uses CPU offload with fused optimizer
Optimization flags
torchtune exposes a number of levers for memory efficiency and performance. The table below
Related Skills
node-connect
341.6kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.6kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.6kCommit, push, and open a PR
