SkillAgentSearch skills...

Torchtune

PyTorch native post-training library

Install / Use

/learn @meta-pytorch/Torchtune
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

torchtune

Unit Test Integration Tests

Overview | Installation | Get Started | Documentation | Community | Citing torchtune | License

📣 Recent updates 📣

  • May 2025: torchtune has added support for Qwen3 models! Check out all the configs here
  • April 2025: Llama4 is now available in torchtune! Try out our full and LoRA finetuning configs here
  • February 2025: Multi-node training is officially open for business in torchtune! Full finetune on multiple nodes to take advantage of larger batch sizes and models.
  • December 2024: torchtune now supports Llama 3.3 70B! Try it out by following our installation instructions here, then run any of the configs here.
  • November 2024: torchtune has released v0.4.0 which includes stable support for exciting features like activation offloading and multimodal QLoRA
  • November 2024: torchtune has added Gemma2 to its models!
  • October 2024: torchtune added support for Qwen2.5 models - find the configs here
  • September 2024: torchtune has support for Llama 3.2 11B Vision, Llama 3.2 3B, and Llama 3.2 1B models! Try them out by following our installation instructions here, then run any of the text configs here or vision configs here.

 

Overview 📚

torchtune is a PyTorch library for easily authoring, post-training, and experimenting with LLMs. It provides:

  • Hackable training recipes for SFT, knowledge distillation, DPO, PPO, GRPO, and quantization-aware training
  • Simple PyTorch implementations of popular LLMs like Llama, Gemma, Mistral, Phi, Qwen, and more
  • Best-in-class memory efficiency, performance improvements, and scaling, utilizing the latest PyTorch APIs
  • YAML configs for easily configuring training, evaluation, quantization or inference recipes

 

Post-training recipes

torchtune supports the entire post-training lifecycle. A successful post-trained model will likely utilize several of the below methods.

Supervised Finetuning (SFT)

| Type of Weight Update | 1 Device | >1 Device | >1 Node | |-----------------------|:--------:|:---------:|:-------:| | Full | ✅ | ✅ | ✅ | | LoRA/QLoRA | ✅ | ✅ | ✅ |

Example: tune run lora_finetune_single_device --config llama3_2/3B_lora_single_device <br /> You can also run e.g. tune ls lora_finetune_single_device for a full list of available configs.

Knowledge Distillation (KD)

| Type of Weight Update | 1 Device | >1 Device | >1 Node | |-----------------------|:--------:|:---------:|:-------:| | Full | ❌ | ❌ | ❌ | | LoRA/QLoRA | ✅ | ✅ | ❌ |

Example: tune run knowledge_distillation_distributed --config qwen2/1.5B_to_0.5B_KD_lora_distributed <br /> You can also run e.g. tune ls knowledge_distillation_distributed for a full list of available configs.

Reinforcement Learning / Reinforcement Learning from Human Feedback (RLHF)

| Method | Type of Weight Update | 1 Device | >1 Device | >1 Node | |------------------------------|-----------------------|:--------:|:---------:|:-------:| | DPO | Full | ❌ | ✅ | ❌ | | | LoRA/QLoRA | ✅ | ✅ | ❌ | | PPO | Full | ✅ | ❌ | ❌ | | | LoRA/QLoRA | ❌ | ❌ | ❌ | | GRPO | Full | 🚧 | ✅ | ✅ | | | LoRA/QLoRA | ❌ | ❌ | ❌ |

Example: tune run lora_dpo_single_device --config llama3_1/8B_dpo_single_device <br /> You can also run e.g. tune ls full_dpo_distributed for a full list of available configs.

Quantization-Aware Training (QAT)

| Type of Weight Update | 1 Device | >1 Device | >1 Node | |-----------------------|:--------:|:---------:|:-------:| | Full | ✅ | ✅ | ❌ | | LoRA/QLoRA | ❌ | ✅ | ❌ |

Example: tune run qat_distributed --config llama3_1/8B_qat_lora <br /> You can also run e.g. tune ls qat_distributed or tune ls qat_single_device for a full list of available configs.

The above configs are just examples to get you started. The full list of recipes can be found here. If you'd like to work on one of the gaps you see, please submit a PR! If there's a entirely new post-training method you'd like to see implemented in torchtune, feel free to open an Issue.

 

Models

For the above recipes, torchtune supports many state-of-the-art models available on the Hugging Face Hub or Kaggle Hub. Some of our supported models:

| Model | Sizes | |-----------------------------------------------|-----------| | Llama4 | Scout (17B x 16E) [models, configs] | | Llama3.3 | 70B [models, configs] | | Llama3.2-Vision | 11B, 90B [models, configs] | | Llama3.2 | 1B, 3B [models, configs] | | Llama3.1 | 8B, 70B, 405B [models, configs] | | Mistral | 7B [models, configs] | | Gemma2 | 2B, 9B, 27B [models, configs] | | Microsoft Phi4 | 14B [models, configs] | Microsoft Phi3 | Mini [models, configs] | Qwen3 | 0.6B, 1.7B, 4B, 8B, 14B, 32B [models, configs] | Qwen2.5 | 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B [models, configs] | Qwen2 | 0.5B, 1.5B, 7B [models, configs]

We're always adding new models, but feel free to file an issue if there's a new one you would like to see in torchtune.

 

Memory and training speed

Below is an example of the memory requirements and training speed for different Llama 3.1 models.

[!NOTE] For ease of comparison, all the below numbers are provided for batch size 2 (without gradient accumulation), a dataset packed to sequence length 2048, and torch compile enabled.

If you are interested in running on different hardware or with different models, check out our documentation on memory optimizations here to find the right setup for you.

| Model | Finetuning Method | Runnable On | Peak Memory per GPU | Tokens/sec * | |:-:|:-:|:-:|:-:|:-:| | Llama 3.1 8B | Full finetune | 1x 4090 | 18.9 GiB | 1650 | | Llama 3.1 8B | Full finetune | 1x A6000 | 37.4 GiB | 2579| | Llama 3.1 8B | LoRA | 1x 4090 | 16.2 GiB | 3083 | | Llama 3.1 8B | LoRA | 1x A6000 | 30.3 GiB | 4699 | | Llama 3.1 8B | QLoRA | 1x 4090 | 7.4 GiB | 2413 | | Llama 3.1 70B | Full finetune | 8x A100 | 13.9 GiB ** | 1568 | | Llama 3.1 70B | LoRA | 8x A100 | 27.6 GiB | 3497 | | Llama 3.1 405B | QLoRA | 8x A100 | 44.8 GB | 653 |

*= Measured over one full training epoch <br /> **= Uses CPU offload with fused optimizer

 

Optimization flags

torchtune exposes a number of levers for memory efficiency and performance. The table below

Related Skills

View on GitHub
GitHub Stars5.7k
CategoryDevelopment
Updated2d ago
Forks709

Languages

Python

Security Score

95/100

Audited on Mar 28, 2026

No findings