torchtune

📣 Recent updates 📣

May 2025: torchtune has added support for Qwen3 models! Check out all the configs here
April 2025: Llama4 is now available in torchtune! Try out our full and LoRA finetuning configs here
February 2025: Multi-node training is officially open for business in torchtune! Full finetune on multiple nodes to take advantage of larger batch sizes and models.
December 2024: torchtune now supports Llama 3.3 70B! Try it out by following our installation instructions here, then run any of the configs here.
November 2024: torchtune has released v0.4.0 which includes stable support for exciting features like activation offloading and multimodal QLoRA
November 2024: torchtune has added Gemma2 to its models!
October 2024: torchtune added support for Qwen2.5 models - find the configs here
September 2024: torchtune has support for Llama 3.2 11B Vision, Llama 3.2 3B, and Llama 3.2 1B models! Try them out by following our installation instructions here, then run any of the text configs here or vision configs here.

Overview 📚

torchtune is a PyTorch library for easily authoring, post-training, and experimenting with LLMs. It provides:

Hackable training recipes for SFT, knowledge distillation, DPO, PPO, GRPO, and quantization-aware training
Simple PyTorch implementations of popular LLMs like Llama, Gemma, Mistral, Phi, Qwen, and more
Best-in-class memory efficiency, performance improvements, and scaling, utilizing the latest PyTorch APIs
YAML configs for easily configuring training, evaluation, quantization or inference recipes

Post-training recipes

torchtune supports the entire post-training lifecycle. A successful post-trained model will likely utilize several of the below methods.

Supervised Finetuning (SFT)

| Type of Weight Update | 1 Device | >1 Device | >1 Node | |-----------------------|:--------:|:---------:|:-------:| | Full | ✅ | ✅ | ✅ | | LoRA/QLoRA | ✅ | ✅ | ✅ |

Example: tune run lora_finetune_single_device --config llama3_2/3B_lora_single_device You can also run e.g. tune ls lora_finetune_single_device for a full list of available configs.

Knowledge Distillation (KD)

| Type of Weight Update | 1 Device | >1 Device | >1 Node | |-----------------------|:--------:|:---------:|:-------:| | Full | ❌ | ❌ | ❌ | | LoRA/QLoRA | ✅ | ✅ | ❌ |

Example: tune run knowledge_distillation_distributed --config qwen2/1.5B_to_0.5B_KD_lora_distributed You can also run e.g. tune ls knowledge_distillation_distributed for a full list of available configs.

Reinforcement Learning / Reinforcement Learning from Human Feedback (RLHF)

| Method | Type of Weight Update | 1 Device | >1 Device | >1 Node | |------------------------------|-----------------------|:--------:|:---------:|:-------:| | DPO | Full | ❌ | ✅ | ❌ | | | LoRA/QLoRA | ✅ | ✅ | ❌ | | PPO | Full | ✅ | ❌ | ❌ | | | LoRA/QLoRA | ❌ | ❌ | ❌ | | GRPO | Full | 🚧 | ✅ | ✅ | | | LoRA/QLoRA | ❌ | ❌ | ❌ |

Example: tune run lora_dpo_single_device --config llama3_1/8B_dpo_single_device You can also run e.g. tune ls full_dpo_distributed for a full list of available configs.

Quantization-Aware Training (QAT)

| Type of Weight Update | 1 Device | >1 Device | >1 Node | |-----------------------|:--------:|:---------:|:-------:| | Full | ✅ | ✅ | ❌ | | LoRA/QLoRA | ❌ | ✅ | ❌ |

Example: tune run qat_distributed --config llama3_1/8B_qat_lora You can also run e.g. tune ls qat_distributed or tune ls qat_single_device for a full list of available configs.

The above configs are just examples to get you started. The full list of recipes can be found here. If you'd like to work on one of the gaps you see, please submit a PR! If there's a entirely new post-training method you'd like to see implemented in torchtune, feel free to open an Issue.

Models

For the above recipes, torchtune supports many state-of-the-art models available on the Hugging Face Hub or Kaggle Hub. Some of our supported models:

| Model | Sizes | |-----------------------------------------------|-----------| | Llama4 | Scout (17B x 16E) [models, configs] | | Llama3.3 | 70B [models, configs] | | Llama3.2-Vision | 11B, 90B [models, configs] | | Llama3.2 | 1B, 3B [models, configs] | | Llama3.1 | 8B, 70B, 405B [models, configs] | | Mistral | 7B [models, configs] | | Gemma2 | 2B, 9B, 27B [models, configs] | | Microsoft Phi4 | 14B [models, configs] | Microsoft Phi3 | Mini [models, configs] | Qwen3 | 0.6B, 1.7B, 4B, 8B, 14B, 32B [models, configs] | Qwen2.5 | 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B [models, configs] | Qwen2 | 0.5B, 1.5B, 7B [models, configs]

We're always adding new models, but feel free to file an issue if there's a new one you would like to see in torchtune.

Memory and training speed

Below is an example of the memory requirements and training speed for different Llama 3.1 models.

[!NOTE] For ease of comparison, all the below numbers are provided for batch size 2 (without gradient accumulation), a dataset packed to sequence length 2048, and torch compile enabled.

If you are interested in running on different hardware or with different models, check out our documentation on memory optimizations here to find the right setup for you.

| Model | Finetuning Method | Runnable On | Peak Memory per GPU | Tokens/sec * | |:-:|:-:|:-:|:-:|:-:| | Llama 3.1 8B | Full finetune | 1x 4090 | 18.9 GiB | 1650 | | Llama 3.1 8B | Full finetune | 1x A6000 | 37.4 GiB | 2579| | Llama 3.1 8B | LoRA | 1x 4090 | 16.2 GiB | 3083 | | Llama 3.1 8B | LoRA | 1x A6000 | 30.3 GiB | 4699 | | Llama 3.1 8B | QLoRA | 1x 4090 | 7.4 GiB | 2413 | | Llama 3.1 70B | Full finetune | 8x A100 | 13.9 GiB ** | 1568 | | Llama 3.1 70B | LoRA | 8x A100 | 27.6 GiB | 3497 | | Llama 3.1 405B | QLoRA | 8x A100 | 44.8 GB | 653 |

*= Measured over one full training epoch **= Uses CPU offload with fused optimizer

Optimization flags

torchtune exposes a number of levers for memory efficiency and performance. The table below

Torchtune

Install / Use

README