Opensloth
No description available
Install / Use
/learn @anhvth/OpenslothREADME
OpenSloth 🦥⚡
Scale Unsloth to multiple GPUs with just torchrun. No configuration files, no custom frameworks - pure PyTorch DDP.
- 🚀 2-4x faster than single GPU
- 🎯 Zero configuration - works out of the box
- 💾 Same VRAM per GPU as single GPU Unsloth
- 🔧 Any Unsloth model - Qwen, Llama, Gemma, etc.
Installation
# Install dependencies
uv add torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
uv add unsloth datasets transformers trl
uv add git+https://github.com/anhvth/opensloth.git
Quick Start
Replace python with torchrun:
# Single GPU
python train_scripts/train_ddp.py
# Multi-GPU
torchrun --nproc_per_node=2 train_scripts/train_ddp.py # 2 GPUs
torchrun --nproc_per_node=4 train_scripts/train_ddp.py # 4 GPUs
OpenSloth automatically handles GPU distribution, gradient sync, and batch sizing.
Performance
| Setup | Time | Speedup | |-------|------|---------| | 1 GPU | 19m 34s | 1.0x | | 2 GPUs | 8m 28s | 2.3x |
Expected scaling: 2 GPUs = ~2.3x, 4 GPUs = ~4.5x, 8 GPUs = ~9x
Usage
from unsloth import FastLanguageModel
from trl import SFTConfig, SFTTrainer
from opensloth.patching.ddp_patch import ddp_patch
ddp_patch() # Enable DDP compatibility
# Standard Unsloth setup
local_rank = int(os.environ.get("LOCAL_RANK", 0))
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-0.6B",
device_map={"": local_rank},
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(model, r=16)
trainer = SFTTrainer(model=model, tokenizer=tokenizer, ...)
trainer.train()
Run: torchrun --nproc_per_node=4 your_script.py
Migration from Old Approach
Current (Recommended): Simple torchrun + DDP patch
from opensloth.patching.ddp_patch import ddp_patch
ddp_patch()
# ... standard Unsloth code
Old Approach (v0.1.8): For complex configuration files, use:
git checkout https://github.com/anhvth/opensloth/releases/tag/v0.1.8
Links
- Unsloth - 2x faster training library
- TRL - Transformer Reinforcement Learning
- PyTorch DDP - Distributed training
git clone https://github.com/anhvth/opensloth.git
cd opensloth
torchrun --nproc_per_node=4 train_scripts/train_ddp.py
Happy training! 🦥⚡
Related Skills
node-connect
342.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
85.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
342.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
