Shimmy
⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.
Install / Use
/learn @Michael-A-Kuykendall/ShimmyREADME
The Lightweight OpenAI API Server
🔒 Local Inference Without Dependencies 🚀
</div>Shimmy will be free forever. No asterisks. No "free for now." No pivot to paid.
💝 Support Shimmy's Growth
🚀 If Shimmy helps you, consider sponsoring — 100% of support goes to keeping it free forever.
- $5/month: Coffee tier ☕ - Eternal gratitude + sponsor badge
- $25/month: Bug prioritizer 🐛 - Priority support + name in SPONSORS.md
- $100/month: Corporate backer 🏢 - Logo placement + monthly office hours
- $500/month: Infrastructure partner 🚀 - Direct support + roadmap input
🎯 Become a Sponsor | See our amazing sponsors 🙏
Drop-in OpenAI API Replacement for Local LLMs
Shimmy is a single-binary that provides 100% OpenAI-compatible endpoints for GGUF models. Point your existing AI tools to Shimmy and they just work — locally, privately, and free.
🎉 NEW in v1.9.0: One download, all GPU backends included! No compilation, no backend confusion - just download and run.
Developer Tools
Whether you're forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.
Try it in 30 seconds
# 1) Download pre-built binary (includes all GPU backends)
# Windows:
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
./shimmy.exe serve &
# Linux:
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy
./shimmy serve &
# macOS (Apple Silicon):
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy
./shimmy serve &
# 2) See models and pick one
./shimmy list
# 3) Smoke test the OpenAI API
curl -s http://127.0.0.1:11435/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model":"REPLACE_WITH_MODEL_FROM_list",
"messages":[{"role":"user","content":"Say hi in 5 words."}],
"max_tokens":32
}' | jq -r '.choices[0].message.content'
🚀 Compatible with OpenAI SDKs and Tools
No code changes needed - just change the API endpoint:
- Any OpenAI client: Python, Node.js, curl, etc.
- Development applications: Compatible with standard SDKs
- VSCode Extensions: Point to
http://localhost:11435 - Cursor Editor: Built-in OpenAI compatibility
- Continue.dev: Drop-in model provider
Use with OpenAI SDKs
- Node.js (openai v4)
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "http://127.0.0.1:11435/v1",
apiKey: "sk-local", // placeholder, Shimmy ignores it
});
const resp = await openai.chat.completions.create({
model: "REPLACE_WITH_MODEL",
messages: [{ role: "user", content: "Say hi in 5 words." }],
max_tokens: 32,
});
console.log(resp.choices[0].message?.content);
- Python (openai>=1.0.0)
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local")
resp = client.chat.completions.create(
model="REPLACE_WITH_MODEL",
messages=[{"role": "user", "content": "Say hi in 5 words."}],
max_tokens=32,
)
print(resp.choices[0].message.content)
⚡ Zero Configuration Required
- Automatically finds models from Hugging Face cache, Ollama, local dirs
- Auto-allocates ports to avoid conflicts
- Auto-detects LoRA adapters for specialized models
- Just works - no config files, no setup wizards
🧠 Advanced MOE (Mixture of Experts) Support
Run 70B+ models on consumer hardware with intelligent CPU/GPU hybrid processing:
- 🔄 CPU MOE Offloading: Automatically distribute model layers across CPU and GPU
- 🧮 Intelligent Layer Placement: Optimizes which layers run where for maximum performance
- 💾 Memory Efficiency: Fit larger models in limited VRAM by using system RAM strategically
- ⚡ Hybrid Acceleration: Get GPU speed where it matters most, CPU reliability everywhere else
- 🎛️ Configurable:
--cpu-moeand--n-cpu-moeflags for fine control
# Enable MOE CPU offloading during installation
cargo install shimmy --features moe
# Run with MOE hybrid processing
shimmy serve --cpu-moe --n-cpu-moe 8
# Automatically balances: GPU layers (fast) + CPU layers (memory-efficient)
Perfect for: Large models (70B+), limited VRAM systems, cost-effective inference
🎯 Perfect for Local Development
- Privacy: Your code never leaves your machine
- Cost: No API keys, no per-token billing
- Speed: Local inference, sub-second responses
- Reliability: No rate limits, no downtime
Quick Start (30 seconds)
Installation
✨ v1.9.0 NEW: Download pre-built binaries with ALL GPU backends included!
📥 Pre-Built Binaries (Recommended - Zero Dependencies)
Pick your platform and download - no compilation needed:
# Windows x64 (includes CUDA + Vulkan + OpenCL)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
# Linux x86_64 (includes CUDA + Vulkan + OpenCL)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy
# macOS ARM64 (includes MLX for Apple Silicon)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy
# macOS Intel (CPU-only)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-intel -o shimmy && chmod +x shimmy
# Linux ARM64 (CPU-only)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-aarch64 -o shimmy && chmod +x shimmy
That's it! Your GPU will be detected automatically at runtime.
🛠️ Build from Source (Advanced)
Want to customize or contribute?
# Basic installation (CPU only)
cargo install shimmy --features huggingface
# Kitchen Sink builds (what pre-built binaries use):
# Windows/Linux x64:
cargo install shimmy --features huggingface,llama,llama-cuda,llama-vulkan,llama-opencl,vision
# macOS ARM64:
cargo install shimmy --features huggingface,llama,mlx,vision
# CPU-only (any platform):
cargo install shimmy --features huggingface,llama,vision
⚠️ Build Notes:
- Windows: Install LLVM first for libclang.dll
- Recommended: Use pre-built binaries to avoid dependency issues
- Advanced users only: Building from source requires C++ compiler + CUDA/Vulkan SDKs
GPU Acceleration
✨ NEW in v1.9.0: One binary per platform with automatic GPU detection!
⚠️ IMPORTANT - Vision Feature Performance:
CPU-based vision inference (MiniCPM-V) is 5-10x slower than GPU acceleration.
CPU: 15-45 seconds per image | GPU (CUDA/Vulkan): 2-8 seconds per image
For production vision workloads, GPU acceleration is strongly recommended.
📥 Download Pre-Built Binaries (Recommended)
No compilation needed! Each binary includes ALL GPU backends for your platform:
| Platform | Download | GPU Support | Auto-Detects | |----------|----------|-------------|--------------| | Windows x64 | shimmy-windows-x86_64.exe | CUDA + Vulkan + OpenCL | ✅ | | Linux x86_64 | shimmy-linux-x86_64 | CUDA + Vulkan + OpenCL | ✅ | | macOS ARM64 | shimmy-macos-arm64 | MLX (Apple Silicon) | ✅ | | macOS Intel | shimmy-macos-intel | CPU only | N/A | | Linux ARM64 | shimmy-linux-aarch64 | CPU only | N/A |
How it works: Download one file, run it. Shimmy automatically detects and uses your GPU!
# Windows example
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
./shimmy.exe serve --gpu-backend auto # Auto-detects CUDA/Vulkan/OpenCL
# Linux example
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy
chmod +x shimmy
./shimmy serve --gpu-backend auto # Auto-detects CUDA/Vulkan/OpenCL
# macOS ARM64 example
curl -L https:/
Related Skills
himalaya
332.0kCLI to manage emails via IMAP/SMTP. Use `himalaya` to list, read, write, reply, forward, search, and organize emails from the terminal. Supports multiple accounts and message composition with MML (MIME Meta Language).
coding-agent
332.0kDelegate coding tasks to Codex, Claude Code, or Pi agents via background process
tavily
332.0kTavily web search, content extraction, and research tools.
prd
Raito Bitcoin ZK client web portal.
