Llama.cpp
Fork of Llama.cpp for improving Metal support.
Install / Use
/learn @lordeveningburci/Llama.cppREADME
llama.cpp

LLM inference in C/C++
Recent API changes
Hot topics
- guide : using the new WebUI of llama.cpp
- guide : running gpt-oss with llama.cpp
- [FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗
- Support for the
gpt-ossmodel with native MXFP4 format has been added | PR | Collaboration with NVIDIA | Comment - Multimodal support arrived in
llama-server: #12898 | documentation - VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
- Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggml-org/llama.cpp/discussions/9669
- Hugging Face GGUF editor: discussion | tool
Quick start
Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:
- Install
llama.cppusing brew, nix or winget - Run with Docker - see our Docker documentation
- Download pre-built binaries from the releases page
- Build from source by cloning this repository - check out our build guide
Once installed, you'll need a model to work with. Head to the Obtaining and quantizing models section to learn more.
Example command:
# Use a local model file
llama-cli -m my_model.gguf
# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
# Launch OpenAI-compatible API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF
Description
The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
range of hardware - locally and in the cloud.
- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2, AVX512 and AMX support for x86 architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
- Vulkan and SYCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
The llama.cpp project is the main playground for developing new features for the ggml library.
Typically finetunes of the base models below are supported as well.
Instructions for adding support for new models: HOWTO-add-model.md
Text-only
- [X] LLaMA 🦙
- [x] LLaMA 2 🦙🦙
- [x] LLaMA 3 🦙🦙🦙
- [X] Mistral 7B
- [x] Mixtral MoE
- [x] DBRX
- [x] Jamba
- [X] Falcon
- [X] Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
- [X] Vigogne (French)
- [X] BERT
- [X] Koala
- [X] Baichuan 1 & 2 + derivations
- [X] Aquila 1 & 2
- [X] Starcoder models
- [X] Refact
- [X] MPT
- [X] Bloom
- [x] Yi models
- [X] StableLM models
- [x] Deepseek models
- [x] Qwen models
- [x] PLaMo-13B
- [x] Phi models
- [x] PhiMoE
- [x] GPT-2
- [x] Orion 14B
- [x] InternLM2
- [x] CodeShell
- [x] Gemma
- [x] Mamba
- [x] Grok-1
- [x] Xverse
- [x] Command-R models
- [x] SEA-LION
- [x] GritLM-7B + GritLM-8x7B
- [x] OLMo
- [x] OLMo 2
- [x] OLMoE
- [x] Granite models
- [x] GPT-NeoX + Pythia
- [x] Snowflake-Arctic MoE
- [x] Smaug
- [x] Poro 34B
- [x] Bitnet b1.58 models
- [x] Flan T5
- [x] Open Elm models
- [x] ChatGLM3-6b + ChatGLM4-9b + GLMEdge-1.5b + GLMEdge-4b
- [x] GLM-4-0414
- [x] SmolLM
- [x] EXAONE-3.0-7.8B-Instruct
- [x] FalconMamba Models
- [x] Jais
- [x] Bielik-11B-v2.3
- [x] RWKV-6
- [x] QRWKV-6
- [x] GigaChat-20B-A3B
- [X] Trillion-7B-preview
- [x] Ling models
- [x] LFM2 models
- [x] Hunyuan models
- [x] BailingMoeV2 (Ring/Ling 2.0) models
Multimodal
- [x] LLaVA 1.5 models, LLaVA 1.6 models
- [x] BakLLaVA
- [x] Obsidian
- [x] ShareGPT4V
- [x] MobileVLM 1.7B/3B models
- [x] Yi-VL
- [x] Mini CPM
- [x] Moondream
- [x] Bunny
- [x] GLM-EDGE
- [x] Qwen2-VL
- [x] LFM2-VL
Related Skills
openhue
353.1kControl Philips Hue lights and scenes via the OpenHue CLI.
sag
353.1kElevenLabs text-to-speech with mac-style say UX.
weather
353.1kGet current weather and forecasts via wttr.in or Open-Meteo
tweakcc
1.6kCustomize Claude Code's system prompts, create custom toolsets, input pattern highlighters, themes/thinking verbs/spinners, customize input box & user message styling, support AGENTS.md, unlock private/unreleased features, and much more. Supports both native/npm installs on all platforms.
