Gemma.cpp
lightweight, standalone C++ inference engine for Google's Gemma models.
Install / Use
/learn @google/Gemma.cppREADME
gemma.cpp
gemma.cpp is a lightweight, standalone C++ inference engine for the Gemma foundation models from Google.
For additional information about Gemma, see ai.google.dev/gemma. Model weights, including gemma.cpp specific artifacts, are available on kaggle.
Who is this project for?
Modern LLM inference engines are sophisticated systems, often with bespoke capabilities extending beyond traditional neural network runtimes. With this comes opportunities for research and innovation through co-design of high level algorithms and low-level computation. However, there is a gap between deployment-oriented C++ inference runtimes, which are not designed for experimentation, and Python-centric ML research frameworks, which abstract away low-level computation through compilation.
gemma.cpp provides a minimalist implementation of Gemma-2, Gemma-3, and PaliGemma-2 models, focusing on simplicity and directness rather than full generality. This is inspired by vertically-integrated model implementations such as ggml, llama.c, and llama.rs.
gemma.cpp targets experimentation and research use cases. It is intended to be straightforward to embed in other projects with minimal dependencies and also easily modifiable with a small ~2K LoC core implementation (along with ~4K LoC of supporting utilities). We use the Google Highway Library to take advantage of portable SIMD for CPU inference.
For production-oriented edge deployments we recommend standard deployment pathways using Python frameworks like JAX, Keras, PyTorch, and Transformers (all model variations here).
Contributing
Community contributions large and small are welcome. See DEVELOPERS.md for additional notes contributing developers and join the discord by following this invite link. This project follows Google's Open Source Community Guidelines.
[!NOTE] Active development is currently done on the
devbranch. Please open pull requests targetingdevbranch instead ofmain, which is intended to be more stable.
What's inside?
-
LLM
- CPU-only inference for: Gemma 2-3, PaliGemma 2.
- Sampling with TopK and temperature.
- Backward pass (VJP) and Adam optimizer for Gemma research.
-
Optimizations
- Mixed-precision (fp8, bf16, fp32, fp64 bit) GEMM:
- Designed for BF16 instructions, can efficiently emulate them.
- Automatic runtime autotuning 7 parameters per matrix shape.
- Weight compression integrated directly into GEMM:
- Custom fp8 format with 2..3 mantissa bits; tensor scaling.
- Also bf16, f32 and non-uniform 4-bit (NUQ); easy to add new formats.
- Mixed-precision (fp8, bf16, fp32, fp64 bit) GEMM:
-
Infrastructure
- SIMD: single implementation via Highway. Chooses ISA at runtime.
- Tensor parallelism: CCX-aware, multi-socket thread pool.
- Disk I/O: memory map or parallel read (heuristic with user override).
- Custom format with forward/backward-compatible metadata serialization.
- Model conversion from Safetensors, not yet open sourced.
- Portability: Linux, Windows/OS X supported. CMake/Bazel. 'Any' CPU.
-
Frontends
- C++ APIs with streaming for single query and batched inference.
- Basic interactive command-line app.
- Basic Python bindings (pybind11).
Quick Start
System requirements
Before starting, you should have installed:
- CMake
- Clang C++ compiler, supporting at least C++17.
tarfor extracting archives from Kaggle.
Building natively on Windows requires the Visual Studio 2012 Build Tools with the
optional Clang/LLVM C++ frontend (clang-cl). This can be installed from the
command line with
winget:
winget install --id Kitware.CMake
winget install --id Microsoft.VisualStudio.2022.BuildTools --force --override "--passive --wait --add Microsoft.VisualStudio.Workload.VCTools;installRecommended --add Microsoft.VisualStudio.Component.VC.Llvm.Clang --add Microsoft.VisualStudio.Component.VC.Llvm.ClangToolset"
Step 1: Obtain model weights and tokenizer from Kaggle or Hugging Face Hub
Visit the
Kaggle page for Gemma-2
and select Model Variations |> Gemma C++.
On this tab, the Variation dropdown includes the options below. Note bfloat16
weights are higher fidelity, while 8-bit switched floating point weights enable
faster inference. In general, we recommend starting with the -sfp checkpoints.
[!NOTE] Important: We strongly recommend starting off with the
gemma2-2b-it-sfpmodel to get up and running.
Gemma 2 models are named gemma2-2b-it for 2B and 9b-it or 27b-it. See the
ModelPrefix function in configs.cc.
Step 2: Extract Files
After filling out the consent form, the download should proceed to retrieve a
tar archive file archive.tar.gz. Extract files from archive.tar.gz (this can
take a few minutes):
tar -xf archive.tar.gz
This should produce a file containing model weights such as 2b-it-sfp.sbs and
a tokenizer file (tokenizer.spm). You may want to move these files to a
convenient directory location (e.g. the build/ directory in this repo).
Step 3: Build
The build system uses CMake. To build the gemma inference
runtime, create a build directory and generate the build files using cmake
from the top-level project directory. Note if you previous ran cmake and are
re-running with a different setting, be sure to delete all files in the build/
directory with rm -rf build/*.
Unix-like Platforms
cmake -B build
After running cmake, you can enter the build/ directory and run make to
build the ./gemma executable:
# Configure `build` directory
cmake --preset make
# Build project using make
cmake --build --preset make -j [number of parallel threads to use]
Replace [number of parallel threads to use] with a number - the number of
cores available on your system is a reasonable heuristic. For example, make -j4 gemma will build using 4 threads. If the nproc command is available, you can
use make -j$(nproc) gemma as a reasonable default for the number of threads.
If you aren't sure of the right value for the -j flag, you can simply run
make gemma instead and it should still build the ./gemma executable.
[!NOTE] On Windows Subsystem for Linux (WSL) users should set the number of parallel threads to 1. Using a larger number may result in errors.
If the build is successful, you should now have a gemma executable in the
build/ directory.
Windows
# Configure `build` directory
cmake --preset windows
# Build project using Visual Studio Build Tools
cmake --build --preset windows -j [number of parallel threads to use]
If the build is successful, you should now have a gemma.exe executable in the
build/ directory.
Bazel
bazel build -c opt --cxxopt=-std=c++20 :gemma
If the build is successful, you should now have a gemma executable in the
bazel-bin/ directory.
Make
If you prefer Makefiles, @jart has made one available here:
https://github.com/jart/gemma3/blob/main/Makefile
Step 4: Run
You can now run gemma from inside the build/ directory.
gemma has the following required arguments:
Argument | Description | Example value
------------- | ---------------------------- | ---------------
--weights | The compressed weights file. | 2b-it-sfp.sbs
--tokenizer | The tokenizer file. | tokenizer.spm
Example invocation for the following configuration:
- weights file
gemma2-2b-it-sfp.sbs(Gemma2 2B instruction-tuned model, 8-bit switched floating point). - Tokenizer file
tokenizer.spm(can omit for single-format weights files created after 2025-05-06, or output by migrate_weights.cc).
./gemma \
--tokenizer tokenizer.spm --weights gemma2-2b-it-sfp.sbs
PaliGemma Vision-Language Model
This repository includes a version of the PaliGemma 2 VLM (paper). We provide a C++ implementation of the PaliGemma 2 model here.
To use the version of PaliGemma included in this repository, build the gemma binary as noted above in Step 3. Download the compressed weights and tokenizer from Kaggle and run the binary as follows:
./gemma \
--tokenizer paligemma_tokenizer.model \
--weights paligemma2-3b-mix-224-sfp.sbs \
--image_file paligemma/testdata/image.ppm
Note that the image reading code is very basic to avoid depending on an image
processing library for now. We currently only support reading binary PPMs (P6).
So use a tool like convert to first convert your images into that format, e.g.
convert image.jpeg -resize 224x224^ image.ppm
(As the image will be resized for processing anyway, we can already resize at this stage for slightly faster loading.)
The interaction with the image (using the mix-224 checkpoint) may then look something like this:
> Describe the image briefly
A large building with two towers in the middle of a city.
> What type of building is it?
church
> What color is the church?
gray
> caption image
A large building with two towers stands tall on the water's edge. The building
has a brown roof and a window on the side. A tree stands in front of the
building, and a flag waves proudly from its top. The water is calm and blue,
reflecting
Related Skills
node-connect
341.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.6kCommit, push, and open a PR
