BitNet
Official inference framework for 1-bit LLMs
Install / Use
/learn @microsoft/BitNetREADME
bitnet.cpp
<img src="./assets/header_model_release.png" alt="BitNet Model on Hugging Face" width="800"/>
Try it out via this demo, or build and run it on your own CPU or GPU.
bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU (NPU support will coming next).
The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. Please refer to the technical report for more details.
Latest optimization introduces parallel kernel implementations with configurable tiling and embedding quantization support, achieving 1.15x to 2.1x additional speedup over the original implementation across different hardware platforms and workloads. For detailed technical information, see the optimization guide.
<img src="./assets/performance.png" alt="performance_comparison" width="800"/>Demo
A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:
https://github.com/user-attachments/assets/7f46b736-edec-4828-b809-4be780a3e5b1
What's New:
- 01/15/2026 BitNet CPU Inference Optimization
- 05/20/2025 BitNet Official GPU inference kernel
- 04/14/2025 BitNet Official 2B Parameter Model on Hugging Face
- 02/18/2025 Bitnet.cpp: Efficient Edge Inference for Ternary LLMs
- 11/08/2024 BitNet a4.8: 4-bit Activations for 1-bit LLMs
- 10/21/2024 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs
- 10/17/2024 bitnet.cpp 1.0 released.
- 03/21/2024 The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ
- 02/27/2024 The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- 10/17/2023 BitNet: Scaling 1-bit Transformers for Large Language Models
Acknowledgements
This project is based on the llama.cpp framework. We would like to thank all the authors for their contributions to the open-source community. Also, bitnet.cpp's kernels are built on top of the Lookup Table methodologies pioneered in T-MAC. For inference of general low-bit LLMs beyond ternary models, we recommend using T-MAC.
Official Models
<table> </tr> <tr> <th rowspan="2">Model</th> <th rowspan="2">Parameters</th> <th rowspan="2">CPU</th> <th colspan="3">Kernel</th> </tr> <tr> <th>I2_S</th> <th>TL1</th> <th>TL2</th> </tr> <tr> <td rowspan="2"><a href="https://huggingface.co/microsoft/BitNet-b1.58-2B-4T">BitNet-b1.58-2B-4T</a></td> <td rowspan="2">2.4B</td> <td>x86</td> <td>✅</td> <td>❌</td> <td>✅</td> </tr> <tr> <td>ARM</td> <td>✅</td> <td>✅</td> <td>❌</td> </tr> </table>Supported Models
❗️We use existing 1-bit LLMs available on Hugging Face to demonstrate the inference capabilities of bitnet.cpp. We hope the release of bitnet.cpp will inspire the development of 1-bit LLMs in large-scale settings in terms of model size and training tokens.
<table> </tr> <tr> <th rowspan="2">Model</th> <th rowspan="2">Parameters</th> <th rowspan="2">CPU</th> <th colspan="3">Kernel</th> </tr> <tr> <th>I2_S</th> <th>TL1</th> <th>TL2</th> </tr> <tr> <td rowspan="2"><a href="https://huggingface.co/1bitLLM/bitnet_b1_58-large">bitnet_b1_58-large</a></td> <td rowspan="2">0.7B</td> <td>x86</td> <td>✅</td> <td>❌</td> <td>✅</td> </tr> <tr> <td>ARM</td> <td>✅</td> <td>✅</td> <td>❌</td> </tr> <tr> <td rowspan="2"><a href="https://huggingface.co/1bitLLM/bitnet_b1_58-3B">bitnet_b1_58-3B</a></td> <td rowspan="2">3.3B</td> <td>x86</td> <td>❌</td> <td>❌</td> <td>✅</td> </tr> <tr> <td>ARM</td> <td>❌</td> <td>✅</td> <td>❌</td> </tr> <tr> <td rowspan="2"><a href="https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens">Llama3-8B-1.58-100B-tokens</a></td> <td rowspan="2">8.0B</td> <td>x86</td> <td>✅</td> <td>❌</td> <td>✅</td> </tr> <tr> <td>ARM</td> <td>✅</td> <td>✅</td> <td>❌</td> </tr> <tr> <td rowspan="2"><a href="https://huggingface.co/collections/tiiuae/falcon3-67605ae03578be86e4e87026">Falcon3 Family</a></td> <td rowspan="2">1B-10B</td> <td>x86</td> <td>✅</td> <td>❌</td> <td>✅</td> </tr> <tr> <td>ARM</td> <td>✅</td> <td>✅</td> <td>❌</td> </tr> <tr> <td rowspan="2"><a href="https://huggingface.co/collections/tiiuae/falcon-edge-series-6804fd13344d6d8a8fa71130">Falcon-E Family</a></td> <td rowspan="2">1B-3B</td> <td>x86</td> <td>✅</td> <td>❌</td> <td>✅</td> </tr> <tr> <td>ARM</td> <td>✅</td> <td>✅</td> <td>❌</td> </tr> </table>Installation
Requirements
- python>=3.9
- cmake>=3.22
- clang>=18
-
For Windows users, install Visual Studio 2022. In the installer, toggle on at least the following options(this also automatically installs the required additional tools like CMake):
- Desktop-development with C++
- C++-CMake Tools for Windows
- Git for Windows
- C++-Clang Compiler for Windows
- MS-Build Support for LLVM-Toolset (clang)
-
For Debian/Ubuntu users, you can download with Automatic installation script
bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"
-
- conda (highly recommend)
Build from source
[!IMPORTANT] If you are using Windows, please remember to always use a Developer Command Prompt / PowerShell for VS2022 for the following commands. Please refer to the FAQs below if you see any issues.
- Clone the repo
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
- Install the dependencies
# (Recommended) Create a new conda environment
conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp
pip install -r requirements.txt
- Build the project
# Manually download the model and run with local path
huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
<pre>
usage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd]
[--use-pretuned]
Setup the environment for running inference
optional arguments:
-h, --help show this help message and exit
--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}
Model used for inference
--model-dir MODEL_DIR, -md MODEL_DIR
Directory to save/load the model
--log-dir LOG_DIR, -ld LOG_DIR
Directory to save the logging info
--quant-type {i2_s,tl1}, -q {i2_s,tl1}
Quantization type
--quant-embd Quantize the embeddings to f16
--use-pretuned, -p Use the pretuned kernel parameters
</pre>
Usage
Basic usage
# Run inference with the quantized model
python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-mod
