RSR

An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Generate Convert Improve

Install / Use

/learn @UIC-InDeXLab/RSR

About this skill

Quality Score

0/100

README

🔥 An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

<img alt="RSR logo" src="assets/icon.png" width="170">

This repository implements Redundant Segment Reduction (RSR), a fast matrix multiplication algorithm designed for matrices in binary and ternary networks. The RSR method optimizes computation efficiency by a log(n) factor, making it particularly useful for applications in low-bit deep learning and efficient inference.

For a kernel-level implementation of RSR see here: .

This code implements this preprint paper (published at ICML'25):

@article{dehghankar2024efficient,
  title={An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks},
  author={Dehghankar, Mohsen and Erfanian, Mahdi and Asudeh, Abolfazl},
  journal={ICML'25: Proceedings of the International Conference on Machine Learning},
  year={2025},
  url={https://arxiv.org/abs/2411.06360}
}

This repository includes:

A native C++ implementation of the RSR method for performance comparison.
NumPy-based implementations for ease of experimentation and integration into Python workflows.
PyTorch implementations with both CPU and GPU support, enabling scalable and optimized matrix operations in deep learning environments.
BitNet.cpp integration of the RSR algorithm.

This project aims to provide a fast and efficient approach to low-bit matrix multiplication.

【🧠 <a href="#-llm-experiments">Large Languange Models</a> | 🧮 <a href="#-numpy-implementations">NumPy</a> | 🔥 <a href="#-torch-implementations">PyTorch</a> | 💻 <a href="#-native-c-implementations">C++</a> | 🚀 <a href="#-bitnetcpp">BitNet.cpp</a>】 <img alt="visualization" src="assets/visual.png" width="100%"> A visualiazation of the algorithm.

🧮 NumPy Implementations

The NumPy implementations of the matrix multipliers (Naive, RSR, and RSR++) are found in numpy_impl directory. You can use these multipliers by instantiating a Multiplier object and passing a weight matrix A (required) and an optional parameter k. Initialization automatically includes any necessary preprocessing steps, and you can perform inference on input vectors using the multiply method.

⚙️ Requirements

Ensure you have Python >= 3.6 installed, along with all packages listed in requirements.txt.

✅ Testing the Multipliers

To validate the correctness of the RSR and RSR++ multipliers, run rsr_test.py. This script randomly generates a weight matrix and an input vector, then compares the results of the multiplication with the ground truth. To run the tests use ./run_test.sh inside numpy_impl directory.

💻 Native C++ Implementations

Native C++ implementations for the matrix multipliers are available in the native directory.

⚙️ Requirements

To compile and run the C++ code, you’ll need clang++ installed.

⏱️ Run Time Comparison

To compare run times for different values of n across algorithms, use the script ./run_time_compare.sh [algorithm], where [algorithm] can be one of naive, rsr, or rsrpp.

🔧 `k` Optimization

To test various values of k for runtime optimization, run ./run_k_optimization.sh. This script benchmarks the run times for different k values, with the target n value specified in k_optimization.cpp.

🧪 Running Tests

Several tests are provided to ensure algorithmic correctness. Run these tests by executing ./run_test.sh inside native directory.

🔥 Torch Implementations

Torch implementations are inside torch_impl directory. This implementation works only with torch tensor operations. To run the tests and examples run ./run_tests.sh inside torch_impl directory.

🧠 LLM Experiments

The LLM implementations are inside llm directory, which contains both codes for the inference on CPU and GPU in the corresponding dirs. The patch_[cpu/gpu].py file contains patched functions for preprocessing and the new forward function for the BitLinear module.

The notebook [cpu/gpu]_impl/profile.ipynb contains an example code to apply patch and profile the running time.
Currently, this example works with this 1.58bit models: Llama3-8B-1.58bit, Falcon3-10B-1.58bit, and Falcon3-3B-1.58bit.
The notebook gpu_impl/matrix_mult_compare.ipynb contains the code for comparing the pure matrix multiplication of torch and RSR.
The result of experiments on GPU is shown in the paper, Experiments section.

🚀 BitNet.cpp

A PoC comparison of RSR with BitNet.cpp is implemented in bitnetcpp directory. It integrates the C++ implementation of RSR into the BitNet.cpp environment.

Check bitnetcpp/README.md for more details on building and running.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Related Skills

node-connect

349.7k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.7k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.7k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

UIC-InDeXLab

View profile

View on GitHub

GitHub Stars17

CategoryDevelopment

Updated9d ago

Forks0

UIC-InDeXLab/RSR

Languages

Python

Security Score

95/100

Audited on Mar 27, 2026

No findings

RSR

Install / Use

README

🔥 An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

🧮 NumPy Implementations

⚙️ Requirements

✅ Testing the Multipliers

💻 Native C++ Implementations

⚙️ Requirements

⏱️ Run Time Comparison

🔧 k Optimization

🧪 Running Tests

🔥 Torch Implementations

🧠 LLM Experiments

🚀 BitNet.cpp

License

Related Skills

🔧 `k` Optimization