SkillAgentSearch skills...

TinyChatEngine

TinyChatEngine: On-Device LLM Inference Library

Install / Use

/learn @mit-han-lab/TinyChatEngine

README

tinychat_logo

TinyChatEngine: On-Device LLM/VLM Inference Library

Running large language models (LLMs) and visual language models (VLMs) on the edge is useful: copilot services (coding, office, smart reply) on laptops, cars, robots, and more. Users can get instant responses with better privacy, as the data is local.

This is enabled by LLM model compression technique: SmoothQuant and AWQ (Activation-aware Weight Quantization), co-designed with TinyChatEngine that implements the compressed low-precision model.

Feel free to check out our slides for more details!

Code LLaMA Demo on NVIDIA GeForce RTX 4070 laptop:

coding_demo_gpu

VILA Demo on Apple MacBook M1 Pro:

vlm_demo_m1

LLaMA Chat Demo on Apple MacBook M1 Pro:

chat_demo_m1

Overview

LLM Compression: SmoothQuant and AWQ

SmoothQuant: Smooth the activation outliers by migrating the quantization difficulty from activations to weights, with a mathematically equal transformation (100*1 = 10*10).

smoothquant_intuition

AWQ (Activation-aware Weight Quantization): Protect salient weight channels by analyzing activation magnitude as opposed to the weights.

LLM Inference Engine: TinyChatEngine

  • Universal: x86 (Intel/AMD), ARM (Apple M1/M2, Raspberry Pi), CUDA (Nvidia GPU).
  • No library dependency: From-scratch C/C++ implementation.
  • High performance: Real-time on Macbook & GeForce laptop.
  • Easy to use: Download and compile, then ready to go!

overview

News

  • (2024/05) 🏆 AWQ and TinyChat received the Best Paper Award at MLSys 2024. 🎉
  • (2024/05) 🔥 We released the support for the Llama-3 model family! Check out our example and model zoo.
  • (2024/02) 🔥AWQ and TinyChat has been accepted to MLSys 2024!
  • (2024/02) 🔥We extended the support for vision language models (VLM). Feel free to try running VILA on your edge device.
<!-- - **(2024/01)** 🔥We released TinyVoiceChat, a voice chatbot that can be deployed on your edge devices, such as MacBook and Jetson Orin Nano. Check out our [demo video](https://youtu.be/Bw5Dm3aWMnA?si=CCvZDmq3HwowEQcC) and follow the [instructions](#deploy-speech-to-speech-chatbot-with-tinychatengine-demo) to deploy it on your device! -->
  • (2023/10) We extended the support for the coding assistant Code Llama. Feel free to check out our model zoo.
  • (2023/10) ⚡We released the new CUDA backend to support Nvidia GPUs with compute capability >= 6.1 for both server and edge GPUs. Its performance is also speeded up by ~40% compared to the previous version. Feel free to check out!

Prerequisites

MacOS

For MacOS, install boost and llvm by

brew install boost
brew install llvm

For M1/M2 users, install Xcode from AppStore to enable the metal compiler for GPU support.

Windows with CPU

For Windows, download and install the GCC compiler with MSYS2. Follow this tutorial: https://code.visualstudio.com/docs/cpp/config-mingw for installation.

  • Install required dependencies with MSYS2
pacman -S --needed base-devel mingw-w64-x86_64-toolchain make unzip git
  • Add binary directories (e.g., C:\msys64\mingw64\bin and C:\msys64\usr\bin) to the environment path

Windows with Nvidia GPU (Experimental)

  • Install CUDA toolkit for Windows (link). When installing CUDA on your PC, please change the installation path to another one that does not include "spaces".

  • Install Visual Studio with C and C++ support: Follow the Instruction.

  • Follow the instructions below and use x64 Native Tools Command Prompt from Visual Studio to compile TinyChatEngine.

Step-by-step to Deploy Llama-3-8B-Instruct with TinyChatEngine

Here, we provide step-by-step instructions to deploy Llama-3-8B-Instruct with TinyChatEngine from scratch.

  • Download the repo.

    git clone --recursive https://github.com/mit-han-lab/TinyChatEngine
    cd TinyChatEngine
    
  • Install Python Packages

    • The primary codebase of TinyChatEngine is written in pure C/C++. The Python packages are only used for downloading (and converting) models from our model zoo.
      conda create -n TinyChatEngine python=3.10 pip -y
      conda activate TinyChatEngine
      pip install -r requirements.txt
      
  • Download the quantized Llama model from our model zoo.

    cd llm
    
    • On an x86 device (e.g., Intel/AMD laptop)
      python tools/download_model.py --model LLaMA_3_8B_Instruct_awq_int4 --QM QM_x86
      
    • On an ARM device (e.g., M1/M2 Macbook, Raspberry Pi)
      python tools/download_model.py --model LLaMA_3_8B_Instruct_awq_int4 --QM QM_ARM
      
    • On a CUDA device (e.g., Jetson AGX Orin, PC/Server)
      python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_CUDA
      
    • Check this table for the detailed list of supported models
  • (CUDA only) Based on the platform you are using and the compute capability of your GPU, modify the Makefile accordingly. If using Windows with Nvidia GPU, please modify -arch=sm_xx in Line 54. If using other platforms with Nvidia GPU, please modify -gencode arch=compute_xx,code=sm_xx in Line 60.

  • Compile and start the chat locally.

    make chat -j
    ./chat
    
    TinyChatEngine by MIT HAN Lab: https://github.com/mit-han-lab/TinyChatEngine
    Using model: LLaMA_3_8B_Instruct
    Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
    Loading model... Finished!
    USER: Write a syllabus for the parallel computing course.
    ASSISTANT: Here is a sample syllabus for a parallel computing course:
    
    **Course Title:** Parallel Computing
    **Instructor:** [Name]
    **Description:** This course covers the fundamental concepts of parallel computing, including parallel algorithms, programming models, and architectures. Students will learn how to design, implement, and optimize parallel programs using various languages and frameworks.
    **Prerequisites:** Basic knowledge of computer science and programming concepts.
    **Course Objectives:**
    * Understand the principles of parallelism and its applications
    * Learn how to write parallel programs using different languages (e.g., OpenMP, MPI)
    ...
    
<!-- ## Deploy speech-to-speech chatbot with TinyChatEngine [[Demo]](https://youtu.be/Bw5Dm3aWMnA?si=CCvZDmq3HwowEQcC) TinyChatEngine offers versatile capabilities suitable for various applications. Additionally, we introduce a sophisticated voice chatbot. Here, we provide very easy-to-follow instructions to deploy speech-to-speech chatbot (Llama-3-8B-Instruct) with TinyChatEngine. - Follow the instructions above to setup the basic environment, i.e., [Prerequisites](#prerequisites) and [Step-by-step to Deploy Llama-3-8B-Instruct with TinyChatEngine](#step-by-step-to-deploy-llama-3-8b-instruct-with-tinychatengine). - Run the shell script to set up the environment for speech-to-speech chatbot. ```bash cd llm ./voicechat_setup.sh ``` - Start the speech-to-speech chat locally. ```bash ./voicechat # chat.exe -v on Windows ``` - If you encounter any issues or errors during setup, please explore [here](llm/application/README.md) to follow the step-by-step guide to debug. -->

Deploy vision language model (VLM) chatbot with TinyChatEngine

<!-- TinyChatEngine supports not only LLM but also VLM. We introduce a sophisticated text/voice chatbot for VLM. Here, we provide easy-to-follow instructions to deploy vision language model chatbot (VILA-7B) with TinyChatEngine. We recommend using M1/M2 MacBooks for this VLM feature. -->

TinyChatEngine supports not only LLM but also VLM. We introduce a sophisticated chatbot for VLM. Here, we provide easy-to-follow instructions to deploy vision language model chatbot (VILA-7B) with TinyChatEngine. We recommend using M1/M2 MacBooks for this VLM feature.

<!-- - (Optional) To enable the speech-to-speech chatbot for VLM, please follow the [instruction above](#deploy-speech-to-speech-chatbot-with-tinychatengine-demo) to run the shell script to set up the environment. ```bash cd llm ./voicechat_setup.sh ``` -->
  • Download the quantized VILA-7B model from our model zoo.

    • On an x86 device (e.g., Intel/AMD laptop)
      python tools/download_model.py --model VILA_7B_awq_int4_CLIP_ViT-L --QM QM_x86
      
    • On an ARM device (e.g., M1/M2 Macbook, Raspberry Pi)
      python tools/download_model.py --model VIL
      
View on GitHub
GitHub Stars949
CategoryEducation
Updated1h ago
Forks96

Languages

C++

Security Score

100/100

Audited on Apr 4, 2026

No findings