SkillAgentSearch skills...

LightCompress

[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.

Install / Use

/learn @ModelTC/LightCompress

README

<div align="center" style="font-family: charter;"> <h1> LightCompress: Towards Accurate and Efficient AIGC Model Compression </h1> <img src="./imgs/llmc.png" alt="llmc" width="75%" /> <img src="./imgs/llmc+.png" alt="llmc" width="75%" />

License Ask DeepWiki arXiv arXiv Discord Banner QQ Doc Doc 

[ English | 中文 ]

</div>

📢 Notice: This repository was formerly known as LLMC and has been renamed to LightCompress.

LightCompress is an off-the-shell tool designed for compressing aigc models(LLM, VLM, Diffusion ...), leveraging state-of-the-art compression algorithms to enhance efficiency and reduce model size without compromising performance. You can download the Docker image that can run LightCompress with the following command. Users in mainland China are recommended to use Alibaba Cloud Docker.

# docker hub: https://hub.docker.com/r/llmcompression/llmc
docker pull llmcompression/llmc:pure-latest

# aliyun docker: registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:[tag]
docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-latest

Community: Discord Server, Tencent QQ Group.

Docs: English, Chinese.

Recommended Python Version: We recommend using Python 3.11 for local development and installation. This matches the project's Docker images and CI configuration, and is generally more stable than Python 3.12 for the current dependency set.

:fire: Latest News

  • Nov 9, 2025: 🍺🍺🍺 Our work LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit has been accepted by AAAI 2026.

  • August 13, 2025: 🚀 We have open-sourced our compression solution for vision-language models (VLMs), supporting over a total of 20 algorithms that cover both token reduction and quantization. This release enables flexible, plug-and-play compression strategies for a wide range of multimodal tasks. please refer to the documentation.

  • May 12, 2025: 🔥 We now fully support quantization for the Wan2.1 series of video generation models and provide export of truly quantized INT8/FP8 weights, compatible with the lightx2v inference framework. For details, please refer to the lightx2v documentation.

  • Feb 07, 2025: 🔥 We now fully support quantization of large-scale MOE models like DeepSeekv3, DeepSeek-R1, and DeepSeek-R1-zero with 671B parameters. You can now directly load FP8 weights without any extra conversion. AWQ and RTN quantization can run on a single 80GB GPU, and we also support the export of true quantized INT4/INT8 weights.

  • Nov 20, 2024: 🔥 We now fully support the quantization of ✨DeepSeekv2(2.5) and other MOE models, as well as ✨Qwen2VL, Llama3.2, and other VLM models. Supported quantization methods include ✅integer quantization, ✅floating-point quantization, and advanced algorithms like ✅AWQ, ✅GPTQ, ✅SmoothQuant, and ✅Quarot.

  • Nov 12, 2024: 🔥 We have added support for 💥static per-tensor activation quantization across various models and algorithms, covering ✅integer quantization and ✅floating-point quantization to further optimize performance and efficiency. Additionally, we now support exporting ✨real quantized models and using the VLLM and SGLang backends for inference acceleration. For more details, refer to the VLLM documentation and SGLang documentation.

  • Sep 26, 2024: 🔥 We now support exporting 💥FP8 quantized(E4M3, E5M2) models from 🚀LLMC to advanced inference backends such as VLLM and SGLang. For detailed usage, please refer to the VLLM documentation and SGLang documentation.

<details close> <summary>Previous News</summary>
  • Sep 24, 2024: 🔥 We have officially released ✅INT4 and ✅INT8 models of ✨Llama-3.1-405B, quantized using 🚀LLMC in save_lightllm mode. You can download the model parameters here.

  • Sep 23, 2024: 🔥 We now support exporting ✨real quantized(INT4, INT8) models from 🚀LLMC to advanced inference backends such as VLLM, SGLang, AutoAWQ, and MLC-LLM for quantized inference deployment, enabling ✨reduced memory usage and ✨faster inference speeds. For detailed usage, please refer to the VLLM documentation, SGLang documentation, AutoAWQ documentation, and MLC-LLM documentation.

  • Sep 09, 2024: 🔥 We provide some configs of our best practice towards superior performance (see Best Practice here).

View on GitHub
GitHub Stars695
CategoryOperations
Updated6h ago
Forks76

Languages

Python

Security Score

100/100

Audited on Apr 1, 2026

No findings