427 skills found · Page 1 of 15
bitsandbytes-foundation / BitsandbytesAccessible large language models via k-bit quantization for PyTorch.
Lightning-AI / Lit LlamaImplementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
city96 / ComfyUI GGUFGGUF Quantization support for native ComfyUI models
thu-ml / SageAttention[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
intel / Neural CompressorSOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
quic / AimetAIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
Efficient-ML / Awesome Model QuantizationA list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (papers, repositories) that are missed by the repo.
microsoft / OliveOlive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.
666DZY666 / Micronetmicronet, a model compression and deploy lib. compression: 1、quantization: quantization-aware-training(QAT), High-Bit(>2b)(DoReFa/Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference)、Low-Bit(≤2b)/Ternary and Binary(TWN/BNN/XNOR-Net); post-training-quantization(PTQ), 8-bit(tensorrt); 2、 pruning: normal、regular and group convolutional channel pruning; 3、 group convolution structure; 4、batch-normalization fuse for quantization. deploy: tensorrt, fp32/fp16/int8(ptq-calibration)、op-adapt(upsample)、dynamic_shape
NVIDIA / Model OptimizerA unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM, TensorRT, vLLM, etc. to optimize inference speed.
horseee / Awesome Efficient LLMA curated list for Efficient Large Language Models
mit-han-lab / Smoothquant[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
tensorflow / Model OptimizationA toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
Vahe1994 / AQLMOfficial Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852
jakerdliu / OpenTrit CHN• OpenTrit, an open-source cross-framework mixed ternary toolkit, supports one-click conversion of mixed ternary models between PyTorch and TensorFlow. It encapsulates heterogeneous computing power scheduling and quantization optimization, addressing the issues of "framework dependency and poor usability" present in existing ternary tools.
ModelCloud / GPTQModelLLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.
HarryR / Z80aiZ80-μLM is a 2-bit quantized language model small enough to run on an 8-bit Z80 processor. Train conversational models in Python, export them as CP/M .COM binaries, and chat with your vintage computer.
Geekgineer / YOLOs CPPCross-Platform Production-ready C++ inference engine for YOLO models (v5-v12, YOLO26). Unified API for detection, segmentation, pose estimation, OBB, and classification. Built on ONNX Runtime and OpenCV. Optimized for CPU/GPU with quantization support.
ModelTC / MQBenchModel Quantization Benchmark
guan-yuan / Awesome AutoML And Lightweight ModelsA list of high-quality (newest) AutoML works and lightweight models including 1.) Neural Architecture Search, 2.) Lightweight Structures, 3.) Model Compression, Quantization and Acceleration, 4.) Hyperparameter Optimization, 5.) Automated Feature Engineering.