1,691 skills found · Page 1 of 57
artidoro / QloraQLoRA: Efficient Finetuning of Quantized LLMs
bitsandbytes-foundation / BitsandbytesAccessible large language models via k-bit quantization for PyTorch.
Lightning-AI / Lit LlamaImplementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
AutoGPTQ / AutoGPTQAn easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
lucidrains / Vector Quantize PytorchVector (and Scalar) Quantization, in Pytorch
mit-han-lab / Llm Awq[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
city96 / ComfyUI GGUFGGUF Quantization support for native ComfyUI models
thu-ml / SageAttention[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
qwopqwop200 / GPTQ For LLaMa4 bits quantization of LLaMA using GPTQ
turboderp / ExllamaA more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
pytorch / AoPyTorch native quantization and sparsity for training and inference
intel / Neural CompressorSOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
quic / AimetAIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
NVIDIA / Model OptimizerA unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM, TensorRT, vLLM, etc. to optimize inference speed.
Efficient-ML / Awesome Model QuantizationA list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (papers, repositories) that are missed by the repo.
casper-hansen / AutoAWQAutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
microsoft / OliveOlive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.
IST-DASLab / GptqCode for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
666DZY666 / Micronetmicronet, a model compression and deploy lib. compression: 1、quantization: quantization-aware-training(QAT), High-Bit(>2b)(DoReFa/Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference)、Low-Bit(≤2b)/Ternary and Binary(TWN/BNN/XNOR-Net); post-training-quantization(PTQ), 8-bit(tensorrt); 2、 pruning: normal、regular and group convolutional channel pruning; 3、 group convolution structure; 4、batch-normalization fuse for quantization. deploy: tensorrt, fp32/fp16/int8(ptq-calibration)、op-adapt(upsample)、dynamic_shape
TimmyOVO / Deepseek Ocr.rsRust multi‑backend OCR/VLM engine (DeepSeek‑OCR-1/2, PaddleOCR‑VL, DotsOCR) with DSQ quantization and an OpenAI‑compatible server & CLI – run locally without Python.