Quansloth
Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference. Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware with ease
Install / Use
/learn @PacifAIst/QuanslothREADME
🦥 Quansloth: TurboQuant Local AI Server
____ _ _ _
/ __ \ | | | | | |
| | | |_ _ __ _ _ __ ___| | ___ | |_| |__
| | | | | | |/ _` | '_ \ / __| |/ _ \| __| '_ \
| |__| | |_| | (_| | | | |\__ \ | (_) | |_| | | |
\___\_\\__,_|\__,_|_| |_||___/_|\___/ \__|_| |_|
[ POWERED BY TURBOQUANT+ | NVIDIA CUDA ]
Breaking the VRAM Wall: Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference.
Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware (like an RTX 3060). By bridging a custom Gradio Python frontend with a highly optimized llama.cpp CUDA backend, Quansloth achieves extreme memory compression, saving up to 75% of VRAM.
🛑 Why Quansloth? (No More GPU Crashes)
Standard LLM inference often hits a "Memory Wall" when processing long documents; as the context grows, the GPU runs out of memory (OOM) and the system crashes.
Quansloth prevents these crashes by:
- 75% Cache Shrink: Compressing the "memory" of the AI from 16-bit to 4-bit (TurboQuant).
- Massive Context on Budget GPUs: Run 32k+ token contexts on a 6GB RTX 3060 that would normally require a 24GB RTX 4090.
- Hardware-Level Stability: Our interface monitors the CUDA backend to ensure the model stays within your GPU's physical limits, allowing for stable, long-form document analysis without the fear of a system hang.

📸 Interface Preview

🖥️ OS Compatibility
- Windows 10/11: Fully Supported (via WSL2 Ubuntu). Features a 1-click
.batlauncher. - Linux: Fully Supported (Native).
- macOS: Not officially supported out-of-the-box (backend optimized for NVIDIA CUDA GPUs).
✨ Features
- TurboQuant Cache Compression: Run 8,192+ token contexts natively on 6GB GPUs without Out-Of-Memory (OOM) crashes.
- Live Hardware Analytics: The UI physically intercepts the C++ engine logs to report your exact VRAM allocation and savings in real-time.
- Context Injector: Upload long documents (PDF, TXT, CSV, MD) directly into the chat stream to test the AI's memory limits.
- Dual-Routing: Auto-scan your local
models/folder, or input custom absolute paths to load any.gguffile. - Cyberpunk UI: A sleek, fully responsive dark-mode dashboard built for power users.
🛠️ Prerequisites
- Windows with WSL2 (Ubuntu) OR native Linux
- NVIDIA GPU with updated drivers
- Miniconda or Anaconda installed
🚀 Installation
1. Prepare Python Environment
conda create -n quansloth python=3.10 -y
conda activate quansloth
2. Clone Repository and Requirements
git clone https://github.com/PacifAIst/Quansloth.git
cd Quansloth
pip install -r requirements.txt
3. Run Installer
chmod +x install.sh
./install.sh
🎮 Usage
Adding Models
Download .gguf models (e.g., Llama 3 8B) and place them in:
models/
Start Server (Windows - 1 Click)
- Use
Launch_Quansloth.bat - Double-click → auto-launches WSL, Conda, and server
Start Server (Linux / WSL)
conda activate quansloth
python quansloth_gui.py
Connect
http://127.0.0.1:7860
🎛️ Pro Tips
- Symmetric (Turbo3) → Best overall compression
- Asymmetric (Q8/Turbo4) → Better for Q4_K_M models (e.g., Qwen)
- Monitor Hardware Stats for real-time VRAM savings
📜 License & Credits
- License: This project is licensed under the Apache 2.0 License.
- Core Technology: Built upon the TurboQuant+ implementation developed by TheTom (@TheTom).
- Research & Algorithms: The underlying algorithm is based on research from Google Research (arXiv:2504.19874).
- CUDA Kernels: Special thanks to Gabe Ortiz (signalnine) for porting the CUDA kernels.
👤 Author
Dr. Manuel Herrador 📧 mherrador@ujaen.es
University of Jaén (UJA) - Spain
<p align="center">Made with ❤️ for the Local AI Community by PacifAIst</p>
