🦥 Quansloth: TurboQuant Local AI Server

   ____                         _       _   _     
  / __ \                       | |     | | | |    
 | |  | |_   _  __ _ _ __   ___| | ___ | |_| |__  
 | |  | | | | |/ _` | '_ \ / __| |/ _ \| __| '_ \ 
 | |__| | |_| | (_| | | | |\__ \ | (_) | |_| | | |
  \___\_\\__,_|\__,_|_| |_||___/_|\___/ \__|_| |_|
         [ POWERED BY TURBOQUANT+ | NVIDIA CUDA ]

Breaking the VRAM Wall: Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference.

Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware (like an RTX 3060). By bridging a custom Gradio Python frontend with a highly optimized llama.cpp CUDA backend, Quansloth achieves extreme memory compression, saving up to 75% of VRAM.

🛑 Why Quansloth? (No More GPU Crashes)

Standard LLM inference often hits a "Memory Wall" when processing long documents; as the context grows, the GPU runs out of memory (OOM) and the system crashes.

Quansloth prevents these crashes by:

75% Cache Shrink: Compressing the "memory" of the AI from 16-bit to 4-bit (TurboQuant).
Massive Context on Budget GPUs: Run 32k+ token contexts on a 6GB RTX 3060 that would normally require a 24GB RTX 4090.
Hardware-Level Stability: Our interface monitors the CUDA backend to ensure the model stays within your GPU's physical limits, allowing for stable, long-form document analysis without the fear of a system hang.

Interface

📸 Interface Preview

Interface

🖥️ OS Compatibility

Windows 10/11: Fully Supported (via WSL2 Ubuntu). Features a 1-click .bat launcher.
Linux: Fully Supported (Native).
macOS: Not officially supported out-of-the-box (backend optimized for NVIDIA CUDA GPUs).

✨ Features

TurboQuant Cache Compression: Run 8,192+ token contexts natively on 6GB GPUs without Out-Of-Memory (OOM) crashes.
Live Hardware Analytics: The UI physically intercepts the C++ engine logs to report your exact VRAM allocation and savings in real-time.
Context Injector: Upload long documents (PDF, TXT, CSV, MD) directly into the chat stream to test the AI's memory limits.
Dual-Routing: Auto-scan your local models/ folder, or input custom absolute paths to load any .gguf file.
Cyberpunk UI: A sleek, fully responsive dark-mode dashboard built for power users.

🛠️ Prerequisites

Windows with WSL2 (Ubuntu) OR native Linux
NVIDIA GPU with updated drivers
Miniconda or Anaconda installed

🚀 Installation

1. Prepare Python Environment

conda create -n quansloth python=3.10 -y
conda activate quansloth

2. Clone Repository and Requirements

git clone https://github.com/PacifAIst/Quansloth.git
cd Quansloth
pip install -r requirements.txt

3. Run Installer

chmod +x install.sh
./install.sh

🎮 Usage

Adding Models

Download .gguf models (e.g., Llama 3 8B) and place them in:

models/

Start Server (Windows - 1 Click)

Use Launch_Quansloth.bat
Double-click → auto-launches WSL, Conda, and server

Start Server (Linux / WSL)

conda activate quansloth
python quansloth_gui.py

Connect

http://127.0.0.1:7860

🎛️ Pro Tips

Symmetric (Turbo3) → Best overall compression
Asymmetric (Q8/Turbo4) → Better for Q4_K_M models (e.g., Qwen)
Monitor Hardware Stats for real-time VRAM savings

📜 License & Credits

License: This project is licensed under the Apache 2.0 License.
Core Technology: Built upon the TurboQuant+ implementation developed by TheTom (@TheTom).
Research & Algorithms: The underlying algorithm is based on research from Google Research (arXiv:2504.19874).
CUDA Kernels: Special thanks to Gabe Ortiz (signalnine) for porting the CUDA kernels.

👤 Author Dr. Manuel Herrador 📧 mherrador@ujaen.es
University of Jaén (UJA) - Spain

<p align="center">Made with ❤️ for the Local AI Community by PacifAIst</p>

Quansloth

Install / Use

README