Inferno
Run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and other state-of-the-art language models locally with scorching-fast performance. Inferno provides an intuitive CLI and an OpenAI/Ollama-compatible API, putting the inferno of AI innovation directly in your hands.
Install / Use
/learn @HelpingAI/InfernoREADME
🔥 Inferno: Ignite Your Local AI Experience 🔥
<div align="center"> <img src="https://img.shields.io/badge/Inferno-Local%20LLM%20Server-orange?style=for-the-badge&logo=python&logoColor=white" alt="Inferno Logo"> <p><strong>Unleash the Blazing Power of Cutting-Edge LLMs on Your Own Hardware</strong></p> <p> Run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and other state-of-the-art language models locally with scorching-fast performance. Inferno provides an intuitive CLI and an OpenAI/Ollama-compatible API, putting the inferno of AI innovation directly in your hands. </p> <!-- Badges --> <p> <a href="LICENSE"><img src="https://img.shields.io/badge/License-HelpingAI%20Open%20Source-blue?style=flat-square" alt="License"></a> <a href="#requirements"><img src="https://img.shields.io/badge/Python-3.9+-blue?style=flat-square&logo=python&logoColor=white" alt="Python Version"></a> <a href="#installation"><img src="https://img.shields.io/badge/Platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey?style=flat-square" alt="Platform"></a> </p> <div> <img src="https://img.shields.io/badge/GPU-Accelerated-76B900?style=for-the-badge&logo=nvidia&logoColor=white" alt="GPU Accelerated"> <img src="https://img.shields.io/badge/API-OpenAI%20Compatible-000000?style=for-the-badge&logo=openai&logoColor=white" alt="OpenAI Compatible"> <img src="https://img.shields.io/badge/Models-Hugging%20Face-FFD21E?style=for-the-badge&logo=huggingface&logoColor=white" alt="Hugging Face"> </div> </div>Navigation
- ✨ Overview
- 🚀 Key Features
- ⚙️ Installation
- 🖥️ Command Line Interface (CLI)
- 🔥 Getting Started
- 📋 Usage Guide
- 🔌 API Usage
- 🐍 Native Python Client
- 🧩 Integrations
- 📦 Requirements
- 🔧 Advanced Configuration
- 🤝 Contributing
- 📄 License
- 📚 Full Documentation
✨ Overview
Inferno is your personal gateway to the blazing frontier of Artificial Intelligence. Designed for both newcomers and seasoned developers, it provides a powerful yet user-friendly platform to run the latest Large Language Models (LLMs) directly on your local machine. Experience the raw power of models like Llama 3.3 and Phi-4 without relying on cloud services, ensuring full control over your data and costs.
Inferno offers an experience similar to Ollama but turbo-charged with enhanced features, including seamless Hugging Face integration, advanced quantization tools, and flexible model management. Its OpenAI & Ollama-compatible APIs ensure drop-in compatibility with your favorite AI frameworks and tools.
[!TIP] New to local LLMs? Inferno makes it incredibly easy to get started. Pull a model and ignite your first conversation within minutes!
🚀 Key Features
-
Bleeding-Edge Model Support: Run the latest models such as Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and more as soon as GGUF versions are available.
-
Hugging Face Integration: Download models with interactive file selection, repository browsing, and direct
repo_id:filenametargeting. -
Dual API Compatibility: Serve models through both OpenAI and Ollama compatible API endpoints. Use Inferno with almost any AI client or framework.
-
Native Python Client: Includes a built-in, OpenAI-compatible Python client for seamless integration into your Python projects. Supports streaming, embeddings, multimodal inputs, and tool calling.
-
Interactive CLI: Command-line interface for downloading, managing, quantizing, and chatting with models.
-
Blazing-Fast Inference: GPU acceleration (CUDA, Metal, ROCm, Vulkan, SYCL) for faster response times. CPU acceleration via OpenBLAS is also supported.
-
Real-time Streaming: Get instant feedback with streaming support for both chat and completions APIs.
-
Flexible Context Control: Adjust the context window size (
n_ctx) per model or session. Max context length is automatically detected from GGUF metadata. -
Smart Model Management: List, show details, copy, remove, and see running models (
ps). Includes RAM requirement estimates. -
Embeddings Generation: Create embeddings using your local models via the API.
-
Advanced Quantization: Convert models between various GGUF quantization levels (including importance matrix methods like
iq4_nl) with interactive comparison and RAM estimates. -
Keep-Alive Management: Control how long models stay loaded in memory when idle.
-
Fine-Grained Configuration: Customize inference parameters such as GPU layers, threads, batch size, RoPE settings, and mlock.
⚙️ Installation
[!IMPORTANT] Critical Prerequisite: Install
llama-cpp-pythonFirst! Inferno relies heavily onllama-cpp-python. For optimal performance, especially GPU acceleration, you MUST installllama-cpp-pythonwith the correct hardware backend flags before installing Inferno. Failure to do this may result in suboptimal performance or CPU-only operation.
1. Install llama-cpp-python with Hardware Acceleration
Choose one of the following commands based on your hardware. See the detailed Hardware Acceleration section below for more options and explanations.
- NVIDIA GPU (CUDA):
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir # Or use pre-built wheels if available (see details below) - Apple Silicon GPU (Metal):
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir # Or use pre-built wheels if available (see details below) - AMD GPU (ROCm):
CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir - CPU Only (OpenBLAS):
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir - Other Backends (Vulkan, SYCL, etc.): See the detailed section below.
[!TIP] Using a virtual environment (like
venvorconda) is highly recommended. Ensure you have Python 3.9+ and the necessary build tools (CMake, C++ compiler) installed. Adding--force-reinstall --upgrade --no-cache-dirhelps ensure a clean build against your system's libraries.
2. Install Inferno
Once llama-cpp-python is installed with your desired backend, you can install Inferno directly from PyPI:
# Install the latest stable release from PyPI
pip install inferno-llm
Or, for development or the latest features, install from source:
# Clone the Inferno repository
git clone https://github.com/HelpingAI/inferno.git
cd inferno
# Install Inferno in editable mode (recommended for development)
pip install -e .
# Or install with all optional dependencies (like quantization tools)
# pip install -e ".[dev]"
Hardware Acceleration (llama-cpp-python Critical Prerequisite)
llama.cpp (the engine behind llama-cpp-python) supports multiple hardware acceleration backends. You need to tell pip how to build llama-cpp-python using CMAKE_ARGS.
You can set CMAKE_ARGS either as an environment variable before running pip install or directly via the -C / --config-settings flag.
Environment Variable Method (Linux/macOS):
CMAKE_ARGS="-DOPTION=on" pip install llama-cpp-python ...
Environment Variable Method (Windows PowerShell):
$env:CMAKE_ARGS = "-DOPTION=on"
pip install llama-cpp-python ...
CLI Method (Works Everywhere, Good for requirements.txt):
# Use semicolons to separate multiple CMake args with -C
pip install llama-cpp-python -C cmake.args="-DOPTION1=on;-DOPTION2=off" ...
</details>
<details open>
<summary>Supported Backends (Install ONE)</summary>
-
CUDA (NVIDIA): Requires NVIDIA drivers & CUDA Toolkit.
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir- Pre-built Wheels (Alternative): If you have CUDA 12.1-12.5 and Python 3.10-3.12, try:
# Replace <cuda-version> with cu121, cu122, cu123, cu124, or cu125 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir \ --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/<cuda-version> # Example: pip install ... --extra-index-url .../whl/cu121
- Pre-built Wheels (Alternative): If you have CUDA 12.1-12.5 and Python 3.10-3.12, try:
-
Metal (Apple Silicon): Requires macOS 11.0+ & Xcode Command Line Tools.
CMAKE_ARGS="-DGGML_METAL=on" pip install ll
