🔥 Inferno: Ignite Your Local AI Experience 🔥

<div align="center"> <img src="https://img.shields.io/badge/Inferno-Local%20LLM%20Server-orange?style=for-the-badge&logo=python&logoColor=white" alt="Inferno Logo"> <p><strong>Unleash the Blazing Power of Cutting-Edge LLMs on Your Own Hardware</strong></p> <p> Run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and other state-of-the-art language models locally with scorching-fast performance. Inferno provides an intuitive CLI and an OpenAI/Ollama-compatible API, putting the inferno of AI innovation directly in your hands. </p>  <p> <a href="LICENSE"><img src="https://img.shields.io/badge/License-HelpingAI%20Open%20Source-blue?style=flat-square" alt="License"></a> <a href="#requirements"><img src="https://img.shields.io/badge/Python-3.9+-blue?style=flat-square&logo=python&logoColor=white" alt="Python Version"></a> <a href="#installation"><img src="https://img.shields.io/badge/Platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey?style=flat-square" alt="Platform"></a> </p> <div> <img src="https://img.shields.io/badge/GPU-Accelerated-76B900?style=for-the-badge&logo=nvidia&logoColor=white" alt="GPU Accelerated"> <img src="https://img.shields.io/badge/API-OpenAI%20Compatible-000000?style=for-the-badge&logo=openai&logoColor=white" alt="OpenAI Compatible"> <img src="https://img.shields.io/badge/Models-Hugging%20Face-FFD21E?style=for-the-badge&logo=huggingface&logoColor=white" alt="Hugging Face"> </div> </div>

Navigation

✨ Overview

Inferno is your personal gateway to the blazing frontier of Artificial Intelligence. Designed for both newcomers and seasoned developers, it provides a powerful yet user-friendly platform to run the latest Large Language Models (LLMs) directly on your local machine. Experience the raw power of models like Llama 3.3 and Phi-4 without relying on cloud services, ensuring full control over your data and costs.

Inferno offers an experience similar to Ollama but turbo-charged with enhanced features, including seamless Hugging Face integration, advanced quantization tools, and flexible model management. Its OpenAI & Ollama-compatible APIs ensure drop-in compatibility with your favorite AI frameworks and tools.

[!TIP] New to local LLMs? Inferno makes it incredibly easy to get started. Pull a model and ignite your first conversation within minutes!

🚀 Key Features

Bleeding-Edge Model Support: Run the latest models such as Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and more as soon as GGUF versions are available.
Hugging Face Integration: Download models with interactive file selection, repository browsing, and direct repo_id:filename targeting.
Dual API Compatibility: Serve models through both OpenAI and Ollama compatible API endpoints. Use Inferno with almost any AI client or framework.
Native Python Client: Includes a built-in, OpenAI-compatible Python client for seamless integration into your Python projects. Supports streaming, embeddings, multimodal inputs, and tool calling.
Interactive CLI: Command-line interface for downloading, managing, quantizing, and chatting with models.
Blazing-Fast Inference: GPU acceleration (CUDA, Metal, ROCm, Vulkan, SYCL) for faster response times. CPU acceleration via OpenBLAS is also supported.
Real-time Streaming: Get instant feedback with streaming support for both chat and completions APIs.
Flexible Context Control: Adjust the context window size (n_ctx) per model or session. Max context length is automatically detected from GGUF metadata.
Smart Model Management: List, show details, copy, remove, and see running models (ps). Includes RAM requirement estimates.
Embeddings Generation: Create embeddings using your local models via the API.
Advanced Quantization: Convert models between various GGUF quantization levels (including importance matrix methods like iq4_nl) with interactive comparison and RAM estimates.
Keep-Alive Management: Control how long models stay loaded in memory when idle.
Fine-Grained Configuration: Customize inference parameters such as GPU layers, threads, batch size, RoPE settings, and mlock.

⚙️ Installation

[!IMPORTANT] Critical Prerequisite: Install llama-cpp-python First! Inferno relies heavily on llama-cpp-python. For optimal performance, especially GPU acceleration, you MUST install llama-cpp-python with the correct hardware backend flags before installing Inferno. Failure to do this may result in suboptimal performance or CPU-only operation.

1. Install `llama-cpp-python` with Hardware Acceleration

Choose one of the following commands based on your hardware. See the detailed Hardware Acceleration section below for more options and explanations.

NVIDIA GPU (CUDA):

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
# Or use pre-built wheels if available (see details below)

Apple Silicon GPU (Metal):

CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
# Or use pre-built wheels if available (see details below)

AMD GPU (ROCm):

CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

CPU Only (OpenBLAS):

CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Other Backends (Vulkan, SYCL, etc.): See the detailed section below.

[!TIP] Using a virtual environment (like venv or conda) is highly recommended. Ensure you have Python 3.9+ and the necessary build tools (CMake, C++ compiler) installed. Adding --force-reinstall --upgrade --no-cache-dir helps ensure a clean build against your system's libraries.

2. Install Inferno

Once llama-cpp-python is installed with your desired backend, you can install Inferno directly from PyPI:

# Install the latest stable release from PyPI
pip install inferno-llm

Or, for development or the latest features, install from source:

# Clone the Inferno repository
git clone https://github.com/HelpingAI/inferno.git
cd inferno

# Install Inferno in editable mode (recommended for development)
pip install -e .

# Or install with all optional dependencies (like quantization tools)
# pip install -e ".[dev]"

Hardware Acceleration (`llama-cpp-python` Critical Prerequisite)

llama.cpp (the engine behind llama-cpp-python) supports multiple hardware acceleration backends. You need to tell pip how to build llama-cpp-python using CMAKE_ARGS.

<details> <summary>How to Set Build Options (Environment Variables vs. CLI)</summary>

You can set CMAKE_ARGS either as an environment variable before running pip install or directly via the -C / --config-settings flag.

Environment Variable Method (Linux/macOS):

CMAKE_ARGS="-DOPTION=on" pip install llama-cpp-python ...

Environment Variable Method (Windows PowerShell):

$env:CMAKE_ARGS = "-DOPTION=on"
pip install llama-cpp-python ...

CLI Method (Works Everywhere, Good for requirements.txt):

# Use semicolons to separate multiple CMake args with -C
pip install llama-cpp-python -C cmake.args="-DOPTION1=on;-DOPTION2=off" ...

</details> <details open> <summary>Supported Backends (Install ONE)</summary>

CUDA (NVIDIA): Requires NVIDIA drivers & CUDA Toolkit.

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Pre-built Wheels (Alternative): If you have CUDA 12.1-12.5 and Python 3.10-3.12, try:

# Replace <cuda-version> with cu121, cu122, cu123, cu124, or cu125
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/<cuda-version>
# Example: pip install ... --extra-index-url .../whl/cu121

Metal (Apple Silicon): Requires macOS 11.0+ & Xcode Command Line Tools.
```
CMAKE_ARGS="-DGGML_METAL=on" pip install ll
```

Inferno

Install / Use

README

🔥 Inferno: Ignite Your Local AI Experience 🔥

✨ Overview

🚀 Key Features

⚙️ Installation

1. Install `llama-cpp-python` with Hardware Acceleration

2. Install Inferno

Hardware Acceleration (`llama-cpp-python` Critical Prerequisite)

Inferno

Install / Use

README

🔥 Inferno: Ignite Your Local AI Experience 🔥

✨ Overview

🚀 Key Features

⚙️ Installation

1. Install llama-cpp-python with Hardware Acceleration

2. Install Inferno

Hardware Acceleration (llama-cpp-python Critical Prerequisite)

1. Install `llama-cpp-python` with Hardware Acceleration

Hardware Acceleration (`llama-cpp-python` Critical Prerequisite)