SkillAgentSearch skills...

Inferno

Run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and other state-of-the-art language models locally with scorching-fast performance. Inferno provides an intuitive CLI and an OpenAI/Ollama-compatible API, putting the inferno of AI innovation directly in your hands.

Install / Use

/learn @HelpingAI/Inferno
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

🔥 Inferno: Ignite Your Local AI Experience 🔥

<div align="center"> <img src="https://img.shields.io/badge/Inferno-Local%20LLM%20Server-orange?style=for-the-badge&logo=python&logoColor=white" alt="Inferno Logo"> <p><strong>Unleash the Blazing Power of Cutting-Edge LLMs on Your Own Hardware</strong></p> <p> Run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and other state-of-the-art language models locally with scorching-fast performance. Inferno provides an intuitive CLI and an OpenAI/Ollama-compatible API, putting the inferno of AI innovation directly in your hands. </p> <!-- Badges --> <p> <a href="LICENSE"><img src="https://img.shields.io/badge/License-HelpingAI%20Open%20Source-blue?style=flat-square" alt="License"></a> <a href="#requirements"><img src="https://img.shields.io/badge/Python-3.9+-blue?style=flat-square&logo=python&logoColor=white" alt="Python Version"></a> <a href="#installation"><img src="https://img.shields.io/badge/Platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey?style=flat-square" alt="Platform"></a> </p> <div> <img src="https://img.shields.io/badge/GPU-Accelerated-76B900?style=for-the-badge&logo=nvidia&logoColor=white" alt="GPU Accelerated"> <img src="https://img.shields.io/badge/API-OpenAI%20Compatible-000000?style=for-the-badge&logo=openai&logoColor=white" alt="OpenAI Compatible"> <img src="https://img.shields.io/badge/Models-Hugging%20Face-FFD21E?style=for-the-badge&logo=huggingface&logoColor=white" alt="Hugging Face"> </div> </div>

Navigation


✨ Overview

Inferno is your personal gateway to the blazing frontier of Artificial Intelligence. Designed for both newcomers and seasoned developers, it provides a powerful yet user-friendly platform to run the latest Large Language Models (LLMs) directly on your local machine. Experience the raw power of models like Llama 3.3 and Phi-4 without relying on cloud services, ensuring full control over your data and costs.

Inferno offers an experience similar to Ollama but turbo-charged with enhanced features, including seamless Hugging Face integration, advanced quantization tools, and flexible model management. Its OpenAI & Ollama-compatible APIs ensure drop-in compatibility with your favorite AI frameworks and tools.

[!TIP] New to local LLMs? Inferno makes it incredibly easy to get started. Pull a model and ignite your first conversation within minutes!

🚀 Key Features

  • Bleeding-Edge Model Support: Run the latest models such as Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and more as soon as GGUF versions are available.

  • Hugging Face Integration: Download models with interactive file selection, repository browsing, and direct repo_id:filename targeting.

  • Dual API Compatibility: Serve models through both OpenAI and Ollama compatible API endpoints. Use Inferno with almost any AI client or framework.

  • Native Python Client: Includes a built-in, OpenAI-compatible Python client for seamless integration into your Python projects. Supports streaming, embeddings, multimodal inputs, and tool calling.

  • Interactive CLI: Command-line interface for downloading, managing, quantizing, and chatting with models.

  • Blazing-Fast Inference: GPU acceleration (CUDA, Metal, ROCm, Vulkan, SYCL) for faster response times. CPU acceleration via OpenBLAS is also supported.

  • Real-time Streaming: Get instant feedback with streaming support for both chat and completions APIs.

  • Flexible Context Control: Adjust the context window size (n_ctx) per model or session. Max context length is automatically detected from GGUF metadata.

  • Smart Model Management: List, show details, copy, remove, and see running models (ps). Includes RAM requirement estimates.

  • Embeddings Generation: Create embeddings using your local models via the API.

  • Advanced Quantization: Convert models between various GGUF quantization levels (including importance matrix methods like iq4_nl) with interactive comparison and RAM estimates.

  • Keep-Alive Management: Control how long models stay loaded in memory when idle.

  • Fine-Grained Configuration: Customize inference parameters such as GPU layers, threads, batch size, RoPE settings, and mlock.

⚙️ Installation

[!IMPORTANT] Critical Prerequisite: Install llama-cpp-python First! Inferno relies heavily on llama-cpp-python. For optimal performance, especially GPU acceleration, you MUST install llama-cpp-python with the correct hardware backend flags before installing Inferno. Failure to do this may result in suboptimal performance or CPU-only operation.

1. Install llama-cpp-python with Hardware Acceleration

Choose one of the following commands based on your hardware. See the detailed Hardware Acceleration section below for more options and explanations.

  • NVIDIA GPU (CUDA):
    CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
    # Or use pre-built wheels if available (see details below)
    
  • Apple Silicon GPU (Metal):
    CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
    # Or use pre-built wheels if available (see details below)
    
  • AMD GPU (ROCm):
    CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
    
  • CPU Only (OpenBLAS):
    CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
    
  • Other Backends (Vulkan, SYCL, etc.): See the detailed section below.

[!TIP] Using a virtual environment (like venv or conda) is highly recommended. Ensure you have Python 3.9+ and the necessary build tools (CMake, C++ compiler) installed. Adding --force-reinstall --upgrade --no-cache-dir helps ensure a clean build against your system's libraries.

2. Install Inferno

Once llama-cpp-python is installed with your desired backend, you can install Inferno directly from PyPI:

# Install the latest stable release from PyPI
pip install inferno-llm

Or, for development or the latest features, install from source:

# Clone the Inferno repository
git clone https://github.com/HelpingAI/inferno.git
cd inferno

# Install Inferno in editable mode (recommended for development)
pip install -e .

# Or install with all optional dependencies (like quantization tools)
# pip install -e ".[dev]"

Hardware Acceleration (llama-cpp-python Critical Prerequisite)

llama.cpp (the engine behind llama-cpp-python) supports multiple hardware acceleration backends. You need to tell pip how to build llama-cpp-python using CMAKE_ARGS.

<details> <summary>How to Set Build Options (Environment Variables vs. CLI)</summary>

You can set CMAKE_ARGS either as an environment variable before running pip install or directly via the -C / --config-settings flag.

Environment Variable Method (Linux/macOS):

CMAKE_ARGS="-DOPTION=on" pip install llama-cpp-python ...

Environment Variable Method (Windows PowerShell):

$env:CMAKE_ARGS = "-DOPTION=on"
pip install llama-cpp-python ...

CLI Method (Works Everywhere, Good for requirements.txt):

# Use semicolons to separate multiple CMake args with -C
pip install llama-cpp-python -C cmake.args="-DOPTION1=on;-DOPTION2=off" ...
</details> <details open> <summary>Supported Backends (Install ONE)</summary>
  • CUDA (NVIDIA): Requires NVIDIA drivers & CUDA Toolkit.

    CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
    
    • Pre-built Wheels (Alternative): If you have CUDA 12.1-12.5 and Python 3.10-3.12, try:
      # Replace <cuda-version> with cu121, cu122, cu123, cu124, or cu125
      pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir \
        --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/<cuda-version>
      # Example: pip install ... --extra-index-url .../whl/cu121
      
  • Metal (Apple Silicon): Requires macOS 11.0+ & Xcode Command Line Tools.

    CMAKE_ARGS="-DGGML_METAL=on" pip install ll
    
View on GitHub
GitHub Stars8
CategoryDevelopment
Updated1mo ago
Forks0

Languages

Python

Security Score

75/100

Audited on Feb 4, 2026

No findings