Infernal

Infernal is a powerful, lightweight CLI tool for running LLMs locally with fast inference, detailed benchmarking, and model finetuning capabilities.

Generate Convert Improve

Install / Use

/learn @chandraprvkvsh/Infernal

About this skill

Quality Score

0/100

README

Infernal

Infernal is a powerful yet lightweight tool for running Large Language Models (LLMs) locally with blazing fast inference, comprehensive benchmarking, and model finetuning capabilities. Built on top of llama.cpp for maximum performance.

Features

Fast Local Inference: Lightning-fast model execution using llama.cpp
Simple Downloads: Pull models directly from HuggingFace with zero configuration
Interactive Chat: Engage in real-time conversations with your models
Performance Benchmarking: Detailed performance metrics and analysis
Model Finetuning: Train custom models using LoRA/PEFT with simple Modelfiles
Clean Management: Simple model organization with automatic configuration
Zero Setup: No complex configuration files - just download and run

Quick Start

Installation

Method 1: Using pip (Recommended)

git clone https://github.com/chandraprvkvsh/Infernal.git
cd infernal
pip install .

Method 2: Direct installation

git clone https://github.com/chandraprvkvsh/Infernal.git
cd infernal
pip install -r requirements.txt

After Method 1, you can use the infernal command globally. For Method 2, use python infernal.py.

Basic Usage

Download a model:

Using pip installation:

infernal pull --repo-id TheBloke/Llama-2-7B-Chat-GGUF --filename llama-2-7b-chat.Q4_K_M.gguf

Or with direct URL:

infernal pull --url https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf

Using direct method:

python infernal.py pull --repo-id TheBloke/Llama-2-7B-Chat-GGUF --filename llama-2-7b-chat.Q4_K_M.gguf

List your models:

infernal list

python infernal.py list

Run inference:

Single prompt:

infernal run llama-2-7b-chat.Q4_K_M.gguf --prompt "Explain quantum computing in simple terms"

Interactive chat:

infernal run llama-2-7b-chat.Q4_K_M.gguf --interactive

Using direct method:

python infernal.py run llama-2-7b-chat.Q4_K_M.gguf --prompt "Explain quantum computing"

Benchmark performance:

infernal benchmark llama-2-7b-chat.Q4_K_M.gguf --prompt "Write a short story about AI" --repeat 5

Using direct method:

python infernal.py benchmark llama-2-7b-chat.Q4_K_M.gguf --prompt "Write a story" --repeat 3

Remove models:

infernal remove llama-2-7b-chat.Q4_K_M.gguf

python infernal.py remove llama-2-7b-chat.Q4_K_M.gguf

Model Finetuning

One of Infernal's most powerful features is the ability to finetune LLMs using a simple Modelfile approach, similar to Docker's approach but for AI models.

Creating a Modelfile

A Modelfile is a simple text file that describes how to finetune a model. It contains the base model, training parameters, and example conversations.

Here's a complete example (Modelfile):

# Infernal Modelfile for Customer Support Bot
# Your HuggingFace token (required for private models)
HF_TOKEN your_huggingface_token_here

# Base model to finetune
FROM meta-llama/Llama-3.2-1B-Instruct

# Training configuration
PARAMETER device cuda
PARAMETER max_length 512
PARAMETER learning_rate 3e-5
PARAMETER epochs 3
PARAMETER batch_size 2
PARAMETER weight_decay 0.01
PARAMETER warmup_steps 100

# LoRA configuration for efficient training
PARAMETER lora true
PARAMETER lora_r 8
PARAMETER lora_alpha 32
PARAMETER lora_dropout 0.05
PARAMETER lora_target_modules q_proj,v_proj,k_proj,o_proj

# System prompt
SYSTEM You are a helpful customer support assistant for TechCorp. Be friendly, professional, and provide clear solutions.

# Training examples (as many as you need)
MESSAGE user How do I reset my password?
MESSAGE assistant To reset your password: 1) Go to the login page 2) Click "Forgot Password" 3) Enter your email 4) Check your email for reset instructions 5) Follow the link to create a new password. If you need further help, please contact our support team.

MESSAGE user My order hasn't arrived yet. What should I do?
MESSAGE assistant I understand your concern about your delayed order. Let me help you: 1) Check your email for tracking information 2) Use the tracking number on our website 3) If it's been more than 5 business days, please provide your order number and I'll investigate further. We apologize for any inconvenience.

MESSAGE user How do I return a defective product?
MESSAGE assistant For defective product returns: 1) Visit our Returns page 2) Select "Defective Item" as the reason 3) Upload photos of the defect 4) Print the prepaid return label 5) Package securely and ship back. You'll receive a replacement or refund within 5-7 business days after we receive the item.

MESSAGE user What's your refund policy?
MESSAGE assistant Our refund policy allows returns within 30 days of purchase. Items must be unused and in original packaging. Refunds typically process within 3-5 business days after we receive the returned item. Digital products and personalized items are non-refundable. Would you like help starting a return?

Modelfile Parameters Explained

Model Configuration:

FROM: Base model from HuggingFace (required)
HF_TOKEN: Your HuggingFace access token
SYSTEM: System prompt that defines the assistant's role

Training Parameters:

device: Training device (cuda for GPU, cpu for CPU)
epochs: Number of training cycles (default: 3)
batch_size: Training batch size (default: 2)
learning_rate: Learning rate (default: 2e-5)
max_length: Maximum sequence length (default: 2048)
weight_decay: Weight decay for regularization
warmup_steps: Number of warmup steps

LoRA/PEFT Parameters (for efficient training):

lora: Enable LoRA training (true/false)
lora_r: LoRA rank (default: 8)
lora_alpha: LoRA alpha parameter (default: 32)
lora_dropout: LoRA dropout rate (default: 0.05)
lora_target_modules: Target modules for LoRA (e.g., q_proj,v_proj)

Running Finetuning

Once you have your Modelfile ready:

Basic finetuning

infernal finetune --modelfile Modelfile --output my-custom-bot

Override parameters from command line

infernal finetune --modelfile Modelfile --output my-custom-bot --epochs 5 --batch-size 4

Specify a custom name for the model

infernal finetune --modelfile Modelfile --output my-custom-bot --name "customer-support-v1"

After Finetuning

The finetuning process will:

Download the base model from HuggingFace
Prepare training data from your MESSAGE examples
Apply LoRA adapters for efficient training (if enabled)
Train the model using HuggingFace Trainer
Merge adapters back into the base model
Save the result in standard Hugging Face format

You can optionally convert the merged model to GGUF format using llama-cpp, and push the converted model to Hugging Face if desired.

Then use your custom model:

infernal pull --url https://huggingface.co/my-username/my-custom-bot/resolve/main/my-custom-bot.gguf
infernal run my-custom-bot --interactive

Finetuning Tips

For better results:

Use 10-50 example conversations in your Modelfile
Keep examples focused on your specific use case
Use consistent formatting in your MESSAGE blocks
Test with small models first (1B-3B parameters)

For efficient training:

Enable LoRA (PARAMETER lora true) to reduce memory usage
Use smaller batch sizes if you run out of memory
Start with fewer epochs (3-5) to avoid overfitting

Hardware recommendations:

CPU training: Works but slow, use small models and batch_size=1
GPU training: Much faster, can use larger models and batch sizes
Memory: 8GB+ RAM minimum, 16GB+ recommended

Benchmarking

Infernal provides comprehensive performance analysis with accurate timing measurements:

Key Metrics

Time to First Token (TTFT): Latency before first response token
Throughput: Tokens generated per second during generation phase
Total Token Rate: Combined input/output processing speed
Memory Usage: Peak RAM consumption during inference
Generation Time: Pure text generation time (excluding TTFT)

Real-World Timing

Infernal uses actual measured timings without artificial thresholds, providing authentic performance data that reflects real-world usage patterns.

Example Output

Run 1/3 - Prompt 1
Time to first token: 0.245 seconds
Generation time (After-TTFT): 2.156 seconds
Throughput (generated tok/sec): 23.45
Total tokens/sec: 19.87
Peak memory usage: 8547.23 MB

Averages across all runs:
Total runs: 3
Total tokens generated: 150
Total time required: 6.78 seconds
Avg Time to first token: 0.251 seconds
Avg throughput (generated tok/sec): 22.89
Avg total tokens/sec: 19.34
Avg Peak memory usage: 8521.45 MB

Multiple Iterations

Run benchmarks multiple times for statistical accuracy:

infernal benchmark model.gguf --prompt "Test prompt" --repeat 10

Benchmark with Modelfiles

You can also benchmark using prompts from a Modelfile:

infernal benchmark model.gguf --promptfile Modelfile

Installation Methods Explained

Method 1: Pip Installation (pyproject.toml)

This project uses modern Python packaging with pyproject.toml instead of the traditional setup.py. This provides:

Cleaner dependency management: All project metadata in one file
Modern build system: Uses hatchling as the build backend
Automatic script creation: infernal command is automatically available system-wide
Better development workflow: Use pip install -e . for editable installs

Benefits:

Run infernal command from anywhere
Cleaner project structure
Modern Python packaging standards
Easy uninstallation with pip uninstall infernal

Method 2: Direct Requirements

Traditional approach using requirements.txt for those who prefer direct control:

Manual dependency installation: `pip install -r requir

Related Skills

node-connect

352.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.5k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。