Infernal
Infernal is a powerful, lightweight CLI tool for running LLMs locally with fast inference, detailed benchmarking, and model finetuning capabilities.
Install / Use
/learn @chandraprvkvsh/InfernalREADME
Infernal
Infernal is a powerful yet lightweight tool for running Large Language Models (LLMs) locally with blazing fast inference, comprehensive benchmarking, and model finetuning capabilities. Built on top of llama.cpp for maximum performance.
Features
- Fast Local Inference: Lightning-fast model execution using llama.cpp
- Simple Downloads: Pull models directly from HuggingFace with zero configuration
- Interactive Chat: Engage in real-time conversations with your models
- Performance Benchmarking: Detailed performance metrics and analysis
- Model Finetuning: Train custom models using LoRA/PEFT with simple Modelfiles
- Clean Management: Simple model organization with automatic configuration
- Zero Setup: No complex configuration files - just download and run
Quick Start
Installation
Method 1: Using pip (Recommended)
git clone https://github.com/chandraprvkvsh/Infernal.git
cd infernal
pip install .
Method 2: Direct installation
git clone https://github.com/chandraprvkvsh/Infernal.git
cd infernal
pip install -r requirements.txt
After Method 1, you can use the infernal command globally. For Method 2, use python infernal.py.
Basic Usage
Download a model:
Using pip installation:
infernal pull --repo-id TheBloke/Llama-2-7B-Chat-GGUF --filename llama-2-7b-chat.Q4_K_M.gguf
Or with direct URL:
infernal pull --url https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
Using direct method:
python infernal.py pull --repo-id TheBloke/Llama-2-7B-Chat-GGUF --filename llama-2-7b-chat.Q4_K_M.gguf
List your models:
infernal list
or
python infernal.py list
Run inference:
Single prompt:
infernal run llama-2-7b-chat.Q4_K_M.gguf --prompt "Explain quantum computing in simple terms"
Interactive chat:
infernal run llama-2-7b-chat.Q4_K_M.gguf --interactive
Using direct method:
python infernal.py run llama-2-7b-chat.Q4_K_M.gguf --prompt "Explain quantum computing"
Benchmark performance:
infernal benchmark llama-2-7b-chat.Q4_K_M.gguf --prompt "Write a short story about AI" --repeat 5
Using direct method:
python infernal.py benchmark llama-2-7b-chat.Q4_K_M.gguf --prompt "Write a story" --repeat 3
Remove models:
infernal remove llama-2-7b-chat.Q4_K_M.gguf
or
python infernal.py remove llama-2-7b-chat.Q4_K_M.gguf
Model Finetuning
One of Infernal's most powerful features is the ability to finetune LLMs using a simple Modelfile approach, similar to Docker's approach but for AI models.
Creating a Modelfile
A Modelfile is a simple text file that describes how to finetune a model. It contains the base model, training parameters, and example conversations.
Here's a complete example (Modelfile):
# Infernal Modelfile for Customer Support Bot
# Your HuggingFace token (required for private models)
HF_TOKEN your_huggingface_token_here
# Base model to finetune
FROM meta-llama/Llama-3.2-1B-Instruct
# Training configuration
PARAMETER device cuda
PARAMETER max_length 512
PARAMETER learning_rate 3e-5
PARAMETER epochs 3
PARAMETER batch_size 2
PARAMETER weight_decay 0.01
PARAMETER warmup_steps 100
# LoRA configuration for efficient training
PARAMETER lora true
PARAMETER lora_r 8
PARAMETER lora_alpha 32
PARAMETER lora_dropout 0.05
PARAMETER lora_target_modules q_proj,v_proj,k_proj,o_proj
# System prompt
SYSTEM You are a helpful customer support assistant for TechCorp. Be friendly, professional, and provide clear solutions.
# Training examples (as many as you need)
MESSAGE user How do I reset my password?
MESSAGE assistant To reset your password: 1) Go to the login page 2) Click "Forgot Password" 3) Enter your email 4) Check your email for reset instructions 5) Follow the link to create a new password. If you need further help, please contact our support team.
MESSAGE user My order hasn't arrived yet. What should I do?
MESSAGE assistant I understand your concern about your delayed order. Let me help you: 1) Check your email for tracking information 2) Use the tracking number on our website 3) If it's been more than 5 business days, please provide your order number and I'll investigate further. We apologize for any inconvenience.
MESSAGE user How do I return a defective product?
MESSAGE assistant For defective product returns: 1) Visit our Returns page 2) Select "Defective Item" as the reason 3) Upload photos of the defect 4) Print the prepaid return label 5) Package securely and ship back. You'll receive a replacement or refund within 5-7 business days after we receive the item.
MESSAGE user What's your refund policy?
MESSAGE assistant Our refund policy allows returns within 30 days of purchase. Items must be unused and in original packaging. Refunds typically process within 3-5 business days after we receive the returned item. Digital products and personalized items are non-refundable. Would you like help starting a return?
Modelfile Parameters Explained
Model Configuration:
FROM: Base model from HuggingFace (required)HF_TOKEN: Your HuggingFace access tokenSYSTEM: System prompt that defines the assistant's role
Training Parameters:
device: Training device (cudafor GPU,cpufor CPU)epochs: Number of training cycles (default: 3)batch_size: Training batch size (default: 2)learning_rate: Learning rate (default: 2e-5)max_length: Maximum sequence length (default: 2048)weight_decay: Weight decay for regularizationwarmup_steps: Number of warmup steps
LoRA/PEFT Parameters (for efficient training):
lora: Enable LoRA training (true/false)lora_r: LoRA rank (default: 8)lora_alpha: LoRA alpha parameter (default: 32)lora_dropout: LoRA dropout rate (default: 0.05)lora_target_modules: Target modules for LoRA (e.g.,q_proj,v_proj)
Running Finetuning
Once you have your Modelfile ready:
Basic finetuning
infernal finetune --modelfile Modelfile --output my-custom-bot
Override parameters from command line
infernal finetune --modelfile Modelfile --output my-custom-bot --epochs 5 --batch-size 4
Specify a custom name for the model
infernal finetune --modelfile Modelfile --output my-custom-bot --name "customer-support-v1"
After Finetuning
The finetuning process will:
- Download the base model from HuggingFace
- Prepare training data from your MESSAGE examples
- Apply LoRA adapters for efficient training (if enabled)
- Train the model using HuggingFace Trainer
- Merge adapters back into the base model
- Save the result in standard Hugging Face format
You can optionally convert the merged model to GGUF format using llama-cpp, and push the converted model to Hugging Face if desired.
Then use your custom model:
infernal pull --url https://huggingface.co/my-username/my-custom-bot/resolve/main/my-custom-bot.gguf
infernal run my-custom-bot --interactive
Finetuning Tips
For better results:
- Use 10-50 example conversations in your Modelfile
- Keep examples focused on your specific use case
- Use consistent formatting in your MESSAGE blocks
- Test with small models first (1B-3B parameters)
For efficient training:
- Enable LoRA (
PARAMETER lora true) to reduce memory usage - Use smaller batch sizes if you run out of memory
- Start with fewer epochs (3-5) to avoid overfitting
Hardware recommendations:
- CPU training: Works but slow, use small models and batch_size=1
- GPU training: Much faster, can use larger models and batch sizes
- Memory: 8GB+ RAM minimum, 16GB+ recommended
Benchmarking
Infernal provides comprehensive performance analysis with accurate timing measurements:
Key Metrics
- Time to First Token (TTFT): Latency before first response token
- Throughput: Tokens generated per second during generation phase
- Total Token Rate: Combined input/output processing speed
- Memory Usage: Peak RAM consumption during inference
- Generation Time: Pure text generation time (excluding TTFT)
Real-World Timing
Infernal uses actual measured timings without artificial thresholds, providing authentic performance data that reflects real-world usage patterns.
Example Output
Run 1/3 - Prompt 1
Time to first token: 0.245 seconds
Generation time (After-TTFT): 2.156 seconds
Throughput (generated tok/sec): 23.45
Total tokens/sec: 19.87
Peak memory usage: 8547.23 MB
Averages across all runs:
Total runs: 3
Total tokens generated: 150
Total time required: 6.78 seconds
Avg Time to first token: 0.251 seconds
Avg throughput (generated tok/sec): 22.89
Avg total tokens/sec: 19.34
Avg Peak memory usage: 8521.45 MB
Multiple Iterations
Run benchmarks multiple times for statistical accuracy:
infernal benchmark model.gguf --prompt "Test prompt" --repeat 10
Benchmark with Modelfiles
You can also benchmark using prompts from a Modelfile:
infernal benchmark model.gguf --promptfile Modelfile
Installation Methods Explained
Method 1: Pip Installation (pyproject.toml)
This project uses modern Python packaging with pyproject.toml instead of the traditional setup.py. This provides:
- Cleaner dependency management: All project metadata in one file
- Modern build system: Uses
hatchlingas the build backend - Automatic script creation:
infernalcommand is automatically available system-wide - Better development workflow: Use
pip install -e .for editable installs
Benefits:
- Run
infernalcommand from anywhere - Cleaner project structure
- Modern Python packaging standards
- Easy uninstallation with
pip uninstall infernal
Method 2: Direct Requirements
Traditional approach using requirements.txt for those who prefer direct control:
- Manual dependency installation: `pip install -r requir
Related Skills
node-connect
352.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
