VulkanIlm
GPU-accelerated LLaMA inference wrapper for legacy Vulkan-capable systems a Pythonic way to run AI with knowledge (Ilm) on fire (Vulkan).
Install / Use
/learn @Talnz007/VulkanIlmREADME
VulkanIlm 🚀🔥
GPU-Accelerated Local LLMs for Everyone (Vulkan + Ilm — "knowledge")
VulkanIlm is a Python-first wrapper and CLI around llama.cpp's Vulkan backend that brings fast local LLM inference to AMD, Intel, and NVIDIA GPUs — no CUDA required. Built for developers with legacy or non-NVIDIA hardware.
TL;DR
- What: Python library + CLI to run LLMs locally using Vulkan GPU acceleration.
- Why: Most acceleration tooling targets CUDA/NVIDIA — VulkanIlm opens up AMD & Intel users.
- Quick result: Small models can run orders of magnitude faster on iGPUs; mid/large legacy GPUs get ~4–6× speedups vs CPU.
Key features
- 🚀 Significant speedups vs CPU on legacy GPUs and iGPUs
- 🎮 Broad GPU support: AMD, Intel, NVIDIA (via Vulkan)
- 🐍 Python-first API + easy CLI tools
- ⚡ Auto detection + GPU-specific optimizations
- 📦 Auto build/install of
llama.cppVulkan backend - 🔄 Real-time streaming token generation
- ✅ Reproducible benchmark scripts in
benchmarks/
Benchmarks (summary)
Benchmarks measured with Gemma-3n-E4B-it (6.9B) unless noted. Results depend on model quantization, GPU drivers, OS, and system load.
| Hardware (OS) | Model | CPU time | Vulkan (GPU) time | Speedup | |---|---:|---:|---:|---:| | Dell E7250 (i7-5600U, integrated GPU) — Fedora 42 Workstation | TinyLLaMA-1.1B-Chat (Q4_K_M) | 121 s | 3 s | 33× | | AMD RX 580 8GB — Ubuntu 22.04.5 LTS (Jammy) | Gemma-3n-E4B-it (6.9B) | 188.47 s | 44.74 s | 4.21× | | Intel Arc A770 | Gemma-3n-E4B-it (6.9B) | ~120 s | ~25 s | ~4.8× | | AMD RX 6600 | Gemma-3n-E4B-it (6.9B) | ~90 s | ~18 s | ~5.0× |
iGPU notes
- The Dell E7250 iGPU result shows older integrated GPUs can be very effective for smaller LLMs when using Vulkan.
- Smaller models and appropriate quantizations are more iGPU-friendly. Driver/version differences significantly affect results.
Other tested (functional) models
DeepSeek-R1-Distill-Qwen-1.5B-unsloth-bnb-4bit— runs (not benchmarked).LLaMA 3.1 8B— runs (not benchmarked).
ROCm / AMD notes
- ROCm is not officially supported for
gfx803(RX 580). - Some community members try ROCm 5/6 workarounds on RX 580, but they are unstable/unsupported.
- VulkanIlm offers a Vulkan-based path that avoids ROCm on legacy AMD cards.
Install
Quick start
git clone https://github.com/Talnz007/VulkanIlm.git
cd VulkanIlm
pip install -e .
Prerequisites
- Python 3.9+
- Vulkan-capable GPU (AMD RX 400+, Intel Arc/Xe, NVIDIA GTX 900+)
- Vulkan drivers installed and working
Install Vulkan tools (if needed)
Ubuntu / Debian:
sudo apt update
sudo apt install vulkan-tools libvulkan-dev
Fedora / RHEL:
sudo dnf install vulkan-tools vulkan-devel
Verify:
vulkaninfo
Usage
CLI examples
# Auto-install llama.cpp with Vulkan support
vulkanilm install
# Check your GPU setup
vulkanilm vulkan-info
# Search and download models (if supported)
vulkanilm search "llama"
vulkanilm download microsoft/DialoGPT-medium
# Generate text
vulkanilm ask path/to/model.gguf --prompt "Explain quantum computing"
# Stream tokens in real-time
vulkanilm stream path/to/model.gguf "Tell me a story about AI"
# Run a benchmark
vulkanilm benchmark path/to/model.gguf --prompt "Benchmark prompt" --repeat 3
Python API (example)
from vulkan_ilm import Llama
# Load model (auto GPU optimization)
llm = Llama("path/to/model.gguf", gpu_layers=16)
# Synchronous generation
response = llm.ask("Explain the term 'ilm' in AI context.")
print(response)
# Streaming generation
for token in llm.stream_ask_real("Tell me about Vulkan API"):
print(token, end='', flush=True)
Reproduce benchmarks (quick checklist)
- Use the exact model file & quantization referenced in
/benchmarks(GGUF + quantization). - Use the benchmark script in
benchmarks/run_benchmark.sh. - Record: driver version, OS version, CPU frequency governor, and system load.
- Run benchmarks multiple times (cold and warm cache) and average results.
Troubleshooting (Linux)
vulkanilm: command not found
- Activate venv and reinstall:
python3 -m venv venv
source venv/bin/activate
pip install -e .
- Or run via Poetry:
poetry run vulkanilm install
Could NOT find Vulkan (missing: glslc)
- Install
glslc(Vulkan SDK / vulkan-tools):
# Fedora
sudo dnf install glslc
# Ubuntu/Debian
sudo apt install vulkan-tools
Verify: glslc --version
Could NOT find CURL
- Install libcurl dev:
# Fedora
sudo dnf install libcurl-devel
# Ubuntu/Debian
sudo apt install libcurl4-openssl-dev
Project structure
VulkanIlm/
├── vulkan_ilm/
│ ├── cli.py
│ ├── llama.py
│ ├── vulkan/
│ │ └── detector.py
│ ├── benchmark.py
│ ├── installer.py
│ └── streaming.py
├── benchmarks/ # benchmark scripts & data
├── pyproject.toml
└── README.md
Contributing
We welcome contributions! Useful areas:
- GPU testing across drivers & OSes
- Additional model formats & quant recipes
- Memory & perf optimizations
- Docs, reproducible benchmarks, and examples
See CONTRIBUTING.md for details. Look for good-first-issue tags.
The story behind the name
Ilm (علم) = knowledge / wisdom. Combined with Vulkan — “knowledge on fire”: making fast local AI accessible to everyone, regardless of GPU brand or budget. 🔥
License
MIT — see LICENSE for details.
Links & support
- Repo: https://github.com/Talnz007/VulkanIlm
- Issues: Report bugs or request features on GitHub
- Discussions: Community Q&A
- 📘 Full Documentation: https://talnz007.github.io/VulkanIlm/#/
Built with passion by @Talnz007 — bringing fast, local AI to legacy GPUs everywhere.
