Flexllama
🚀 FlexLLama - Lightweight self-hosted tool for running multiple llama.cpp server instances with OpenAI v1 API compatibility and multi-GPU support
Install / Use
/learn @yazon/FlexllamaREADME
FlexLLama is a lightweight, extensible, and user-friendly self-hosted tool that easily runs multiple llama.cpp server instances with OpenAI v1 API compatibility. It's designed to manage multiple models across different GPUs, making it a powerful solution for local AI development and deployment.
Key Features of FlexLLama
- 🚀 Multiple llama.cpp instances - Run different models simultaneously
- 🎯 Multi-GPU support - Distribute models across different GPUs
- 🔌 OpenAI v1 API compatible - Drop-in replacement for OpenAI endpoints
- 📊 Real-time dashboard - Monitor model status with a web interface
- 🤖 Chat & Completions - Full chat and text completion support
- 🔍 Embeddings & Reranking - Supports models for embeddings and reranking
- ⚡ Auto-start - Automatically start default runners on launch
- 🔄 Model switching - Dynamically load/unload models as needed
- ⏱️ Auto model unload - Automatically unload models after a configurable idle timeout

Quickstart
🚀 Want to get started in 5 minutes? Check out our QUICKSTART.md for a simple Docker setup with the Qwen3-4B model!
📦 Local Installation
-
Install FlexLLama:
From GitHub:
pip install git+https://github.com/yazon/flexllama.gitFrom local source (after cloning):
# git clone https://github.com/yazon/flexllama.git # cd flexllama pip install . -
Create your configuration: Copy the example configuration file to create your own. If you installed from a local clone, you can run:
cp config_example.json config.jsonIf you installed from git, you may need to download it from the repository.
-
Edit
config.json: Updateconfig.jsonwith the correct paths for yourllama-serverbinary and your model files (.gguf). -
Run FlexLLama:
python main.py config.jsonor
flexllama config.json -
Open dashboard:
http://localhost:8080
🐳 Docker
FlexLLama can be run using Docker and Docker Compose. We provide profiles for CPU-only, GPU-accelerated (NVIDIA CUDA), and Vulkan GPU environments.
-
Clone the repository:
git clone https://github.com/yazon/flexllama.git cd flexllama
After cloning, you can proceed with the quick start script or a manual setup.
Using the Quick Start Script (docker-start.sh) - ONE COMMAND SETUP! ✨
The docker-start.sh script provides a fully automated, plug-and-play setup. Just run ONE command and everything is configured automatically:
- Auto-detects your GPU(s)
- Auto-configures NVIDIA runtime (if needed)
- Builds the correct Docker image
- Starts the container automatically
- NVIDIA + AMD multi-GPU systems work automatically!
-
Make the script executable (Linux/Unix):
chmod +x docker-start.sh -
Run ONE command - that's it!
For CPU-only:
./docker-start.sh # Windows: .\docker-start.ps1For NVIDIA CUDA GPUs:
./docker-start.sh --gpu=cuda # Windows: .\docker-start.ps1 -gpu cudaFor Vulkan (AMD/Intel - works with multiple GPUs!):
./docker-start.sh --gpu=vulkan # Windows: .\docker-start.ps1 -gpu vulkan -
Done! FlexLLama is running automatically!
The script automatically:
- Detects your GPUs
- Configures NVIDIA runtime (if needed)
- Builds and starts everything
Just open: http://localhost:8090
Manual Docker and Docker Compose Setup
If you prefer to run the steps manually, follow this guide:
-
Place your models:
# Create the models directory if it doesn't exist mkdir -p models # Copy your .gguf model files into it cp /path/to/your/model.gguf models/ -
Configure your models:
# Edit the Docker configuration to point to your models # • CPU-only: keep "n_gpu_layers": 0 # • GPU: set "n_gpu_layers" to e.g. 99 and specify "main_gpu": 0 -
Build and Start FlexLLama with Docker Compose (Recommended): Use the
--profileflag to select your environment. The service will be available athttp://localhost:8090.For CPU-only:
docker compose --profile cpu up --build -dFor GPU support (NVIDIA CUDA):
docker compose --profile gpu up --build -dFor Vulkan GPU support (AMD/Intel):
docker compose --profile vulkan up --build -d -
View Logs To monitor the output of your services, you can view their logs in real-time.
For the CPU service:
docker compose --profile cpu logs -fFor the GPU service (CUDA):
docker compose --profile gpu logs -fFor the Vulkan service:
docker compose --profile vulkan logs -f(Press
Ctrl+Cto stop viewing the logs.) -
(Alternative) Using
docker run: You can also build and run the containers manually.For CPU-only:
# Build the image docker build -t flexllama:latest . # Run the container docker run -d -p 8090:8090 \ -v $(pwd)/models:/app/models:ro \ -v $(pwd)/docker/config.json:/app/config.json:ro \ flexllama:latestFor GPU support (NVIDIA CUDA):
# Build the image docker build -f Dockerfile.cuda -t flexllama-gpu:latest . # Run the container docker run -d --gpus all -p 8090:8090 \ -v $(pwd)/models:/app/models:ro \ -v $(pwd)/docker/config.json:/app/config.json:ro \ flexllama-gpu:latestFor Vulkan GPU support:
# Build the image docker build -f Dockerfile.vulkan -t flexllama-vulkan:latest . # Run the container (AMD/Intel GPUs) docker run -d --device /dev/dri:/dev/dri -p 8090:8090 \ -v $(pwd)/models:/app/models:ro \ -v $(pwd)/docker/config.json:/app/config.json:ro \ flexllama-vulkan:latest -
Open the dashboard: Access the FlexLLama dashboard in your browser:
http://localhost:8090
Vulkan GPU Support
FlexLLama supports Vulkan-based GPU acceleration for AMD and Intel GPUs on Linux.
Prerequisites:
- Linux host with Vulkan drivers installed
- AMD GPUs: Mesa RADV drivers (
mesa-vulkan-drivers) - works immediately - Intel GPUs: Mesa ANV drivers (
mesa-vulkan-drivers) - works immediately
Note: For NVIDIA GPUs, please use the CUDA backend (--gpu=cuda).
Configuration:
Edit your docker/config.json to enable Vulkan:
{
"runner": "runner1",
"model": "/app/models/your-model.gguf",
"model_alias": "vulkan-model",
"n_gpu_layers": 99,
"args": "--device Vulkan0"
}
You can also use the example configuration:
cp docker/config.vulkan.json docker/config.json
Troubleshooting:
Check if Vulkan is working inside the container:
docker exec -it <container-id> vulkaninfo --summary
For AMD ROCm systems, you may need to add /dev/kfd device mapping in docker-compose.yml.
Note: Vulkan support on Windows with Docker is limited. For best results on Windows, use WSL2 with the Linux instructions or use the CUDA backend instead.
Configuration
FlexLLama is highly configurable through the config.json file. You can set up multiple runners, distribute models across GPUs, configure auto-unload timeouts, set environment variables, and much more.
📖 For detailed configuration options, examples, and advanced setups, see CONFIGURATION.md
Quick Configuration Tips
- Edit
config.jsonto add your models and runners - Use
config_example.jsonas a reference - Validate your configuration:
python backend/config.py config.json - Set
auto_start_runners: trueto automatically load models on startup
Testing
FlexLLama includes a comprehensive test suite to validate your setup and ensure everything is working correctly.
Running Tests
The tests/ directory contains scripts for different testing purposes. All test scripts generate detailed logs in the tests/logs/{session_id}/ directory.
Prerequisites:
- For
test_basic.pyandtest_all_models.py, the main application must be running (flexllama config.json). - For
test_model_switching.py, the main application should not be running.
Basic API Tests
test_basic.py performs basic checks on the API endpoints to ensure they are responsive.
# Run basic tests against the default URL (http://localhost:8080)
python tests/test_basic.py
What it tests:
/v1/modelsand/healthendpoints/v1/chat/completionswith both regular and streaming responses- Concurrent request handling
All Models Test
test_all_models.py runs a comprehensive test suite against every model defined in your config.json.
# Test all configured models
python tests/test_all_models.py config.json
What it tests:
- Model loading and health checks
- Chat completions (regular and streaming) for each model
- R
