vllama – Ultra-fast vLLM inference for ollama

vllama is a hybrid server that brings together the best of two worlds: it combines Ollama's versatile model management with the high-speed GPU inference of vLLM. The result is an OpenAI-compatible API that serves local models with optimized performance. It runs on port 11435 as a fast alternative to Ollama (which uses port 11434), allowing you to run both simultaneously.

The server empowers you to use local large language models for programming tasks like code generation, debugging, and code completion. It is designed for efficient local LLM operations and on-device AI, serving as a powerful, private alternative to cloud-based services like GitHub Copilot.

Key Features:

On-Demand Model Loading & Unloading: Models are loaded on-demand when a request is received and automatically unloaded after 5 minutes of inactivity, freeing up VRAM and making it a true on-demand solution.
Automatic Context Length Optimization: vllama automatically calculates and maximizes the context length based on your available VRAM, ensuring peak performance without manual tweaking.
Broad Model Support: All Ollama models are automatically discovered. While vLLM's GGUF support is experimental, many models, including top performers like Devstral and DeepSeek, are proven to work.
Network-Wide Access: Serve models to your entire local network, enabling agents powered by local LLM and collaborative development.
Advanced Model Techniques: Supports models using quantization, distilled models for local programming, and techniques like model pruning to run efficiently on your hardware.

Quick Start
Development
Maintenance
Supported Models
Integrations with Programming Agents
Frequently Asked Questions (FAQ)
Updates
Logging
Vision
Client Integration Notes
How to contribute

Quick Start

Running vllama inside a Docker container is the recommended method as it provides a consistent and isolated environment.

Prerequisites

Docker: A working Docker installation.
NVIDIA Container Toolkit: Required to run GPU-accelerated Docker containers. Please see the official installation guide for your distribution.

Install and Run

Pull the latest docker image
```
docker pull tomhimanen/vllama:latest
```
Pull good-known images To make sure vllama works fine on your system, you probably want to pull a few ollama models that are known to be compatible with vllama and after that try other models.
```
ollama pull tom_himanen/deepseek-r1-roo-cline-tools:14b # proven to be compatible
ollama pull huihui_ai/devstral-abliterated:latest # proven to be compatible
```
Run the container with a single command: You can download and execute the helper script directly with a single command. This script will automatically detect your Ollama models path and launch the container as a background service. NB! Always review the script's content before executing it.
```
curl -fsSL https://raw.githubusercontent.com/erkkimon/vllama/refs/heads/main/helpers/start_dockerized_vllama.sh | bash
```
vllama will then be available at http://localhost:11435/v1 and will be exposed to all devices in the network by default. You can test if the server is alive by running curl http://localhost:11435/v1/models. You can also see the logs with docker logs -f vllama-service command.

Development

If you want to run vllama directly from the source for development, follow these steps.

Clone the repository:

git clone https://github.com/erkkimon/vllama.git
cd vllama

Create a Virtual Environment and Install Dependencies: This requires Python 3.12 or newer.
```
python3 -m venv venv312
source venv312/bin/activate
pip install -r requirements.txt
```
Run the application:
```
python vllama.py
```
The server will start on http://localhost:11435.

Building Docker image for development

Build the Docker image: From within the repository directory, run the build command:
```
docker build -t vllama-dev .
```
Run the container using the helper script: Run the helper script as in the quick start instructions, but comment out the default docker run command in the end of the file and uncomment the command below it.
```
./helpers/start_dockerized_vllama.sh
```

Maintenance

Here are the common commands for managing the vllama container service.

Viewing Logs

View logs in real-time:
```
docker logs -f vllama-service
```
View the last 100 lines of logs:
```
docker logs --tail 100 vllama-service
```

Starting and Stopping the Service

Start the service (if it was previously stopped):
```
docker start vllama-service
```
Stop the service:
```
docker stop vllama-service
```
Remove the service permanently (stops it from starting on boot):
```
docker stop vllama-service
docker rm vllama-service
```

Updating the Container

To update vllama to the latest version, follow these steps:

Navigate to your project directory and pull the latest code:
```
cd /path/to/vllama
git pull
```

Stop and remove the old container:

docker stop vllama-service
docker rm vllama-service

Rebuild the image with the new code:
```
docker build -t vllama .
```
Start the new container using the helper script:
```
./helpers/start_dockerized_vllama.sh
```

Supported Models

vllama can run any GGUF model available on Ollama, but compatibility ultimately depends on vLLM's support for the model architecture. The table below lists models that have been tested or are good candidates for local coding tasks. If you have managed to run a GGUF model using vLLM, please open an issue with the command you have used for running the GGUF model on vLLM – it helps a lot in integration work. Let's make local programming happen!

| Model Family | Status | Notes | |---|---|---| | Devstral | ✅ Proven to Work | Excellent performance for coding and general tasks. | | Magistral | ✅ Proven to Work | A powerful Mistral-family model, works great. | | DeepSeek-R1 | ✅ Proven to Work | Great for complex programming and following instructions. | | DeepSeek-V2 / V3 | ❔ Untested | Promising for code generation and debugging. | | Mistral / Mistral-Instruct | ❔ Untested | Lightweight and fast, good for code completion. | | CodeLlama / CodeLlama-Instruct | ❔ Untested | Specifically fine-tuned for programming tasks. | | Phi-3 (Mini, Small, Medium) | ❔ Untested | Strong reasoning capabilities in a smaller package. | | Llama-3-Code | ❔ Untested | A powerful contender for local coding performance. | | Qwen (2.5, 3, 3-VL, 3-Coder) | ❔ Untested | Strong multilingual and coding abilities. | | Gemma / Gemma-2 | ❔ Untested | Google's open models, good for general purpose and coding. | | StarCoder / StarCoder2 | ❔ Untested | Trained on a massive corpus of code. | | WizardCoder | ❔ Untested | Fine-tuned for coding proficiency. | | GLM / GLM-4 | ❔ Untested | Bilingual models with strong performance. | | Codestral | ❔ Untested | Mistral's first code-specific model. | | Kimi K2 | ❔ Untested | Known for its large context window capabilities. | | Granite-Code | ❔ Untested | IBM's open-source code models. | | CodeBERT | ❔ Untested | An early but influential code model. | | Pythia-Coder | ❔ Untested | A model for studying LLM development. | | Stable-Code | ❔ Untested | From the creators of Stable Diffusion. | | Mistral-Nemo | ❔ Untested | A powerful new model from Mistral. | | Llama-3.1 | ❔ Untested | The latest iteration of the Llama family. | | TabNine-Local | ❔ Untested | Open variants of the popular code completion tool. |

Additionally, vllama supports loading custom GGUF models. If you create a /opt/vllama/models directory on your host system, it will be automatically mounted as a read-only volume inside the Docker container. This feature allows you to use GGUF models that are not available on Ollama Hub. For example, you can download a smaller, efficient model like Devstral-Small-2505-abliterated.i1-IQ2_M.gguf. Using smaller models is particularly useful for GPUs with lower VRAM, as it can free up resources to allow for a larger context window.

Important Note on Custom Model Directory: If you create the /opt/vllama/models directory after the vllama-service container has been initially launched, you must stop and remove the existing container (e.g., docker stop vllama-service && docker rm vllama-service) and then re-run the start_dockerized_vllama.sh helper script. This ensures the new volume mount is correctly applied to the container.

Integrations with Programming Agents

One of the most powerful uses of vllama is to serve as the brain for local programming a

Vllama

Install / Use

README