SkillAgentSearch skills...

Vllama

vllama is an open source hybrid server that combines Ollama's seamless model management with vLLM's lightning-fast GPU inference, delivering a drop-in OpenAI-compatible API for optimized performance.

Install / Use

/learn @erkkimon/Vllama
About this skill

Quality Score

0/100

Supported Platforms

Zed

README

vllama – Ultra-fast vLLM inference for ollama

vllama is a hybrid server that brings together the best of two worlds: it combines Ollama's versatile model management with the high-speed GPU inference of vLLM. The result is an OpenAI-compatible API that serves local models with optimized performance. It runs on port 11435 as a fast alternative to Ollama (which uses port 11434), allowing you to run both simultaneously.

The server empowers you to use local large language models for programming tasks like code generation, debugging, and code completion. It is designed for efficient local LLM operations and on-device AI, serving as a powerful, private alternative to cloud-based services like GitHub Copilot.

<p align="center"> <a href="https://raw.githack.com/erkkimon/vllama/main/assets/player.html" target="_blank"> <img src="assets/ollama-vs-vllama-thumbnail.jpg" alt="Ollama vs vLLaMA Demo"> </a> </p>

Key Features:

  • On-Demand Model Loading & Unloading: Models are loaded on-demand when a request is received and automatically unloaded after 5 minutes of inactivity, freeing up VRAM and making it a true on-demand solution.
  • Automatic Context Length Optimization: vllama automatically calculates and maximizes the context length based on your available VRAM, ensuring peak performance without manual tweaking.
  • Broad Model Support: All Ollama models are automatically discovered. While vLLM's GGUF support is experimental, many models, including top performers like Devstral and DeepSeek, are proven to work.
  • Network-Wide Access: Serve models to your entire local network, enabling agents powered by local LLM and collaborative development.
  • Advanced Model Techniques: Supports models using quantization, distilled models for local programming, and techniques like model pruning to run efficiently on your hardware.

Table of Contents

Quick Start

Running vllama inside a Docker container is the recommended method as it provides a consistent and isolated environment.

Prerequisites

  1. Docker: A working Docker installation.
  2. NVIDIA Container Toolkit: Required to run GPU-accelerated Docker containers. Please see the official installation guide for your distribution.

Install and Run

  1. Pull the latest docker image

    docker pull tomhimanen/vllama:latest
    
  2. Pull good-known images To make sure vllama works fine on your system, you probably want to pull a few ollama models that are known to be compatible with vllama and after that try other models.

    ollama pull tom_himanen/deepseek-r1-roo-cline-tools:14b # proven to be compatible
    ollama pull huihui_ai/devstral-abliterated:latest # proven to be compatible
    
  3. Run the container with a single command: You can download and execute the helper script directly with a single command. This script will automatically detect your Ollama models path and launch the container as a background service. NB! Always review the script's content before executing it.

    curl -fsSL https://raw.githubusercontent.com/erkkimon/vllama/refs/heads/main/helpers/start_dockerized_vllama.sh | bash
    

    vllama will then be available at http://localhost:11435/v1 and will be exposed to all devices in the network by default. You can test if the server is alive by running curl http://localhost:11435/v1/models. You can also see the logs with docker logs -f vllama-service command.

Development

If you want to run vllama directly from the source for development, follow these steps.

  1. Clone the repository:

    git clone https://github.com/erkkimon/vllama.git
    cd vllama
    
  2. Create a Virtual Environment and Install Dependencies: This requires Python 3.12 or newer.

    python3 -m venv venv312
    source venv312/bin/activate
    pip install -r requirements.txt
    
  3. Run the application:

    python vllama.py
    

    The server will start on http://localhost:11435.

Building Docker image for development

  1. Build the Docker image: From within the repository directory, run the build command:

    docker build -t vllama-dev .
    
  2. Run the container using the helper script: Run the helper script as in the quick start instructions, but comment out the default docker run command in the end of the file and uncomment the command below it.

    ./helpers/start_dockerized_vllama.sh
    

Maintenance

Here are the common commands for managing the vllama container service.

Viewing Logs

  • View logs in real-time:

    docker logs -f vllama-service
    
  • View the last 100 lines of logs:

    docker logs --tail 100 vllama-service
    

Starting and Stopping the Service

  • Start the service (if it was previously stopped):

    docker start vllama-service
    
  • Stop the service:

    docker stop vllama-service
    
  • Remove the service permanently (stops it from starting on boot):

    docker stop vllama-service
    docker rm vllama-service
    

Updating the Container

To update vllama to the latest version, follow these steps:

  1. Navigate to your project directory and pull the latest code:

    cd /path/to/vllama
    git pull
    
  2. Stop and remove the old container:

    docker stop vllama-service
    docker rm vllama-service
    
  3. Rebuild the image with the new code:

    docker build -t vllama .
    
  4. Start the new container using the helper script:

    ./helpers/start_dockerized_vllama.sh
    

Supported Models

vllama can run any GGUF model available on Ollama, but compatibility ultimately depends on vLLM's support for the model architecture. The table below lists models that have been tested or are good candidates for local coding tasks. If you have managed to run a GGUF model using vLLM, please open an issue with the command you have used for running the GGUF model on vLLM – it helps a lot in integration work. Let's make local programming happen!

| Model Family | Status | Notes | |---|---|---| | Devstral | ✅ Proven to Work | Excellent performance for coding and general tasks. | | Magistral | ✅ Proven to Work | A powerful Mistral-family model, works great. | | DeepSeek-R1 | ✅ Proven to Work | Great for complex programming and following instructions. | | DeepSeek-V2 / V3 | ❔ Untested | Promising for code generation and debugging. | | Mistral / Mistral-Instruct | ❔ Untested | Lightweight and fast, good for code completion. | | CodeLlama / CodeLlama-Instruct | ❔ Untested | Specifically fine-tuned for programming tasks. | | Phi-3 (Mini, Small, Medium) | ❔ Untested | Strong reasoning capabilities in a smaller package. | | Llama-3-Code | ❔ Untested | A powerful contender for local coding performance. | | Qwen (2.5, 3, 3-VL, 3-Coder) | ❔ Untested | Strong multilingual and coding abilities. | | Gemma / Gemma-2 | ❔ Untested | Google's open models, good for general purpose and coding. | | StarCoder / StarCoder2 | ❔ Untested | Trained on a massive corpus of code. | | WizardCoder | ❔ Untested | Fine-tuned for coding proficiency. | | GLM / GLM-4 | ❔ Untested | Bilingual models with strong performance. | | Codestral | ❔ Untested | Mistral's first code-specific model. | | Kimi K2 | ❔ Untested | Known for its large context window capabilities. | | Granite-Code | ❔ Untested | IBM's open-source code models. | | CodeBERT | ❔ Untested | An early but influential code model. | | Pythia-Coder | ❔ Untested | A model for studying LLM development. | | Stable-Code | ❔ Untested | From the creators of Stable Diffusion. | | Mistral-Nemo | ❔ Untested | A powerful new model from Mistral. | | Llama-3.1 | ❔ Untested | The latest iteration of the Llama family. | | TabNine-Local | ❔ Untested | Open variants of the popular code completion tool. |

Additionally, vllama supports loading custom GGUF models. If you create a /opt/vllama/models directory on your host system, it will be automatically mounted as a read-only volume inside the Docker container. This feature allows you to use GGUF models that are not available on Ollama Hub. For example, you can download a smaller, efficient model like Devstral-Small-2505-abliterated.i1-IQ2_M.gguf. Using smaller models is particularly useful for GPUs with lower VRAM, as it can free up resources to allow for a larger context window.

Important Note on Custom Model Directory: If you create the /opt/vllama/models directory after the vllama-service container has been initially launched, you must stop and remove the existing container (e.g., docker stop vllama-service && docker rm vllama-service) and then re-run the start_dockerized_vllama.sh helper script. This ensures the new volume mount is correctly applied to the container.

Integrations with Programming Agents

One of the most powerful uses of vllama is to serve as the brain for local programming a

View on GitHub
GitHub Stars67
CategoryDevelopment
Updated3d ago
Forks10

Languages

Python

Security Score

95/100

Audited on Mar 24, 2026

No findings