SkillAgentSearch skills...

OpenArc

Inference engine for Intel devices. Serve LLMs, VLMs, Whisper, Kokoro-TTS, Embedding and Rerank models over OpenAI endpoints.

Install / Use

/learn @SearchSavior/OpenArc

README

openarc_DOOM

Discord Hugging Face Devices Ask DeepWiki

[!NOTE] OpenArc is under active development.

OpenArc is an inference engine for Intel devices. Serve LLMs, VLMs, Whisper, Kokoro-TTS, Embedding and Reranker models over OpenAI compatible endpoints, powered by OpenVINO on your device. Local, private, open source AI.

OpenArc 2.0 arrives with more endpoints, better UX, pipeline paralell, NPU support and much more!

Drawing on ideas from llama.cpp, vLLM, transformers, OpenVINO Model Server, Ray, Lemonade, and other projects cited below, OpenArc has been a way for me to learn about inference engines by trying to build one myself.

Along the way a Discord community has formed around this project, which was unexpected! If you are interested in using Intel devices for AI and machine learning, feel free to stop by.

Thanks to everyone on Discord for their continued support!

Table of Contents

Features

  • NEW! Containerization with Docker #60 by @meatposes
  • NEW! Speculative decoding support for LLMs #57 by @meatposes
  • NEW! Streaming cancellation support for LLMs and VLMs
  • Multi GPU Pipeline Paralell
  • CPU offload/Hybrid device
  • NPU device support
  • OpenAI compatible endpoints
    • /v1/models
    • /v1/completions: llm only
    • /v1/chat/completions
    • /v1/audio/transcriptions: whisper only
    • /v1/audio/speech: kokoro only
    • /v1/embeddings: qwen3-embedding #33 by @mwrothbe
    • /v1/rerank: qwen3-reranker #39 by @mwrothbe
  • jinja templating with AutoTokenizers
  • OpenAI Compatible tool calls with streaming and paralell
    • tool call parser currently reads "name", "argument"
  • Fully async multi engine, multi task architecture
  • Model concurrency: load and infer multiple models at once
  • Automatic unload on inference failure
  • llama-bench style benchmarking for llm w/automatic sqlite database
  • metrics on every request
    • ttft
    • prefill_throughput
    • decode_throughput
    • decode_duration
    • tpot
    • load time
    • stream mode
  • More OpenVINO examples
  • OpenVINO implementation of hexgrad/Kokoro-82M

[!NOTE] Interested in contributing? Please open an issue before submitting a PR!

<div align="right">

↑ Top

</div>

Quickstart

<details id="linux"> <summary><strong style="font-size: 1.2em;">Linux</strong></summary> <br>
  1. OpenVINO requires device specifc drivers.
  1. Install uv from astral

  2. After cloning use:

uv sync
  1. Activate your environment with:
source .venv/bin/activate

Build latest optimum

uv pip install "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel"

Build latest OpenVINO and OpenVINO GenAI from nightly wheels

uv pip install --pre -U openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
  1. Set your API key as an environment variable:
	export OPENARC_API_KEY=<api-key>
  1. To get started, run:
openarc --help
</details> <details id="windows"> <summary><strong style="font-size: 1.2em;">Windows</strong></summary> <br>
  1. OpenVINO requires device specifc drivers.
  1. Install uv from astral

  2. Clone OpenArc, enter the directory and run:

uv sync
  1. Activate your environment with:
.venv\Scripts\activate

Build latest optimum

uv pip install "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel"

Build latest OpenVINO and OpenVINO GenAI from nightly wheels

uv pip install --pre -U openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
  1. Set your API key as an environment variable:
setx OPENARC_API_KEY openarc-api-key
  1. To get started, run:
openarc --help
</details> <details id="docker"> <summary><strong style="font-size: 1.2em;">Docker</strong></summary> <br>

Instead of fighting with Intel's own docker images, we built our own which is as close to boilerplate as possible. For a primer on docker check out this video.

Build and run the container:

docker-compose up --build -d

Run the container:

docker run -d -p 8000:8000 openarc:latest

Enter the container:

docker exec -it openarc /bin/bash

Environment Variables

export OPENARC_API_KEY="openarc-api-key" # default, set it to whatever you want
export OPENARC_AUTOLOAD_MODEL="model_name" # model_name to load on startup
export MODEL_PATH="/path/to/your/models" # mount your models to `/models` inside the container
docker-compose up --build -d

Take a look at the Dockerfile and docker-compose for more details.

</details> <br>

[!NOTE] Need help installing drivers? Join our Discord or open an issue.

[!NOTE] uv has a pip interface which is a drop in replacement for pip, but faster. Pretty cool, and a good place to start learning uv.

OpenArc CLI

This section documents the CLI commands available to you.

OpenArc command line tool helps you manage the server by packaging requests; every operation the command line does can be scripted programmatically, but using the command tool will help you get a feel for what the server does and how you can use it.

Getting Started

After choosing a model, use commands in this order:

  • Add model configurations with openarc add,

Here's an example for Gemma 3 VLM on GPU:

openarc add --model-name <model-name> --model-path <path/to/model> --engine ovgenai --model-type vlm --device GPU.0 --vlm-type gemma3

And all LLM on GPU:

openarc add --model-name <model-name> --model-path <path/to/model> --engine ovgenai --model-type llm --device GPU.0

Next up:

  • Show added configurations with openarc list,
  • Launch the server with openarc serve,
  • Load models with openarc load
  • Check a model's status using openarc status.
  • Benchmark performance like llama-bench with openarc-bench
  • Call utility scripts with openarc tool

Each command has groups of options which offer fine-grained control of both server behavior and performance optimizations, which are documented here with examples to get you started. Remember to use this as reference.

Use openarc [OPTION] --help to see available arguments at any time as you work through the reference.

Reference

<details id="openarc-add"> <summary><code>openarc add</code></summary> <br>

Add a model to openarc_config.json for easy loading with openarc load.

Single device

openarc add --model-name <model-name> --model-path <path/to/model> --engine <engine> --model-type <model-type> --device <target-device>

To see what options you have for --device, use openarc tool device-detect.

VLM

openarc add --model-name <model-name> --model-path <path/to/model> --engine <engine> --model-type <model-type> --device <target-device> --vlm-type <vlm-type>

Getting VLM to work the way I wanted required using VLMPipeline in ways that are not well documented. You can look at the code to see where the magic happens.

vlm-type maps a vision token for a given architecture using strings like qwen25vl, phi4mm and more. Use openarc add --help to see the available options. The server will complain if you get anything wrong, so it should be easy to figure out.

Whisper

openarc add --model-name <model-name> --model-path <path/to/whisper> --engine ovgenai --model-type whisper --device <target-device> 

Kokoro (CPU only)

openarc add --model-name <model-name> --model-path <path/to/kokoro> --engine openvino --model-type kokoro --device CPU 

Advanced Configuration Options

runtime-config accepts many options to modify openvino runtime behavior for different inference scenarios. OpenArc reports c++ errors to the server when these fail, making experimentation easy.

See OpenVINO documentation on [Inference Optimization](http

View on GitHub
GitHub Stars351
CategoryDevelopment
Updated20h ago
Forks19

Languages

Python

Security Score

100/100

Audited on Mar 24, 2026

No findings