WilmerAI

"What If Language Models Expertly Routed All Inference?"

DISCLAIMER:

This project is still under development. The software is provided as-is, without warranty of any kind.

This project and any expressed views, methodologies, etc., found within are the result of contributions by the maintainer and any contributors in their free time and on their personal hardware, and should not reflect upon any of their employers.

The maintainer of this project, SomeOddCodeGuy, is not doing any Contract, Freelance, or Collaboration work.

What is WilmerAI?

WilmerAI is an application designed for advanced semantic prompt routing and complex task orchestration. It originated from the need for a router that could understand the full context of a conversation, rather than just the most recent message.

Unlike simple routers that might categorize a prompt based on a single keyword, WilmerAI's routing system can analyze the entire conversation history. This allows it to understand the true intent behind a query like "What do you think it means?", recognizing it as historical query if that statement was preceded by a discussion about the Rosetta Stone, rather than merely conversational.

This contextual understanding is made possible by its core: a node-based workflow engine. Like the rest of Wilmer, the routing is a workflow, categorizing through a sequence of steps, or "nodes", defined in a JSON file. The route chosen kicks off another specialized workflow, which can call more workflows from there. Each node can orchestrate different LLMs, call external tools, run custom scripts, call other workflows, and many other things.

To the client application, this entire multi-step process appears as a standard API call, enabling advanced backend logic without requiring changes to your existing front-end tools.

Maintainer's Note - UPDATED 2026-03-29

I've been on a tear with Wilmer lately, and this is probably the biggest batch of changes since the workflow engine refactor. The short version: you don't need to run multiple Wilmer instances anymore.

That was always the thing that bugged me the most about how Wilmer worked. You'd end up with this pile of running instances, each with their own config, and it was a pain to manage. So I finally sat down and fixed it.

Here's what's new:

Multi-user support. You can now launch Wilmer with --User alice --User bob (as many as you need), and each user gets their own config, conversation files, memories, and log directory. Wilmer figures out who's making the request and routes everything to the right place.

Concurrency controls. The --concurrency and --concurrency-timeout flags when starting the server let you gate how many simultaneous requests are run at once. By default only one request processes at a time (which is what you want for most local Mac setups), and anything else queues up instead of stepping on each other. You can crank it up if your backend can handle it, like NVidia setups.

Per-user file isolation. Discussion ID files and some other per-session stuff now live in user-specific directories. When you've got multiple users on one instance, this keeps everyone's files from piling into one big folder.

API key support. If a request comes in with an Authorization: Bearer key, Wilmer uses that to bundle files into isolated per-key directories. This is a second layer of bundling

EXPERIMENTAL: Optional encryption. You can enable per-api-key Fernet encryption for stored files. If you turn it on, then it will use your API key to encrypt your loose files generated by Wilmer. There's also a re-keying script if you ever need to rotate keys. (NOTE: Doesn't yet affect sqlite dbs)

More memory and context options. I've added a couple of new tools for managing long conversations- an automatic memory condensation layer for file-based memories, and a ContextCompactor workflow node for token-aware conversation compaction. More on those in the docs, but the memory condenser is a big help on long chats. Short version is it generates N number of memories, and when it hits that point it will take those N memories and rewrite them as 1 memory, then keep going. So if you do 3, then it writes 3, rewrites them down to be 1, then does 3 more, rewrites those 3 as 1, etc. So instead of 6 memories, you get 2. If you are writing memories every 10,000 tokens, that's 60,000 tokens summarized down to two small 500-1000 token memories.

Image handling improvements. Fixed a longstanding design issue in Wilmer's ImageProcessor so that Images are now tracked per-message from the moment they come in all the way through to LLM dispatch, so they stay tied to the conversation turn that produced them. This also allowed me to add caching in the image processor (when a discussionid is active) so recurring image calls don't have to reprocess the same data every time.

In a previous recent release, I also added shared workflow collections and workflow selection via the API model field. The /v1/models and /api/tags endpoints now return your available workflows, which means front-ends like Open WebUI will show them right in the model dropdown. You just pick the workflow you want the same way you'd pick a model. Shared workflow folders (_shared/) let multiple users point at the same workflow sets without duplicating config all over the place, but also let one user have a bunch of workflows under it. So instead of having a coding workflow as one user, a general workflow as another, etc, you get one user with multiple available workflows under it.

The example users and workflows that ship with Wilmer are overdue for an update to reflect all of this. That's next on my list.

-Socg

The Power of Workflows

Semi-Autonomous Workflows Allow You Determine What Tools and When

The below shows Open WebUI connected to 2 instances of Wilmer (recorded before multi-user support was added; a single instance can now serve multiple users). The first instance just hits Mistral Small 3 24b directly, and then the second instance makes a call to the Offline Wikipedia API before making the call to the same model.

Click the image to play gif if it doesn't start automatically

Iterative LLM Calls To Improve Performance

A zero-shot to an LLM may not give great results, but follow-up questions will often improve them. If you regularly perform the same follow-up questions when doing tasks like software development, creating a workflow to automate those steps can have great results.

Distributed LLMs

With workflows, you can have as many LLMs available to work together in a single call as you have computers to support. For example, if you have old machines lying around that can run 3-8b models? You can put them to use as worker LLMs in various nodes. The more LLM APIs that you have available to you, either on your own home hardware or via proprietary APIs, the more powerful you can make your workflow network. A single prompt to Wilmer could reach out to 5+ computers, including proprietary APIs, depending on how you build your workflow.

Some (Not So Pretty) Pictures to Help People Visualize What It Can Do

Example of A Simple Assistant Workflow Using the Prompt Router

Single Assistant Routing to Multiple LLMs

Example of How Routing Might Be Used

Prompt Routing Example

Group Chat to Different LLMs

Groupchat to Different LLMs

Example of a UX Workflow Where A User Asks for a Website

Oversimplified Example Coding Workflow

Key Features

Advanced Contextual Routing The primary function of WilmerAI. It directs user requests using sophisticated, context-aware logic. This is handled by two mechanisms:
- Prompt Routing: At the start of a conversation, it analyzes the user's prompt to select the most appropriate specialized workflow (e.g., "Coding," "Factual," "Creative").
- In-Workflow Routing: During a workflow, it provides conditional "if/then" logic, allowing a process to dynamically choose its next step based on the output of a previous node.
Crucially, these routing decisions can be based on the entire conversation history, not just the user's last messages, allowing for a much deeper understanding of intent.

Core: Node-Based Workflow Engine The foundation that powers the routing and all other logic. WilmerAI processes requests using workflows, which are JSON files that define a sequence of steps (nodes). Each node performs a specific task, and its output can be passed as input to the next, enabling complex, chained-thought processes.

Multi-LLM & Multi-Tool Orchestration Each node in a workflow can connect to a completely different LLM endpoint or execute a tool. This allows you to orchestrate the best model for each part of a task -- for example, using a small, fast local model for summarization and a large, powerful cloud model for the final reasoning, all within a single workflow.

Modular & Reusable Workflows You can build self-contained workflows for common tasks (like searching a database or summari

WilmerAI

Install / Use

README

WilmerAI

DISCLAIMER:

What is WilmerAI?

Maintainer's Note - UPDATED 2026-03-29

The Power of Workflows

Semi-Autonomous Workflows Allow You Determine What Tools and When

Iterative LLM Calls To Improve Performance

Distributed LLMs

Some (Not So Pretty) Pictures to Help People Visualize What It Can Do

Example of A Simple Assistant Workflow Using the Prompt Router

Example of How Routing Might Be Used

Group Chat to Different LLMs

Example of a UX Workflow Where A User Asks for a Website

Key Features