Artificial Intelligence Controller Interface (AICI)

LLGuidance library is an actively maintained evolution and specialization of AICI, recommended if all you want is constrained decoding.

The Artificial Intelligence Controller Interface (AICI) lets you build Controllers that constrain and direct output of a Large Language Model (LLM) in real time. Controllers are flexible programs capable of implementing constrained decoding, dynamic editing of prompts and generated text, and coordinating execution across multiple, parallel generations. Controllers incorporate custom logic during the token-by-token decoding and maintain state during an LLM request. This allows diverse Controller strategies, from programmatic or query-based decoding to multi-agent conversations to execute efficiently in tight integration with the LLM itself.

The purpose of AICI is to make it easy to build and experiment with both existing and entirely new Controller strategies for improving LLM generations. By abstracting away implementation details of the underlying LLM inference and serving engine, AICI aims to simplify the development of Controllers, make it easier to write fast Controllers, and ease compatibility across LLM inference and serving engines.

AICI is designed for both local and cloud execution, including (eventually) multi-tenant LLM deployments. Controllers are implemented as light-weight WebAssembly (Wasm) modules which run on the same machine as the LLM inference engine, utilizing the CPU while the GPU is busy with token generation. AICI is one layer in the inference stack, and is designed to allow control libraries such as Guidance, LMQL, and others to run on top of it and gain both efficiency and performance improvements, as well as portability across LLM inference and serving engines.

AICI currently integrates with llama.cpp, HuggingFace Transformers, and rLLM (custom tch-based LLM inference engine), with vLLM in the works.

AICI is:

Flexible: Controllers can be written in any language that can compile to Wasm (Rust, C, C++, ...), or be interpreted inside Wasm (Python, JavaScript, ...)
Secure: Controllers are sandboxed and cannot access the filesystem, network, or any other resources
Fast: Wasm modules are compiled to native code and run in parallel with the LLM inference engine, inducing only a minimal overhead to the generation process

AICI is a prototype, designed and built at Microsoft Research.

Artificial Intelligence Controller Interface (AICI)
QuickStart: Example Walkthrough
Comprehensive Guide: Exploring Further
Architecture
Security
Performance
Flexibility
Acknowledgements
Contributing
Trademarks

QuickStart: Example Walkthrough

In this quickstart, we'll guide you through the following steps:

Set up rLLM Server and AICI Runtime.
Build and deploy a Controller.
Use AICI to control LLM output, so you can customize a LLM to follow specific rules when generating text.

Development Environment Setup

To compile AICI components, you need to set up your development environment for Rust. For this quickstart you also need Python 3.11 or later to create a controller.

Windows WSL / Linux / macOS

[!NOTE] Windows users: please use WSL2 or the included devcontainer. Adding native Windows support is tracked here.

MacOS users: please make sure you have XCode command line tools installed by running xcode-select -p and, if not installed, run xcode-select --install.

CUDA: the CUDA build relies on specific libtorch installation. It's highly recommended you use the included devcontainer.

If you're using devcontainer, you can skip to the next section.

Using the system package manager, install the necessary tools for building code in the repository, including git, cmake and ccache.

For instance in WSL / Ubuntu using apt:

sudo apt-get install --assume-yes --no-install-recommends \
    build-essential cmake ccache pkg-config libssl-dev libclang-dev clang llvm-dev git-lfs

or using Homebrew on macOS:

brew install git cmake ccache

Then install Rust, Rustup and Cargo, following the instructions provided here and here:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

After installation, verify that the rustup --version command is accessible by running it from the terminal. If the command isn't recognized, try opening a new terminal session.

Next install wasm32-wasi Rust component:

rustup target add wasm32-wasi

If you already had Rust installed, or are getting complaints from Cargo about outdated versions, run:

rustup update

Last, to work with Python controllers and scripts (like this tutorial), run this command to install the required packages:

pip install pytest pytest-forked ujson posix_ipc numpy requests

Build and start rLLM server and AICI Runtime

The rLLM server has two backends, one based on libtorch and CUDA (rllm-cuda), and the other based on llama.cpp (rllm-llamacpp).

The rllm-cuda backend only works with NVidia GPUs with compute capability 8.0 or later (A100 and later; RTX 30x0 and later) and requires a fiddly setup of libtorch -- it's strongly recommended to use the included devcontainer. While this guide focuses on the rllm-llamacpp backend, the build steps are the same for rllm-cuda, modulo the folder name.

After dev env setup above, clone the AICI repository and proceed with the next steps outlined below.

Use the following command to build and run aicirt and rllm-llamacpp:

cd rllm/rllm-llamacpp
./server.sh phi2

You can pass other model names as argument (run ./server.sh without arguments to see available models). You can also use a HuggingFace URL to .gguf file or a local path to a .gguf file. (For rllm-cuda use HuggingFace model id or path to folder).

./server.sh orca

You can find more details about rllm-llamacpp here.

The rLLM server provides a HTTP interface, utilized for configuration tasks and processing requests. You can also use this interface to promptly verify its status. For instance, if you open http://127.0.0.1:4242/v1/models, you should see:

{
  "object": "list",
  "data": [
    {
      "object": "model",
      "id": "TheBloke/phi-2-GGUF",
      "created": 946810800,
      "owned_by": "owner"
    }
  ]
}

confirming that the selected model is loaded.

Control AI output using AICI controllers

AICI allows hosting custom logic, called Controllers, that initiate, terminate, and interact with LLMs token generation. Controllers take input arguments, process them, and return a result with logs, LLM tokens, and variables.

The repository includes some examples, in particular:

jsctrl: a controller that accepts JavaScript code as input for execution. This code can interact with the model to generate text and tokens.
pyctrl: a controller that accepts Python code as input for execution. This code can also interact with the model to generate text and tokens.

In this example we'll utilize pyctrl to manage token generation using a simple Python script. If you want, you can build and upload pyctrl, however by default the server will automatically download the latest release of pyctrl from GitHub.

In general, controllers require building and deployment, while scripts (Python or JavaScript) are sent with each request.

The following illustrates the relationship between the rLLM server, the AICI runtime, and the controller:

erDiagram
    Host    ||--|{ CPU : ""
    Host    ||--|{ GPU : ""
    
    CPU     ||--|| "rLLM Server" : execute
    CPU     ||--|{ "AICI Runtime" : execute

    "AICI Runtime" ||--|| "Controller" : instantiate

    GPU     ||--|{ "LLM token generation" : execute

Controlling the LLM token generation

Suppose we aim for a model to generate a list, adhering to a specific format and containing only five items.

Typically, achieving this involves prompt engineering, crafting the prompt precisely with clear instructions, such as:

What are the five most popular types of vehicles?
Return the result as a numbered list.
Do not add explanations, only the list.

The prompt would also vary depending on the model in use, given that each model tends to add explanations and understands instructions in different ways.

With AICI, we shift control back to code, and we can simplify the prompt to:

What are the most popular types of vehicles?

using code to:

Limit the list to 5 items
Prevent the model from adding some initial explanation
Format to a numbered list
Stop the model from adding some text after the list.

Let's create a list-of-five.py python file with the following content:

import pyaici.server as aici

# Force the model to generate a well formatted list of 5 items, e.g.
#   1. name 1
#   2. name 2
#   3. name 3
#   4. name 4
#   5. name 5
asy

Aici

Install / Use

README