Toolio
GenAI & agent toolkit for Apple Silicon Mac, implementing JSON schema-steered structured output (3SO) and tool-calling in Python. For more on 3SO: https://huggingface.co/blog/ucheog/llm-power-steering
Install / Use
/learn @OoriData/ToolioREADME
♪ Come along and ride on a fantastic voyage 🎵, with AI riding shotgun seat and a flatbed full of tools.
Toolio is an OpenAI-like HTTP server API implementation which supports structured LLM response generation (e.g. make it conform to a JSON schema). It also implements tool calling by LLMs. Toolio is based on the MLX framework for Apple Silicon (e.g. M1/M2/M3/M4 Macs), so that's the only supported platform at present.
Whether the buzzword you're pursuing is tool-calling, function-calling, agentic workflows, compound AI, guaranteed structured output, schema-driven output, guided generation, or steered response, give Toolio a try, in your own private setting.
Builds on: https://github.com/otriscon/llm-structured-output/
Schema-steered structured output (3SO)
There is sometimes confusion over the various ways to constrain LLM output
- You can basically beg the model through prompt engineering (detailed instructions, few-shot, etc.), then attempt generation, check the results, and retry if it doesn't conform (perhaps with further LLM begging in the re-prompt). This gives uneven results, is slow and wasteful, and ends up requiring much more powerful LLMs.
- Toolio's approach, which we call schema-steered structured output (3SO), is to convert the input format of the grammar (JSON schema in this case) into a state machine which applies those rules as hard constraints on the output sampler. Rather than begging the LLM, we steer it.
In either case you get better results if you've trained or fine-tuned the model with a lot of examples of the desired output syntax and structure, but the LLM's size, power and training are only part of the picture with S3O.
Specific components and usage modes
toolio_server(command line)—Host MLX-format LLMs for structured output query or function calling via HTTP requeststoolio_request(command line)—Execute HTTP client requests against a servertoolio.local_model_runner(Python API)—Encapsulate an MLX-format LLM for convenient, in-resident query with structured output or function callingtoolio.client.struct_mlx_chat_api(Python API)—Make a toolio server request from code
We'd love your help, though! Click to learn how to make contributions to the project.
The following video, "Toolio in 10 minutes", is an easy way to learn about the project.
<!-- <iframe width="560" height="315" src="https://www.youtube.com/embed/9DpQYbteakc?si=Zy4Cj1v1q9ID07eg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> <img width="1268" alt="Toolio in 10 minutes still" src="https://github.com/user-attachments/assets/fc8dda94-326d-426e-a566-ac8ec60be31f"> -->Installation
As simple as:
pip install toolio
If you're not sure, you can check that you're on an Apple Silicon Mac.
python -c "import platform; assert 'arm64' in platform.platform()"
Host a server
Use toolio_server to host MLX-format LLMs for structured output query or function-calling. For example you can host the MLX version of Nous Research's Hermes-2 Θ (Theta).
toolio_server --model=mlx-community/Llama-3.2-3B-Instruct-4bit
This will download the model from the HuggingFace path mlx-community/Llama-3.2-3B-Instruct-4bit to your local disk cache. The 4bit at the end means you are downloading a version quantized to 4 bits, so that each parameter in the neural network, which would normally take up 16 bits, only takes up 4, in order to save memory and boost speed. There are 8 billion parameters, so this version will take up a little over 4GB on your disk, and running it will take up about the sama amount of your unified RAM.
To learn more about the MLX framework for ML workloads (including LLMs) on Apple Silicon, see the MLX Notes article series. The "Day One" article provides all the context you need for using local LLMs with Toolio.
There are many hundreds of models you can select. One bit of advice is that Toolio, for now, tends to work better with base or base/chat models, rather than instruct-tuned models.
cURLing the Toolio server
Try out a basic request, not using any of Toolio's special features, but rather using the LLM as is:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H 'Content-Type: application/json' \
-d '{
"messages": [{"role": "user", "content": "I am thinking of a number between 1 and 10. Guess what it is."}],
"temperature": 0.1
}'
This is actually not constraining to any output structure, and is just using the LLM as is. The result will be in complex-looking JSON, but read on for more straightforward ways to query against a Toolio server.
Specifying an output JSON schema
Here is a request that does constrain return structure:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H 'Content-Type: application/json' \
-d '{
"messages": [{"role": "user", "content": "I am thinking of a number between 1 and 10. Guess what it is."}],
"response_format": {
"type": "json_object",
"schema": "{\"type\": \"object\",\"properties\": {\"guess\": {\"type\": \"number\"}}}"
},
"temperature": 0.1
}'
The key here is specification of a JSON schema. The schema is escaped for the command line shell above, so here it is in its regular form:
{"type": "object", "properties": {"guess": {"type": "number"}}}
This describes a response such as:
{"guess": 5}
The schema may look a bit intimidating, at first, if you're not familiar with JSON schema, but they're reasonably easy to learn. You can follow the primer.
Or you can just paste an example of your desired output structure and ask ChatGPT, Claude, Gemini, etc.—or of course our favorite local LLM via Toolio. "Please write a JSON schema to represent this data format: [response format example]"
Toolio's JSON schema support is a subset, so you might need to tweak a schema before using it with Toolio. Most of the unsupported features can be just omitted, or expressed in the prompt or schema descriptions instead.
Using the command line client instead
cURL is a pretty raw interface for this, though. For example, you have to parse the resulting response JSON. It's a lot easier to use the more specialized command line client tool toolio_request. Here is the equivalent too the first cURL example, above:
toolio_request --apibase="http://localhost:8000" --prompt="I am thinking of a number between 1 and 10. Guess what it is."
This time you'll just get the straightforward response text, e.g. "Sure, I'll guess 5. Is that your number?"
Here is an example using JSON schema constraint to extract structured data from an unstructured sentence.
export LMPROMPT='Which countries are mentioned in the sentence "Adamma went home to Nigeria for the hols"? Your answer should be only JSON, according to this schema: #!JSON_SCHEMA!#'
export LMSCHEMA='{"type": "array", "items": {"type": "object", "properties": {"name": {"type": "string"}, "continent": {"type": "string"}}, "required": ["name", "continent"]}}'
toolio_request --apibase="http://localhost:8000" --prompt=$LMPROMPT --schema=$LMSCHEMA
(…and yes, in practice a smaller, specialized entity extraction model might be a better option for a case this simple)
Notice the #!JSON_SCHEMA!# cutout, which Toolio replaces for you with the actual schema you've provided.
With any decent LLM you should get the following and no extraneous text cluttering things up!
[{"name": "Nigeria", "continent": "Africa"}]
Or if you have the prompt or schema written to files:
echo 'Which countries are mentioned in the sentence "Adamma went home to Nigeria for the hols"? Your answer should be only JSON, according to this schema: #!JSON_SCHEMA!#' > /tmp/llmprompt.txt
echo '{"type": "array", "items": {"type": "object", "properties": {"name": {"type": "string"}, "continent": {"type": "string"}}, "required": ["name", "continent"]}}' > /tmp/countries.schema.json
toolio_request --apibase="http://localhost:8000" --prompt-file=/tmp/llmprompt.txt --schema-file=/tmp/countries.schema.json
Tool calling
You can run tool usage (function-calling) prompts, a key technique in LLM agent frameworks. A schema will automatically be generated from the tool specs, which themselves are based on JSON Schema, according to OpenAI conventions.
echo 'What'\''s the weather like in Boulder today?' > /tmp/llmprompt.txt
echo '{"tools": [{"type": "function","function": {"name": "get_current_weather","description": "Get the current weather in a given location","parameters": {"type": "object","properties": {"location": {"type": "string","description": "City and state, e.g. San Francisco, CA"},"unit": {"type": "string","enum": ["℃","℉"]}},"required": ["location"]}}}], "tool_choice": "auto"}' > /tmp/toolspec.json
toolio_request --apibase="http://localhost:8000" --prompt-file=/tmp/llmprompt.txt -
