SkillAgentSearch skills...

Zml

Any model. Any hardware. Zero compromise. Built with @ziglang / @openxla / MLIR / @bazelbuild

Install / Use

/learn @zml/Zml
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <img src="https://raw.githubusercontent.com/zml/zml.github.io/refs/heads/main/docs-assets/zml-banner.png" style="width:100%; height:120px;"> <a href="https://zml.ai">Website</a> | <a href="#getting-started">Getting Started</a> | <a href="./docs/README.md">Documentation</a> | <a href="https://discord.gg/6y72SN2E7H">Discord</a> | <a href="./CONTRIBUTING.md">Contributing</a> </div>

About

ZML is a production inference stack, purpose-built to decouple AI workloads from proprietary hardware.

Any model, many hardwares, one codebase, peak performance.

Compiled directly to NVIDIA, AMD, TPU, Trainium for peak hardware performance on any accelerator. No rewriting.

It is built using the Zig language, MLIR, and Bazel.

Getting Started

Prerequisites

We use bazel to build ZML and its dependencies. The only prerequisite is bazel, which we recommend installing through bazelisk.

macOS

brew install bazelisk

Linux

curl -L -o /usr/local/bin/bazel 'https://github.com/bazelbuild/bazelisk/releases/download/v1.28.0/bazelisk-linux-amd64'
chmod +x /usr/local/bin/bazel

30-Second Smoke Test

Run the MNIST example:

bazel run //examples/mnist

This downloads a small pretrained MNIST model, compiles it, loads the weights, and classifies a random handwritten digit.

LLM Quickstart

The main LLM example is //examples/llm. It currently supports:

  • Llama 3.1 / 3.2
  • Qwen 3.5
  • LFM 2.5

Authenticate with Hugging Face if you want to load gated repos such as Meta Llama:

bazel run //tools/hf -- auth login

Alternatively, set the HF_TOKEN environment variable.

Then run a prompt directly:

bazel run //examples/llm -- --model=hf://meta-llama/Llama-3.2-1B-Instruct --prompt="What is the capital of France?"

Open the interactive chat loop by omitting --prompt:

bazel run //examples/llm -- --model=hf://meta-llama/Llama-3.2-1B-Instruct

You can also load from:

  • a local directory: --model=/var/models/meta-llama/Llama-3.2-1B-Instruct
  • S3: --model=s3://bucket/path/to/model

Running Models on GPU / TPU

Append one or more platform flags when compiling or running:

  • NVIDIA CUDA: --@zml//platforms:cuda=true
  • AMD RoCM: --@zml//platforms:rocm=true
  • Google TPU: --@zml//platforms:tpu=true
  • AWS Trainium / Inferentia 2: --@zml//platforms:neuron=true
  • Disable CPU compilation: --@zml//platforms:cpu=false

Example on CUDA:

bazel run //examples/llm --@zml//platforms:cuda=true -- --model=hf://meta-llama/Llama-3.2-1B-Instruct --prompt="Write a haiku about Zig"

Example on ROCm:

bazel run //examples/llm --@zml//platforms:rocm=true -- --model=hf://meta-llama/Llama-3.2-1B-Instruct --prompt="Write a haiku about Zig"

Run Tests

bazel test //zml:test

Examples

A Taste Of ZML

const Mnist = struct {
    fc1: Layer,
    fc2: Layer,

    const Layer = struct {
        weight: zml.Tensor,
        bias: zml.Tensor,

        pub fn init(store: zml.io.TensorStore.View) Layer {
            return .{
                .weight = store.createTensor("weight", .{ .d_out, .d }, null),
                .bias = store.createTensor("bias", .{.d_out}, null),
            };
        }

        pub fn forward(self: Layer, input: zml.Tensor) zml.Tensor {
            return self.weight.dot(input, .d).add(self.bias).relu().withTags(.{.d});
        }
    };

    pub fn init(store: zml.io.TensorStore.View) Mnist {
        return .{
            .fc1 = .init(store.withPrefix("fc1")),
            .fc2 = .init(store.withPrefix("fc2")),
        };
    }

    pub fn load(
        self: *const Mnist,
        allocator: std.mem.Allocator,
        io: std.Io,
        platform: *const zml.Platform,
        store: *const zml.io.TensorStore,
        shardings: []const zml.sharding.Sharding,
    ) !zml.Bufferized(Mnist) {
        return zml.io.load(Mnist, self, allocator, io, platform, store, .{
            .shardings = shardings,
            .parallelism = 1,
            .dma_chunks = 1,
            .dma_chunk_size = 16 * 1024 * 1024,
        });
    }

    pub fn unloadBuffers(self: *zml.Bufferized(Mnist)) void {
        self.fc1.weight.deinit();
        self.fc1.bias.deinit();
        self.fc2.weight.deinit();
        self.fc2.bias.deinit();
    }

    /// just two linear layers + relu activation
    pub fn forward(self: Mnist, input: zml.Tensor) zml.Tensor {
        var x = input.flatten().convert(.f32).withTags(.{.d});
        const layers: []const Layer = &.{ self.fc1, self.fc2 };
        for (layers) |layer| {
            x = layer.forward(x);
        }
        return x.argMax(0).indices.convert(.u8);
    }
};

For a full walkthrough, see:

Where To Go Next

Contributing

See here.

License

ZML is licensed under the Apache 2.0 license.

Thanks To Our Contributors

<a href="https://github.com/zml/zml/graphs/contributors"> <img src="https://contrib.rocks/image?repo=zml/zml" /> </a>

Related Skills

View on GitHub
GitHub Stars3.3k
CategoryDevelopment
Updated6h ago
Forks125

Languages

Zig

Security Score

100/100

Audited on Mar 26, 2026

No findings