PyFlame

Native Deep Learning Framework for Cerebras WSE

PRE-RELEASE ALPHA 2.0

This software is in early development and is not yet ready for production use. APIs may change without notice. Use at your own risk. Project is moving closer to beta, still needs transforms.py to have functionality and not placeholder functions.

PyFlame is a tensor computation library designed natively for the Cerebras Wafer-Scale Engine (WSE), featuring lazy evaluation, automatic CSL code generation, and a Python-first API.

A Quick Note For Developers

The Nvidia vendor lock-in that CUDA/PyTorch created has become a multi-million dollar nightmare for a lot of enterprises that now struggle to get GPUs for their own AI projects. Understand what that means.

I released PyFlame and the very next day my phone was ringing almost off the hook. By the end of the next day I had secured THREE consulting agreements at $250k each and started discussions on a contract that as of this writing (just 6 days following the release of PyFlame) is looking like $2.8M

Understand what that means. I found a massive pain point, figured out how to solve it, actually GAVE AWAY the solution (sort of) and now I've obtained over $3.5 million in revenue.

Want to learn the SYSTEM that I use for doing that (because this is the 9th time I've done this exact thing this exact way)

Just go to: https://oaqlabs.com/vibe.html

About OA Quantum Labs

PyFlame is developed by OA Quantum Labs, a specialized engineering firm focused on high-performance and quantum computing.

What We Do

In the context of the PyFlame project we help organizations unlock the full potential of specialized hardware through custom developer tools, optimized frameworks, and performance engineering:

Custom Framework Development — Native tooling designed for your specific accelerator architecture
Performance Optimization — Squeeze maximum throughput from your existing hardware investments
Migration & Porting — Adapt existing ML workloads to new accelerator platforms
Training & Enablement — Get your team productive on specialized hardware faster

Why Work With Us

PyFlame demonstrates our approach: rather than forcing general-purpose tools onto specialized hardware, we build native solutions that leverage the unique strengths of each architecture. The result is dramatically better performance and a more intuitive developer experience.

If your organization is working with specialized AI accelerators, FPGAs, or custom silicon, we'd love to discuss how purpose-built tooling could transform your development workflow.

Get In Touch

Danny Wall — CTO, OA Quantum Labs dwall@oaqlabs.com | oaqlabs.com

Features

Native WSE Design: Built from the ground up for Cerebras architecture
Lazy Evaluation: Computation graphs are built lazily and executed on demand
CSL Code Generation: Automatic generation of optimized CSL kernels
2D Mesh Layouts: First-class support for tensor distribution across PEs
Python + C++ API: Use from Python or C++ with the same abstractions
NumPy Interoperability: Easy conversion to/from NumPy arrays
PyTorch-like API: Familiar nn.Module system for building models
Automatic Differentiation: Full autograd support for training
Complete Training Stack: Optimizers, loss functions, and LR schedulers

Project Status

Version: Pre-Release Alpha 1.0

Phase 1 (Core Infrastructure) - Complete

[x] Core tensor class with lazy evaluation
[x] Computation graph (IR) system
[x] Shape inference
[x] Elementwise operations (add, mul, relu, sigmoid, etc.)
[x] Reduction operations (sum, mean, max, min)
[x] Matrix multiplication
[x] CSL code generation framework
[x] Python bindings via pybind11
[x] CPU reference implementation

Phase 2 (ML Primitives) - Complete

[x] Automatic differentiation (autograd)
[x] Neural network module system (nn.Module)
[x] Linear layers (Linear)
[x] Convolutional layers (Conv1d, Conv2d)
[x] Normalization layers (BatchNorm, LayerNorm, GroupNorm)
[x] Pooling layers (MaxPool, AvgPool, AdaptivePool)
[x] Dropout layers
[x] Multi-head attention
[x] Loss functions (MSE, CrossEntropy, BCE, etc.)
[x] Optimizers (SGD, Adam, AdamW, RMSprop)
[x] Learning rate schedulers

Requirements

CMake 3.18+
C++17 compiler (GCC 9+, Clang 10+, MSVC 2019+)
Python 3.8+
pybind11 2.10+

Cerebras SDK (Optional)

PyFlame includes a CPU reference implementation that allows you to develop, test, and validate your models without access to Cerebras hardware. All tensor operations, graph building, and CSL code generation work without the SDK.

To actually execute on Cerebras WSE hardware, you need:

Cerebras SDK - This is proprietary software available only to Cerebras customers and partners. It is not publicly downloadable.
Access to Cerebras hardware - Either on-premises CS-2/CS-3 systems or Cerebras Cloud.

Supported deployment options: | Environment | Runtime Address | Notes | |-------------|-----------------|-------| | On-premises CS-2/CS-3 | localhost:9000 or system IP | Direct hardware access | | Cerebras Cloud | Cloud endpoint URL | Provided by your cloud instance |

If you are interested in running PyFlame on Cerebras hardware, please contact Cerebras Systems to inquire about SDK access and hardware availability.

To build with Cerebras SDK support (once you have access):

cmake .. -DPYFLAME_USE_CEREBRAS_SDK=ON -DCEREBRAS_SDK_PATH=/path/to/sdk

To configure the runtime endpoint (for cloud or remote on-premises):

export CEREBRAS_RUNTIME_ADDRESS="your-endpoint:port"

Building

Linux/macOS

# Clone the repository
git clone https://github.com/CTO92/PyFlame.git
cd pyflame

# Create build directory
mkdir build && cd build

# Configure
cmake .. -DCMAKE_BUILD_TYPE=Release

# Build
cmake --build . -j$(nproc)

# Run tests
ctest --output-on-failure

# Install Python package (development mode)
pip install -e .

Windows

# Clone and build
git clone https://github.com/CTO92/PyFlame.git
cd pyflame

mkdir build
cd build

cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release

ctest -C Release --output-on-failure

Quick Start

Python

import pyflame as pf

# Create tensors
a = pf.randn([1024, 512])
b = pf.randn([512, 256])

# Build computation graph (lazy)
c = a @ b              # Matrix multiply
d = pf.relu(c)         # Activation
e = d.sum()            # Reduction

# Execute
result = pf.eval(e)
print(result.numpy())

# With explicit mesh layout for WSE
x = pf.zeros([4096, 4096], layout=pf.MeshLayout.grid(16, 16))
y = pf.zeros([4096, 4096], layout=pf.MeshLayout.grid(16, 16))
z = x @ y  # Distributed across 256 PEs

Training a Neural Network

import pyflame as pf
from pyflame import nn, optim

# Define a model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

# Setup optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

# Training step
x = pf.randn([32, 784])  # Batch of inputs
y = pf.randint(0, 10, [32])  # Labels

optimizer.zero_grad()
output = model(x)
loss = loss_fn(output, y)
loss.backward()
optimizer.step()

C++

#include <pyflame/pyflame.hpp>
using namespace pyflame;

int main() {
    auto a = Tensor::randn({1024, 512});
    auto b = Tensor::randn({512, 256});

    auto c = matmul(a, b);
    auto d = relu(c);
    auto e = d.sum();

    e.eval();
    std::cout << "Result: " << e.data<float>()[0] << "\n";

    return 0;
}

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    PyFlame User API (Python/C++)            │
│   - Tensor abstraction with lazy evaluation                │
│   - Dataflow-aware operators                               │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│               PyFlame Intermediate Representation           │
│   - Computation graph with shape inference                 │
│   - Optimization passes (fusion, layout, etc.)             │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  PyFlame CSL Backend                        │
│   - Template-based code generation                         │
│   - PE placement and routing optimization                  │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              CSL Runtime / Cerebras Hardware                │
│   - 850,000+ Processing Elements                           │
│   - 2D mesh fabric with wavelet communication              │
└─────────────────────────────────────────────────────────────┘

Project Structure

pyflame/
├── CMakeLists.txt           # Build configuration
├── include/pyflame/         # C++ headers
│   ├── core/                # Tensor, DType, Layout
│   ├── ir/                  # Graph IR, operations
│   └── backend/             # CSL code generation
├── src/                     # C++ implementation
├── python/                  # Python bindings
│   ├── pyflame/             # Python package
│   └── bindings.cpp         # pybind11 bindings
├── tests/                   # Unit tests
│   ├── cpp/                 # C++ tests (Google Test)
│   └── python/              # Python tests (pytest)
├── examples/                # Example programs
│   ├── cpp/
│   └── p

PyFlame

Install / Use

README

PyFlame

A Quick Note For Developers

About OA Quantum Labs

What We Do

Why Work With Us

Get In Touch

Features

Project Status

Requirements

Cerebras SDK (Optional)

Building

Linux/macOS

Windows

Quick Start

Python

Training a Neural Network

C++

Architecture

Project Structure