PyFlame
A Python deep learning framework with lazy evaluation, automatic differentiation, and a PyTorch-like API. Features include neural network modules, data loading, training utilities, model serving, and integrations with MLflow, W&B, ONNX, and Jupyter.
Install / Use
/learn @CTO92/PyFlameREADME
PyFlame
Native Deep Learning Framework for Cerebras WSE
PRE-RELEASE ALPHA 2.0
This software is in early development and is not yet ready for production use. APIs may change without notice. Use at your own risk. Project is moving closer to beta, still needs transforms.py to have functionality and not placeholder functions.
PyFlame is a tensor computation library designed natively for the Cerebras Wafer-Scale Engine (WSE), featuring lazy evaluation, automatic CSL code generation, and a Python-first API.
A Quick Note For Developers
The Nvidia vendor lock-in that CUDA/PyTorch created has become a multi-million dollar nightmare for a lot of enterprises that now struggle to get GPUs for their own AI projects. Understand what that means.
I released PyFlame and the very next day my phone was ringing almost off the hook. By the end of the next day I had secured THREE consulting agreements at $250k each and started discussions on a contract that as of this writing (just 6 days following the release of PyFlame) is looking like $2.8M
Understand what that means. I found a massive pain point, figured out how to solve it, actually GAVE AWAY the solution (sort of) and now I've obtained over $3.5 million in revenue.
Want to learn the SYSTEM that I use for doing that (because this is the 9th time I've done this exact thing this exact way)
Just go to: https://oaqlabs.com/vibe.html
About OA Quantum Labs
PyFlame is developed by OA Quantum Labs, a specialized engineering firm focused on high-performance and quantum computing.
What We Do
In the context of the PyFlame project we help organizations unlock the full potential of specialized hardware through custom developer tools, optimized frameworks, and performance engineering:
- Custom Framework Development — Native tooling designed for your specific accelerator architecture
- Performance Optimization — Squeeze maximum throughput from your existing hardware investments
- Migration & Porting — Adapt existing ML workloads to new accelerator platforms
- Training & Enablement — Get your team productive on specialized hardware faster
Why Work With Us
PyFlame demonstrates our approach: rather than forcing general-purpose tools onto specialized hardware, we build native solutions that leverage the unique strengths of each architecture. The result is dramatically better performance and a more intuitive developer experience.
If your organization is working with specialized AI accelerators, FPGAs, or custom silicon, we'd love to discuss how purpose-built tooling could transform your development workflow.
Get In Touch
Danny Wall — CTO, OA Quantum Labs dwall@oaqlabs.com | oaqlabs.com
Features
- Native WSE Design: Built from the ground up for Cerebras architecture
- Lazy Evaluation: Computation graphs are built lazily and executed on demand
- CSL Code Generation: Automatic generation of optimized CSL kernels
- 2D Mesh Layouts: First-class support for tensor distribution across PEs
- Python + C++ API: Use from Python or C++ with the same abstractions
- NumPy Interoperability: Easy conversion to/from NumPy arrays
- PyTorch-like API: Familiar nn.Module system for building models
- Automatic Differentiation: Full autograd support for training
- Complete Training Stack: Optimizers, loss functions, and LR schedulers
Project Status
Version: Pre-Release Alpha 1.0
Phase 1 (Core Infrastructure) - Complete
- [x] Core tensor class with lazy evaluation
- [x] Computation graph (IR) system
- [x] Shape inference
- [x] Elementwise operations (add, mul, relu, sigmoid, etc.)
- [x] Reduction operations (sum, mean, max, min)
- [x] Matrix multiplication
- [x] CSL code generation framework
- [x] Python bindings via pybind11
- [x] CPU reference implementation
Phase 2 (ML Primitives) - Complete
- [x] Automatic differentiation (autograd)
- [x] Neural network module system (nn.Module)
- [x] Linear layers (Linear)
- [x] Convolutional layers (Conv1d, Conv2d)
- [x] Normalization layers (BatchNorm, LayerNorm, GroupNorm)
- [x] Pooling layers (MaxPool, AvgPool, AdaptivePool)
- [x] Dropout layers
- [x] Multi-head attention
- [x] Loss functions (MSE, CrossEntropy, BCE, etc.)
- [x] Optimizers (SGD, Adam, AdamW, RMSprop)
- [x] Learning rate schedulers
Requirements
- CMake 3.18+
- C++17 compiler (GCC 9+, Clang 10+, MSVC 2019+)
- Python 3.8+
- pybind11 2.10+
Cerebras SDK (Optional)
PyFlame includes a CPU reference implementation that allows you to develop, test, and validate your models without access to Cerebras hardware. All tensor operations, graph building, and CSL code generation work without the SDK.
To actually execute on Cerebras WSE hardware, you need:
- Cerebras SDK - This is proprietary software available only to Cerebras customers and partners. It is not publicly downloadable.
- Access to Cerebras hardware - Either on-premises CS-2/CS-3 systems or Cerebras Cloud.
Supported deployment options:
| Environment | Runtime Address | Notes |
|-------------|-----------------|-------|
| On-premises CS-2/CS-3 | localhost:9000 or system IP | Direct hardware access |
| Cerebras Cloud | Cloud endpoint URL | Provided by your cloud instance |
If you are interested in running PyFlame on Cerebras hardware, please contact Cerebras Systems to inquire about SDK access and hardware availability.
To build with Cerebras SDK support (once you have access):
cmake .. -DPYFLAME_USE_CEREBRAS_SDK=ON -DCEREBRAS_SDK_PATH=/path/to/sdk
To configure the runtime endpoint (for cloud or remote on-premises):
export CEREBRAS_RUNTIME_ADDRESS="your-endpoint:port"
Building
Linux/macOS
# Clone the repository
git clone https://github.com/CTO92/PyFlame.git
cd pyflame
# Create build directory
mkdir build && cd build
# Configure
cmake .. -DCMAKE_BUILD_TYPE=Release
# Build
cmake --build . -j$(nproc)
# Run tests
ctest --output-on-failure
# Install Python package (development mode)
pip install -e .
Windows
# Clone and build
git clone https://github.com/CTO92/PyFlame.git
cd pyflame
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release
ctest -C Release --output-on-failure
Quick Start
Python
import pyflame as pf
# Create tensors
a = pf.randn([1024, 512])
b = pf.randn([512, 256])
# Build computation graph (lazy)
c = a @ b # Matrix multiply
d = pf.relu(c) # Activation
e = d.sum() # Reduction
# Execute
result = pf.eval(e)
print(result.numpy())
# With explicit mesh layout for WSE
x = pf.zeros([4096, 4096], layout=pf.MeshLayout.grid(16, 16))
y = pf.zeros([4096, 4096], layout=pf.MeshLayout.grid(16, 16))
z = x @ y # Distributed across 256 PEs
Training a Neural Network
import pyflame as pf
from pyflame import nn, optim
# Define a model
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
# Setup optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()
# Training step
x = pf.randn([32, 784]) # Batch of inputs
y = pf.randint(0, 10, [32]) # Labels
optimizer.zero_grad()
output = model(x)
loss = loss_fn(output, y)
loss.backward()
optimizer.step()
C++
#include <pyflame/pyflame.hpp>
using namespace pyflame;
int main() {
auto a = Tensor::randn({1024, 512});
auto b = Tensor::randn({512, 256});
auto c = matmul(a, b);
auto d = relu(c);
auto e = d.sum();
e.eval();
std::cout << "Result: " << e.data<float>()[0] << "\n";
return 0;
}
Architecture
┌─────────────────────────────────────────────────────────────┐
│ PyFlame User API (Python/C++) │
│ - Tensor abstraction with lazy evaluation │
│ - Dataflow-aware operators │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PyFlame Intermediate Representation │
│ - Computation graph with shape inference │
│ - Optimization passes (fusion, layout, etc.) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PyFlame CSL Backend │
│ - Template-based code generation │
│ - PE placement and routing optimization │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CSL Runtime / Cerebras Hardware │
│ - 850,000+ Processing Elements │
│ - 2D mesh fabric with wavelet communication │
└─────────────────────────────────────────────────────────────┘
Project Structure
pyflame/
├── CMakeLists.txt # Build configuration
├── include/pyflame/ # C++ headers
│ ├── core/ # Tensor, DType, Layout
│ ├── ir/ # Graph IR, operations
│ └── backend/ # CSL code generation
├── src/ # C++ implementation
├── python/ # Python bindings
│ ├── pyflame/ # Python package
│ └── bindings.cpp # pybind11 bindings
├── tests/ # Unit tests
│ ├── cpp/ # C++ tests (Google Test)
│ └── python/ # Python tests (pytest)
├── examples/ # Example programs
│ ├── cpp/
│ └── p
