<p align="center"> <img src="https://raw.githubusercontent.com/fab2s/floDl/main/docs/floDl.png" alt="floDl" width="640"> </p> <h1 align="center">floDl</h1> <p align="center"> A Rust-native deep learning framework built on libtorch.<br> Same GPU kernels as PyTorch. No Python. No GIL. No GC. Just Rust. </p> <p align="center"> <a href="https://flodl.dev"><img src="https://img.shields.io/badge/web-flodl.dev-6c8cff" alt="Website"></a> <a href="https://github.com/fab2s/floDl/actions"><img src="https://github.com/fab2s/floDl/actions/workflows/ci.yml/badge.svg" alt="CI"></a> <a href="https://crates.io/crates/flodl"><img src="https://img.shields.io/crates/v/flodl.svg" alt="crates.io"></a> <a href="https://docs.rs/flodl"><img src="https://docs.rs/flodl/badge.svg" alt="docs.rs"></a> <a href="https://github.com/fab2s/floDl/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="MIT License"></a> </p> <p align="center"> <a href="#if-you-know-pytorch-you-know-flodl">PyTorch Users</a> • <a href="#getting-started">Getting Started</a> • <a href="#the-graph-builder">Graph Builder</a> • <a href="#graph-tree-hierarchical-composition">Graph Tree</a> • <a href="#the-training-experience">Training</a> • <a href="#pytorch-parity">Parity</a> • <a href="#performance">Benchmarks</a> • <a href="https://github.com/fab2s/floDl/blob/main/docs/pytorch_migration.md">Migration Guide</a> </p>

If You Know PyTorch, You Know floDl

<table> <tr><th>PyTorch</th><th>floDl</th></tr> <tr><td>

model = nn.Sequential(
    nn.Linear(2, 16),
    nn.GELU(),
    nn.LayerNorm(16),
    nn.Linear(16, 2),
)

pred = model(x)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer.step()

</td><td>

let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)
    .through(LayerNorm::new(16)?)
    .through(Linear::new(16, 2)?)
    .build()?;

let pred = model.forward(&x)?;
let loss = mse_loss(&pred, &target)?;
loss.backward()?;
optimizer.step()?;

</td></tr> </table>

Same concepts, same names, same GPU kernels underneath. The ? operator replaces silent failures with compile-time error handling. Drop replaces the garbage collector. The full migration guide covers every op, module, and pattern.

New to Rust? Read Rust for PyTorch Users — 10 patterns in 15 minutes.

Getting Started

With Docker (no Rust or libtorch needed):

curl -sL https://flodl.dev/init.sh | sh -s my-project
cd my-project
make build    # first build (~5 min, downloads libtorch)
make run      # train the template model

Without Docker — Rust 1.85+ and libtorch:

# Auto-detects CPU or CUDA
curl -sL https://raw.githubusercontent.com/fab2s/floDl/main/download-libtorch.sh | sh
cargo add flodl && cargo build

For CUDA: cargo add flodl --features cuda + CUDA toolkit.

Both paths generate an annotated training template. Edit src/main.rs to build your model:

use flodl::*;

let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)
    .through(LayerNorm::new(16)?)
    .also(Linear::new(16, 16)?)     // residual connection
    .through(Linear::new(16, 2)?)
    .build()?;

let params = model.parameters();
let mut optimizer = Adam::new(&params, 0.01);
model.train();

for (input_t, target_t) in &batches {
    let input = Variable::new(input_t.clone(), true);
    let target = Variable::new(target_t.clone(), false);

    let pred = model.forward(&input)?;
    let loss = mse_loss(&pred, &target)?;

    optimizer.zero_grad();
    loss.backward()?;
    clip_grad_norm(&params, 1.0)?;
    optimizer.step()?;
}

The Graph Builder

floDl's fluent graph builder lets you describe complex architectures as readable data flow — no boilerplate, no nn.Module subclassing.

let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)                        // activation
    .through(LayerNorm::new(16)?)         // normalization
    .also(Linear::new(16, 16)?)           // residual connection
    .through(Linear::new(16, 2)?)         // output projection
    .build()?;

build() returns a Graph that implements Module — you can nest it inside other graphs. Things get interesting when architectures get complex:

let g = FlowBuilder::from(encoder).tag("encoded")
    .split(modules![head_a, head_b, head_c]).merge(MergeOp::Mean)
    .loop_body(refinement_block).for_n(3).tag("refined")
    .gate(router, modules![expert_a, expert_b]).using(&["encoded"])
    .switch(selector, modules![light_path, heavy_path]).using(&["refined"])
    .through(StateAdd).using(&["memory"]).tag("memory")
    .loop_body(decoder).while_cond(halt_condition, 10)
    .through(output_head)
    .build()?;

Every construct — split/merge, also, loop_body, gate, switch, map, tag/using — composes cleanly. Forward references (using before tag) carry state across calls, enabling recurrent architectures without special-casing.

| Method | What it does | |--------|-------------| | from(m).through(m) | Linear chain | | also(m) | Residual: input + m(input) | | fork(m) | Side branch: capture output as tag, stream continues | | split(modules![...]).merge(op) | Parallel branches, merged by Add or Mean | | tag(name) / using(refs) | Named references — backward or forward (across calls) | | loop_body(body).for_n(n) | Fixed iteration with BPTT | | loop_body(body).while_cond / until_cond | Conditional loops | | gate(router, modules![...]) | Soft routing — weighted combination | | switch(selector, modules![...]) | Hard routing — only selected branch | | map(body).each() / .over(tag) / .slices(n) | Element-wise, tagged, or sliced iteration | | input(names) | Auxiliary graph inputs for multi-input architectures |

See the Graph Builder Tutorial and the full showcase.

Graph Tree: Hierarchical Composition

This is where floDl goes beyond PyTorch. Graphs nest inside graphs with label-path addressing — dot-separated paths that let you reach into any subgraph from the root. Train components independently, compose them into larger architectures, and control training phases declaratively.

// Build components independently
let scan = FlowBuilder::from(scan_net).tag("hidden")
    .label("scan").build()?;

let read = FlowBuilder::from(read_net).tag("confidence")
    .label("read").build()?;

let encoder = FlowBuilder::from(scan)
    .through(read)
    .label("encoder").build()?;

// Compose into full model
let model = FlowBuilder::from(encoder)
    .through(classifier)
    .build()?;

Dotted paths reach anywhere

Every tag and subgraph is addressable through dotted paths from the root:

model.validate_path("encoder")?;                 // -> Subgraph
model.validate_path("encoder.scan.hidden")?;      // -> Tag (three levels deep)
model.validate_path("encoder.read.confidence")?;  // -> Tag

Declarative training phases

Freeze and thaw entire subtrees by path — no manual parameter iteration:

// Phase 1: train only the classifier, encoder is frozen
model.freeze("encoder")?;
let fresh_params = model.parameters();  // only unfrozen params
let mut opt = Adam::new(&fresh_params, 1e-3);
// ... train ...

// Phase 2: thaw scan, keep read frozen (it's proven)
model.thaw("encoder.scan")?;
let mut opt = Adam::with_groups()
    .group(&model.parameters_at("encoder.scan")?, 1e-4)  // low LR
    .group(&model.parameters_at("classifier")?, 1e-3)
    .build();

Subgraph checkpoints

Train a component standalone, save it, load it into a larger model:

// Pre-trained encoder saved earlier
encoder.save_checkpoint("encoder_v1.fdl.gz")?;

// Load into the composed model — namespace + hash validated
model.load_subgraph_checkpoint("encoder", "encoder_v1.fdl.gz")?;
model.freeze("encoder.read")?;  // lock what's proven

Cross-boundary observation

Metrics flow up through the tree automatically:

model.record_at("encoder.scan.loss", scan_loss)?;
model.record_at("encoder.read.accuracy", read_acc)?;
model.record_scalar("total_loss", total)?;

model.flush(&[]);  // single call flushes the entire tree

// Trends across boundaries — drive training decisions
if model.trend_at("encoder.scan.loss")?.stalled(10, 1e-4) {
    model.thaw("encoder.read")?;  // scan stalled, unfreeze read
}

// Monitor sees all metrics with dotted names automatically
monitor.log(epoch, elapsed, &model);
// -> total_loss, encoder.scan.loss, encoder.read.accuracy

This is progressive model composition: each component is trained and validated independently before becoming a building block in a larger architecture. Checkpoints, metrics, and training phases compose just like the graphs themselves.

See the full Graph Tree Tutorial.

The Training Experience

Training Monitor

Drop-in monitor with adaptive ETA, resource tracking, and a live web dashboard — no external dependencies, no separate process.

use flodl::monitor::Monitor;

let mut monitor = Monitor::new(num_epochs);
monitor.serve(3000)?;  // optional: live dashboard at http://localhost:3000

for epoch in 0..num_epochs {
    let t = std::time::Instant::now();
    // ... training ...
    monitor.log(epoch, t.elapsed(), &model);  // sees entire graph tree
}
monitor.finish();

  epoch   1/100  loss=1.5264  [49ms  ETA 4.8s]
  epoch  10/100  loss=0.3817  [25ms  ETA 2.2s]  VRAM: 2.1/6.0 GB (82%)
  epoch  50/100  loss=0.0023  [24ms  ETA 1.2s]  VRAM: 2.1/6.0

FloDl

Install / Use

README