FluidLM

FluidLM

A Transformer-free language model replacing O(N^2) self-attention with reaction-diffusion PDEs -- achieving O(N) scaling, adaptive computation, and no KV-cache.

Early-stage proof of concept. The goal is to demonstrate that the mathematical framework is sound and the core mechanisms work -- not to compete with production LLMs.

What is FluidLM?

FluidLM replaces the self-attention mechanism with a system of continuous partial differential equations (PDEs). Inspired by Alan Turing's 1952 work on morphogenesis, it treats tokens as chemical concentrations that diffuse and react in a latent space.

Instead of every token explicitly "looking at" every other token through an N x N matrix, information propagates through:

Local diffusion -- a multi-scale Laplacian (dilations 1, 4, 16) that spreads information between neighbors, like heat through a medium.
Selective State Space (Mamba) -- a content-based temporal routing mechanism that replaces the previous fixed CausalLongConv. Each token selectively retains or forgets information based on its content, providing the content-aware long-range mixing that pure diffusion lacks.
Reaction MLP (SwiGLU) -- the nonlinear component where all semantic capacity is concentrated.
Global memory pump -- a reservoir h of shape (B, D) with learned forget gate that accumulates a sequence summary.

This shift from "every token talks to every token" to "information flows like a fluid" eliminates quadratic complexity.

Motivation
The FluidLM Approach
Mathematical Foundations
Architecture V4.5.0
Implemented Features
Architecture Comparison
Research Dashboard
Experimental Status
Research Roadmap
Getting Started
Version History
References

1. Motivation

The O(N^2) Attention Wall

Self-attention computes pairwise interactions between every token. For a sequence of N tokens, this produces an N x N matrix -- O(N^2) computational and memory complexity. Doubling context quadruples cost.

Static Computation

A Transformer applies exactly L layers to every input, whether it is processing "Hello" or a complex mathematical proof.

The KV-Cache Memory Wall

During inference, Transformers store Key and Value matrices for every past token in every layer. This cache grows with sequence length x layer count -- tens of gigabytes for long-context models.

GPU Dependency

The combination of O(N^2) compute, large KV-caches, and irregular memory access makes Transformers fundamentally GPU-dependent. Their patterns are poorly suited for CPUs, NPUs, and embedded processors.

FluidLM asks: what if we replaced the attention matrix entirely with a different mathematical object?

2. The FluidLM Approach

Two complementary propagation mechanisms replace global attention:

Local diffusion -- the multi-scale Laplacian propagates information from neighbor to neighbor at three spatial scales (dilations 1, 4, 16). Each token influences its immediate neighborhood; patterns emerge globally through repeated application.

Selective State Space (Mamba) -- V4.5.0 replaces the fixed CausalLongConv filter with a Mamba-style selective SSM. This is the key architectural upgrade: while the Laplacian provides position-based spatial mixing, the SSM provides content-based temporal routing. The SSM's input-dependent matrices (A, B, C) allow the model to selectively retain or discard information based on what the token contains -- the missing capability that caused the previous loss plateau at ~6.0.

Memory pump -- a global reservoir h of shape (B, D) that accumulates a sequence summary at each integration step, then broadcasts it uniformly to all positions. The Forget Gate (decay = sigmoid(decay_param)) introduces learned viscosity: the model learns what persists in the reservoir and what dissipates. The gate is h-aware (sigmoid(Wx + Uh)) to avoid accumulating what is already present.

Laplacian smoothness regularizer (V4.4.8, ported from FluidWorld) -- a grad_loss term computed on the final hidden representation penalizes high-frequency noise along the sequence dimension. This acts as an implicit spatial regularizer that improves autoregressive generation stability.

3. Mathematical Foundations

The Standard Transformer (what we replace)

$$\text{Attention}(Q, K, V) = \text{Softmax}!\left(\frac{Q \cdot K^\top}{\sqrt{d_k}}\right) \cdot V$$

The product $Q \cdot K^\top$ produces the $N \times N$ attention matrix -- the source of $O(N^2)$ complexity.

The FluidLM Governing Equation

$$\frac{\partial u}{\partial t} = \underbrace{\sum_{k} D_k \cdot \nabla^2_{d_k}(u)}{\text{local diffusion}} + \underbrace{\text{SSM}(u)}{\text{content-based routing}} + \underbrace{R(u, \theta)}{\text{reaction (SwiGLU)}} + \underbrace{\alpha \cdot h}{\text{global memory}}$$

Term 1: Multi-Scale Local Diffusion

The discrete Laplacian $[1, -2, 1]$ applied at three dilation levels:

| Dilation | Reach per step | Role | |----------|----------------|------| | 1 | ~1 token | Local syntax, morphology | | 4 | ~4 tokens | Phrase-level structure | | 16 | ~16 tokens | Sentence / paragraph |

$O(N)$ per step, sequential memory access -- ideal for CPU SIMD.

Term 2: Selective State Space (Mamba)

$$h_t = \bar{A}t \cdot h{t-1} + \bar{B}_t \cdot x_t$$

$$y_t = C_t \cdot h_t + D \cdot x_t$$

Where $A$, $B$, $C$ are input-dependent (computed from the current token). This is mathematically a discretized ODE -- the same family as the PDE diffusion. The SSM selectively chooses what to remember and what to forget based on content, providing the content-based routing that pure diffusion lacks.

Pure PyTorch implementation (no custom CUDA kernels). $O(N \cdot d \cdot s)$ training, $O(d \cdot s)$ constant per-token inference. No KV-cache.

Term 3: Reaction Function (SwiGLU)

$$R(u, \theta) = \left(W_1 \cdot u \odot \sigma(W_g \cdot u)\right) \cdot W_2$$

SwiGLU (V4.5.0) replaces the previous GELU MLP. Proven by LLaMA/PaLM to improve language modeling quality. $\frac{8}{3}$ expansion ratio.

Term 4: Global Memory Pump + Forget Gate

$$s = \text{mean}_{L}(R(u)) \quad \in \mathbb{R}^{B \times D}$$

$$g = \sigma(W_x \cdot \bar{u} + W_h \cdot h) \quad \in \mathbb{R}^{B \times D}$$

$$\delta = \sigma(\theta_{\text{decay}}) \quad \in (0,1)^{D} \quad \text{(learned viscosity)}$$

$$h \leftarrow \delta \odot h + g \cdot \tanh(s)$$

$h$ is of shape $(B, D)$ -- global reservoir, $O(1)$ memory. Initialized at $\text{decay} \approx 0.97$.

Term 5: Laplacian Grad Loss (V4.4.8)

$$\mathcal{L}{\text{grad}} = w_g \cdot \text{mean}!\left(|\nabla^2{1D}(z_{\text{final}})|\right)$$

Applied to the final hidden representation $z_{\text{final}}$ before the LM head. Penalizes second-order discontinuity along the sequence, encouraging smooth latent representations that degrade gracefully during autoregressive generation.

Positional Encoding: Sinusoidal

Applied once at the FluidNet input (not at each integration step):

$$PE_{(pos, 2i)} = \sin!\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos!\left(\frac{pos}{10000^{2i/d}}\right)$$

Sinusoidal encoding gives a clean additive signal that propagates naturally through the PDE.

Time Integration (Forward Euler)

$$u_{t+1} = \text{RMSNorm}!\left( u_t + \Delta t \cdot \left[ \sum_k D_k \nabla^2_{d_k}(u_t) + \text{SSM}(u_t) + R(u_t,\theta) + \alpha \cdot h_t \right] \right)$$

$\Delta t$ is a per-layer learned parameter (dt_gate), initialized from the config dt value.

Turing Equilibrium and Adaptive Computation

$$\tau = \text{mean}\left(\frac{|u_{t+1} - u_t|}{|u_t| + \varepsilon}\right) \xrightarrow{< \varepsilon} \text{HALT}$$

During inference, the model halts early when the fluid stabilizes. During training, turbulence contributes to a differentiable regularization loss.

4. Architecture V4.5.0

Input tokens
    |
    v
[Embedding]  (d_model = 512)
    |
    v
[Sinusoidal PE]  (0 params, applied once)
    |
    v
[FluidLayer 0]  -- Diffusion (Laplacian x3 scales)
                 -- Selective SSM (Mamba, content-based)
                 -- Reaction SwiGLU (8/3 expansion)
                 -- Local Memory (causal avg pool + projection)
                 -- Memory Pump h (B,D) + Forget Gate (x T steps, T<=8)
    |
[FluidLayer 1..3]  (same structure, independent params)
    |
    v
[RMSNorm]
    |
    v
[Linear Head]  (weight-tied with Embedding)
    |
    v
Logits (vocab: 50,257)

Parameter Count (d_model=512, 4 layers)

| Component | Parameters | Notes | |-----------|-----------|-------| | Embedding (weight-tied with head) | ~25.7M | Counts once | | Sinusoidal PE | 0 | Buffer | | Reaction SwiGLU 8/3 (x4 layers) | ~10.7M | Replaces GELU MLP | | Selective SSM / Mamba (x4 layers) | ~3.4M | Replaces CausalLongConv | | Memory Gate x + h (x4 layers) | ~2.1M | h-aware, (B,D) | | Multi-Head LongConv (x4 layers) | ~1.2M | K=33/65/129/257 | | Local Memory proj (x4 layers) | ~1.05M | Causal avg pool | | Diffusion coefficients (x4 layers) | ~6K | | | RMSNorm + dt_gate + alpha (x4 layers) | ~8K | | | Total V4.5.0 | ~44.2M | |

5. Implemented Features

| Feature | Status | Notes | |---------|--------|-------| | Multi-Scale Dilated Diffusion | Done | Dilations [1, 4, 16], learnable coefficients | | Selective SSM (Mamba) | Done | Content-based routing, pure PyTorch, V4.5.0 | | Multi-Head LongConv (K=33/65/129/257) | Done | 4

FluidLM

Install / Use

README

FluidLM

FluidLM

What is FluidLM?

Table of Contents

1. Motivation

The O(N^2) Attention Wall

Static Computation

The KV-Cache Memory Wall

GPU Dependency

2. The FluidLM Approach

3. Mathematical Foundations

The Standard Transformer (what we replace)

The FluidLM Governing Equation

Term 1: Multi-Scale Local Diffusion

Term 2: Selective State Space (Mamba)

Term 3: Reaction Function (SwiGLU)

Term 4: Global Memory Pump + Forget Gate

Term 5: Laplacian Grad Loss (V4.4.8)

Positional Encoding: Sinusoidal

Time Integration (Forward Euler)

Turing Equilibrium and Adaptive Computation

4. Architecture V4.5.0

Parameter Count (d_model=512, 4 layers)

5. Implemented Features