FluidLM
FluidLM is a Transformer-free language model replacing O(N²) self-attention with reaction-diffusion PDEs - achieving O(N) scaling, adaptive computation, and no KV-cache.
Install / Use
/learn @infinition/FluidLMREADME
FluidLM
<img width="1950" height="674" alt="fluidlm_banner" src="https://github.com/user-attachments/assets/d17ee5df-5c7d-423c-a276-1ff4528b9614" />FluidLM
A Transformer-free language model replacing O(N^2) self-attention with reaction-diffusion PDEs -- achieving O(N) scaling, adaptive computation, and no KV-cache.
Early-stage proof of concept. The goal is to demonstrate that the mathematical framework is sound and the core mechanisms work -- not to compete with production LLMs.
What is FluidLM?
FluidLM replaces the self-attention mechanism with a system of continuous partial differential equations (PDEs). Inspired by Alan Turing's 1952 work on morphogenesis, it treats tokens as chemical concentrations that diffuse and react in a latent space.
Instead of every token explicitly "looking at" every other token through an N x N matrix, information propagates through:
- Local diffusion -- a multi-scale Laplacian (dilations 1, 4, 16) that spreads information between neighbors, like heat through a medium.
- Selective State Space (Mamba) -- a content-based temporal routing mechanism that replaces the previous fixed CausalLongConv. Each token selectively retains or forgets information based on its content, providing the content-aware long-range mixing that pure diffusion lacks.
- Reaction MLP (SwiGLU) -- the nonlinear component where all semantic capacity is concentrated.
- Global memory pump -- a reservoir
hof shape(B, D)with learned forget gate that accumulates a sequence summary.
This shift from "every token talks to every token" to "information flows like a fluid" eliminates quadratic complexity.
Table of Contents
- Motivation
- The FluidLM Approach
- Mathematical Foundations
- Architecture V4.5.0
- Implemented Features
- Architecture Comparison
- Research Dashboard
- Experimental Status
- Research Roadmap
- Getting Started
- Version History
- References
1. Motivation
The O(N^2) Attention Wall
Self-attention computes pairwise interactions between every token. For a sequence of N tokens, this produces an N x N matrix -- O(N^2) computational and memory complexity. Doubling context quadruples cost.
Static Computation
A Transformer applies exactly L layers to every input, whether it is processing "Hello" or a complex mathematical proof.
The KV-Cache Memory Wall
During inference, Transformers store Key and Value matrices for every past token in every layer. This cache grows with sequence length x layer count -- tens of gigabytes for long-context models.
GPU Dependency
The combination of O(N^2) compute, large KV-caches, and irregular memory access makes Transformers fundamentally GPU-dependent. Their patterns are poorly suited for CPUs, NPUs, and embedded processors.
FluidLM asks: what if we replaced the attention matrix entirely with a different mathematical object?
2. The FluidLM Approach
Two complementary propagation mechanisms replace global attention:
Local diffusion -- the multi-scale Laplacian propagates information from neighbor to neighbor at three spatial scales (dilations 1, 4, 16). Each token influences its immediate neighborhood; patterns emerge globally through repeated application.
Selective State Space (Mamba) -- V4.5.0 replaces the fixed CausalLongConv filter with a Mamba-style selective SSM. This is the key architectural upgrade: while the Laplacian provides position-based spatial mixing, the SSM provides content-based temporal routing. The SSM's input-dependent matrices (A, B, C) allow the model to selectively retain or discard information based on what the token contains -- the missing capability that caused the previous loss plateau at ~6.0.
Memory pump -- a global reservoir h of shape (B, D) that accumulates a sequence summary at each integration step, then broadcasts it uniformly to all positions. The Forget Gate (decay = sigmoid(decay_param)) introduces learned viscosity: the model learns what persists in the reservoir and what dissipates. The gate is h-aware (sigmoid(Wx + Uh)) to avoid accumulating what is already present.
Laplacian smoothness regularizer (V4.4.8, ported from FluidWorld) -- a grad_loss term computed on the final hidden representation penalizes high-frequency noise along the sequence dimension. This acts as an implicit spatial regularizer that improves autoregressive generation stability.
3. Mathematical Foundations
The Standard Transformer (what we replace)
$$\text{Attention}(Q, K, V) = \text{Softmax}!\left(\frac{Q \cdot K^\top}{\sqrt{d_k}}\right) \cdot V$$
The product $Q \cdot K^\top$ produces the $N \times N$ attention matrix -- the source of $O(N^2)$ complexity.
The FluidLM Governing Equation
$$\frac{\partial u}{\partial t} = \underbrace{\sum_{k} D_k \cdot \nabla^2_{d_k}(u)}{\text{local diffusion}} + \underbrace{\text{SSM}(u)}{\text{content-based routing}} + \underbrace{R(u, \theta)}{\text{reaction (SwiGLU)}} + \underbrace{\alpha \cdot h}{\text{global memory}}$$
Term 1: Multi-Scale Local Diffusion
The discrete Laplacian $[1, -2, 1]$ applied at three dilation levels:
| Dilation | Reach per step | Role | |----------|----------------|------| | 1 | ~1 token | Local syntax, morphology | | 4 | ~4 tokens | Phrase-level structure | | 16 | ~16 tokens | Sentence / paragraph |
$O(N)$ per step, sequential memory access -- ideal for CPU SIMD.
Term 2: Selective State Space (Mamba)
$$h_t = \bar{A}t \cdot h{t-1} + \bar{B}_t \cdot x_t$$
$$y_t = C_t \cdot h_t + D \cdot x_t$$
Where $A$, $B$, $C$ are input-dependent (computed from the current token). This is mathematically a discretized ODE -- the same family as the PDE diffusion. The SSM selectively chooses what to remember and what to forget based on content, providing the content-based routing that pure diffusion lacks.
Pure PyTorch implementation (no custom CUDA kernels). $O(N \cdot d \cdot s)$ training, $O(d \cdot s)$ constant per-token inference. No KV-cache.
Term 3: Reaction Function (SwiGLU)
$$R(u, \theta) = \left(W_1 \cdot u \odot \sigma(W_g \cdot u)\right) \cdot W_2$$
SwiGLU (V4.5.0) replaces the previous GELU MLP. Proven by LLaMA/PaLM to improve language modeling quality. $\frac{8}{3}$ expansion ratio.
Term 4: Global Memory Pump + Forget Gate
$$s = \text{mean}_{L}(R(u)) \quad \in \mathbb{R}^{B \times D}$$
$$g = \sigma(W_x \cdot \bar{u} + W_h \cdot h) \quad \in \mathbb{R}^{B \times D}$$
$$\delta = \sigma(\theta_{\text{decay}}) \quad \in (0,1)^{D} \quad \text{(learned viscosity)}$$
$$h \leftarrow \delta \odot h + g \cdot \tanh(s)$$
$h$ is of shape $(B, D)$ -- global reservoir, $O(1)$ memory. Initialized at $\text{decay} \approx 0.97$.
Term 5: Laplacian Grad Loss (V4.4.8)
$$\mathcal{L}{\text{grad}} = w_g \cdot \text{mean}!\left(|\nabla^2{1D}(z_{\text{final}})|\right)$$
Applied to the final hidden representation $z_{\text{final}}$ before the LM head. Penalizes second-order discontinuity along the sequence, encouraging smooth latent representations that degrade gracefully during autoregressive generation.
Positional Encoding: Sinusoidal
Applied once at the FluidNet input (not at each integration step):
$$PE_{(pos, 2i)} = \sin!\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos!\left(\frac{pos}{10000^{2i/d}}\right)$$
Sinusoidal encoding gives a clean additive signal that propagates naturally through the PDE.
Time Integration (Forward Euler)
$$u_{t+1} = \text{RMSNorm}!\left( u_t + \Delta t \cdot \left[ \sum_k D_k \nabla^2_{d_k}(u_t) + \text{SSM}(u_t) + R(u_t,\theta) + \alpha \cdot h_t \right] \right)$$
$\Delta t$ is a per-layer learned parameter (dt_gate), initialized from the config dt value.
Turing Equilibrium and Adaptive Computation
$$\tau = \text{mean}\left(\frac{|u_{t+1} - u_t|}{|u_t| + \varepsilon}\right) \xrightarrow{< \varepsilon} \text{HALT}$$
During inference, the model halts early when the fluid stabilizes. During training, turbulence contributes to a differentiable regularization loss.
4. Architecture V4.5.0
Input tokens
|
v
[Embedding] (d_model = 512)
|
v
[Sinusoidal PE] (0 params, applied once)
|
v
[FluidLayer 0] -- Diffusion (Laplacian x3 scales)
-- Selective SSM (Mamba, content-based)
-- Reaction SwiGLU (8/3 expansion)
-- Local Memory (causal avg pool + projection)
-- Memory Pump h (B,D) + Forget Gate (x T steps, T<=8)
|
[FluidLayer 1..3] (same structure, independent params)
|
v
[RMSNorm]
|
v
[Linear Head] (weight-tied with Embedding)
|
v
Logits (vocab: 50,257)
Parameter Count (d_model=512, 4 layers)
| Component | Parameters | Notes | |-----------|-----------|-------| | Embedding (weight-tied with head) | ~25.7M | Counts once | | Sinusoidal PE | 0 | Buffer | | Reaction SwiGLU 8/3 (x4 layers) | ~10.7M | Replaces GELU MLP | | Selective SSM / Mamba (x4 layers) | ~3.4M | Replaces CausalLongConv | | Memory Gate x + h (x4 layers) | ~2.1M | h-aware, (B,D) | | Multi-Head LongConv (x4 layers) | ~1.2M | K=33/65/129/257 | | Local Memory proj (x4 layers) | ~1.05M | Causal avg pool | | Diffusion coefficients (x4 layers) | ~6K | | | RMSNorm + dt_gate + alpha (x4 layers) | ~8K | | | Total V4.5.0 | ~44.2M | |
5. Implemented Features
| Feature | Status | Notes | |---------|--------|-------| | Multi-Scale Dilated Diffusion | Done | Dilations [1, 4, 16], learnable coefficients | | Selective SSM (Mamba) | Done | Content-based routing, pure PyTorch, V4.5.0 | | Multi-Head LongConv (K=33/65/129/257) | Done | 4
