SkillAgentSearch skills...

FluidLM

FluidLM is a Transformer-free language model replacing O(N²) self-attention with reaction-diffusion PDEs - achieving O(N) scaling, adaptive computation, and no KV-cache.

Install / Use

/learn @infinition/FluidLM
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

FluidLM

<img width="1950" height="674" alt="fluidlm_banner" src="https://github.com/user-attachments/assets/d17ee5df-5c7d-423c-a276-1ff4528b9614" />

FluidLM

A Transformer-free language model replacing O(N^2) self-attention with reaction-diffusion PDEs -- achieving O(N) scaling, adaptive computation, and no KV-cache.

Early-stage proof of concept. The goal is to demonstrate that the mathematical framework is sound and the core mechanisms work -- not to compete with production LLMs.


What is FluidLM?

FluidLM replaces the self-attention mechanism with a system of continuous partial differential equations (PDEs). Inspired by Alan Turing's 1952 work on morphogenesis, it treats tokens as chemical concentrations that diffuse and react in a latent space.

Instead of every token explicitly "looking at" every other token through an N x N matrix, information propagates through:

  • Local diffusion -- a multi-scale Laplacian (dilations 1, 4, 16) that spreads information between neighbors, like heat through a medium.
  • Selective State Space (Mamba) -- a content-based temporal routing mechanism that replaces the previous fixed CausalLongConv. Each token selectively retains or forgets information based on its content, providing the content-aware long-range mixing that pure diffusion lacks.
  • Reaction MLP (SwiGLU) -- the nonlinear component where all semantic capacity is concentrated.
  • Global memory pump -- a reservoir h of shape (B, D) with learned forget gate that accumulates a sequence summary.

This shift from "every token talks to every token" to "information flows like a fluid" eliminates quadratic complexity.


Table of Contents

  1. Motivation
  2. The FluidLM Approach
  3. Mathematical Foundations
  4. Architecture V4.5.0
  5. Implemented Features
  6. Architecture Comparison
  7. Research Dashboard
  8. Experimental Status
  9. Research Roadmap
  10. Getting Started
  11. Version History
  12. References

1. Motivation

The O(N^2) Attention Wall

Self-attention computes pairwise interactions between every token. For a sequence of N tokens, this produces an N x N matrix -- O(N^2) computational and memory complexity. Doubling context quadruples cost.

Static Computation

A Transformer applies exactly L layers to every input, whether it is processing "Hello" or a complex mathematical proof.

The KV-Cache Memory Wall

During inference, Transformers store Key and Value matrices for every past token in every layer. This cache grows with sequence length x layer count -- tens of gigabytes for long-context models.

GPU Dependency

The combination of O(N^2) compute, large KV-caches, and irregular memory access makes Transformers fundamentally GPU-dependent. Their patterns are poorly suited for CPUs, NPUs, and embedded processors.

FluidLM asks: what if we replaced the attention matrix entirely with a different mathematical object?


2. The FluidLM Approach

Two complementary propagation mechanisms replace global attention:

Local diffusion -- the multi-scale Laplacian propagates information from neighbor to neighbor at three spatial scales (dilations 1, 4, 16). Each token influences its immediate neighborhood; patterns emerge globally through repeated application.

Selective State Space (Mamba) -- V4.5.0 replaces the fixed CausalLongConv filter with a Mamba-style selective SSM. This is the key architectural upgrade: while the Laplacian provides position-based spatial mixing, the SSM provides content-based temporal routing. The SSM's input-dependent matrices (A, B, C) allow the model to selectively retain or discard information based on what the token contains -- the missing capability that caused the previous loss plateau at ~6.0.

Memory pump -- a global reservoir h of shape (B, D) that accumulates a sequence summary at each integration step, then broadcasts it uniformly to all positions. The Forget Gate (decay = sigmoid(decay_param)) introduces learned viscosity: the model learns what persists in the reservoir and what dissipates. The gate is h-aware (sigmoid(Wx + Uh)) to avoid accumulating what is already present.

Laplacian smoothness regularizer (V4.4.8, ported from FluidWorld) -- a grad_loss term computed on the final hidden representation penalizes high-frequency noise along the sequence dimension. This acts as an implicit spatial regularizer that improves autoregressive generation stability.


3. Mathematical Foundations

The Standard Transformer (what we replace)

$$\text{Attention}(Q, K, V) = \text{Softmax}!\left(\frac{Q \cdot K^\top}{\sqrt{d_k}}\right) \cdot V$$

The product $Q \cdot K^\top$ produces the $N \times N$ attention matrix -- the source of $O(N^2)$ complexity.

The FluidLM Governing Equation

$$\frac{\partial u}{\partial t} = \underbrace{\sum_{k} D_k \cdot \nabla^2_{d_k}(u)}{\text{local diffusion}} + \underbrace{\text{SSM}(u)}{\text{content-based routing}} + \underbrace{R(u, \theta)}{\text{reaction (SwiGLU)}} + \underbrace{\alpha \cdot h}{\text{global memory}}$$

Term 1: Multi-Scale Local Diffusion

The discrete Laplacian $[1, -2, 1]$ applied at three dilation levels:

| Dilation | Reach per step | Role | |----------|----------------|------| | 1 | ~1 token | Local syntax, morphology | | 4 | ~4 tokens | Phrase-level structure | | 16 | ~16 tokens | Sentence / paragraph |

$O(N)$ per step, sequential memory access -- ideal for CPU SIMD.

Term 2: Selective State Space (Mamba)

$$h_t = \bar{A}t \cdot h{t-1} + \bar{B}_t \cdot x_t$$

$$y_t = C_t \cdot h_t + D \cdot x_t$$

Where $A$, $B$, $C$ are input-dependent (computed from the current token). This is mathematically a discretized ODE -- the same family as the PDE diffusion. The SSM selectively chooses what to remember and what to forget based on content, providing the content-based routing that pure diffusion lacks.

Pure PyTorch implementation (no custom CUDA kernels). $O(N \cdot d \cdot s)$ training, $O(d \cdot s)$ constant per-token inference. No KV-cache.

Term 3: Reaction Function (SwiGLU)

$$R(u, \theta) = \left(W_1 \cdot u \odot \sigma(W_g \cdot u)\right) \cdot W_2$$

SwiGLU (V4.5.0) replaces the previous GELU MLP. Proven by LLaMA/PaLM to improve language modeling quality. $\frac{8}{3}$ expansion ratio.

Term 4: Global Memory Pump + Forget Gate

$$s = \text{mean}_{L}(R(u)) \quad \in \mathbb{R}^{B \times D}$$

$$g = \sigma(W_x \cdot \bar{u} + W_h \cdot h) \quad \in \mathbb{R}^{B \times D}$$

$$\delta = \sigma(\theta_{\text{decay}}) \quad \in (0,1)^{D} \quad \text{(learned viscosity)}$$

$$h \leftarrow \delta \odot h + g \cdot \tanh(s)$$

$h$ is of shape $(B, D)$ -- global reservoir, $O(1)$ memory. Initialized at $\text{decay} \approx 0.97$.

Term 5: Laplacian Grad Loss (V4.4.8)

$$\mathcal{L}{\text{grad}} = w_g \cdot \text{mean}!\left(|\nabla^2{1D}(z_{\text{final}})|\right)$$

Applied to the final hidden representation $z_{\text{final}}$ before the LM head. Penalizes second-order discontinuity along the sequence, encouraging smooth latent representations that degrade gracefully during autoregressive generation.

Positional Encoding: Sinusoidal

Applied once at the FluidNet input (not at each integration step):

$$PE_{(pos, 2i)} = \sin!\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos!\left(\frac{pos}{10000^{2i/d}}\right)$$

Sinusoidal encoding gives a clean additive signal that propagates naturally through the PDE.

Time Integration (Forward Euler)

$$u_{t+1} = \text{RMSNorm}!\left( u_t + \Delta t \cdot \left[ \sum_k D_k \nabla^2_{d_k}(u_t) + \text{SSM}(u_t) + R(u_t,\theta) + \alpha \cdot h_t \right] \right)$$

$\Delta t$ is a per-layer learned parameter (dt_gate), initialized from the config dt value.

Turing Equilibrium and Adaptive Computation

$$\tau = \text{mean}\left(\frac{|u_{t+1} - u_t|}{|u_t| + \varepsilon}\right) \xrightarrow{< \varepsilon} \text{HALT}$$

During inference, the model halts early when the fluid stabilizes. During training, turbulence contributes to a differentiable regularization loss.


4. Architecture V4.5.0

Input tokens
    |
    v
[Embedding]  (d_model = 512)
    |
    v
[Sinusoidal PE]  (0 params, applied once)
    |
    v
[FluidLayer 0]  -- Diffusion (Laplacian x3 scales)
                 -- Selective SSM (Mamba, content-based)
                 -- Reaction SwiGLU (8/3 expansion)
                 -- Local Memory (causal avg pool + projection)
                 -- Memory Pump h (B,D) + Forget Gate (x T steps, T<=8)
    |
[FluidLayer 1..3]  (same structure, independent params)
    |
    v
[RMSNorm]
    |
    v
[Linear Head]  (weight-tied with Embedding)
    |
    v
Logits (vocab: 50,257)

Parameter Count (d_model=512, 4 layers)

| Component | Parameters | Notes | |-----------|-----------|-------| | Embedding (weight-tied with head) | ~25.7M | Counts once | | Sinusoidal PE | 0 | Buffer | | Reaction SwiGLU 8/3 (x4 layers) | ~10.7M | Replaces GELU MLP | | Selective SSM / Mamba (x4 layers) | ~3.4M | Replaces CausalLongConv | | Memory Gate x + h (x4 layers) | ~2.1M | h-aware, (B,D) | | Multi-Head LongConv (x4 layers) | ~1.2M | K=33/65/129/257 | | Local Memory proj (x4 layers) | ~1.05M | Causal avg pool | | Diffusion coefficients (x4 layers) | ~6K | | | RMSNorm + dt_gate + alpha (x4 layers) | ~8K | | | Total V4.5.0 | ~44.2M | |


5. Implemented Features

| Feature | Status | Notes | |---------|--------|-------| | Multi-Scale Dilated Diffusion | Done | Dilations [1, 4, 16], learnable coefficients | | Selective SSM (Mamba) | Done | Content-based routing, pure PyTorch, V4.5.0 | | Multi-Head LongConv (K=33/65/129/257) | Done | 4

View on GitHub
GitHub Stars7
CategoryDevelopment
Updated5d ago
Forks0

Languages

Python

Security Score

70/100

Audited on Mar 18, 2026

No findings