<h1 align="center">mosaicmem</h1> Hybrid Spatial Memory for Video World Models Rust implementation of <a href="https://arxiv.org/abs/2603.17117">MosaicMem</a> — a synthetic scaffold for the paper-faithful system, with an explicit path toward checkpoint-backed real execution once a real backend is configured. <a href="https://crates.io/crates/mosaicmem"><img src="https://img.shields.io/crates/v/mosaicmem.svg" alt="crates.io"></a> <a href="https://docs.rs/mosaicmem"><img src="https://docs.rs/mosaicmem/badge.svg" alt="docs.rs"></a> <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="MIT"></a>

The Problem

Standard video diffusion models forget. Move the camera forward, turn around, and the scene behind you is gone. The model hallucinates a new one. MosaicMem fixes this by maintaining a 3D patch memory that tracks what has been generated, where it was in the world, and how to retrieve it when the camera returns.

Architecture

Trajectory ──▶ Keyframes ──▶ Depth ──▶ Fusion
[SE3 poses]                               │
                                          ▼
                                    Patch Memory
                                      (KD-tree)
                                          │
   Target Pose ──▶ Frustum Cull ──▶ Retrieve Top-K
                                          │
                 ┌────────────────────────┼────────────────────────┐
                 │                        │                        │
                 ▼                        ▼                        ▼
           Warped RoPE              Warped Latent            Coverage Mask
                 │                        │                        │
                 └────────────────────────┼────────────────────────┘
                                          │
                                          ▼
                  DiT Denoising (cross-attn + PRoPE)
                                          │
                                          ▼
                                     VAE Decode ──▶ Frames

TUI Showcase

<table> <tr> <td width="50%" valign="top"> <img src="docs/assets/img/tui-1.png" alt="Pipeline Demo tab" /> Pipeline Demo: run the end-to-end synthetic rollout and inspect generated artifacts from the terminal. </td> <td width="50%" valign="top"> <img src="docs/assets/img/tui-2.png" alt="Coverage tab" /> Coverage: visualize retrieval footprint and how much of the current view is grounded by memory. </td> </tr> <tr> <td width="50%" valign="top"> <img src="docs/assets/img/tui-3.png" alt="Benchmark tab" /> Benchmark: inspect latency, throughput, and per-iteration pipeline statistics in one screen. </td> <td width="50%" valign="top"> <img src="docs/assets/img/tui-4.png" alt="Memory Ops tab" /> Memory Ops: exercise splice, erase, translate, and other patch-space manipulations interactively. </td> </tr> </table>

Quick Start

cargo add mosaicmem

Or from source:

git clone https://github.com/AbdelStark/mosaicmem.git
cd mosaicmem
cargo test
cargo run -- demo --num-frames 16 --width 64 --height 64 --steps 5

The demo writes synthetic frames, a point cloud, a trajectory file, and a serialized memory store to demo_output/.

Backend Modes

mosaicmem now exposes explicit backend modes in both config and CLI output:

synthetic is the default fast-test path used throughout the current demo, benchmark, inspection, and TUI flows.
real is reserved for checkpoint-backed execution and requires both the real-backend Cargo feature and a checkpoint path in PipelineConfig.

Every CLI command prints its active backend as [synthetic] or [real] so synthetic scaffold runs are not mistaken for full paper reproduction.

Core Concepts

1. Camera Trajectory

Everything starts with a sequence of SE3 camera poses. Each pose encodes where the camera is and which direction it faces, using a rigid-body transformation (rotation + translation) in world coordinates.

use mosaicmem::camera::{CameraPose, CameraTrajectory};
use nalgebra::{UnitQuaternion, Vector3};

// Build a trajectory: camera moves 0.5 units along X per frame
let poses: Vec<CameraPose> = (0..32)
    .map(|i| {
        CameraPose::from_translation_rotation(
            i as f64 * 0.033,  // timestamp (seconds)
            Vector3::new(i as f32 * 0.5, 0.0, 0.0),
            UnitQuaternion::identity(),
        )
    })
    .collect();

let trajectory = CameraTrajectory::new(poses);

Trajectories can be saved/loaded as JSON, enabling offline trajectory design and replay.

2. Depth Estimation & 3D Lifting

Keyframes are lifted into 3D using monocular depth. The DepthEstimator trait abstracts over backends — plug in your own ONNX model or use the included synthetic estimator for testing.

use mosaicmem::geometry::depth::{DepthEstimator, SyntheticDepthEstimator};

let depth = SyntheticDepthEstimator::new(5.0, 1.0); // base_depth, noise_scale

Depth maps are unprojected through camera intrinsics into world-space 3D points, then incrementally fused via StreamingFusion with voxel-based deduplication and a KD-tree for spatial queries.

3. Patch Memory Store

The core data structure. Stores 3D patches — each carrying a latent vector (from VAE encoding), its world-space center, source camera pose, and provenance metadata.

use mosaicmem::memory::store::{MosaicMemoryStore, MemoryConfig};

let config = MemoryConfig {
    max_patches: 4096,
    top_k: 64,
    near_clip: 0.1,
    far_clip: 100.0,
    patch_size: 16,
    latent_patch_size: 2,
    ..Default::default()
};

let store = MosaicMemoryStore::new(config);

Retrieval is the key operation: given a target camera pose, the store performs frustum culling, projects 3D patch centers into the target view, scores by visibility, and returns the top-K patches. Supports:

Temporal decay — exponential half-life downweights stale patches
Diversity filtering — penalizes spatially redundant patches in the target 2D view
Budget enforcement — evicts lowest-scored patches when memory is full

The store is serializable to JSON for inspection and debugging.

4. Warped RoPE

Standard Rotary Position Embeddings (RoPE) assign positions on a fixed grid. Warped RoPE instead:

Takes each memory patch's 3D center
Reprojects it into the target camera's 2D coordinates
Computes a temporal offset |t_source - t_target|
Uses the warped (u, v, t) triple as RoPE positions

This ensures patches from different viewpoints and times appear at geometrically correct positions in the attention computation, rather than arbitrary grid slots.

5. PRoPE (Projective Rotary Position Embedding)

Camera-dependent positional encoding that goes beyond Plucker coordinates. PRoPE derives multiplicative rotation parameters from the camera's full projection geometry (intrinsics + extrinsics), giving the transformer direct access to the projective structure of each frame.

6. Warped Latent Alignment

Retrieved memory patches are geometrically aligned to the target view in feature space. A planar homography is computed from source-to-target camera geometry and applied to warp the patch's latent tensor, so the diffusion backbone sees memory features in the correct spatial position.

7. Memory Cross-Attention

Multi-head cross-attention where:

Queries = current denoising tokens
Keys/Values = retrieved memory patches + rasterized latent canvas

Keys are enhanced with Warped RoPE before the attention dot product. A learnable gate controls how much memory signal reaches the denoising stream, preventing catastrophic overwriting of novel content.

8. Autoregressive Rollout

Long videos are generated in overlapping windows. Each window:

Slices the trajectory into window_size poses
Retrieves the memory mosaic from the store
Runs the full DiT denoising loop (with memory conditioning)
Decodes latents through the VAE
Linearly blends overlap frames with the previous window
Inserts new keyframes back into memory

The stride between windows is window_size - overlap, giving smooth transitions without discontinuities.

Library API

use mosaicmem::camera::{CameraPose, CameraTrajectory};
use mosaicmem::diffusion::backbone::SyntheticBackbone;
use mosaicmem::diffusion::scheduler::DDPMScheduler;
use mosaicmem::diffusion::vae::SyntheticVAE;
use mosaicmem::geometry::depth::SyntheticDepthEstimator;
use mosaicmem::pipeline::autoregressive::AutoregressivePipeline;
use mosaicmem::pipeline::config::PipelineConfig;

let config = PipelineConfig {
    width: 256,
    height: 256,
    window_size: 16,
    window_overlap: 4,
    num_inference_steps: 50,
    retrieval_top_k: 64,
    diversity_radius: 32.0,
    temporal_decay_half_life: 5.0,
    enable_warped_latent: true,
    ..Default::default()
};

let trajectory = CameraTrajectory::new(/* your poses */);
let backbone = SyntheticBackbone::new(0.1);
let scheduler = DDPMScheduler::linear(1000, 1e-4, 0.02);
let vae = SyntheticVAE::new(8, 4, 16);  // downsample, latent_ch, hidden
let depth = SyntheticDepthEstimator::new(5.0, 1.0);
let text_embedding = vec![vec![0.0f32; 64]; 10];

let mut pipeline = AutoregressivePipeline::new(config);
let (frames, shapes) = pipeline.generate(
    &trajectory, &text_embedding,
    &backbone, &scheduler, &vae, &depth,
    None,  // optional per-window callback
)?;
// frames: Vec<f32> in [B, C, T, H, W] layout per window
// shapes: Vec<[usize; 5]> — one shape per window

All model-dependent components are behind traits (DiffusionBackbone, NoiseScheduler, VAE, `Dept

Mosaicmem

Install / Use

README