Milimovideo

The AI-Native Cinematic Studio. A professional Non-Linear Editor (NLE) for filmmakers.

Generate Convert Improve

Install / Use

/learn @mainza-ai/Milimovideo

About this skill

Quality Score

0/100

README

Milimo Video

The AI-Native Cinematic Studio

</div>

Milimo Video is a state-of-the-art, open-source AI video production studio designed for filmmakers. It unifies the world's best foundation models into a cohesive, professional workflow — running entirely local-first on your own machine.

Unlike simple "prompt-to-video" interfaces, Milimo is a full Non-Linear Editor (NLE) that combines:

LTX-2 19B — Dual-stream transformer for cinematic video generation (text-to-video, image-to-video, keyframe interpolation).
Flux 2 Klein 9B — High-fidelity image synthesis with IP-Adapter reference conditioning and RePaint inpainting.
SAM 3 — Text-prompted segmentation, click-to-segment, and video object tracking via standalone microservice.
Gemma 3 — Intelligent prompt enhancement and narrative direction.

✨ Key Features

🎬 Visual Conditioning & Character Consistency

Achieve what standard models can't: persistent identities across shots.

IP-Adapter Integration: Flux 2's IP-Adapter (CLIP ViT-L → 4-token projection) injects visual style and character identity directly into the generation latent space.
Reference Conditioning: Native AE encodes reference images with temporal offsets, concatenating them to the denoising input for style-faithful generation.
Story Elements: Define reusable Characters, Locations, and Objects with trigger words (e.g., @Hero). The system auto-detects triggers in prompts and injects the correct IP-Adapter images and enriched text.
Projected Latents: Support for projecting reference images into LTX-2's latent space for seamless Image-to-Video transitions.

Milimo Elements

📝 Storyboard Engine

Transform screenplays into video productions instantly.

Script-to-Video: Paste standard screenplay text — Milimo parses it into Scenes and Shots via regex-based ScriptParser.
Auto-Injection: ElementManager scans for @Element references and injects visual/textual conditioning + narrative context (action, dialogue, character).
Chained Generation: Shots exceeding 121 frames auto-trigger Quantum Alignment — autoregressive chunk-by-chunk generation with latent handoffs aligned to the 8-pixel VAE grid, ensuring seamless visual continuity.

🎞️ Professional Non-Linear Editor (NLE)

A fully functional timeline built for the AI workflow.

Multi-Track Editing: 3 tracks — V1 (magnetic main), V2 (overlay, free placement), A1 (audio).
Smart Continue: Autoregressive video chaining via StoryboardManager with last-frame extraction and overlap trimming.
Precision Control: Frame-accurate seeking, scrubbing, trimming (trimIn/trimOut), and snapEngine snapping.
CSS-Based Timeline: GPU-accelerated clip positioning (translateX), granular Zustand selectors, and useShallow for 60fps UI responsiveness.

Milimo Timeline

✂️ In-Painting & Intelligent Editing

Professional-grade retouching powered by the SAM 3 → Flux 2 pipeline.

Flux RePaint Inpainting: Select any frame, mask an area, and use natural language to add or remove elements. Uses iterative mask-blended denoising with real-time SSE progress. Inpaint jobs are persisted to the database for reliable status polling via /status/{job_id}.
SAM 3 Text-Prompted Segmentation: Describe what to segment ("a person", "the sky") — SAM 3 finds all matching objects with bounding boxes and confidence scores.
Click-to-Segment: Click on objects in the video frame for instant SAM 3-powered mask generation. No manual roto-scoping.
Video Object Tracking: Select an object on one frame → SAM 3 tracks it across every frame of the video, bidirectionally. Full UI via TrackingPanel with session lifecycle management (start → prompt → propagate → navigate results).
AE Hot-Swap: Toggle between native AutoEncoder (supports reference conditioning) and diffusers AE fallback via the enable_ae flag.
True CFG Mode: Optional double-pass negative prompting for Flux 2 (2× inference time, disabled by default).

Milimo Image Generation

🧠 Advanced Generation

Dual-Stage Pipeline: LTX-2 generates at half-res, then spatially upsamples 2× with distilled LoRA-384.
3 Pipeline Modes: ti2vid (text/image-to-video), ic_lora (IC-LoRA conditioning), keyframe (keyframe interpolation).
Single-Frame Shortcut: When num_frames==1, video gen silently delegates to Flux 2 for instant image generation.
Real-Time Progress: SSE (Server-Sent Events) stream denoising step progress, ETA, and enhanced prompts to the UI in real-time.

🛠️ Architecture

Milimo Architecture

Milimo Video is built on a modern, robust stack:

| Layer | Technology | |---|---| | Frontend | React 18, TypeScript, Vite, Zustand (7-slice store + persist + zundo undo/redo) | | Backend | FastAPI (Python 3.10+), SQLModel/SQLAlchemy (SQLite), SSE via sse-starlette | | Video AI | LTX-2 19B Dual-Stream Transformer — 3 pipelines + chained generation | | Image AI | Flux 2 Klein 9B (FluxInpainter) — IP-Adapter, True CFG, RePaint inpainting | | Segmentation | SAM 3 Microservice (port 8001) — Sam3Processor (text/box), inst_interactive_predictor (click), Sam3VideoPredictor (tracking, MPS/CUDA/CPU) | | Prompt AI | Gemma 3 (via LTX-2 text encoder) — cinematic prompt enhancement | | Processing | FFmpeg — thumbnails, frame extraction, overlap trimming, concat |

MPS-First Optimization

Designed for Apple Silicon with CUDA as primary target:

FP8 on CUDA, float32 fallback on MPS
VAE decode CPU-offloaded on MPS (prevents black output)
Transformer forced to float32 dtype on MPS
Memory managed via gc.collect() + torch.mps.empty_cache()
SAM 3 Video Predictor: device auto-detection (CUDA → MPS → CPU), guarded torch.cuda.* calls
PYTORCH_ENABLE_MPS_FALLBACK=1 for SAM 3 ops not yet on MPS

Documentation

See the docs/ directory for comprehensive technical documentation:

System Architecture — Component diagrams and subsystem interactions
Data Models — ER diagrams, Pydantic schemas, TypeScript types
AI Pipelines — LTX-2, Flux 2, SAM 3 pipeline analysis with flow diagrams
Frontend State — Zustand store architecture and optimization strategies
Execution Flow — Sequence diagrams for all generation workflows
File Dependencies — Import graphs and module analysis
Flux 2 Bible — Deep-dive into Flux 2 integration
LTX-2 Bible — Deep-dive into LTX-2 integration
SAM 3 Bible — Deep-dive into SAM 3 microservice

🚀 Getting Started

Prerequisites

Python 3.10+
Node.js 18+
FFmpeg
High-End GPU:
- NVIDIA: 16GB+ VRAM recommended.
- Apple Silicon: M1/M2/M3/M4 Max or Ultra recommended (32GB+ RAM).

1. Installation

Clone the repository:

git clone https://github.com/mainza-ai/milimovideo.git
cd milimovideo

2. Backend Setup

Milimo uses a specialized environment for LTX-2 and Flux.

Create Environment:

python3 -m venv milimov
./milimov/bin/pip install -e ./LTX-2/packages/ltx-core
./milimov/bin/pip install -e ./LTX-2/packages/ltx-pipelines
./milimov/bin/pip install -e ./flux2
./milimov/bin/pip install -r backend/requirements.txt

Download LTX-2 Models: Place the following into LTX-2/models/checkpoints/:
- LTX-2 19B Distilled
- LTX-2 Distilled LoRA-384 → checkpoints/
- LTX-2 Spatial Upscaler → upscalers/
- Gemma 3 Text Encoder → text_encoders/gemma3/

Download Flux 2 Models: Place files in backend/models/flux2/:

backend/models/flux2/
├── flux-2-klein-9b.safetensors    # Flow model (9B params)
├── ae.safetensors                 # Native AutoEncoder (preferred)
├── vae/                           # Diffusers AE fallback (config.json + diffusion_pytorch_model.safetensors)
├── text_encoder/                  # Qwen 3 (8B) text encoder
├── tokenizer/                     # Qwen tokenizer files
└── ip-adapter.safetensors         # IP-Adapter weights (CLIP ViT-L projection)

SAM 3 Setup (Segmentation & Tracking): The SAM 3 service runs in a separate environment (sam3_env).
1. Create environment:
```
conda create -n sam3_env python=3.12
conda activate sam3_env
pip install -e sam3
pip install fastapi uvicorn python-multipart psutil pycocotools huggingface_hub
```
2. Download Model: Download the SAM 3 checkpoint from HuggingFace (auto-downloads on first start if missing).
3. Place it in: backend/models/sam3/sam3.pt

3. Running the Studio

**1. Start the B

Related Skills

docs-writer

98.9k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

332.9k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

arscontexta

2.8k

Claude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.

mainza-ai

View profile

View on GitHub

GitHub Stars70

CategoryContent

Updated3d ago

Forks10

mainza-ai/milimovideo

Languages

Python

Security Score

85/100

Audited on Mar 20, 2026

No findings