FloodDiffusion

FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

Generate Convert Improve

Install / Use

/learn @ShandaAI/FloodDiffusion

About this skill

Quality Score

0/100

README

FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

[Project Page] | [Paper (arXiv)]

We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.

Features

🔄 Streaming Generation: Support for continuous motion generation with text condition changes
🚀 Latent Diffusion Forcing: Efficient generation using compressed latent space with diffusion
⚡ Real-time Capable: Optimized for streaming inference with ~50 FPS model output

Installation

Environment Setup

# Create conda environment
conda create -n motion_gen python=3.10
conda activate motion_gen

# Install PyTorch
pip install torch torchvision torchaudio

# Install dependencies
pip install -r requirements.txt

# Install Flash Attention
conda install -c nvidia cuda-toolkit
export CUDA_HOME=$CONDA_PREFIX
pip install flash-attn --no-build-isolation

Quick Inference (No Data Required)

If you only need to generate motions and don't plan to train or evaluate models, you can use our standalone model on Hugging Face:

🤗 ShandaAI/FloodDiffusion

This version requires no dataset downloads and works out-of-the-box for inference:

from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "ShandaAI/FloodDiffusion",
    trust_remote_code=True
)

# Generate motion from text
motion = model("a person walking forward", length=60)
print(f"Generated motion: {motion.shape}")  # (~240, 263)

# Generate as joint coordinates for visualization
motion_joints = model("a person walking forward", length=60, output_joints=True)
print(f"Generated joints: {motion_joints.shape}")  # (~240, 22, 3)

# Multi-text transitions
motion = model(
    text=[["walk forward", "turn around", "run back"]],
    length=[120],
    text_end=[[40, 80, 120]]
)

For detailed API documentation, see the model card.

We also provide a tiny version of it which is smaller and faster 🤗 ShandaAI/FloodDiffusionTiny:

from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "ShandaAI/FloodDiffusionTiny",
    trust_remote_code=True
)

Note: For training, evaluation, or using the scripts in this repository, continue with the Data Preparation section below.

Data Preparation

Prepare Data from Original Sources

To reproduce our results from scratch, follow the original data preparation pipelines:

HumanML3D:

Follow the instructions in the HumanML3D repository
Extract 263D motion features using their processing pipeline
Place the processed data in raw_data/HumanML3D/

BABEL:

Download from the BABEL website
Process the motion sequences to extract 263D features
For streaming generation, segment and process according to the frame-level annotations
Place the processed data in raw_data/BABEL_streamed/

Dependencies:

Download T5 encoder weights from Hugging Face
Download T2M evaluation models from the text-to-motion repository
Download GloVe embeddings

Quick Start: Download Preprocessed Data (Recommended)

We provide all necessary data (datasets, dependencies, and pretrained models) on Hugging Face: 🤗 ShandaAI/FloodDiffusionDownloads

For inference only (downloads deps/ and outputs/):

pip install huggingface_hub
python download_assets.py

For training/evaluation (also downloads datasets in raw_data/):

pip install huggingface_hub
python download_assets.py --with-dataset

This will automatically download and extract files into the correct directories.

Directory Structure

After downloading or preparing the data, your project should have the following structure:

Dependencies Directory:

deps/
├── t2m/                     # Text-to-Motion evaluation models
│   ├── humanml3d/           # HumanML3D evaluator
│   ├── kit/                 # KIT-ML evaluator
│   └── meta/                # Statistics (mean.npy, std.npy)
├── glove/                   # GloVe word embeddings
│   ├── our_vab_data.npy
│   ├── our_vab_idx.pkl
│   └── our_vab_words.pkl
└── t5_umt5-xxl-enc-bf16/    # T5 text encoder

Dataset Directory:

raw_data/
├── HumanML3D/
│   ├── new_joint_vecs/      # 263D motion features (required)
│   ├── texts/               # Text annotations
│   ├── train.txt            # Training split
│   ├── val.txt              # Validation split
│   ├── test.txt             # Test split
│   ├── all.txt              # All samples
│   ├── Mean.npy             # Dataset mean
│   ├── Std.npy              # Dataset std
│   ├── TOKENS_*/            # Pretokenized features (auto-generated)
│   └── animations/          # Rendered videos (optional)
│
└── BABEL_streamed/
    ├── motions/             # 263D motion features (required)
    ├── texts/               # Text annotations
    ├── frames/              # Frame-level annotations
    ├── train_processed.txt  # Training split
    ├── val_processed.txt    # Validation split
    ├── test_processed.txt   # Test split
    ├── TOKENS_*/            # Pretokenized features (auto-generated)
    └── animations/          # Rendered videos (optional)

Pretrained Models Directory:

outputs/
├── vae_1d_z4_step=300000.ckpt          # VAE model (1D, z_dim=4)
├── 20251106_063218_ldf/
│   └── step_step=50000.ckpt            # LDF model checkpoint (HumanML3D)
├── 20251107_021814_ldf_stream/
│   └── step_step=240000.ckpt           # LDF streaming model checkpoint (BABEL)
├── 20251217_023720_ldf_tiny/
│   └── step_step=60000.ckpt            # LDF tiny model checkpoint
└── 20251219_01492_ldf_tiny_stream/
    └── step_step=200000.ckpt           # LDF tiny streaming model checkpoint

Note: If you downloaded the models using the script above, the paths are already correctly configured. Otherwise, update test_ckpt and test_vae_ckpt in your config files to point to your checkpoint locations.

Configuration

Create configs/paths.yaml from the example:

cp configs/paths_default.yaml configs/paths.yaml
# Edit paths.yaml to point to your data directories

Available Configs

vae_wan_1d.yaml - VAE training configuration
ldf.yaml - LDF training on HumanML3D
ldf_babel.yaml - LDF training on BABEL
stream.yaml - Streaming generation config
ldf_generate.yaml - Generation-only config

Training

1. Train VAE (Motion Encoder)

# Train VAE
python train_vae.py --config configs/vae_wan_1d.yaml --override train=True

# Test VAE
python train_vae.py --config configs/vae_wan_1d.yaml

2. Pretokenize Dataset

Precompute VAE tokens for diffusion training:

python pretokenize_vae.py --config configs/vae_wan_1d.yaml

3. Train Latent Diffusion Forcing (Flood Diffusion)

# Train on HumanML3D
python train_ldf.py --config configs/ldf.yaml --override train=True

# Train on BABEL (streaming)
python train_ldf.py --config configs/ldf_babel.yaml --override train=True

# Test/Evaluate
python train_ldf.py --config configs/ldf.yaml

Generation

Interactive Generation

python generate_ldf.py --config configs/stream.yaml

Visualization

Render motion files to videos:

python visualize_motion.py

This script:

Reads 263D motion features from disk
Renders to MP4 videos with skeleton visualization
Supports batch processing of directories

Web Real-time Demo

For real-time interactive demo with streaming generation, see web_demo/README.md.

Model Architecture

VAE (Variational Autoencoder)

Input: T × 263 motion features
Latent: (T/4) × 4 tokens
Architecture: Causal encoder and decoder based on WAN2.2

LDF (Latent Diffusion Forcing)

Backbone: DiT based on WAN2.2
Text Encoder: T5
Diffusion Schedule: Triangular noise schedule
Streaming: Autoregressive latent generation

Project Structure

<project_root>/
├── configs/                        # Configuration files
│   ├── vae_wan_1d.yaml             # VAE training config
│   ├── ldf.yaml                    # LDF training (HumanML3D)
│   ├── ldf_babel.yaml              # LDF training (BABEL)
│   ├── stream.yaml                 # Streaming generation
│   └── paths.yaml                  # Data paths (create from .example)
│
├── datasets/                       # Dataset loaders
│   ├── humanml3d.py                # HumanML3D dataset
│   └── babel.py                    # BABEL dataset
│
├── models/                         # Model implementations
│   ├── vae_wan_1d.py               # VAE encoder-decoder
│   └── diffusion_forcing_wan.py    # LDF diffusion model
│
├── metrics/                        # Evaluation metrics
│   ├── t2m.py                      # Text-to-Motion metrics
│   └── mr.py                       # Motion reconstruction metrics
│
├── utils/                          # Utilities
│   ├── initialize.py               # Config & model loading
│   ├── motion_process.py           # Motion data processing
│   └── visualize.py                # Rendering utilities
│
├── train_vae.py                    # VAE training script
├── train_ldf.py                    # LDF training script
├── pretokenize_vae.py              # Dataset pretokenization
├── generate_ldf.py                 # Motion generation
├── visualize_motion.py             # Batch visualization
├── requiremen

Related Skills

node-connect

351.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

351.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

351.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。