FloodDiffusion
FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation
Install / Use
/learn @ShandaAI/FloodDiffusionREADME
FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation
[Project Page] | [Paper (arXiv)]
We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.
Features
- 🔄 Streaming Generation: Support for continuous motion generation with text condition changes
- 🚀 Latent Diffusion Forcing: Efficient generation using compressed latent space with diffusion
- ⚡ Real-time Capable: Optimized for streaming inference with ~50 FPS model output
Installation
Environment Setup
# Create conda environment
conda create -n motion_gen python=3.10
conda activate motion_gen
# Install PyTorch
pip install torch torchvision torchaudio
# Install dependencies
pip install -r requirements.txt
# Install Flash Attention
conda install -c nvidia cuda-toolkit
export CUDA_HOME=$CONDA_PREFIX
pip install flash-attn --no-build-isolation
Quick Inference (No Data Required)
If you only need to generate motions and don't plan to train or evaluate models, you can use our standalone model on Hugging Face:
This version requires no dataset downloads and works out-of-the-box for inference:
from transformers import AutoModel
# Load model
model = AutoModel.from_pretrained(
"ShandaAI/FloodDiffusion",
trust_remote_code=True
)
# Generate motion from text
motion = model("a person walking forward", length=60)
print(f"Generated motion: {motion.shape}") # (~240, 263)
# Generate as joint coordinates for visualization
motion_joints = model("a person walking forward", length=60, output_joints=True)
print(f"Generated joints: {motion_joints.shape}") # (~240, 22, 3)
# Multi-text transitions
motion = model(
text=[["walk forward", "turn around", "run back"]],
length=[120],
text_end=[[40, 80, 120]]
)
For detailed API documentation, see the model card.
We also provide a tiny version of it which is smaller and faster 🤗 ShandaAI/FloodDiffusionTiny:
from transformers import AutoModel
# Load model
model = AutoModel.from_pretrained(
"ShandaAI/FloodDiffusionTiny",
trust_remote_code=True
)
Note: For training, evaluation, or using the scripts in this repository, continue with the Data Preparation section below.
Data Preparation
Prepare Data from Original Sources
To reproduce our results from scratch, follow the original data preparation pipelines:
HumanML3D:
- Follow the instructions in the HumanML3D repository
- Extract 263D motion features using their processing pipeline
- Place the processed data in
raw_data/HumanML3D/
BABEL:
- Download from the BABEL website
- Process the motion sequences to extract 263D features
- For streaming generation, segment and process according to the frame-level annotations
- Place the processed data in
raw_data/BABEL_streamed/
Dependencies:
- Download T5 encoder weights from Hugging Face
- Download T2M evaluation models from the text-to-motion repository
- Download GloVe embeddings
Quick Start: Download Preprocessed Data (Recommended)
We provide all necessary data (datasets, dependencies, and pretrained models) on Hugging Face: 🤗 ShandaAI/FloodDiffusionDownloads
For inference only (downloads deps/ and outputs/):
pip install huggingface_hub
python download_assets.py
For training/evaluation (also downloads datasets in raw_data/):
pip install huggingface_hub
python download_assets.py --with-dataset
This will automatically download and extract files into the correct directories.
Directory Structure
After downloading or preparing the data, your project should have the following structure:
Dependencies Directory:
deps/
├── t2m/ # Text-to-Motion evaluation models
│ ├── humanml3d/ # HumanML3D evaluator
│ ├── kit/ # KIT-ML evaluator
│ └── meta/ # Statistics (mean.npy, std.npy)
├── glove/ # GloVe word embeddings
│ ├── our_vab_data.npy
│ ├── our_vab_idx.pkl
│ └── our_vab_words.pkl
└── t5_umt5-xxl-enc-bf16/ # T5 text encoder
Dataset Directory:
raw_data/
├── HumanML3D/
│ ├── new_joint_vecs/ # 263D motion features (required)
│ ├── texts/ # Text annotations
│ ├── train.txt # Training split
│ ├── val.txt # Validation split
│ ├── test.txt # Test split
│ ├── all.txt # All samples
│ ├── Mean.npy # Dataset mean
│ ├── Std.npy # Dataset std
│ ├── TOKENS_*/ # Pretokenized features (auto-generated)
│ └── animations/ # Rendered videos (optional)
│
└── BABEL_streamed/
├── motions/ # 263D motion features (required)
├── texts/ # Text annotations
├── frames/ # Frame-level annotations
├── train_processed.txt # Training split
├── val_processed.txt # Validation split
├── test_processed.txt # Test split
├── TOKENS_*/ # Pretokenized features (auto-generated)
└── animations/ # Rendered videos (optional)
Pretrained Models Directory:
outputs/
├── vae_1d_z4_step=300000.ckpt # VAE model (1D, z_dim=4)
├── 20251106_063218_ldf/
│ └── step_step=50000.ckpt # LDF model checkpoint (HumanML3D)
├── 20251107_021814_ldf_stream/
│ └── step_step=240000.ckpt # LDF streaming model checkpoint (BABEL)
├── 20251217_023720_ldf_tiny/
│ └── step_step=60000.ckpt # LDF tiny model checkpoint
└── 20251219_01492_ldf_tiny_stream/
└── step_step=200000.ckpt # LDF tiny streaming model checkpoint
Note: If you downloaded the models using the script above, the paths are already correctly configured. Otherwise, update
test_ckptandtest_vae_ckptin your config files to point to your checkpoint locations.
Configuration
Create configs/paths.yaml from the example:
cp configs/paths_default.yaml configs/paths.yaml
# Edit paths.yaml to point to your data directories
Available Configs
vae_wan_1d.yaml- VAE training configurationldf.yaml- LDF training on HumanML3Dldf_babel.yaml- LDF training on BABELstream.yaml- Streaming generation configldf_generate.yaml- Generation-only config
Training
1. Train VAE (Motion Encoder)
# Train VAE
python train_vae.py --config configs/vae_wan_1d.yaml --override train=True
# Test VAE
python train_vae.py --config configs/vae_wan_1d.yaml
2. Pretokenize Dataset
Precompute VAE tokens for diffusion training:
python pretokenize_vae.py --config configs/vae_wan_1d.yaml
3. Train Latent Diffusion Forcing (Flood Diffusion)
# Train on HumanML3D
python train_ldf.py --config configs/ldf.yaml --override train=True
# Train on BABEL (streaming)
python train_ldf.py --config configs/ldf_babel.yaml --override train=True
# Test/Evaluate
python train_ldf.py --config configs/ldf.yaml
Generation
Interactive Generation
python generate_ldf.py --config configs/stream.yaml
Visualization
Render motion files to videos:
python visualize_motion.py
This script:
- Reads 263D motion features from disk
- Renders to MP4 videos with skeleton visualization
- Supports batch processing of directories
Web Real-time Demo
For real-time interactive demo with streaming generation, see web_demo/README.md.
Model Architecture
VAE (Variational Autoencoder)
- Input: T × 263 motion features
- Latent: (T/4) × 4 tokens
- Architecture: Causal encoder and decoder based on WAN2.2
LDF (Latent Diffusion Forcing)
- Backbone: DiT based on WAN2.2
- Text Encoder: T5
- Diffusion Schedule: Triangular noise schedule
- Streaming: Autoregressive latent generation
Project Structure
<project_root>/
├── configs/ # Configuration files
│ ├── vae_wan_1d.yaml # VAE training config
│ ├── ldf.yaml # LDF training (HumanML3D)
│ ├── ldf_babel.yaml # LDF training (BABEL)
│ ├── stream.yaml # Streaming generation
│ └── paths.yaml # Data paths (create from .example)
│
├── datasets/ # Dataset loaders
│ ├── humanml3d.py # HumanML3D dataset
│ └── babel.py # BABEL dataset
│
├── models/ # Model implementations
│ ├── vae_wan_1d.py # VAE encoder-decoder
│ └── diffusion_forcing_wan.py # LDF diffusion model
│
├── metrics/ # Evaluation metrics
│ ├── t2m.py # Text-to-Motion metrics
│ └── mr.py # Motion reconstruction metrics
│
├── utils/ # Utilities
│ ├── initialize.py # Config & model loading
│ ├── motion_process.py # Motion data processing
│ └── visualize.py # Rendering utilities
│
├── train_vae.py # VAE training script
├── train_ldf.py # LDF training script
├── pretokenize_vae.py # Dataset pretokenization
├── generate_ldf.py # Motion generation
├── visualize_motion.py # Batch visualization
├── requiremen
Related Skills
node-connect
351.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
351.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
351.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
