SkillAgentSearch skills...

MotionStreamer

[ICCV 2025] MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

Install / Use

/learn @zju3dv/MotionStreamer
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<h2 align="center"<strong>MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space</strong></h2> <p align="center"> <a href='https://li-xingxiao.github.io/homepage/' target='_blank'>Lixing Xiao</a><sup>1</sup> · <a href='https://shunlinlu.github.io/' target='_blank'>Shunlin Lu</a> <sup>2</sup> · <a href='https://phj128.github.io/' target='_blank'>Huaijin Pi</a><sup>3</sup> · <a href='https://vankouf.github.io/' target='_blank'>Ke Fan</a><sup>4</sup> · <a href='https://liangpan99.github.io/' target='_blank'>Liang Pan</a><sup>3</sup> · <a href='https://yueezhou7@gmail.com' target='_blank'>Yueer Zhou</a><sup>1</sup> · <a href='https://dblp.org/pid/120/4362.html/' target='_blank'>Ziyong Feng</a><sup>5</sup> · <br> <a href='https://www.xzhou.me/' target='_blank'>Xiaowei Zhou</a><sup>1</sup> · <a href='https://pengsida.net/' target='_blank'>Sida Peng</a><sup>1†</sup> · <a href='https://wangjingbo1219.github.io/' target='_blank'>Jingbo Wang</a><sup>6</sup> <br> <br> <sup>1</sup>Zhejiang University <sup>2</sup>The Chinese University of Hong Kong, Shenzhen <sup>3</sup>The University of Hong Kong <br><sup>4</sup>Shanghai Jiao Tong University <sup>5</sup>DeepGlint <sup>6</sup>Shanghai AI Lab <br> <strong>ICCV 2025</strong> </p> </p> <p align="center"> <a href='https://arxiv.org/abs/2503.15451'> <img src='https://img.shields.io/badge/Arxiv-2503.15451-A42C25?style=flat&logo=arXiv&logoColor=A42C25'></a> <a href='https://openaccess.thecvf.com/content/ICCV2025/papers/Xiao_MotionStreamer_Streaming_Motion_Generation_via_Diffusion-based_Autoregressive_Model_in_Causal_ICCV_2025_paper.pdf'> <img src='https://img.shields.io/badge/Paper-PDF-blue?style=flat&logo=arXiv&logoColor=blue'></a> <a href='https://zju3dv.github.io/MotionStreamer/'> <img src='https://img.shields.io/badge/Project-Page-green?style=flat&logo=Google%20chrome&logoColor=green'></a> <a href='https://huggingface.co/datasets/lxxiao/272-dim-HumanML3D'> <img src='https://img.shields.io/badge/Data-Download-yellow?style=flat&logo=huggingface&logoColor=yellow'></a> </p> <img width="1385" alt="image" src="assets/teaser.jpg"/>

🔥 News

  • [2025-06] MotionStreamer has been accepted to ICCV 2025! 🎉

TODO List

  • [x] Release the processing script of 272-dim motion representation.
  • [x] Release the processed 272-dim Motion Representation of HumanML3D dataset. Only for academic usage.
  • [x] Release the training code and checkpoint of our TMR-based motion evaluator trained on the processed 272-dim HumanML3D dataset.
  • [x] Release the training and evaluation code as well as checkpoint of Causal TAE.
  • [x] Release the training code of original motion generation model and streaming generation model (MotionStreamer).
  • [x] Release the checkpoint and demo inference code of original motion generation model.
  • [ ] Release complete code for MotionStreamer.

🏃 Motion Representation

For more details of how to obtain the 272-dim motion representation, as well as other useful tools (e.g., Visualization and Conversion to BVH format), please refer to our GitHub repo.

Installation

🐍 Python Virtual Environment

conda env create -f environment.yaml
conda activate mgpt

🤗 Hugging Face Mirror

Since all of our models and data are available on Hugging Face, if Hugging Face is not directly accessible, you can use the HF-mirror tools following:

pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com

📥 Data Preparation

To facilitate researchers, we provide the processed 272-dim Motion Representation of:

HumanML3D dataset at this link.

BABEL dataset at this link.

❗️❗️❗️ The processed data is solely for academic purposes. Make sure you read through the AMASS License.

  1. Download the processed 272-dim HumanML3D dataset following:
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-HumanML3D --local-dir ./humanml3d_272
cd ./humanml3d_272
unzip texts.zip
unzip motion_data.zip

The dataset is organized as:

./humanml3d_272
  ├── mean_std
      ├── Mean.npy
      ├── Std.npy
  ├── split
      ├── train.txt
      ├── val.txt
      ├── test.txt
  ├── texts
      ├── 000000.txt
      ...
  ├── motion_data
      ├── 000000.npy
      ...
  1. Download the processed 272-dim BABEL dataset following:
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL --local-dir ./babel_272
cd ./babel_272
unzip texts.zip
unzip motion_data.zip

The dataset is organized as:

./babel_272
  ├── t2m_babel_mean_std
      ├── Mean.npy
      ├── Std.npy
  ├── split
      ├── train.txt
      ├── val.txt
  ├── texts
      ├── 000000.txt
      ...
  ├── motion_data
      ├── 000000.npy
      ...
  1. Download the processed streaming 272-dim BABEL dataset following:
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL-stream --local-dir ./babel_272_stream
cd ./babel_272_stream
unzip train_stream.zip
unzip train_stream_text.zip
unzip val_stream.zip
unzip val_stream_text.zip

The dataset is organized as:

./babel_272_stream
  ├── train_stream
      ├── seq1.npy
      ...
  ├── train_stream_text
      ├── seq1.txt
      ...
  ├── val_stream
      ├── seq1.npy
      ...
  ├── val_stream_text
      ├── seq1.txt
      ...

NOTE: We process the original BABEL dataset to support training of streaming motion generation. e.g. If there is a motion sequence A, annotated as (A1, A2, A3, A4) in BABEL dataset, each subsequence has text description: (A1_t, A2_t, A3_t, A4_t).

Then, our BABEL-stream is constructed as:

seq1: (A1, A2) --- seq1_text: (A1_t*A2_t#A1_length)

seq2: (A2, A3) --- seq2_text: (A2_t*A3_t#A2_length)

seq3: (A3, A4) --- seq3_text: (A3_t*A4_t#A3_length)

Here, * and # is separation symbol, A1_length means the number of frames of subsequence A1.

🚀 Training

  1. Train our TMR-based motion evaluator on the processed 272-dim HumanML3D dataset:

    bash TRAIN_evaluator_272.sh
    

    After training for 100 epochs, the checkpoint will be stored at: Evaluator_272/experiments/temos/EXP1/checkpoints/.

    ⬇️ We provide the evaluator checkpoint on Hugging Face, download it following:

    python humanml3d_272/prepare/download_evaluator_ckpt.py
    

    The downloaded checkpoint will be stored at: Evaluator_272/.

  2. Train the Causal TAE:

    bash TRAIN_causal_TAE.sh ${NUM_GPUS}
    

    e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8

    The checkpoint will be stored at: Experiments/causal_TAE_t2m_272/

    Tensorboard visualization:

    tensorboard --logdir='Experiments/causal_TAE_t2m_272'
    

    ⬇️ We provide the Causal TAE checkpoint on Hugging Face, download it following:

    python humanml3d_272/prepare/download_Causal_TAE_t2m_272_ckpt.py
    
  3. Train text to motion model:

    We provide scripts to train the original text to motion generation model with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (trained in the first stage).

    3.1 Get motion latents:

    python get_latent.py --resume-pth Causal_TAE/net_last.pth --latent_dir humanml3d_272/t2m_latents
    

    3.2 Download sentence-T5-XXL model on Hugging Face:

    huggingface-cli download --resume-download sentence-transformers/sentence-t5-xxl --local-dir sentencet5-xxl/
    

    3.3 Train text to motion generation model:

    bash TRAIN_t2m.sh ${NUM_GPUS}
    

    e.g., if you have 8 GPUs, run: bash TRAIN_t2m.sh 8

    The checkpoint will be stored at: Experiments/t2m_model/

    Tensorboard visualization:

    tensorboard --logdir='Experiments/t2m_model'
    

    ⬇️ We provide the text to motion model checkpoint on Hugging Face, download it following:

    python humanml3d_272/prepare/download_t2m_model_ckpt.py
    
  4. Train streaming motion generation model (MotionStreamer):

    We provide scripts to train the streaming motion generation model (MotionStreamer) with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (need to train a new Causal TAE using both HumanML3D-272 and BABEL-272 data).

    4.1 Train a Causal TAE using both HumanML3D-272 and BABEL-272 data:

    bash TRAIN_causal_TAE.sh ${NUM_GPUS} t2m_babel_272
    

    e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8 t2m_babel_272

    The checkpoint will be stored at: Experiments/causal_TAE_t2m_babel_272/

    Tensorboard visualization:

    tensorboard --logdir='Experiments/causal_TAE_t2m_babel_272'
    

    ⬇️ We provide the Causal TAE checkpoint trained using both HumanML3D-272 and BABEL-272 data on Hugging Face, download it following:

    python humanml3
    

Related Skills

View on GitHub
GitHub Stars262
CategoryDevelopment
Updated14h ago
Forks17

Languages

Python

Security Score

95/100

Audited on Apr 2, 2026

No findings