ReactMotion
ReactMotion: Generating Reactive Listener Motions from Speaker Utterance
Install / Use
/learn @awakening-ai/ReactMotionREADME
We introduce Reactive Listener Motion Generation from Speaker Utterance — a new task that generates naturalistic listener body motions appropriately responding to a speaker's utterance. Our unified framework ReactMotion jointly models text, audio, emotion, and motion with preference-based objectives, producing natural, diverse, and appropriate listener responses.
📢 Updates
- [2026.03.17] 🎮 Inference Demo & Gradio UI released
- [2026.03.16] 🎯 Full Training, Evaluation Code released
🚀 TLDR
Modeling nonverbal listener behavior is challenging due to the inherently non-deterministic nature of human reactions — the same speaker utterance can elicit many appropriate listener responses.
🔥 ReactMotion generates naturalistic listener body motions from speaker utterance (text + audio + emotion), trained with preference-based ranking on our ReactMotionNet dataset that captures the one-to-many nature of listener behavior.
We present:
- ReactMotionNet — A large-scale dataset pairing speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness (gold/silver/negative)
- ReactMotion — A unified generative framework built on T5 that jointly models text, audio, emotion, and motion, trained with preference-based ranking objectives
- JudgeNetwork — A multi-modal contrastive scorer for best-of-K selection
- Preference-oriented evaluation protocols tailored to assess reactive appropriateness
🏗️ Architecture
| Component | Backbone | Input | Output | |---|---|---|---| | ReactMotion | T5-base | Text + Audio + Emotion | Motion token sequence | | JudgeNetwork | T5 text enc + Mimi audio enc | Multi-modal conditions + motion | Ranking score |
Conditioning modes — flexibly combine modalities:
| Mode | Description |
|------|-------------|
| t | Text (transcription) only |
| a | Audio only |
| t+e | Text + Emotion |
| a+e | Audio + Emotion |
| t+a | Text + Audio |
| t+a+e | Text + Audio + Emotion (full) |
🛠️ Installation
conda create -n reactmotion python=3.11 -y
conda activate reactmotion
pip install -e .
Or install dependencies directly:
pip install -r requirements.txt
We use wandb to log and visualize the training process:
wandb login
📥 Pretrained Models & Evaluators
Download the pretrained Motion VQ-VAE and evaluation models:
bash prepare/download_vqvae.sh
bash prepare/download_evaluators.sh
Once downloaded, your external/ directory should look like:
external/
├── pretrained_vqvae/
│ └── t2m.pth # Motion VQ-VAE checkpoint
└── t2m/
├── Comp_v6_KLD005/ # T2M evaluation model
├── text_mot_match/ # Text-motion matching model
└── VQVAEV3_CB1024_CMT_H1024_NRES3/
└── meta/
├── mean.npy # Per-dim mean for normalization
└── std.npy # Per-dim std for normalization
📦 Data Preparation
ReeactMotionNet
| Resource | Description | Link | |---|---|---| | ReeactMotionNet | speaker audio codes, raw audio, train.csv, val.csv, and test.csv | Google Drive |
Dataset structure:
dataset
├── HumanML3D
├── audio_code
├── audio_raw
put the train.csv, val.csv, and test.csv to
reactmotion
├── reactmotion
├── data
├── train.csv
├── val.csv
├── test.csv
├── dataset
├── eval
├── models
....
HumanML3D Dataset
We use the HumanML3D 3D human motion-language dataset. Please follow the HumanML3D instructions to download and prepare the dataset, then place it under the dataset/ directory:
dataset/HumanML3D/
├── new_joint_vecs/ # Joint feature vectors (263-dim)
├── texts/ # Motion captions
├── Mean.npy # Per-dim mean for normalization
├── Std.npy # Per-dim std for normalization
├── train.txt
├── val.txt
├── test.txt
├── train_val.txt
└── all.txt
Motion VQ-VAE Codes
Pre-encode the HumanML3D motions with the Motion VQ-VAE and place the codes as .npy files:
dataset/HumanML3D/VQVAE/
├── 000000.npy
├── 000001.npy
├── M000000.npy
└── ...
Each .npy file contains a 1D integer array of VQ codebook indices (codebook size = 512).
Speaker Audio
We provide both pre-extracted Mimi code indices and raw wav files:
| Resource | Description | Link |
|---|---|---|
| Audio Codes | Mimi encoder code indices (.npz) | Google Drive |
| Audio Raw | Raw speaker wav files (.wav) | Google Drive |
Download and place them under your dataset directory:
{DATASET_DIR}/
├── audio_code/ # Mimi code indices (for audio_mode=code)
│ ├── 001193_1_reaction_fearful_4.npz
│ └── ...
└── audio_wav/ # Raw speaker wav (for audio_mode=wav)
├── 001193_1_reaction_fearful_4.wav
└── ...
The audio codes are pre-extracted using Mimi (from the Moshi project). Mimi weights are automatically downloaded from HuggingFace on first use if you want to re-encode from raw wav.
CSV Splits
Prepare train.csv, val.csv, test.csv with the following columns:
| Column | Description |
|---|---|
| group_id | Unique group identifier |
| item_id | Unique item identifier |
| tier_label | Sample quality tier: gold / silver / neg |
| speaker_transcript | Speaker transcription text |
| speaker_emotion | Speaker emotion label |
| listener_motion_caption | Text description of the listener motion |
| motion_id | Motion file ID (6-digit zero-padded, e.g. 000267) |
| speaker_audio_wav | Audio file stem (maps to audio code/wav files) |
| group_w (optional) | Per-group weight for weighted training |
🤗 Model Card
Our pretrained models are available on Hugging Face:
| Model | Backbone | Description | Link | |---|---|---|---| | ReactMotion 1.0 | T5-base | Generator (Text + Audio + Emotion → Motion) | awakening-ai/ReactMotion1.0 | | ReactMotion-Judge | T5 text enc + Mimi audio enc | Multi-modal judge network for best-of-K selection | awakening-ai/ReactMotion-Judge |
Download via CLI:
# Install huggingface_hub if needed
pip install huggingface_hub
# Download the generator
huggingface-cli download awakening-ai/ReactMotion1.0 --local-dir models/ReactMotion1.0
# Download the judge network
huggingface-cli download awakening-ai/ReactMotion-Judge --local-dir models/ReactMotion-Judge
Or in Python:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="awakening-ai/ReactMotion1.0", local_dir="models/ReactMotion1.0")
snapshot_download(repo_id="awakening-ai/ReactMotion-Judge", local_dir="models/ReactMotion-Judge")
⚡ Quick Demo
Inference Demo
Download the pretrained model from Hugging Face and make sure you have run the prepare scripts to download the VQ-VAE checkpoint and normalization files. Then run:
python demo_inference.py \
--gen_ckpt models/ReactMotion1.0 \
--vqvae_ckpt external/pretrained_vqvae/t2m.pth \
--mean_path external/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/mean.npy \
--std_path external/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/std.npy \
--text "It is really nice to meet you!" \
--emotion "excited" \
--cond_mode t+e \
--num_gen 3 \
--out_path output/demo_text_meet.mp4
The generated videos will be saved in output/, example shown below:
https://github.com/user-attachments/assets/e0096715-f8b9-400c-8dd8-434d1e10c8d4
<!-- TODO: add demo video/gif here --> <!-- <p align="center"> <img src="images/demo_meet.gif" alt="Demo: It is really nice to meet you!" style="max-height:320px; width:auto;"> </p> --> <details> <summary>More demo examples</summary>Audio + Text input:
python demo_inference.py \
--gen_ckpt models/React
Related Skills
node-connect
352.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。

