ReactMotion

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

Generate Convert Improve

Install / Use

/learn @awakening-ai/ReactMotion

About this skill

Quality Score

0/100

README

<h1 align="center">ReactMotion: Generating Reactive Listener Motions from Speaker Utterance</h1> <a href="">Cheng Luo</a>1* · <a href="">Bizhu Wu</a>2,4,5* · <a href="">Bing Li</a>1&dagger; · <a href="">Jianfeng Ren</a>4 · <a href="">Ruibin Bai</a>4 · <a href="">Rong Qu</a>5 · <a href="">Linlin Shen</a>2,3&dagger; · <a href="">Bernard Ghanem</a>1 1King Abdullah University of Science and Technology 2School of Artificial Intelligence, Shenzhen University 3Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen University 4School of Computer Science, University of Nottingham Ningbo China 5School of Computer Science, University of Nottingham, UK *Equal contribution   &dagger;Corresponding author <h2 align="center"><a href="https://arxiv.org/pdf/2603.15083">Paper</a> | <a href="https://reactmotion.github.io">Project Page</a> | <a href="https://www.youtube.com/watch?v=48jq_G1uU5s">Video</a> | <a href="https://huggingface.co/awakening-ai/ReactMotion1.0">Hugging Face</a></h2>

We introduce Reactive Listener Motion Generation from Speaker Utterance — a new task that generates naturalistic listener body motions appropriately responding to a speaker's utterance. Our unified framework ReactMotion jointly models text, audio, emotion, and motion with preference-based objectives, producing natural, diverse, and appropriate listener responses.

📢 Updates

[2026.03.17] 🎮 Inference Demo & Gradio UI released
[2026.03.16] 🎯 Full Training, Evaluation Code released

🚀 TLDR

Modeling nonverbal listener behavior is challenging due to the inherently non-deterministic nature of human reactions — the same speaker utterance can elicit many appropriate listener responses.

🔥 ReactMotion generates naturalistic listener body motions from speaker utterance (text + audio + emotion), trained with preference-based ranking on our ReactMotionNet dataset that captures the one-to-many nature of listener behavior.

We present:

ReactMotionNet — A large-scale dataset pairing speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness (gold/silver/negative)
ReactMotion — A unified generative framework built on T5 that jointly models text, audio, emotion, and motion, trained with preference-based ranking objectives
JudgeNetwork — A multi-modal contrastive scorer for best-of-K selection
Preference-oriented evaluation protocols tailored to assess reactive appropriateness

🏗️ Architecture

| Component | Backbone | Input | Output | |---|---|---|---| | ReactMotion | T5-base | Text + Audio + Emotion | Motion token sequence | | JudgeNetwork | T5 text enc + Mimi audio enc | Multi-modal conditions + motion | Ranking score |

Conditioning modes — flexibly combine modalities:

| Mode | Description | |------|-------------| | t | Text (transcription) only | | a | Audio only | | t+e | Text + Emotion | | a+e | Audio + Emotion | | t+a | Text + Audio | | t+a+e | Text + Audio + Emotion (full) |

🛠️ Installation

conda create -n reactmotion python=3.11 -y
conda activate reactmotion
pip install -e .

Or install dependencies directly:

pip install -r requirements.txt

We use wandb to log and visualize the training process:

wandb login

📥 Pretrained Models & Evaluators

Download the pretrained Motion VQ-VAE and evaluation models:

bash prepare/download_vqvae.sh
bash prepare/download_evaluators.sh

Once downloaded, your external/ directory should look like:

external/
├── pretrained_vqvae/
│   └── t2m.pth                           # Motion VQ-VAE checkpoint
└── t2m/
    ├── Comp_v6_KLD005/                   # T2M evaluation model
    ├── text_mot_match/                   # Text-motion matching model
    └── VQVAEV3_CB1024_CMT_H1024_NRES3/
        └── meta/
            ├── mean.npy                  # Per-dim mean for normalization
            └── std.npy                   # Per-dim std for normalization

📦 Data Preparation

ReeactMotionNet

| Resource | Description | Link | |---|---|---| | ReeactMotionNet | speaker audio codes, raw audio, train.csv, val.csv, and test.csv | Google Drive |

Dataset structure:

dataset
├── HumanML3D
├── audio_code
├── audio_raw

put the train.csv, val.csv, and test.csv to

reactmotion
├── reactmotion
    ├── data
        ├── train.csv
        ├── val.csv
        ├── test.csv
    ├── dataset
    ├── eval
    ├── models
    ....

HumanML3D Dataset

We use the HumanML3D 3D human motion-language dataset. Please follow the HumanML3D instructions to download and prepare the dataset, then place it under the dataset/ directory:

dataset/HumanML3D/
├── new_joint_vecs/      # Joint feature vectors (263-dim)
├── texts/               # Motion captions
├── Mean.npy             # Per-dim mean for normalization
├── Std.npy              # Per-dim std for normalization
├── train.txt
├── val.txt
├── test.txt
├── train_val.txt
└── all.txt

Motion VQ-VAE Codes

Pre-encode the HumanML3D motions with the Motion VQ-VAE and place the codes as .npy files:

dataset/HumanML3D/VQVAE/
├── 000000.npy
├── 000001.npy
├── M000000.npy
└── ...

Each .npy file contains a 1D integer array of VQ codebook indices (codebook size = 512).

Speaker Audio

We provide both pre-extracted Mimi code indices and raw wav files:

| Resource | Description | Link | |---|---|---| | Audio Codes | Mimi encoder code indices (.npz) | Google Drive | | Audio Raw | Raw speaker wav files (.wav) | Google Drive |

Download and place them under your dataset directory:

{DATASET_DIR}/
├── audio_code/   # Mimi code indices (for audio_mode=code)
│   ├── 001193_1_reaction_fearful_4.npz
│   └── ...
└── audio_wav/    # Raw speaker wav (for audio_mode=wav)
    ├── 001193_1_reaction_fearful_4.wav
    └── ...

The audio codes are pre-extracted using Mimi (from the Moshi project). Mimi weights are automatically downloaded from HuggingFace on first use if you want to re-encode from raw wav.

CSV Splits

Prepare train.csv, val.csv, test.csv with the following columns:

| Column | Description | |---|---| | group_id | Unique group identifier | | item_id | Unique item identifier | | tier_label | Sample quality tier: gold / silver / neg | | speaker_transcript | Speaker transcription text | | speaker_emotion | Speaker emotion label | | listener_motion_caption | Text description of the listener motion | | motion_id | Motion file ID (6-digit zero-padded, e.g. 000267) | | speaker_audio_wav | Audio file stem (maps to audio code/wav files) | | group_w (optional) | Per-group weight for weighted training |

🤗 Model Card

Our pretrained models are available on Hugging Face:

| Model | Backbone | Description | Link | |---|---|---|---| | ReactMotion 1.0 | T5-base | Generator (Text + Audio + Emotion → Motion) | awakening-ai/ReactMotion1.0 | | ReactMotion-Judge | T5 text enc + Mimi audio enc | Multi-modal judge network for best-of-K selection | awakening-ai/ReactMotion-Judge |

Download via CLI:

# Install huggingface_hub if needed
pip install huggingface_hub

# Download the generator
huggingface-cli download awakening-ai/ReactMotion1.0 --local-dir models/ReactMotion1.0

# Download the judge network
huggingface-cli download awakening-ai/ReactMotion-Judge --local-dir models/ReactMotion-Judge

Or in Python:

from huggingface_hub import snapshot_download
snapshot_download(repo_id="awakening-ai/ReactMotion1.0", local_dir="models/ReactMotion1.0")
snapshot_download(repo_id="awakening-ai/ReactMotion-Judge", local_dir="models/ReactMotion-Judge")

⚡ Quick Demo

Inference Demo

Download the pretrained model from Hugging Face and make sure you have run the prepare scripts to download the VQ-VAE checkpoint and normalization files. Then run:

python demo_inference.py \
  --gen_ckpt   models/ReactMotion1.0 \
  --vqvae_ckpt external/pretrained_vqvae/t2m.pth \
  --mean_path  external/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/mean.npy \
  --std_path   external/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/std.npy \
  --text "It is really nice to meet you!" \
  --emotion "excited" \
  --cond_mode t+e \
  --num_gen 3 \
  --out_path output/demo_text_meet.mp4

The generated videos will be saved in output/, example shown below:

https://github.com/user-attachments/assets/e0096715-f8b9-400c-8dd8-434d1e10c8d4

<details> <summary>More demo examples</summary>

Audio + Text input:

python demo_inference.py \
  --gen_ckpt   models/React

Related Skills

node-connect

352.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。