3DTrajMaster

[ICLR'25] 3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation

Generate Convert Improve

Install / Use

/learn @KlingAIResearch/3DTrajMaster

About this skill

Quality Score

0/100

README

3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation

Xiao Fu1, Xian Liu1, Xintao Wang2 ✉, Sida Peng3, Menghan Xia2, Xiaoyu Shi2, Ziyang Yuan2, Pengfei Wan2 Di Zhang2, Dahua Lin1✉ 1The Chinese University of Hong Kong 2Kuaishou Technology 3Zhejiang University ✉: Corresponding Authors

ICLR 2025

</div>

🌟 Introduction

🔥 3DTrajMaster controls one or multiple entity motions in 3D space with entity-specific 3D trajectories for text-to-video (T2V) generation. It has the following features:

6 Domain of Freedom (DoF): control 3D entity location and orientation.
Diverse Entities: human, animal, robot, car, even abstract fire, breeze, etc.
Diverse Background: city, forest, desert, gym, sunset beach, glacier, hall, night city, etc.
Complex 3D trajectories: 3D occlusion, rotating in place, 180°/continuous 90° turnings, etc.
Fine-grained Entity Prompt: change human hair, clothing, gender, figure size, accessory, etc.

https://github.com/user-attachments/assets/efe1870f-4168-4aff-98b8-dbd9e3802928

🔥 Release News

[2025/01/23] 3DTrajMaster is accepted to ICLR 2025.
[2025/01/22] Release inference and training codes based on CogVideoX-5B.
[2024/12/10] Release paper, project page, dataset, and eval code.

⚙️ Quick Start

(1) Access to Our Internal Video Model

As per company policy, we may not release the proprietary trained model at this time. However, if you wish to access our internal model, please submit your request via (1) a shared document or (2) directly via email (lemonaddie0909@gmail.com, recommended); we will respond to requests with the generated video as quickly as possible. Please ensure your request includes the following:

Entity prompts (1–3, with a maximum of 42 tokens, approximately 20 words per entity)
Location prompt
Trajectory template (you can refer to the trajectory template in our released 360°-Motion Dataset, or simply describe new ones via text)

(2) Access to Publicly Available Codebase

We open-source a model based on CogVideoX-5B. Below is a comparison between CogVideoX and our internal video model as of 2025.01.15.

https://github.com/user-attachments/assets/a49e46d3-92d0-42ec-a89f-a9d43919f620

Inference

[Environment Set Up] Our environment setup is identical to CogVideoX. You can refer to their configuration to complete the environment setup.
```
conda create -n 3dtrajmaster python=3.10
conda activate 3dtrajmaster
pip install -r requirements.txt
```
[Download Weights and Dataset] Download the pretrained checkpoints (CogVideo-5B, LoRA, and injector) from here and place them in the CogVideo/weights directory. Then, download the dataset from here. Please note that in both training stages, we use only 11 camera poses and exclude the last camera pose as the novel pose setting.

[Inference on Generalizable Prompts] Change root path to CogVideo/inference. Note a higher LoRA scale and more annealed steps can improve accuracy in prompt generation but may result in lower visual quality. You can modify test_sets.json to add novel entity&location prompts. For entity input, you can use GPT to enhance the description to an appropriate length, such as "Generate a detailed description of approximately 20 words".

python 3dtrajmaster_inference.py \
    --model_path ../weights/cogvideox-5b \
    --ckpt_path ../weights/injector \
    --lora_path ../weights/lora \
    --lora_scale 0.6 \
    --annealed_sample_step 20 \
    --seed 24 \
    --output_path output_example

| Argument | Description | |-------------------------|-------------| | --lora_scale | LoRA alpha weight. Options: 0-1, float. Default: 0.6. | | --annealed_sample_step | annealed sampling steps during inference. Options: 0-50, int. Default: 20. | | Generalizable Robustness | prompt entity number: 1>2>3 | | Entity Length | 15-24 words, ~24-40 tokens after T5 embeddings |

The following code snapshot showcases the core components of 3DTrajMaster, namely the plug-and-play 3D-motion grounded object injector.

# 1. norm & modulate
norm_hidden_states, norm_empty_encoder_hidden_states, gate_msa, enc_gate_msa = self.norm1(hidden_states, empty_encoder_hidden_states, temb)
bz, N_visual, dim = norm_hidden_states.shape
max_entity_num = 3
_, entity_num, num_frames, _ = pose_embeds.shape

# 2. pair-wise fusion of trajectory and entity
attn_input = self.attn_null_feature.repeat(bz, max_entity_num, 50, num_frames, 1)
pose_embeds = self.pose_fuse_layer(pose_embeds)
attn_input[:,:entity_num,:,:,:] = pose_embeds.unsqueeze(-3) + prompt_entities_embeds.unsqueeze(-2)
attn_input = torch.cat((
    rearrange(norm_hidden_states, "b (n t) d -> b n t d",n=num_frames), 
    rearrange(attn_input, "b n t f d -> b f (n t) d")),
    dim=2
).flatten(1,2)

# 3. gated self-attention
attn_hidden_states, attn_encoder_hidden_states = self.attn1_injector(
    hidden_states=attn_input,
    encoder_hidden_states=norm_empty_encoder_hidden_states,
    image_rotary_emb=image_rotary_emb,
)
attn_hidden_states = attn_hidden_states[:,:N_visual,:]

hidden_states = hidden_states + gate_msa * attn_hidden_states

Training

Change root path to CogVideo/finetune. First, train lora module to fit the synthetic data domain.
```
bash finetune_single_rank_lora.sh
```
Then, train injector module to learn the entity motion controller. Here we set --block_interval to 2 to insert the injector every 2 transformer blocks. You can increase this value for a lighter model, but note that it will require a longer training time. For the initial fine-tuning stage, use --finetune_init. If resuming from a pre-trained checkpoint, omit --finetune_init and specify --resume_from_checkpoint $TRANSFORMER_PATH instead. Note that in both training stages, we use only 11 camera poses and exclude the last camera pose as the novel pose setting.
```
bash finetune_single_rank_injector.sh
```

📦 360°-Motion Dataset (Download 🤗)

 ├── 360Motion-Dataset                      Video Number        Cam-Obj Distance (m)
   ├── 480_720/384_672
       ├── Desert (desert)                    18,000               [3.06, 13.39]
           ├── location_data.json
       ├── HDRI                                                      
           ├── loc1 (snowy street)             3,600               [3.43, 13.02]
           ├── loc2 (park)                     3,600               [4.16, 12.22]
           ├── loc3 (indoor open space)        3,600               [3.62, 12.79]
           ├── loc11 (gymnastics room)         3,600               [4.06, 12.32]
           ├── loc13 (autumn forest)           3,600               [4.49, 11.92]
           ├── location_data.json
       ├── RefPic
       ├── CharacterInfo.json
       ├── Hemi12_transforms.json

(1) Released Dataset Information (V1.0.0)

| Argument | Description |Argument | Description | |-------------------------|-------------|-------------------------|-------------| | Video Resolution | (1) 480×720 (2) 384×672 | Frames/Duration/FPS | 99/3.3s/30 | | UE Scenes | 6 (1 desert+5 HDRIs) | Video Samples | (1) 36,000 (2) 36,000 | | Camera Intrinsics (fx,fy) | (1) 1060.606 (2) 989.899 | Sensor Width/Height (mm) | (1) 23.76/15.84 (2) 23.76/13.365 | | Hemi12_transforms.json | 12 surrounding cameras | CharacterInfo.json | entity prompts | | RefPic | 50 animals | 1/2/3 Trajectory Templates | 36/60/35 (121 in total) | | {D/N}_{locX} | {Day/Night}{LocationX} | **{C} {XX}{35mm}** | {Close-Up Shot}{Cam. Index(1-12)} _{Focal Length}|

Note that the resolution of 384×672 refers to our internal video diffusion resolution. In fact, we render the video at a resolution of 378×672 (aspect ratio 9:16), with a 3-p

Related Skills

qqbot-channel

349.0k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.3k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

349.0k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

KlingAIResearch

View profile

View on GitHub

GitHub Stars367

CategoryContent

Updated17d ago

Forks17

KlingAIResearch/3DTrajMaster

Languages

Jupyter Notebook

Security Score

80/100

Audited on Mar 19, 2026

No findings