3DTrajMaster
[ICLR'25] 3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
Install / Use
/learn @KlingAIResearch/3DTrajMasterREADME
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
<div align="center"> <img src='imgs/logo.png' style="height:90px"></img>
<a href='http://fuxiao0719.github.io/projects/3dtrajmaster'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
<a href='https://arxiv.org/pdf/2412.07759'><img src='https://img.shields.io/badge/arXiv-2412.07759-b31b1b.svg'></a>
<a href='https://huggingface.co/datasets/KwaiVGI/360Motion-Dataset'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue'></a>
<a href='https://huggingface.co/KwaiVGI/3DTrajMaster'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
Xiao Fu<sup>1</sup>, Xian Liu<sup>1</sup>, Xintao Wang<sup>2 ✉</sup>, Sida Peng<sup>3</sup>, Menghan Xia<sup>2</sup>, Xiaoyu Shi<sup>2</sup>, Ziyang Yuan<sup>2</sup>, <br> Pengfei Wan<sup>2</sup> Di Zhang<sup>2</sup>, Dahua Lin<sup>1✉</sup> <br> <sup>1</sup>The Chinese University of Hong Kong <sup>2</sup>Kuaishou Technology <sup>3</sup>Zhejiang University <br> ✉: Corresponding Authors
ICLR 2025
</div>🌟 Introduction
🔥 3DTrajMaster controls one or multiple entity motions in 3D space with entity-specific 3D trajectories for text-to-video (T2V) generation. It has the following features:
- 6 Domain of Freedom (DoF): control 3D entity location and orientation.
- Diverse Entities: human, animal, robot, car, even abstract fire, breeze, etc.
- Diverse Background: city, forest, desert, gym, sunset beach, glacier, hall, night city, etc.
- Complex 3D trajectories: 3D occlusion, rotating in place, 180°/continuous 90° turnings, etc.
- Fine-grained Entity Prompt: change human hair, clothing, gender, figure size, accessory, etc.
https://github.com/user-attachments/assets/efe1870f-4168-4aff-98b8-dbd9e3802928
🔥 Release News
[2025/01/23]3DTrajMaster is accepted to ICLR 2025.[2025/01/22]Release inference and training codes based on CogVideoX-5B.[2024/12/10]Release paper, project page, dataset, and eval code.
⚙️ Quick Start
(1) Access to Our Internal Video Model
As per company policy, we may not release the proprietary trained model at this time. However, if you wish to access our internal model, please submit your request via (1) a shared document or (2) directly via email (lemonaddie0909@gmail.com, recommended); we will respond to requests with the generated video as quickly as possible.
Please ensure your request includes the following:
- Entity prompts (1–3, with a maximum of 42 tokens, approximately 20 words per entity)
- Location prompt
- Trajectory template (you can refer to the trajectory template in our released 360°-Motion Dataset, or simply describe new ones via text)
(2) Access to Publicly Available Codebase
We open-source a model based on CogVideoX-5B. Below is a comparison between CogVideoX and our internal video model as of 2025.01.15.
https://github.com/user-attachments/assets/a49e46d3-92d0-42ec-a89f-a9d43919f620
Inference
-
[Environment Set Up] Our environment setup is identical to CogVideoX. You can refer to their configuration to complete the environment setup.
conda create -n 3dtrajmaster python=3.10 conda activate 3dtrajmaster pip install -r requirements.txt -
[Download Weights and Dataset] Download the pretrained checkpoints (CogVideo-5B, LoRA, and injector) from here and place them in the
CogVideo/weightsdirectory. Then, download the dataset from here. Please note that in both training stages, we use only 11 camera poses and exclude the last camera pose as the novel pose setting. -
[Inference on Generalizable Prompts] Change root path to
CogVideo/inference. Note a higher LoRA scale and more annealed steps can improve accuracy in prompt generation but may result in lower visual quality. You can modifytest_sets.jsonto add novel entity&location prompts. For entity input, you can use GPT to enhance the description to an appropriate length, such as "Generate a detailed description of approximately 20 words".python 3dtrajmaster_inference.py \ --model_path ../weights/cogvideox-5b \ --ckpt_path ../weights/injector \ --lora_path ../weights/lora \ --lora_scale 0.6 \ --annealed_sample_step 20 \ --seed 24 \ --output_path output_example| Argument | Description | |-------------------------|-------------| |
--lora_scale| LoRA alpha weight. Options: 0-1, float. Default: 0.6. | |--annealed_sample_step| annealed sampling steps during inference. Options: 0-50, int. Default: 20. | | Generalizable Robustness | prompt entity number: 1>2>3 | | Entity Length | 15-24 words, ~24-40 tokens after T5 embeddings |The following code snapshot showcases the core components of 3DTrajMaster, namely the plug-and-play 3D-motion grounded object injector.
# 1. norm & modulate norm_hidden_states, norm_empty_encoder_hidden_states, gate_msa, enc_gate_msa = self.norm1(hidden_states, empty_encoder_hidden_states, temb) bz, N_visual, dim = norm_hidden_states.shape max_entity_num = 3 _, entity_num, num_frames, _ = pose_embeds.shape # 2. pair-wise fusion of trajectory and entity attn_input = self.attn_null_feature.repeat(bz, max_entity_num, 50, num_frames, 1) pose_embeds = self.pose_fuse_layer(pose_embeds) attn_input[:,:entity_num,:,:,:] = pose_embeds.unsqueeze(-3) + prompt_entities_embeds.unsqueeze(-2) attn_input = torch.cat(( rearrange(norm_hidden_states, "b (n t) d -> b n t d",n=num_frames), rearrange(attn_input, "b n t f d -> b f (n t) d")), dim=2 ).flatten(1,2) # 3. gated self-attention attn_hidden_states, attn_encoder_hidden_states = self.attn1_injector( hidden_states=attn_input, encoder_hidden_states=norm_empty_encoder_hidden_states, image_rotary_emb=image_rotary_emb, ) attn_hidden_states = attn_hidden_states[:,:N_visual,:] hidden_states = hidden_states + gate_msa * attn_hidden_states
Training
-
Change root path to
CogVideo/finetune. First, train lora module to fit the synthetic data domain.bash finetune_single_rank_lora.sh -
Then, train injector module to learn the entity motion controller. Here we set
--block_intervalto 2 to insert the injector every 2 transformer blocks. You can increase this value for a lighter model, but note that it will require a longer training time. For the initial fine-tuning stage, use--finetune_init. If resuming from a pre-trained checkpoint, omit--finetune_initand specify--resume_from_checkpoint $TRANSFORMER_PATHinstead. Note that in both training stages, we use only 11 camera poses and exclude the last camera pose as the novel pose setting.bash finetune_single_rank_injector.sh
📦 360°-Motion Dataset (Download 🤗)
├── 360Motion-Dataset Video Number Cam-Obj Distance (m)
├── 480_720/384_672
├── Desert (desert) 18,000 [3.06, 13.39]
├── location_data.json
├── HDRI
├── loc1 (snowy street) 3,600 [3.43, 13.02]
├── loc2 (park) 3,600 [4.16, 12.22]
├── loc3 (indoor open space) 3,600 [3.62, 12.79]
├── loc11 (gymnastics room) 3,600 [4.06, 12.32]
├── loc13 (autumn forest) 3,600 [4.49, 11.92]
├── location_data.json
├── RefPic
├── CharacterInfo.json
├── Hemi12_transforms.json
(1) Released Dataset Information (V1.0.0)
| Argument | Description |Argument | Description | |-------------------------|-------------|-------------------------|-------------| | Video Resolution | (1) 480×720 (2) 384×672 | Frames/Duration/FPS | 99/3.3s/30 | | UE Scenes | 6 (1 desert+5 HDRIs) | Video Samples | (1) 36,000 (2) 36,000 | | Camera Intrinsics (fx,fy) | (1) 1060.606 (2) 989.899 | Sensor Width/Height (mm) | (1) 23.76/15.84 (2) 23.76/13.365 | | Hemi12_transforms.json | 12 surrounding cameras | CharacterInfo.json | entity prompts | | RefPic | 50 animals | 1/2/3 Trajectory Templates | 36/60/35 (121 in total) | | {D/N}_{locX} | {Day/Night}{LocationX} | **{C} {XX}{35mm}** | {Close-Up Shot}{Cam. Index(1-12)} _{Focal Length}|
Note that the resolution of 384×672 refers to our internal video diffusion resolution. In fact, we render the video at a resolution of 378×672 (aspect ratio 9:16), with a 3-p
Related Skills
qqbot-channel
349.0kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.3k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
349.0kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
