Phantom
Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment
Install / Use
/learn @Phantom-video/PhantomREADME
Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment
<div align="center">
<a href="https://huggingface.co/bytedance-research/Phantom"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Model&color=orange"></a>
<p align="center"> <img src="assets/teaser.png" width=95%> <p>Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment<br> Lijie Liu<sup> * </sup>, Tianxiang Ma<sup> * </sup>, Bingchuan Li<sup> * †</sup>, Zhuowei Chen<sup> * </sup>, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, Xinglong Wu <br><sup> * </sup>Equal contribution,<sup> † </sup>Project lead <br>Intelligent Creation Team, ByteDance<br>
🔥 Latest News!
- Sep 10, 2025: 🔥 HuMo is here! We release a human-centric video generation model that supports multimodal inputs including text, reference images, and audio. HuMo has strong text prompt following, consistent subject preservation and synchronized audio-driven motion. The model is open-sourced, feel free to use it!
- Jun 26, 2025: 🎉 Congratulations! Phantom has been accepted by ICCV 2025!
- Jun 24, 2025: 📊 We have published Phantom-Data, the first general-purpose large-scale cross-pair dataset, which introduces a core cross-pair data construction method for subject-consistent video generation tasks.
- May 27, 2025: 🎉 We have released the Phantom-Wan-14B model, a more powerful Subject-to-Video generation model.
- Apr 23, 2025: 😊 Thanks to ComfyUI-WanVideoWrapper for adapting ComfyUI to Phantom-Wan-1.3B (May 27: the 14B model has also been adapted.). Everyone is welcome to use it!
- Apr 21, 2025: 👋 Phantom-Wan is coming! We adapted the Phantom framework into the Wan2.1 video generation model. The inference codes and checkpoint have been released.
- Apr 10, 2025: We have updated the full version of the Phantom paper, which now includes more detailed descriptions of the model architecture and dataset pipeline.
- Feb 16, 2025: We proposed a novel subject-consistent video generation model, Phantom, and have released the report publicly. For more video demos, please visit the project page.
📑 Todo List
- [x] Inference codes and Checkpoint of Phantom-Wan-1.3B
- [x] Checkpoint of Phantom-Wan-14B
- [ ] Checkpoint of Phantom-Wan-14B Pro
- [ ] Open source Phantom-Data
- [ ] Training codes of Phantom-Wan
📖 Overview
Phantom is a unified video generation framework for single and multi-subject references, built on existing text-to-video and image-to-video architectures. It achieves cross-modal alignment using text-image-video triplet data by redesigning the joint text-image injection model. Additionally, it emphasizes subject consistency in human generation while enhancing ID-preserving video generation.
⚡️ Quickstart
Installation
Clone the repo:
git clone https://github.com/Phantom-video/Phantom.git
cd Phantom
Install dependencies:
# Ensure torch >= 2.4.0
pip install -r requirements.txt
Model Download
| Models | Download Link | Notes | |--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------| | Phantom-Wan-1.3B | 🤗 Huggingface | Supports both 480P and 720P | Phantom-Wan-14B | 🤗 Huggingface | Supports both 480P and 720P
First you need to download the 1.3B original model of Wan2.1, since our Phantom-Wan model relies on the Wan2.1 VAE and Text Encoder model. Download Wan2.1-1.3B using huggingface-cli:
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B
Then download the Phantom-Wan-1.3B and Phantom-Wan-14B model:
huggingface-cli download bytedance-research/Phantom --local-dir ./Phantom-Wan-Models
Alternatively, you can manually download the required models and place them in the Phantom-Wan-Models folder.
Run Subject-to-Video Generation
Phantom-Wan-1.3B
- Single-GPU inference
python generate.py --task s2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-Models/Phantom-Wan-1.3B.pth --ref_image "examples/ref1.png,examples/ref2.png" --prompt "暖阳漫过草地,扎着双马尾、头戴绿色蝴蝶结、身穿浅绿色连衣裙的小女孩蹲在盛开的雏菊旁。她身旁一只棕白相间的狗狗吐着舌头,毛茸茸尾巴欢快摇晃。小女孩笑着举起黄红配色、带有蓝色按钮的玩具相机,将和狗狗的欢乐瞬间定格。" --base_seed 42
- Multi-GPU inference using FSDP + xDiT USP
pip install "xfuser>=0.4.1"
torchrun --nproc_per_node=8 generate.py --task s2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-Models/Phantom-Wan-1.3B.pth --ref_image "examples/ref3.png,examples/ref4.png" --dit_fsdp --t5_fsdp --ulysses_size 4 --ring_size 2 --prompt "夕阳下,一位有着小麦色肌肤、留着乌黑长发的女人穿上有着大朵立体花朵装饰、肩袖处带有飘逸纱带的红色纱裙,漫步在金色的海滩上,海风轻拂她的长发,画面唯美动人。" --base_seed 42
💡Note:
- Changing
--ref_imagecan achieve single reference Subject-to-Video generation or multi-reference Subject-to-Video generation. The number of reference images should be within 4.- To achieve the best generation results, we recommend that you describe the visual content of the reference image as accurately as possible when writing
--prompt. For example, "examples/ref1.png" can be described as "a toy camera in yellow and red with blue buttons".- When the generated video is unsatisfactory, the most straightforward solution is to try changing the
--base_seedand modifying the description in the--prompt.
For more inference examples, please refer to "infer.sh". You will get the following generated results:
<table style="width: 100%; border-collapse: collapse; text-align: center; border: 1px solid #ccc;"> <tr> <th style="text-align: center;"> <strong>Reference Images</strong> </th> <th style="text-align: center;"> <strong>Generated Videos (480P)</strong> </th> </tr> <tr> <td style="text-align: center; vertical-align: middle;"> <img src="examples/ref1.png" alt="Image 1" style="height: 180px;"> <img src="examples/ref2.png" alt="Image 2" style="height: 180px;"> </td> <td style="text-align: center; vertical-align: middle;"> <img src="examples/ref_results/result1.gif" alt="GIF 1" style="width: 400px;"> </td> </tr> <tr> <td style="text-align: center; vertical-align: middle;"> <img src="examples/ref3.png" alt="Image 3" style="height: 180px;"> <img src="examples/ref4.png" alt="Image 4" style="height: 180px;"> </td> <td style="text-align: center; vertical-align: middle;"> <img src="examples/ref_results/result2.gif" alt="GIF 2" style="width: 400px;"> </td> </tr> </tr> <tr> <td style="text-align: center; vertical-align: middle;"> <img src="examples/ref5.png" alt="Image 5" style="height: 180px;"> <img src="examples/ref6.png" alt="Image 6" style="height: 180px;"> <img src="examples/ref7.png" alt="Image 7" style="height: 180px;"> </td> <td style="text-align: center; vertical-align: middle;"> <img src="examples/ref_results/result3.gif" alt="GIF 3" style="width: 400px;"> </td> </tr> <tr> <td style="text-align: center; vertical-align: middle;"> <img src="examples/ref8.png" alt="Image 8" style="height: 100px;"> <img src="examples/ref9.png" alt="Image 9" style="height: 100px;"> <img src="examples/ref10.png" alt="Image 10" style="height: 100px;"> <img src="examples/ref11.png" alt="Image 11" style="height: 100px;"> </td> <td style="text-align: center; vertical-align: middle;"> <img src="examples/ref_results/result4.gif" alt="GIF 4" style="width: 400px;"> </td> </tr> </table>Phantom-Wan-14B
- Single-GPU inference
python generate.py --task s2v-14B --size 832*480 --frame_num 121 --sample_fps 24 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-Models --ref_image "examples/ref12.png,examples/ref13.png" --prompt "扎着双丸子头,身着红黑配色并带有火焰纹饰服饰,颈戴金项圈、臂缠金护腕的哪吒,和有着一头淡蓝色头发,额间有蓝色印记,身着一袭白色长袍的敖丙,并肩坐在教室的座位上,他们专注地讨论着书本内容。背景为柔和的灯光和窗外微风拂过的树叶,营造出安静又充满活力的学习氛围。"
- Multi-GPU inference using FSDP + xDiT USP
pip install "xfuser>=0.4.1"
torchrun --nproc_per_node=8 generate.py --task s2v-14B --size 832*480 --frame_num 121 --sample_fps 24 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-Models --ref_image "examples/ref14.png,examples/ref15.png,examples/ref16.png" --dit_fsdp --t5_fsdp --ulysses_size 8 --ring_size 1 --prompt "一位戴着黄色帽子、身穿黄色上衣配棕色背带的卡通老爷爷,在装饰有粉色和蓝色桌椅、悬挂着彩色吊灯且摆满彩色圆球装饰的清新卡通风格咖啡馆里,端起一只蓝色且冒着热气的咖啡杯,画面风格卡通、清新。"
💡Note:
- The currently released Phantom-Wan-14B model was trained on 480P data but can also be applied to generating videos at 720P and higher resolutions, though the results may be less stable. We plan to release a version further traine
Related Skills
docs-writer
99.2k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
337.3kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
arscontexta
2.9kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
zola-ai
An autonomous Solana wallet agent that executes payments via Twitter mentions and an in-app dashboard, powered by Claude.
