HoloCine
[CVPR 2026] Official Implementations for Paper - HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
Install / Use
/learn @yihao-meng/HoloCineREADME
HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
[📄 Paper] [🌐 Project Page] [🤗 Model Weights]
https://github.com/user-attachments/assets/c4dee993-7c6c-4604-a93d-a8eb09cfd69b
Yihao Meng<sup>1,2</sup>, Hao Ouyang<sup>2</sup>, Yue Yu<sup>1,2</sup>, Qiuyu Wang<sup>2</sup>, Wen Wang<sup>2,3</sup>, Ka Leong Cheng<sup>2</sup>, <br>Hanlin Wang<sup>1,2</sup>, Yixuan Li<sup>2,4</sup>, Cheng Chen<sup>2,5</sup>, Yanhong Zeng<sup>2</sup>, Yujun Shen<sup>2</sup>, Huamin Qu<sup>1</sup> <br> <sup>1</sup>HKUST, <sup>2</sup>Ant Group, <sup>3</sup>ZJU, <sup>4</sup>CUHK, <sup>5</sup>NTU
TLDR
- What it is: A text-to-video model that generates full scenes, not just isolated clips.
- Key Feature: It maintains consistency of characters, objects, and style across all shots in a scene.
- How it works: You provide shot-by-shot text prompts, giving you directorial control over the final video.
Strongly recommend seeing our demo page.
If you enjoyed the videos we created, please consider giving us a star 🌟.
🚀 Open-Source Plan
✅ Released
- Full inference code
HoloCine-14B-fullHoloCine-14B-sparse
⏰ To Be Released
HoloCine-14B-full-l(For videos longer than 1 minute)HoloCine-14B-sparse-l(For videos longer than 1 minute)HoloCine-5B-full(For limited-memory users)HoloCine-5B-sparse(For limited-memory users)
🗺️ In Planning
- Support first frame and key-frame input
HoloCine-audio
Community Support
Comfyui
Thanks Dango233 for implementing comfyui node for HoloCine. (kijai/ComfyUI-WanVideoWrapper#1566) and https://github.com/Dango233/ComfyUI-WanVideoWrapper-Multishot/. This part is still under test, so feel to leave an issue if you encounter any problem here.
Setup
git clone https://github.com/yihao-meng/HoloCine.git
cd HoloCine
Environment
We use a environment similar to diffsynth. If you have a diffsynth environment, you can probably reuse it.
conda create -n HoloCine python=3.10
pip install -e .
We use FlashAttention-3 to implement the sparse inter-shot attention. We highly recommend using FlashAttention-3 for its fast speed. We provide a simple instruction on how to install FlashAttention-3.
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
cd hopper
python setup.py install
If you encounter environment problem when installing FlashAttention-3, you can refer to their official github page https://github.com/Dao-AILab/flash-attention.
If you cannot install FlashAttention-3, you can use FlashAttention-2 as an alternative, and our code will automatically detect the FlashAttention version. It will be slower than FlashAttention-3,but can also produce the right result.
If you want to install FlashAttention-2, you can use the following command:
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
Checkpoint
Step 1: Download Wan 2.2 VAE and T5
If you already have downloaded Wan 2.2 14B T2V before, skip this section.
If not, you need the T5 text encoder and the VAE from the original Wan 2.2 repository: https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B
Based on the repository's file structure, you only need to download models_t5_umt5-xxl-enc-bf16.pth and Wan2.1_VAE.pth.
You do not need to download the google, high_noise_model, or low_noise_model folders, nor any other files.
Recommended Download (CLI)
We recommend using huggingface-cli to download only the necessary files. Make sure you have huggingface_hub installed (pip install huggingface_hub).
This command will download only the required T5 and VAE models into the correct directory:
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B \
--local-dir checkpoints/Wan2.2-T2V-A14B \
--allow-patterns "models_t5_*.pth" "Wan2.1_VAE.pth"
Manual Download
Alternatively, go to the "Files" tab on the Hugging Face repo and manually download the following two files:
models_t5_umt5-xxl-enc-bf16.pthWan2.1_VAE.pth
Place both files inside a new folder named checkpoints/Wan2.2-T2V-A14B/.
Step 2: Download HoloCine Model (HoloCine_dit)
Download our fine-tuned high-noise and low-noise DiT checkpoints from the following link:
[➡️ Download HoloCine_dit Model Checkpoints Here]
This download contain the four fine-tuned model files. Two for full_attention version: full_high_noise.safetensors, full_low_noise.safetensors. And two for sparse inter-shot attention version: sparse_high_noise.safetensors, sparse_high_noise.safetensors. The sparse version is still uploading.
You can choose a version to download, or try both version if you want.
The full attention version will have better performance, so we suggest you start from it. The sparse inter-shot attention version will be slightly unstable (but also great in most cases), but faster than the full attention version.
For full attention version:
Create a new folder named checkpoints/HoloCine_dit/full/ and place both high and low noise files inside.
For sparse attention version:
Create a new folder named checkpoints/HoloCine_dit/full/ and place both high and low noise files inside.
Step 3: Final Directory Structure
If you downloaded the full model, your checkpoints directory should look like this:
checkpoints/
├── Wan2.2-T2V-A14B/
│ ├── models_t5_umt5-xxl-enc-bf16.pth
│ └── Wan2.1_VAE.pth
└── HoloCine_dit/
└── full/
├── full_high_noise.safetensors
└── full_low_noise.safetensors
(If you downloaded the sparse model, replace full with sparse.)
Inference
We release two version of models, one using full attention to model the multi-shot sequence (our default), the other using sparse Inter-shot attention.
To use the full attention version.
python HoloCine_inference_full_attention.py
To use the sparse inter-shot attention version.
python HoloCine_inference_sparse_attention.py
If you don't have enough VRAM, you can reduce the frame amount from 241 to 81 (15s to 5s).
Prompt Format
To achieve precise content control of each shot, our prompt is designed to follow a format. Our inference script is designed to be flexible and we support two way to input the text prompt. Note that currently the text encoder will truncate truncate any prompt that exceeds its 512-token limit, so make sure the prompt is concise and less than 512 token.
Choice 1: Structured Input (Recommended if you want to test on your own sample)
This is the easiest way to create new multi-shot prompts. You provide the components as separate arguments inside the script, and our helper function will format them correctly.
global_caption: A string describing the entire scene, characters, and setting.shot_captions: A list of strings, where each string describes one shot in sequential order.num_frames: The total number of frames for the video (default is241as we train on this sequence length).shot_cut_frames: (Optional) A list of frame numbers where you want cuts to happen. By defult, the script will automatically calculate evenly spaced cuts. If you want to customize it, make sure you understand that the shot cut number indicated byshot_cut_framesshould align withshot_captions.
Example (inside HoloCine_inference_full_attention.py):
run_inference(
pipe=pipe,
negative_prompt=scene_negative_prompt,
output_path="test_structured_output.mp4",
# Choice 1 inputs
global_caption="The scene is set in a lavish, 1920s Art Deco ballroom during a masquerade party. [character1] is a mysterious woman with a sleek bob, wearing a sequined silver dress and an ornate feather mask. [character2] is a dapper gentleman in a black tuxedo, his face half-hidden by a simple black domino mask. The environment is filled with champagne fountains, a live jazz band, and dancing couples in extravagant costumes. This scene contains 5 shots.",
shot_captions=[
"Medium shot of [character1] standing by a pillar, observing the crowd, a champagne flute in her hand.",
"Close-up of [character2] watching her from across the room, a look of intrigue on his visible features.",
"Medium shot as [character2] navigates the crowd and approaches [character1], offering a polite bow. ",
"Close-up on [character1]'s eyes through her mask, as they crinkle in a subtle, amused smile.",
"A stylish medium two-shot of them standing together, the swirling party out of focus behind them, as they begin to converse."
],
num_frames=241
)
https://github.com/user-attachments/assets/10dba757-27dc-4f65-8fc3-b396cf466063
Choice 2: Raw String Input
This mode allows you to provide the full, concatenated prompt string, just like in our original script. This is useful if you want to re-using our provided prompts.
The format must be exact:
[global caption] ... [per shot caption] ... [shot cut] ... [shot cut] ...
Example (inside HoloCine_inference_full_attention.py):
Related Skills
qqbot-channel
348.2kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.2k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
348.2kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
