<img src="https://github.com/rhymes-ai/Allegro/blob/main/assets/TI2V_banner.gif"/> <a href="https://rhymes.ai/allegro_gallery" target="_blank"> Gallery</a> · <a href="https://huggingface.co/rhymes-ai/Allegro" target="_blank">Hugging Face</a> · <a href="https://rhymes.ai/blog-details/allegro-advanced-video-generation-model" target="_blank">Blog</a> · <a href="https://arxiv.org/abs/2410.15458" target="_blank">Paper</a> · <a href="https://discord.com/invite/u8HxU23myj" target="_blank">Discord</a>

Allegro is a powerful text-to-video model that generates high-quality videos up to 6 seconds at 15 FPS and 720p resolution from simple text input. Allegro-TI2V, a variant of Allegro, extends this functionality by generating similar high-quality videos using text inputs along with first-frame and optionally last-frame image inputs.

News 🔥

[2025/02/07] 🚀 We release the full code for Presto in this repo. Presto is an adapted T2V model based on Allegro, with a long duration and rich content.
[2025/01/02] 🚀 We release the training code for further training / fine-tuning on Allegro-TI2V-88x720P! Happy New Year!
[2024/12/26] 🚀 We release the low-resolution (<a href="https://huggingface.co/rhymes-ai/Allegro-T2V-40x360P">40x360P</a>) and fewer-frame (<a href="https://huggingface.co/rhymes-ai/Allegro-T2V-40x720P">40x720P</a>) models of Allegro for research purpose!
[2024/12/10] 🚀 We release the training code for further training / fine-tuning!
[2024/11/25] 🚀 Allegro-TI2V is open sourced!
[2024/10/30] 🚀 We release multi-card inference code and PAB in Allegro-VideoSys. With VideoSys framework, the inference time can be further reduced to 3 mins (8xH100) and 2 mins (8xH100+PAB). We also opened a PR to the original VideoSys repo.
[2024/10/29] 🎉 Congratulations that Allegro is merged into diffusers! Currently Allegro is supported in 0.32.0-dev0. It will be integrated in the next release version. So for now, please use pip install git+https://github.com/huggingface/diffusers.git to install diffuser dev version. See huggingface for more details.
[2024/10/22]🚀 Allegro is open sourced!

Model Info

<table> <tr> <th>Model</th> <td>Allegro</td> <td>Allegro-TI2V</td> </tr> <tr> <th>Description</th> <td>Text-to-Video Generation Model</td> <td>Text-Image-to-Video Generation Model</td> </tr> <tr> <th>Download</th> <td><a href="https://huggingface.co/rhymes-ai/Allegro">Hugging Face (88x720P)</a> <a href="https://huggingface.co/rhymes-ai/Allegro-T2V-40x720P">Hugging Face (40x720P)</a> <a href="https://huggingface.co/rhymes-ai/Allegro-T2V-40x360P">Hugging Face (40x360P)</a></td> <td><a href="https://huggingface.co/rhymes-ai/Allegro-TI2V">Hugging Face (88x720P)</a></td> </tr> <tr> <th rowspan="2">Parameter</th> <td colspan="2">VAE: 175M</td> </tr> <tr> <td colspan="2">DiT: 2.8B</td> </tr> <tr> <th rowspan="2">Inference Precision</th> <td colspan="2">VAE: FP32/TF32/BF16/FP16 (best in FP32/TF32)</td> </tr> <tr> <td colspan="2">DiT/T5: BF16/FP32/TF32</td> </tr> <tr> <th>Context Length</th> <td colspan="2">79.2K</td> </tr> <tr> <th>Resolution</th> <td colspan="2">720 x 1280</td> </tr> <tr> <th>Frames</th> <td colspan="2">88</td> </tr> <tr> <th>Video Length</th> <td colspan="2">6 seconds @ 15 FPS</td> </tr> <tr> <th>Single GPU Memory Usage</th> <td colspan="2">9.3G BF16 (with cpu_offload)</td> </tr> <tr> <th>Inference time</th> <td colspan="2">20 mins (single H100) / 3 mins (8xH100)</td> </tr> </table>

Quick Start

Single Inference

Allegro

Download the Allegro GitHub code.
Install the necessary requirements.
- Ensure Python >= 3.10, PyTorch >= 2.4, CUDA >= 12.4. For details, see requirements.txt.
- It is recommended to use Anaconda to create a new environment (Python >= 3.10) to run the following example.
Download the Allegro model weights.

Run inference.

python single_inference.py \
--user_prompt 'A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this location might be a popular spot for docking fishing boats.' \
--save_path ./output_videos/test_video.mp4 \
--vae your/path/to/vae \
--dit your/path/to/transformer \
--text_encoder your/path/to/text_encoder \
--tokenizer your/path/to/tokenizer \
--guidance_scale 7.5 \
--num_sampling_steps 100 \
--seed 42

Use --enable_cpu_offload to offload the model into CPU for less GPU memory cost (about 9.3G, compared to 27.5G if CPU offload is not enabled), but the inference time will increase significantly.

(Optional) Interpolate the video to 30 FPS.

It is recommended to use EMA-VFI to interpolate the video from 15 FPS to 30 FPS.

For better visual quality, please use imageio to save the video.

Allegro TI2V

Download the Allegro GitHub code.
Install the necessary requirements.
- Ensure Python >= 3.10, PyTorch >= 2.4, CUDA >= 12.4. For details, see requirements.txt.
- It is recommended to use Anaconda to create a new environment (Python >= 3.10) to run the following example.
Download the Allegro-TI2V model weights.

Run inference.

python single_inference_ti2v.py \
--user_prompt 'The car drives along the road' \
--first_frame your/path/to/first_frame_image.png \
--vae your/path/to/vae \
--dit your/path/to/transformer \
--text_encoder your/path/to/text_encoder \
--tokenizer your/path/to/tokenizer \
--guidance_scale 8 \
--num_sampling_steps 100 \
--seed 1427329220

The output video resolution is fixed at 720 × 1280. Input images with different resolutions will be automatically cropped and resized to fit.

| Argument | Description | |-----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------| | --user_prompt | [Required] Text input for image-to-video generation. | | --first_frame | [Required] First-frame image input for image-to-video generation. | | --last_frame | [Optional] If provided, the model will generate intermediate video content based on the specified first and last frame images. | | --enable_cpu_offload | [Optional] Offload the model into CPU for less GPU memory cost (about 9.3G, compared to 27.5G if CPU offload is not enabled), but the inference time will increase significantly. |

(Optional) Interpolate the video to 30 FPS.

It is recommended to use EMA-VFI to interpolate the video from 15 FPS to 30 FPS.

For better visual quality, please use imageio to save the video.

Multi-Card Inference

For both Allegro & Allegro TI2V: We release multi-card inference code and PAB in Allegro-VideoSys.

Training / Fine-tuning

Allegro T2V

Download the Allegro GitHub code, Allegro model weights and prepare the environment in requirements.txt.
Our training code loads the dataset from .parquet files. We recommend first constructing a .jsonl file to store all data cases in a list. Each case should be stored as a dict, like this:
```
[
    {"path": "foo/bar.mp4", "num_frames": 123, "height": 1080, "width": 1920, "cap": "This is a fake caption."}
    ...
]
```
After that, run dataset_utils.py to convert .jsonl into .parquet.

The absolute path to each video is constructed by joining args.data_dir in train.py with the path value from the dataset. Therefore, you may define path as a relative path within your dataset and set args.data_dir to the root dir when running training.

Run Training / Fine-tuning:

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1

export WANDB_API_KEY=YOUR_WANDB_KEY

accelerate launch \
    --num_machines 1 \
    --num_processes 8 \
    --machine_rank 0 \
    --config_file config/accelerate_config.yaml \
    train.py \
    --project_name Allegro_F

Allegro

Install / Use

README

News 🔥

Model Info

Quick Start

Single Inference

Allegro

Allegro TI2V

Multi-Card Inference

Training / Fine-tuning

Allegro T2V