OmniCustom
Official Implementation of 'OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model'
Install / Use
/learn @OmniCustom-project/OmniCustomREADME
OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model
<div align="center">
<a href="https://huggingface.co/Omni1307/OmniCustom"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Model&color=orange"></a>
🔥 Latest News!
- Feb 12, 2026: We proposed OmniCustom, a novel framework to deal with sync audio-video customization. For more video demos, please visit the project page.
- Feb 14, 2026: The inference code and the model checkpoint are publicly available.
🎥 Video
https://github.com/user-attachments/assets/8ddeaf77-ea5d-45d6-ad1a-440f64fc96a0
<!-- https://github.com/user-attachments/assets/7943515a-691b-417e-99c7-65003a63e258 -->📖 Overview
Given a reference image $I^{r}$ and a reference audio $A^{r}$, our OmniCustom framework synchronously generates a video that preserves the visual identity from $I^{r}$ and an audio track that mimics the timbre of $A^{r}$. Here, the speech content can be freely specified through a textual prompt.
<p align="center"> <img src="assets/images/teaser.png"> </p>⚡️ Quickstart
Installation
1.Clone the repo:
git clone https://github.com/OmniCustom-project/OmniCustom.git
cd OmniCustom
2. Create Environment:
conda create -n omnicustom python=3.10
conda activate omnicustom
pip install -r requirements.txt
3. Install Flash Attention :
pip install flash-attn --no-build-isolation
Model Download
First, you need to download the original model of OVI, Wan2.2-TI2V-5B, and MMAudio. You can download them using download_weights.py, and put them into ckpts:
python3 download_weights.py --output-dir ./ckpts
| Models | Download Link | Notes | |--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------| | OmniCustom models | 🤗 Huggingface | 1.9G | Naturalspeech 3 | 🤗 Huggingface | timbre embedding extractor |InsightFace | 🤗 Huggingface | face embedding extractor |LivePortrait | 🤗 Huggingface | crop reference image
Then, please download the model of our OmniCustom, Naturalspeech 3, InsightFace, and LivePortrait from Huggingface, and put them into ckpts. Here, we provide a unified command to download these four models from Huggingface.
pip install "huggingface_hub[cli]"
huggingface-cli download Omni1307/OmniCustom \
--include "ckpts/**" \
--local-dir ./ \
--local-dir-use-symlinks False
The final structure of the ckpts directory should be:
# OmniCustom/ckpts
ckpts/
├── InsightFace/
├── LivePortrait/
├── MMAudio/
├── naturalspeech3_facodec/
├── Ovi/
├── step-92000.safetensors
└── Wan2.2-TI2V-5B/
⚙️ Configure OmniCustom
The configure file of OmniCustom OmniCustom/configs/inference/inference_fusion.yaml can be modified. The following parameters control generation quality, video resolution, and how text, image, and audio inputs are balanced:
ckpt_name: Ovi/model.safetensors #base model
lora_path: ./ckpts/step-92000.safetensors #the checkpoint of our OmniCustom
self_lora: true
# face embedder
face_embedder_ckpt_dir: ./ckpts/InsightFace
face_ip_emb_dim: 512
# audio embedder
audio_embedder_ckpt_dir: ./ckpts/naturalspeech3_facodec
audio_ip_emb_dim: 256
# output
output_dir: ./outputs/
sample_steps: 50 # number of denoising steps. Lower (30-40) = faster generation
solver_name: unipc # sampling algorithm for denoising process
shift: 5.0 #timestep shift factor for sampling scheduler
sp_size: 1
audio_guidance_scale: 3.0
video_guidance_scale: 4.0
mode: "id2v" # ["id2v", "t2v", "i2v", "t2i2v"] all comes with audio
fp8: False # load fp8 version of model, will have quality degradation and will not have speed
cpu_offload: False
seed: 102 # random seed for reproducible results
crop_face: true # crop face region from the reference image
video_negative_prompt: "jitter, bad hands, blur, distortion, two people, two persons, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border"
audio_negative_prompt: "robotic, muffled, echo, distorted" # avoid artifacts in audio
video_frame_height_width: [576, 992] #[512, 992] # only useful if mode = t2v or t2i2v, recommended values: [512, 992], [992, 512], [960, 512], [512, 960], [720, 720], [448, 1120]
text_prompt: ./example_prompts/benchmark_example.csv #group generation
slg_layer: 11
each_example_n_times: 1
🔑 Inference
Single GPU
bash ./inference.sh
Or run:
CUDA_VISIBLE_DEVICES=0 infer.py --config-file ./configs/inference/inference_fusion.yaml
💡Note:
text_promptinconfigs/inference/inference_fusion.yamlcan change examples for sync audio-video customization.text_promptsupports a CSV file, which contains text_prompt, ip_image_path, and ip_audio_path.- Those results without any customization and those with only identity customization will be saved to the result folder.
- When the generated video is unsatisfactory, the most straightforward solution is to try changing the
seedinconfigs/inference/inference_fusion.yaml.- The Peak VRAM Required is 80 GB in a single GPU.
More Results
<table width="100%" border="1" cellpadding="20" cellspacing="0" align="center" style="border-collapse: collapse; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;"> <thead> <tr bgcolor="#f5f5f5"> <th style="width:18%; padding:16px; border:1px solid #ddd; font-size:14px;">Reference Images</th> <th style="width:18%; padding:16px; border:1px solid #ddd; font-size:14px;">Reference Audios</th> <th style="width:24%; padding:16px; border:1px solid #ddd; font-size:14px;">Text prompts</th> <th style="width:40%; padding:16px; border:1px solid #ddd; font-size:14px;">Generated Videos</th> </tr> </thead> <tbody> <tr> <!-- 行1 --> <td align="center" style="vertical-align:middle; padding:16px; border:1px solid #ddd; min-height:300px;"> <img src="https://raw.githubusercontent.com/OmniCustom-project/OmniCustom/main/assets/images/ref_68.png" alt="Reference Image 1" style="max-height:200px; max-width:100%; object-fit: contain;"> </td> <td align="center" style="vertical-align:middle; padding:16px; border:1px solid #ddd; min-height:300px;"> <!-- 音频:显示为播放按钮,点击在新窗口播放 --> <a href="https://github.com/user-attachments/assets/281b285a-80bc-482e-a62b-6f1cce389c6a"> <img src="https://img.shields.io/badge/🔊-Play%20Audio%201-1e88e5?style=for-the-badge" alt="Play Audio 1"> </a> </td> <td style="line-height:1.6; vertical-align:middle; padding:16px; border:1px solid #ddd; font-size:13px; min-height:300px;"> <div style="max-height:280px; overflow-y: auto; padding-right: 8px;"> A man stands at the podium in OpenAI's luxurious conference room, behind him a massive electronic screen displays the company's glowing profit data. He grips the microphone firmly, gazes across the audience below, and announces in a steady tone: <S>The board wants to sell OpenAI to Zuckerberg, which is unacceptable.<E> </div> </td> <td align="center" style="vertical-align:middle; padding:16px; border:1px solid #ddd; min-height:300px;"> <!-- 视频:显示缩略图+播放按钮,点击在新窗口播放 --> <a href="https://github.com/user-attachments/assets/5472ec3d-57bc-45b3-a921-7913d0bd8bb7"> <img src="https://img.shields.io/badge/▶️-Watch%20Video%201-764ba2?style=for-the-badge" alt="Watch Video 1"> </a> </td> </tr> <tr> <!-- 行2 --> <td align="center" style="vertical-align:middle; padding:16px; border:1px solid #ddd; min-height:300px;"> <img src="https://raw.githubusercontent.com/OmniCustom-project/OmniCustom/main/assets/images/ref_69.png" alt="Reference Image 2" style="max-height:200px; max-width:100%; object-fit: contain;"> </td> <td align="center" style="vertical-align:middle; padding:16px; border:1px solid #ddd; min-height:300px;"> <a href="https://github.com/user-attachments/assets/a6294f38-ba3a-449c-b8e3-8fce181124fe"> <img src="https://img.shields.io/badge/🔊-Play%20Audio%202-1e88e5?style=for-the-badge" alt="Play Audio 2"> </a> </td> <td style="line-height:1.6; vertical-align:middle; padding:16px; border:1px solid #ddd; font-size:13px; min-height:300px;"> <div style="max-height:280px; overflow-y: auto; padding-right: 8px;"> A woman stands before the iconic Rockefeller Center Christmas Tree, its thousands of lights reflecting in her eyes as snow begins to fall gently around her. Wearing a tartan scarf and holding a cup of steaming cocoa, she brings her mittened hands together and speaks softly into the frosty air: <S>May the spirit of Christmas fill your heart throughout the coming year.<E> </div> </td> <td aRelated Skills
docs-writer
99.5k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
341.0kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
project-overview
FlightPHP Skeleton Project Instructions This document provides guidelines and best practices for structuring and developing a project using the FlightPHP framework. Instructions for AI Coding A
ddd
Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso
