OmniCustom

Official Implementation of 'OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model'

Generate Convert Improve

Install / Use

/learn @OmniCustom-project/OmniCustom

About this skill

Quality Score

0/100

README

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

</div>

🔥 Latest News!

Feb 12, 2026: We proposed OmniCustom, a novel framework to deal with sync audio-video customization. For more video demos, please visit the project page.
Feb 14, 2026: The inference code and the model checkpoint are publicly available.

🎥 Video

https://github.com/user-attachments/assets/8ddeaf77-ea5d-45d6-ad1a-440f64fc96a0

📖 Overview

Given a reference image $I^{r}$ and a reference audio $A^{r}$, our OmniCustom framework synchronously generates a video that preserves the visual identity from $I^{r}$ and an audio track that mimics the timbre of $A^{r}$. Here, the speech content can be freely specified through a textual prompt.

⚡️ Quickstart

Installation

1.Clone the repo:

git clone https://github.com/OmniCustom-project/OmniCustom.git
cd OmniCustom

2. Create Environment:

conda create -n omnicustom python=3.10
conda activate omnicustom
pip install -r requirements.txt

3. Install Flash Attention :

pip install flash-attn --no-build-isolation

Model Download

First, you need to download the original model of OVI, Wan2.2-TI2V-5B, and MMAudio. You can download them using download_weights.py, and put them into ckpts:

python3 download_weights.py --output-dir ./ckpts

| Models | Download Link | Notes | |--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------| | OmniCustom models | 🤗 Huggingface | 1.9G | Naturalspeech 3 | 🤗 Huggingface | timbre embedding extractor |InsightFace | 🤗 Huggingface | face embedding extractor |LivePortrait | 🤗 Huggingface | crop reference image

Then, please download the model of our OmniCustom, Naturalspeech 3, InsightFace, and LivePortrait from Huggingface, and put them into ckpts. Here, we provide a unified command to download these four models from Huggingface.

pip install "huggingface_hub[cli]"
huggingface-cli download Omni1307/OmniCustom \
  --include "ckpts/**" \
  --local-dir ./ \
  --local-dir-use-symlinks False

The final structure of the ckpts directory should be:

# OmniCustom/ckpts 
ckpts/
├── InsightFace/
├── LivePortrait/
├── MMAudio/
├── naturalspeech3_facodec/
├── Ovi/
├── step-92000.safetensors
└── Wan2.2-TI2V-5B/

⚙️ Configure OmniCustom

The configure file of OmniCustom OmniCustom/configs/inference/inference_fusion.yaml can be modified. The following parameters control generation quality, video resolution, and how text, image, and audio inputs are balanced:

ckpt_name: Ovi/model.safetensors  #base model
lora_path: ./ckpts/step-92000.safetensors #the checkpoint of our OmniCustom
self_lora: true 
# face embedder 
face_embedder_ckpt_dir: ./ckpts/InsightFace  
face_ip_emb_dim: 512   
# audio embedder
audio_embedder_ckpt_dir: ./ckpts/naturalspeech3_facodec
audio_ip_emb_dim: 256 
# output
output_dir: ./outputs/
sample_steps: 50  # number of denoising steps. Lower (30-40) = faster generation
solver_name: unipc  # sampling algorithm for denoising process
shift: 5.0    #timestep shift factor for sampling scheduler
sp_size: 1
audio_guidance_scale: 3.0
video_guidance_scale: 4.0
mode: "id2v"                                                  # ["id2v", "t2v", "i2v", "t2i2v"] all comes with audio
fp8: False                        # load fp8 version of model, will have quality degradation and will not have speed 
cpu_offload: False
seed: 102                    # random seed for reproducible results
crop_face: true        # crop face region from the reference image
video_negative_prompt: "jitter, bad hands, blur, distortion, two people, two persons, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border" 
audio_negative_prompt: "robotic, muffled, echo, distorted"    # avoid artifacts in audio
video_frame_height_width: [576, 992] #[512, 992]                         # only useful if mode = t2v or t2i2v, recommended values: [512, 992], [992, 512], [960, 512], [512, 960], [720, 720], [448, 1120]
text_prompt: ./example_prompts/benchmark_example.csv  #group generation
slg_layer: 11
each_example_n_times: 1

🔑 Inference

Single GPU

bash ./inference.sh

Or run:

CUDA_VISIBLE_DEVICES=0  infer.py --config-file ./configs/inference/inference_fusion.yaml

💡Note:

text_prompt in configs/inference/inference_fusion.yaml can change examples for sync audio-video customization. text_prompt supports a CSV file, which contains text_prompt, ip_image_path, and ip_audio_path.

Those results without any customization and those with only identity customization will be saved to the result folder.

When the generated video is unsatisfactory, the most straightforward solution is to try changing the seed in configs/inference/inference_fusion.yaml.

The Peak VRAM Required is 80 GB in a single GPU.

More Results

<table width="100%" border="1" cellpadding="20" cellspacing="0" align="center" style="border-collapse: collapse; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;"> <thead> <tr bgcolor="#f5f5f5"> <th style="width:18%; padding:16px; border:1px solid #ddd; font-size:14px;">Reference Images</th> <th style="width:18%; padding:16px; border:1px solid #ddd; font-size:14px;">Reference Audios</th> <th style="width:24%; padding:16px; border:1px solid #ddd; font-size:14px;">Text prompts</th> <th style="width:40%; padding:16px; border:1px solid #ddd; font-size:14px;">Generated Videos</th> </tr> </thead> <tbody> <tr>  <td align="center" style="vertical-align:middle; padding:16px; border:1px solid #ddd; min-height:300px;"> <img src="https://raw.githubusercontent.com/OmniCustom-project/OmniCustom/main/assets/images/ref_68.png" alt="Reference Image 1" style="max-height:200px; max-width:100%; object-fit: contain;"> </td> <td align="center" style="vertical-align:middle; padding:16px; border:1px solid #ddd; min-height:300px;">  <a href="https://github.com/user-attachments/assets/281b285a-80bc-482e-a62b-6f1cce389c6a"> <img src="https://img.shields.io/badge/🔊-Play%20Audio%201-1e88e5?style=for-the-badge" alt="Play Audio 1"> </a> </td> <td style="line-height:1.6; vertical-align:middle; padding:16px; border:1px solid #ddd; font-size:13px; min-height:300px;"> <div style="max-height:280px; overflow-y: auto; padding-right: 8px;"> A man stands at the podium in OpenAI's luxurious conference room, behind him a massive electronic screen displays the company's glowing profit data. He grips the microphone firmly, gazes across the audience below, and announces in a steady tone: <S>The board wants to sell OpenAI to Zuckerberg, which is unacceptable.<E> </div> </td> <td align="center" style="vertical-align:middle; padding:16px; border:1px solid #ddd; min-height:300px;">  <a href="https://github.com/user-attachments/assets/5472ec3d-57bc-45b3-a921-7913d0bd8bb7"> <img src="https://img.shields.io/badge/▶️-Watch%20Video%201-764ba2?style=for-the-badge" alt="Watch Video 1"> </a> </td> </tr> <tr>  <td align="center" style="vertical-align:middle; padding:16px; border:1px solid #ddd; min-height:300px;"> <img src="https://raw.githubusercontent.com/OmniCustom-project/OmniCustom/main/assets/images/ref_69.png" alt="Reference Image 2" style="max-height:200px; max-width:100%; object-fit: contain;"> </td> <td align="center" style="vertical-align:middle; padding:16px; border:1px solid #ddd; min-height:300px;"> <a href="https://github.com/user-attachments/assets/a6294f38-ba3a-449c-b8e3-8fce181124fe"> <img src="https://img.shields.io/badge/🔊-Play%20Audio%202-1e88e5?style=for-the-badge" alt="Play Audio 2"> </a> </td> <td style="line-height:1.6; vertical-align:middle; padding:16px; border:1px solid #ddd; font-size:13px; min-height:300px;"> <div style="max-height:280px; overflow-y: auto; padding-right: 8px;"> A woman stands before the iconic Rockefeller Center Christmas Tree, its thousands of lights reflecting in her eyes as snow begins to fall gently around her. Wearing a tartan scarf and holding a cup of steaming cocoa, she brings her mittened hands together and speaks softly into the frosty air: <S>May the spirit of Christmas fill your heart throughout the coming year.<E> </div> </td> <td a

Related Skills

docs-writer

99.5k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

341.0k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

project-overview

FlightPHP Skeleton Project Instructions This document provides guidelines and best practices for structuring and developing a project using the FlightPHP framework. Instructions for AI Coding A

ddd

Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso

OmniCustom-project

View profile

View on GitHub

GitHub Stars401

CategoryContent

Updated1h ago

Forks6

OmniCustom-project/OmniCustom

Languages

Python

Security Score

80/100

Audited on Mar 30, 2026

No findings