MuseV English 中文

<font size=5>MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising </br> Zhiqiang Xia <sup>*</sup>, Zhaokang Chen<sup>*</sup>, Bin Wu<sup>†</sup>, Chao Li, Kwok-Wai Hung, Chao Zhan, Yingjie He, Wenjiang Zhou (<sup>*</sup>co-first author, <sup>†</sup>Corresponding Author, benbinwu@tencent.com) </font>

Lyra Lab, Tencent Music Entertainment

github huggingface HuggingfaceSpace project Technical report (comming soon)

We have setup the world simulator vision since March 2023, believing diffusion models can simulate the world. MuseV was a milestone achieved around July 2023. Amazed by the progress of Sora, we decided to opensource MuseV, hopefully it will benefit the community. Next we will move on to the promising diffusion+transformer scheme.

Update:

We have released <a href="https://github.com/TMElyralab/MuseTalk" style="font-size:24px; color:red;">MuseTalk</a>, a real-time high quality lip sync model, which can be applied with MuseV as a complete virtual human generation solution.
:new: We are thrilled to announce that MusePose has been released. MusePose is an image-to-video generation framework for virtual human under control signal like pose. Together with MuseV and MuseTalk, we hope the community can join us and march towards the vision where a virtual human can be generated end2end with native ability of full body movement and interaction.

Overview

MuseV is a diffusion-based virtual human video generation framework, which

supports infinite length generation using a novel Visual Conditioned Parallel Denoising scheme.
checkpoint available for virtual human video generation trained on human dataset.
supports Image2Video, Text2Image2Video, Video2Video.
compatible with the Stable Diffusion ecosystem, including base_model, lora, controlnet, etc.
supports multi reference image technology, including IPAdapter, ReferenceOnly, ReferenceNet, IPAdapterFaceID.
training codes (comming very soon).

Important bug fixes

musev_referencenet_pose: model_name of unet, ip_adapter of Command is not correct, please use musev_referencenet_pose instead of musev_referencenet.

News

[03/27/2024] release MuseV project and trained model musev, muse_referencenet.
[03/30/2024] add huggingface space gradio to generate video in gui

Model

Overview of model structure

model_structure

Parallel denoising

parallel_denoise

Cases

All frames were generated directly from text2video model, without any post process. MoreCase is in project, including 1-2 minute video.

Examples bellow can be accessed at configs/tasks/example.yaml

Text/Image2Video

Human

<table class="center"> <tr style="font-weight: bolder;text-align:center;"> <td width="50%">image</td> <td width="45%">video </td> <td width="5%">prompt</td> </tr> <tr> <td> <img src=./data/images/yongen.jpeg width="400"> </td> <td > <video src="https://github.com/TMElyralab/MuseV/assets/163980830/732cf1fd-25e7-494e-b462-969c9425d277" width="100" controls preload></video> </td> <td>(masterpiece, best quality, highres:1),(1boy, solo:1),(eye blinks:1.8),(head wave:1.3) </td> </tr> <tr> <td> <img src=./data/images/seaside4.jpeg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/9b75a46c-f4e6-45ef-ad02-05729f091c8f" width="100" controls preload></video> </td> <td> (masterpiece, best quality, highres:1), peaceful beautiful sea scene </td> </tr> <tr> <td> <img src=./data/images/seaside_girl.jpeg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/d0f3b401-09bf-4018-81c3-569ec24a4de9" width="100" controls preload></video> </td> <td> (masterpiece, best quality, highres:1), peaceful beautiful sea scene </td> </tr>  <tr> <td> <img src=./data/images/boy_play_guitar.jpeg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/61bf955e-7161-44c8-a498-8811c4f4eb4f" width="100" controls preload></video> </td> <td> (masterpiece, best quality, highres:1), playing guitar </td> </tr> <tr> <td> <img src=./data/images/girl_play_guitar2.jpeg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/40982aa7-9f6a-4e44-8ef6-3f185d284e6a" width="100" controls preload></video> </td> <td> (masterpiece, best quality, highres:1), playing guitar </td> </tr>  <tr> <td> <img src=./data/images/dufu.jpeg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/28294baa-b996-420f-b1fb-046542adf87d" width="100" controls preload></video> </td> <td> (masterpiece, best quality, highres:1),(1man, solo:1),(eye blinks:1.8),(head wave:1.3),Chinese ink painting style </td> </tr> <tr> <td> <img src=./data/images/Mona_Lisa.jpg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/1ce11da6-14c6-4dcd-b7f9-7a5f060d71fb" width="100" controls preload></video> </td> <td> (masterpiece, best quality, highres:1),(1girl, solo:1),(beautiful face, soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3) </td> </tr> </table >

Scene

<table class="center"> <tr style="font-weight: bolder;text-align:center;"> <td width="35%">image</td> <td width="50%">video</td> <td width="15%">prompt</td> </tr> <tr> <td> <img src=./data/images/waterfall4.jpeg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/852daeb6-6b58-4931-81f9-0dddfa1b4ea5" width="100" controls preload></video> </td> <td> (masterpiece, best quality, highres:1), peaceful beautiful waterfall, an endless waterfall </td> </tr> <tr> <td> <img src=./data/images/seaside2.jpeg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/4a4d527a-6203-411f-afe9-31c992d26816" width="100" controls preload></video> </td> <td>(masterpiece, best quality, highres:1), peaceful beautiful sea scene </td> </tr> </table >

VideoMiddle2Video

pose2video In duffy mode, pose of the vision condition frame is not aligned with the first frame of control video. posealign will solve the problem.

<table class="center"> <tr style="font-weight: bolder;text-align:center;"> <td width="25%">image</td> <td width="65%">video</td> <td width="10%">prompt</td> </tr> <tr> <td> <img src=./data/images/spark_girl.png width="200"> <img src=./data/images/cyber_girl.png width="200"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/484cc69d-c316-4464-a55b-3df929780a8e" width="400" controls preload></video> </td> <td> (masterpiece, best quality, highres:1) , a girl is dancing, animation </td> </tr> <tr> <td> <img src=./data/images/duffy.png width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/c44682e6-aafc-4730-8fc1-72825c1bacf2" width="400" controls preload></video> </td> <td> (masterpiece, best quality, highres:1), is dancing, animation </td> </tr> </table >

MuseTalk

The character of talk, Sun Xinying is a supermodel KOL. You can follow her on douyin.

<table class="center"> <tr style="font-weight: bolder;"> <td width="35%">name</td> <td width="50%">video</td> </tr> <tr> <td> talk </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/951188d1-4731-4e7f-bf40-03cacba17f2f" width="100" controls preload></video> </td> <tr> <td> sing </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/50b8ffab-9307-4836-99e5-947e6ce7d112" width="100" controls preload></video> </td> </tr> </table >

TODO:

[ ] technical report (comming soon).
[ ] training codes.
[ ] release pretrained unet model, which is trained with controlnet、referencenet、IPAdapter, which is better on pose2video.
[ ] support diffusion transformer generation framework.
[ ] release posealign module

Quickstart

Prepare python environment and install extra package like diffusers, controlnet_aux, mmcm.

Third party integration

Thanks for the third-party integration, which makes installation and use more convenient for everyone. We also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results.

ComfyUI

One click integration package in windows

netdisk:https://www.123pan.com/s/Pf5Yjv-Bb9W3.html

code: glut

Prepare environment

You are recommended to use docker primarily to prepare python environment.

prepare python env

Attention: we only test with docker, there are maybe trouble with

MuseV

Install / Use

README