MuseV
MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising
Install / Use
/learn @TMElyralab/MuseVREADME
MuseV English 中文
<font size=5>MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising </br> Zhiqiang Xia <sup>*</sup>, Zhaokang Chen<sup>*</sup>, Bin Wu<sup>†</sup>, Chao Li, Kwok-Wai Hung, Chao Zhan, Yingjie He, Wenjiang Zhou (<sup>*</sup>co-first author, <sup>†</sup>Corresponding Author, benbinwu@tencent.com) </font>
Lyra Lab, Tencent Music Entertainment
github huggingface HuggingfaceSpace project Technical report (comming soon)
We have setup the world simulator vision since March 2023, believing diffusion models can simulate the world. MuseV was a milestone achieved around July 2023. Amazed by the progress of Sora, we decided to opensource MuseV, hopefully it will benefit the community. Next we will move on to the promising diffusion+transformer scheme.
Update:
- We have released <a href="https://github.com/TMElyralab/MuseTalk" style="font-size:24px; color:red;">MuseTalk</a>, a real-time high quality lip sync model, which can be applied with MuseV as a complete virtual human generation solution.
- :new: We are thrilled to announce that MusePose has been released. MusePose is an image-to-video generation framework for virtual human under control signal like pose. Together with MuseV and MuseTalk, we hope the community can join us and march towards the vision where a virtual human can be generated end2end with native ability of full body movement and interaction.
Overview
MuseV is a diffusion-based virtual human video generation framework, which
- supports infinite length generation using a novel Visual Conditioned Parallel Denoising scheme.
- checkpoint available for virtual human video generation trained on human dataset.
- supports Image2Video, Text2Image2Video, Video2Video.
- compatible with the Stable Diffusion ecosystem, including
base_model,lora,controlnet, etc. - supports multi reference image technology, including
IPAdapter,ReferenceOnly,ReferenceNet,IPAdapterFaceID. - training codes (comming very soon).
Important bug fixes
musev_referencenet_pose: model_name ofunet,ip_adapterof Command is not correct, please usemusev_referencenet_poseinstead ofmusev_referencenet.
News
- [03/27/2024] release
MuseVproject and trained modelmusev,muse_referencenet. - [03/30/2024] add huggingface space gradio to generate video in gui
Model
Overview of model structure

Parallel denoising

Cases
All frames were generated directly from text2video model, without any post process. MoreCase is in project, including 1-2 minute video.
<!-- # TODO: // use youtu video link? -->Examples bellow can be accessed at configs/tasks/example.yaml
Text/Image2Video
Human
<table class="center"> <tr style="font-weight: bolder;text-align:center;"> <td width="50%">image</td> <td width="45%">video </td> <td width="5%">prompt</td> </tr> <tr> <td> <img src=./data/images/yongen.jpeg width="400"> </td> <td > <video src="https://github.com/TMElyralab/MuseV/assets/163980830/732cf1fd-25e7-494e-b462-969c9425d277" width="100" controls preload></video> </td> <td>(masterpiece, best quality, highres:1),(1boy, solo:1),(eye blinks:1.8),(head wave:1.3) </td> </tr> <tr> <td> <img src=./data/images/seaside4.jpeg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/9b75a46c-f4e6-45ef-ad02-05729f091c8f" width="100" controls preload></video> </td> <td> (masterpiece, best quality, highres:1), peaceful beautiful sea scene </td> </tr> <tr> <td> <img src=./data/images/seaside_girl.jpeg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/d0f3b401-09bf-4018-81c3-569ec24a4de9" width="100" controls preload></video> </td> <td> (masterpiece, best quality, highres:1), peaceful beautiful sea scene </td> </tr> <!-- guitar --> <tr> <td> <img src=./data/images/boy_play_guitar.jpeg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/61bf955e-7161-44c8-a498-8811c4f4eb4f" width="100" controls preload></video> </td> <td> (masterpiece, best quality, highres:1), playing guitar </td> </tr> <tr> <td> <img src=./data/images/girl_play_guitar2.jpeg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/40982aa7-9f6a-4e44-8ef6-3f185d284e6a" width="100" controls preload></video> </td> <td> (masterpiece, best quality, highres:1), playing guitar </td> </tr> <!-- famous people --> <tr> <td> <img src=./data/images/dufu.jpeg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/28294baa-b996-420f-b1fb-046542adf87d" width="100" controls preload></video> </td> <td> (masterpiece, best quality, highres:1),(1man, solo:1),(eye blinks:1.8),(head wave:1.3),Chinese ink painting style </td> </tr> <tr> <td> <img src=./data/images/Mona_Lisa.jpg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/1ce11da6-14c6-4dcd-b7f9-7a5f060d71fb" width="100" controls preload></video> </td> <td> (masterpiece, best quality, highres:1),(1girl, solo:1),(beautiful face, soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3) </td> </tr> </table >Scene
<table class="center"> <tr style="font-weight: bolder;text-align:center;"> <td width="35%">image</td> <td width="50%">video</td> <td width="15%">prompt</td> </tr> <tr> <td> <img src=./data/images/waterfall4.jpeg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/852daeb6-6b58-4931-81f9-0dddfa1b4ea5" width="100" controls preload></video> </td> <td> (masterpiece, best quality, highres:1), peaceful beautiful waterfall, an endless waterfall </td> </tr> <tr> <td> <img src=./data/images/seaside2.jpeg width="400"> </td> <td> <video src="https://github.com/TMElyralab/MuseV/assets/163980830/4a4d527a-6203-411f-afe9-31c992d26816" width="100" controls preload></video> </td> <td>(masterpiece, best quality, highres:1), peaceful beautiful sea scene </td> </tr> </table >VideoMiddle2Video
pose2video
In duffy mode, pose of the vision condition frame is not aligned with the first frame of control video. posealign will solve the problem.
MuseTalk
The character of talk, Sun Xinying is a supermodel KOL. You can follow her on douyin.
TODO:
- [ ] technical report (comming soon).
- [ ] training codes.
- [ ] release pretrained unet model, which is trained with controlnet、referencenet、IPAdapter, which is better on pose2video.
- [ ] support diffusion transformer generation framework.
- [ ] release
posealignmodule
Quickstart
Prepare python environment and install extra package like diffusers, controlnet_aux, mmcm.
Third party integration
Thanks for the third-party integration, which makes installation and use more convenient for everyone. We also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results.
ComfyUI
One click integration package in windows
netdisk:https://www.123pan.com/s/Pf5Yjv-Bb9W3.html
code: glut
Prepare environment
You are recommended to use docker primarily to prepare python environment.
prepare python env
Attention: we only test with docker, there are maybe trouble with
