Genesis
[NeurIPS 2025]Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency
Install / Use
/learn @xiaomi-research/GenesisREADME
Xiangyu Guo<sup>1*</sup>, Zhanqian Wu<sup>2*</sup>, Kaixin Xiong<sup>2*</sup>, Ziyang Xu<sup>1</sup>, Lijun Zhou<sup>2</sup>, Gangwei Xu<sup>1</sup>, Shaoqing Xu<sup>2</sup>, Haiyang Sun<sup>2†</sup>, Bing Wang<sup>2</sup>, Guang Chen<sup>2</sup>, Hangjun Ye<sup>2</sup>, Wenyu Liu<sup>1</sup>, Xinggang Wang<sup>1,✉</sup>
<sup>1</sup> Huazhong University of Science and Technology <sup>2</sup> Xiaomi EV
(*) Equal contribution. (†) Project leader. (✉)Corresponding Author.
<a href="https://arxiv.org/abs/2506.07497"><img src='https://img.shields.io/badge/arXiv-Genesis-red' alt='Paper PDF'></a> <a href="https://xiaomi-research.github.io/genesis/"><img src='https://img.shields.io/badge/Project_Page-Genesis-green' alt='Project Page'></a>
</div> <!-- ## Introduction -->Abstract
We present Genesis, a unified framework for joint generation of multi-view driving videos and LiDAR sequences with spatio-temporal and cross-modal consistency. Genesis employs a two-stage architecture that integrates a DiT-based video diffusion model with 3D-VAE encoding, and a BEV-aware LiDAR generator with NeRF-based rendering and adaptive sampling. Both modalities are directly coupled through a shared latent space, enabling coherent evolution across visual and geometric domains. To guide the generation with structured semantics, we introduce DataCrafter, a captioning module built on vision-language models that provides scene-level and instance-level supervision. Extensive experiments on the nuScenes benchmark demonstrate that Genesis achieves state-of-the-art performance across video and LiDAR metrics (FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including segmentation and 3D detection, validating the semantic fidelity and practical utility of the generated data.
Overview
<div align="center"> <img src="assets/images/framework.png" width="1000"> </div>News
[2025/09/18] Genesis is accepted by NeurIPS 2025🎉🎉🎉!
[2025/06/18] ArXiv paper release. Models/Code are coming soon. Please stay tuned! ☕️
Updates
- [x] Release Paper
- [ ] Release Full Models
- [ ] Release Inference Framework
- [ ] Release Training Framework
Citation
If you find Genesis is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
@article{guo2025genesis,
title={Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency},
author={Guo, Xiangyu and Wu, Zhanqian and Xiong, Kaixin and Xu, Ziyang and Zhou, Lijun and Xu, Gangwei and Xu, Shaoqing and Sun, Haiyang and Wang, Bing and Chen, Guang and others},
journal={arXiv preprint arXiv:2506.07497},
year={2025}
}
Related Skills
node-connect
349.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
Security Score
Audited on Apr 2, 2026
