Genesis

[NeurIPS 2025]Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency

Generate Convert Improve

Install / Use

/learn @xiaomi-research/Genesis

About this skill

Quality Score

0/100

README

<div align="center"> <h3>[🎉NeurIPS 2025!]Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency</h3>

Xiangyu Guo1*, Zhanqian Wu2*, Kaixin Xiong2*, Ziyang Xu1, Lijun Zhou2, Gangwei Xu1, Shaoqing Xu2, Haiyang Sun2†, Bing Wang2, Guang Chen2, Hangjun Ye2, Wenyu Liu1, Xinggang Wang1,✉

1 Huazhong University of Science and Technology 2 Xiaomi EV

(*) Equal contribution. (†) Project leader. (✉)Corresponding Author.

</div>

Abstract

We present Genesis, a unified framework for joint generation of multi-view driving videos and LiDAR sequences with spatio-temporal and cross-modal consistency. Genesis employs a two-stage architecture that integrates a DiT-based video diffusion model with 3D-VAE encoding, and a BEV-aware LiDAR generator with NeRF-based rendering and adaptive sampling. Both modalities are directly coupled through a shared latent space, enabling coherent evolution across visual and geometric domains. To guide the generation with structured semantics, we introduce DataCrafter, a captioning module built on vision-language models that provides scene-level and instance-level supervision. Extensive experiments on the nuScenes benchmark demonstrate that Genesis achieves state-of-the-art performance across video and LiDAR metrics (FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including segmentation and 3D detection, validating the semantic fidelity and practical utility of the generated data.

Overview

News

[2025/09/18] Genesis is accepted by NeurIPS 2025🎉🎉🎉!

[2025/06/18] ArXiv paper release. Models/Code are coming soon. Please stay tuned! ☕️

Updates

[x] Release Paper
[ ] Release Full Models
[ ] Release Inference Framework
[ ] Release Training Framework

Citation

If you find Genesis is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@article{guo2025genesis,
  title={Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency},
  author={Guo, Xiangyu and Wu, Zhanqian and Xiong, Kaixin and Xu, Ziyang and Zhou, Lijun and Xu, Gangwei and Xu, Shaoqing and Sun, Haiyang and Wang, Bing and Chen, Guang and others},
  journal={arXiv preprint arXiv:2506.07497},
  year={2025}
}

Related Skills

node-connect

349.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。