SkillAgentSearch skills...

DriveLM

[ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering

Install / Use

/learn @OpenDriveLab/DriveLM

README

[!IMPORTANT] 🌟 Stay up to date at opendrivelab.com!

<div id="top" align="center"> <p align="center"> <img src="assets/images/repo/title_v2.jpg"> </p>

DriveLM: Driving with Graph Visual Question Answering

<!-- Download dataset [**HERE**](docs/data_prep_nus.md) (serves as Official source for `Autonomous Driving Challenge 2024`) -->

Autonomous Driving Challenge 2024 Driving-with-Language Leaderboard.

</div> <div id="top" align="center">

License: Apache2.0 arXiv Hugging Face

<!-- <a href="https://opendrivelab.github.io/DriveLM" target="_blank"> <img alt="Github Page" src="https://img.shields.io/badge/Project%20Page-white?logo=GitHub&color=green" /> </a> --> <!-- [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DriveLM-ffc107?color=ffc107&logoColor=white)](https://huggingface.co/datasets/OpenDrive/DriveLM) --> </div> <!-- > https://github.com/OpenDriveLab/DriveLM/assets/103363891/67495435-4a32-4614-8d83-71b5c8b66443 --> <!-- > above is old demo video. demo scene token: cc8c0bf57f984915a77078b10eb33198 -->

https://github.com/OpenDriveLab/DriveLM/assets/54334254/cddea8d6-9f6e-4e7e-b926-5afb59f8dce2

<!-- > above is new demo video. demo scene token: cc8c0bf57f984915a77078b10eb33198 -->

Highlights <a name="highlight"></a>

🔥 We instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving.

<!-- 🔥 **The key insight** is that with our proposed suite, we obtain a suitable proxy task to mimic the human reasoning process during driving. -->

🏁 DriveLM serves as a main track in the CVPR 2024 Autonomous Driving Challenge. Everything you need for the challenge is HERE, including baseline, test data and submission format and evaluation pipeline!

<p align="center"> <img src="assets/images/repo/drivelm_teaser.jpg"> </p> <!-- ### Highlights of the DriveLM-Data --> <!-- #### In the view of full-stack autonomous driving - 🛣 Completeness in functionality (covering **Perception**, **Prediction**, and **Planning** QA pairs). <p align="center"> <img src="assets/images/repo/point_1.png"> </p> --> <!-- - 🔜 Reasoning for future events that have not yet happened. - Many **"What If"**-style questions: imagine the future by language. <p align="center"> <img src="assets/images/repo/point_2.png" width=70%> </p> - ♻ Task-driven decomposition. - **One** scene-level description into **many** frame-level trajectories & planning QA pairs. <p align="center"> <img src="assets/images/repo/point_3.png"> </p> --> <!-- ### Highlights of the DriveLM-Agent --> <!-- #### In the view of the general Vision Language Models --> <!-- 🕸️ Multi-modal **Graph Visual Question Answering** (GVQA) benchmark for structured reasoning in the general Vision Language Models. https://github.com/OpenDriveLab/DriveLM-new/assets/75412366/78c32442-73c8-4f1d-ab69-34c15e7060af --> <!-- > above is graph VQA demo video. -->

News <a name="news"></a>

  • [2025/01/08] Drive-Bench release! In-depth analysis in what are DriveLM really benchmarking. Take a look at arxiv.
  • [2024/07/16] DriveLM official leaderboard reopen!
  • [2024/07/01] DriveLM got accepted to ECCV 2024! Congrats to the team!
  • [2024/06/01] Challenge ended up! See the final leaderboard.
  • [2024/03/25] Challenge test server is online and the test questions are released. Check it out!
  • [2024/02/29] Challenge repo release. Baseline, data and submission format, evaluation pipeline. Have a look!
  • [2023/08/25] DriveLM-nuScenes demo released.
  • [2023/12/22] DriveLM-nuScenes full v1.0 and paper released.
<!-- > - **`[Early 2024]`** DriveLM-Agent inference code. --> <!-- > - **`Note:`** We plan to release a simple, flexible training code that supports multi-view inputs as a starter kit for the AD challenge (stay tuned for details). -->

Table of Contents

  1. Highlights
  2. Getting Started
  3. Current Endeavors and Future Horizons
  4. TODO List
  5. DriveLM-Data
  6. License and Citation
  7. Other Resources
<!-- - [News](#news) - [DriveLM-Data](#drivelm-data) - [Getting Started](#getting-started) - [License and Citation](#license-and-citation) - [Other Resources](#other-resources) -->

Getting Started <a name="gettingstarted"></a>

To get started with DriveLM:

<p align="right">(<a href="#top">back to top</a>)</p>

Current Endeavors and Future Directions <a name="timeline"></a>

  • The advent of GPT-style multimodal models in real-world applications motivates the study of the role of language in driving.
  • Date below reflects the arXiv submission date.
  • If there is any missing work, please reach out to us!
<p align="center"> <img src="assets/images/repo/drivelm_timeline_v3.jpg"> </p>

DriveLM attempts to address some of the challenges faced by the community.

  • Lack of data: DriveLM-Data serves as a comprehensive benchmark for driving with language.
  • Embodiment: GVQA provides a potential direction for embodied applications of LLMs / VLMs.
  • Closed-loop: DriveLM-CARLA attempts to explore closed-loop planning with language.
<p align="right">(<a href="#top">back to top</a>)</p>

TODO List <a name="newsandtodolist"></a>

  • [x] DriveLM-Data
    • [x] DriveLM-nuScenes
    • [x] DriveLM-CARLA
  • [x] DriveLM-Metrics
    • [x] GPT-score
  • [ ] DriveLM-Agent
    • [x] Inference code on DriveLM-nuScenes
    • [ ] Inference code on DriveLM-CARLA
<p align="right">(<a href="#top">back to top</a>)</p>

DriveLM-Data <a name="drivelmdata"></a>

We facilitate the Perception, Prediction, Planning, Behavior, Motion tasks with human-written reasoning logic as a connection between them. We propose the task of GVQA on the DriveLM-Data.

<!-- DriveLM is an autonomous driving (**AD**) dataset incorporating linguistic information. Through DriveLM, we want to connect large language models and autonomous driving systems, and eventually introduce the reasoning ability of Large Language Models in autonomous driving (**AD**) to make decisions and ensure explainable planning. --> <!-- In DriveLM, we study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems to boost generalization and enable interactivity with human users. Specifically, we aim to facilitate `Perception, Prediction, Planning, Behavior, Motion` tasks with human-written reasoning logic as a connection. We propose the task of GVQA to connect the QA pairs in a graph-style structure. To support this novel task, we provide the DriveLM-Data. ### What is GVQA? The most exciting aspect of the dataset is that the questions and answers (`QA`) are connected in a graph-style structure, with QA pairs as every node and potential logical progression as the edges. The reason for doing this in the AD domain is that AD tasks are well-defined per stage, from raw sensor input to final control action through perception, prediction and planning. Its key difference to prior VQA tasks for AD is the availability of logical dependencies between QAs, which can be used to guide the answering process. -->

📊 Comparison and Stats <a name="comparison"></a>

DriveLM-Data is the first language-driving dataset facilitating the full stack of driving tasks with graph-structured logical dependencies.

<!-- <center> | Language Dataset | Base Dataset | Language Form | Perspectives | Scale | Release?| |:---------:|:-------------:|:-------------:|:------:|:--------------------------------------------:|:----------:| | [BDD-X 2018](https://github.com/JinkyuKimUCB/explainable-deep-driving) | [BDD](https://bdd-data.berkeley.edu/) | Description | Perception & Reasoning | 8M frames, 20k text strings |**:heavy_check_mark:**| | [HAD 2019](https://usa.honda-ri.com/had) | [HDD](https://usa.honda-ri.com/hdd) | Advice | Goal-oriented & stimulus-driven advice | 5,675 video clips, 45k text strings |**:heavy_check_mark:**| | [DRAMA 2022](https://usa.honda-ri.com/drama) | - | Description | Perception & Planning results | 18k frames, 100k text strings | **:heavy_check_mark:**| | [Rank2Tell 2023](https://arxiv.org/abs/2309.06597) | - | Perception & Planning results | QA + Captions | 5k frames | :x: | | [nuScenes-QA 2023](https://arxiv.org/abs/2305.14836)

Related Skills

View on GitHub
GitHub Stars1.3k
CategoryDevelopment
Updated8h ago
Forks83

Languages

HTML

Security Score

100/100

Audited on Mar 21, 2026

No findings