LaMPilot

[CVPR 2024] LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs

Generate Convert Improve

Install / Use

/learn @PurdueDigitalTwin/LaMPilot

About this skill

Quality Score

0/100

README

LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs

Authors: Yunsheng Ma, Can Cui, Xu Cao, Wenqian Ye, Peiran Liu, Juanwu Lu, Amr Abdelraouf, Rohit Gupta, Kyungtae Han, Aniket Bera, James M. Rehg, Ziran Wang

Abstract

Autonomous driving (AD) has made significant strides in recent years. However, existing frameworks struggle to interpret and execute spontaneous user instructions, such as "overtake the car ahead." Large Language Models (LLMs) have demonstrated impressive reasoning capabilities showing potential to bridge this gap. In this paper, we present LaMPilot, a novel framework that integrates LLMs into AD systems, enabling them to follow user instructions by generating code that leverages established functional primitives. We also introduce LaMPilot-Bench, the first benchmark dataset specifically designed to quantitatively evaluate the efficacy of language model programs in AD. Adopting the LaMPilot framework, we conduct extensive experiments to assess the performance of off-the-shelf LLMs on LaMPilot-Bench. Our results demonstrate the potential of LLMs in handling diverse driving scenarios and following user instructions in driving.

🚀 Features

Natural Language to Code: Convert high-level driving commands into executable Python code
Policy Repository: Automatically stores and retrieves successful driving policies for reuse
Human-in-the-Loop Feedback: Incorporates human feedback to iteratively improve generated policies
Multiple LLM Support: Compatible with GPT-3.5, GPT-4, CodeLlama, Llama-2, and Code-Bison
Flexible Evaluation: Supports various driving tasks including lane changes, overtaking, intersection navigation, and more

🔧 Installation

Prerequisites

Python 3.8+
OpenAI API key (or access to other supported LLM services)

Setup

Clone the repository:

git clone https://github.com/PurdueDigitalTwin/LaMPilot.git
cd LaMPilot

Install dependencies:

pip install -r requirements.txt

Set up your API key:

export OPENAI_API_KEY=your_api_key_here

🚦 Quick Start

Basic Usage

Run a single task with a configuration file:

# Using the helper script (recommended)
./run_demo.sh --config projects/lampilot/configs/DbLv1/go_straight.json

# Or directly with Python
python projects/lampilot/demo.py --config projects/lampilot/configs/DbLv1/go_straight.json

Additional options for demo:

--model-name: Specify LLM model (default: gpt-3.5-turbo)
--zero-shot: Use zero-shot mode (default: few-shot)
--no-window: Disable visualization window
--wait-time: Wait time between simulation steps (default: 1e-3)

Human Feedback Agent

Run the human feedback agent for iterative policy improvement:

# Using the helper script (recommended)
./run_test_hf.sh \
    --config-root projects/lampilot/configs/DbLv1 \
    --model-name gpt-3.5-turbo \
    --ckpt-dir ckpt/human-fdbk \
    --resume

# Or directly with Python
python projects/lampilot/test_hf.py \
    --config-root projects/lampilot/configs/DbLv1 \
    --model-name gpt-3.5-turbo \
    --ckpt-dir ckpt/human-fdbk \
    --resume

Additional options:

--test-size: Number of test cases to evaluate (default: 98)
--use-demo: Use demo dataset instead of full dataset
--num-process: Number of parallel processes (default: 1)
--few-shot: Enable few-shot learning
--record-video: Record simulation videos
--shuffle: Shuffle the dataset
--random_seed: Random seed for reproducibility (default: 42)

Zero-Shot and Few-Shot Code Generation

Test code generation without policy repository:

# Demo: Run few-shot evaluation on 5 random scenarios with GPT-5.2
./run_test_icl.sh \
    --config-root projects/lampilot/configs/DbLv1 \
    --model-name gpt-5.2 \
    --test-size 5 \
    --few-shot \
    --shuffle \
    --random_seed 123

# Using the helper script (recommended)
./run_test_icl.sh \
    --config-root projects/lampilot/configs/DbLv1 \
    --model-name gpt-4 \
    --few-shot  # Use --few-shot for few-shot, omit for zero-shot

# Or directly with Python
python projects/lampilot/test_icl.py \
    --config-root projects/lampilot/configs/DbLv1 \
    --model-name gpt-4 \
    --few-shot  # Use --few-shot for few-shot, omit for zero-shot

Running Full Benchmark

The LaMPilot-Bench (DbLv1) contains 4,900 test cases total. Run the complete benchmark evaluation:

# Zero-shot evaluation (full benchmark: 4,900 items)
python projects/lampilot/test_icl.py \
    --config-root projects/lampilot/configs/DbLv1 \
    --model-name gpt-3.5-turbo \
    --test-size 4900 \
    --num-process 4

# Few-shot evaluation (full benchmark: 4,900 items)
python projects/lampilot/test_icl.py \
    --config-root projects/lampilot/configs/DbLv1 \
    --model-name gpt-4 \
    --few-shot \
    --test-size 4900 \
    --num-process 4

Note: You can use a smaller --test-size value (e.g., 98, 500, 1000) for faster evaluation or testing purposes. The script will automatically skip already-evaluated items if you use the --resume flag or run with the same checkpoint directory.

Note: The helper scripts (run_demo.sh, run_test_hf.sh, run_test_icl.sh) automatically handle:

Virtual environment activation (if present)
PYTHONPATH configuration
Proper module imports

🏗️ Architecture

LaMPilot consists of several key components:

1. Code Generation Agent (`cg_agent.py`)

Converts natural language commands into Python code
Supports multiple LLM backends (OpenAI)
Handles zero-shot and few-shot learning modes

2. Policy Repository (`policy_repo.py`)

Stores successful driving policies with semantic descriptions
Uses vector database (ChromaDB) for efficient policy retrieval
Automatically indexes policies for reuse in similar scenarios

3. Human Feedback Agent (`hf_agent.py`)

Extends the code generation agent with feedback mechanisms
Incorporates human critiques to refine generated policies
Commits successful policies to the repository

4. Vehicle Digital Twin (`vehicle_dt.py`)

Executes generated Python code in the simulation environment
Provides a safe execution environment for LLM-generated code
Implements control interfaces for vehicle manipulation

5. Evaluators (`evaluator/`)

Task-specific evaluators for different driving scenarios
Metrics: Time-to-Collision (TTC), speed variance, time efficiency
Supports ACC (by speed and by distance), lane change, overtaking, intersection, and pullover tasks
Evaluator types: AccEval, ACCEvalbySpeed, ACCEvalbyDistance, LaneChangeEval, OvertakeEval, IntersectionEval, PullOverEval

6. Benchmark Dataset (`dbl.py`)

DbLv1Dataset: Loads and manages the LaMPilot-Bench (Drive by Language) dataset
DbLv1DemoDataset: Subset of demo cases for quick testing
Supports shuffling and random seed configuration
Automatically loads configurations from config_list.txt

📊 LaMPilot-Bench

LaMPilot-Bench (also referred to as DbLv1 - Drive by Language version 1, where DbL stands for Drive by Language) is the first benchmark dataset specifically designed to quantitatively evaluate the efficacy of language model programs in autonomous driving. The benchmark includes 32 diverse driving scenarios, each with multiple samples and commands, resulting in 4,900 total test cases for comprehensive evaluation.

Task Categories

Speed Control
- Absolute speed adjustments (increase/decrease to specific speeds)
- Relative speed adjustments (increase/decrease by specific amounts)
Following Distance
- Absolute distance adjustments (increase/decrease to specific distances)
- Relative distance adjustments (increase/decrease by specific amounts)
Lane Changes
- Left lane change
- Right lane change
Overtaking
- Left overtake
- Right overtake
Intersection Navigation
- Turn left
- Turn right
- Go straight
Maneuvers
- Pull over

Evaluation Metrics

Safety Score: Based on Time-to-Collision (TTC)
Speed Variance Score: Measures driving smoothness
Time Efficiency Score: Evaluates task completion time
Overall Score: Weighted combination of the above metrics

💡 Usage Examples

Example 1: Simple Command Execution

import json
from projects.lampilot.dt.cg_agent import CodeGenerationAgent
from projects.lampilot.dt.vehicle_dt import CtrlVDT
from projects.lampilot.evaluator import ACCEvalbySpeed

# Initialize agent
agent = CodeGenerationAgent(
    model_name="gpt-3.5-turbo",
    zero_shot=False  # Use few-shot by default
)
vehicle_dt = CtrlVDT()

# Load configuration
with open("projects/lampilot/configs/DbLv1/go_straight.json", 'r') as f:
    config = json.load(f)
sample = config['samples'][0]
command = config['commands'][0]

# Create evaluator
evaluator_type = sample.get('eval', {}).get('type', 'AccEval')
evaluator = eval(evaluator_type)(config=sample, show_window=True)

# Generate and execute policy
agent.reset(command=command, context_info=evaluator.get_context_info())
policy = agent.step()

vehicle_dt.reset(ego_vehicle=evaluator.env.unwrapped.vehicle)
vehicle_dt.execute(policy)

# Run simulation
while not evaluator.ended:
    evaluator.step(vehicle_dt)

evaluator.close()
print(f"Score: {evaluator.score:.1f}")

Example 2: Using Policy Repository with Human Feedback

from projects.lampilot.dt.hf_agent import HumanFeedbackCGAgent
from projects.lampilot.dt.vehicle_dt import CtrlVDT
from projects.lampilot.evaluator import Overtake

Related Skills

node-connect

347.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

PurdueDigitalTwin

View profile

View on GitHub

GitHub Stars39

CategoryDevelopment

Updated10d ago

Forks2

PurdueDigitalTwin/LaMPilot

Languages

Python

Security Score

90/100

Audited on Mar 24, 2026

No findings

LaMPilot

Install / Use

README

LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs

Abstract

🚀 Features

📋 Table of Contents

🔧 Installation

Prerequisites

Setup

🚦 Quick Start

Basic Usage

Human Feedback Agent

Zero-Shot and Few-Shot Code Generation

Running Full Benchmark

🏗️ Architecture

1. Code Generation Agent (cg_agent.py)

2. Policy Repository (policy_repo.py)

3. Human Feedback Agent (hf_agent.py)

4. Vehicle Digital Twin (vehicle_dt.py)

5. Evaluators (evaluator/)

6. Benchmark Dataset (dbl.py)

📊 LaMPilot-Bench

Task Categories

Evaluation Metrics

💡 Usage Examples

Example 1: Simple Command Execution

Example 2: Using Policy Repository with Human Feedback

Related Skills

1. Code Generation Agent (`cg_agent.py`)

2. Policy Repository (`policy_repo.py`)

3. Human Feedback Agent (`hf_agent.py`)

4. Vehicle Digital Twin (`vehicle_dt.py`)

5. Evaluators (`evaluator/`)

6. Benchmark Dataset (`dbl.py`)