LaMPilot
[CVPR 2024] LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs
Install / Use
/learn @PurdueDigitalTwin/LaMPilotREADME
LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs
Authors: Yunsheng Ma, Can Cui, Xu Cao, Wenqian Ye, Peiran Liu, Juanwu Lu, Amr Abdelraouf, Rohit Gupta, Kyungtae Han, Aniket Bera, James M. Rehg, Ziran Wang
Abstract
Autonomous driving (AD) has made significant strides in recent years. However, existing frameworks struggle to interpret and execute spontaneous user instructions, such as "overtake the car ahead." Large Language Models (LLMs) have demonstrated impressive reasoning capabilities showing potential to bridge this gap. In this paper, we present LaMPilot, a novel framework that integrates LLMs into AD systems, enabling them to follow user instructions by generating code that leverages established functional primitives. We also introduce LaMPilot-Bench, the first benchmark dataset specifically designed to quantitatively evaluate the efficacy of language model programs in AD. Adopting the LaMPilot framework, we conduct extensive experiments to assess the performance of off-the-shelf LLMs on LaMPilot-Bench. Our results demonstrate the potential of LLMs in handling diverse driving scenarios and following user instructions in driving.
🚀 Features
- Natural Language to Code: Convert high-level driving commands into executable Python code
- Policy Repository: Automatically stores and retrieves successful driving policies for reuse
- Human-in-the-Loop Feedback: Incorporates human feedback to iteratively improve generated policies
- Multiple LLM Support: Compatible with GPT-3.5, GPT-4, CodeLlama, Llama-2, and Code-Bison
- Flexible Evaluation: Supports various driving tasks including lane changes, overtaking, intersection navigation, and more
📋 Table of Contents
- Abstract
- Features
- Installation
- Quick Start
- Architecture
- LaMPilot-Bench
- Usage Examples
- Configuration
- Citation
- License
🔧 Installation
Prerequisites
- Python 3.8+
- OpenAI API key (or access to other supported LLM services)
Setup
- Clone the repository:
git clone https://github.com/PurdueDigitalTwin/LaMPilot.git
cd LaMPilot
- Install dependencies:
pip install -r requirements.txt
- Set up your API key:
export OPENAI_API_KEY=your_api_key_here
🚦 Quick Start
Basic Usage
Run a single task with a configuration file:
# Using the helper script (recommended)
./run_demo.sh --config projects/lampilot/configs/DbLv1/go_straight.json
# Or directly with Python
python projects/lampilot/demo.py --config projects/lampilot/configs/DbLv1/go_straight.json
Additional options for demo:
--model-name: Specify LLM model (default:gpt-3.5-turbo)--zero-shot: Use zero-shot mode (default: few-shot)--no-window: Disable visualization window--wait-time: Wait time between simulation steps (default: 1e-3)
Human Feedback Agent
Run the human feedback agent for iterative policy improvement:
# Using the helper script (recommended)
./run_test_hf.sh \
--config-root projects/lampilot/configs/DbLv1 \
--model-name gpt-3.5-turbo \
--ckpt-dir ckpt/human-fdbk \
--resume
# Or directly with Python
python projects/lampilot/test_hf.py \
--config-root projects/lampilot/configs/DbLv1 \
--model-name gpt-3.5-turbo \
--ckpt-dir ckpt/human-fdbk \
--resume
Additional options:
--test-size: Number of test cases to evaluate (default: 98)--use-demo: Use demo dataset instead of full dataset--num-process: Number of parallel processes (default: 1)--few-shot: Enable few-shot learning--record-video: Record simulation videos--shuffle: Shuffle the dataset--random_seed: Random seed for reproducibility (default: 42)
Zero-Shot and Few-Shot Code Generation
Test code generation without policy repository:
# Demo: Run few-shot evaluation on 5 random scenarios with GPT-5.2
./run_test_icl.sh \
--config-root projects/lampilot/configs/DbLv1 \
--model-name gpt-5.2 \
--test-size 5 \
--few-shot \
--shuffle \
--random_seed 123
# Using the helper script (recommended)
./run_test_icl.sh \
--config-root projects/lampilot/configs/DbLv1 \
--model-name gpt-4 \
--few-shot # Use --few-shot for few-shot, omit for zero-shot
# Or directly with Python
python projects/lampilot/test_icl.py \
--config-root projects/lampilot/configs/DbLv1 \
--model-name gpt-4 \
--few-shot # Use --few-shot for few-shot, omit for zero-shot
Running Full Benchmark
The LaMPilot-Bench (DbLv1) contains 4,900 test cases total. Run the complete benchmark evaluation:
# Zero-shot evaluation (full benchmark: 4,900 items)
python projects/lampilot/test_icl.py \
--config-root projects/lampilot/configs/DbLv1 \
--model-name gpt-3.5-turbo \
--test-size 4900 \
--num-process 4
# Few-shot evaluation (full benchmark: 4,900 items)
python projects/lampilot/test_icl.py \
--config-root projects/lampilot/configs/DbLv1 \
--model-name gpt-4 \
--few-shot \
--test-size 4900 \
--num-process 4
Note: You can use a smaller --test-size value (e.g., 98, 500, 1000) for faster evaluation or testing purposes. The script will automatically skip already-evaluated items if you use the --resume flag or run with the same checkpoint directory.
Note: The helper scripts (run_demo.sh, run_test_hf.sh, run_test_icl.sh) automatically handle:
- Virtual environment activation (if present)
- PYTHONPATH configuration
- Proper module imports
🏗️ Architecture
LaMPilot consists of several key components:
1. Code Generation Agent (cg_agent.py)
- Converts natural language commands into Python code
- Supports multiple LLM backends (OpenAI)
- Handles zero-shot and few-shot learning modes
2. Policy Repository (policy_repo.py)
- Stores successful driving policies with semantic descriptions
- Uses vector database (ChromaDB) for efficient policy retrieval
- Automatically indexes policies for reuse in similar scenarios
3. Human Feedback Agent (hf_agent.py)
- Extends the code generation agent with feedback mechanisms
- Incorporates human critiques to refine generated policies
- Commits successful policies to the repository
4. Vehicle Digital Twin (vehicle_dt.py)
- Executes generated Python code in the simulation environment
- Provides a safe execution environment for LLM-generated code
- Implements control interfaces for vehicle manipulation
5. Evaluators (evaluator/)
- Task-specific evaluators for different driving scenarios
- Metrics: Time-to-Collision (TTC), speed variance, time efficiency
- Supports ACC (by speed and by distance), lane change, overtaking, intersection, and pullover tasks
- Evaluator types:
AccEval,ACCEvalbySpeed,ACCEvalbyDistance,LaneChangeEval,OvertakeEval,IntersectionEval,PullOverEval
6. Benchmark Dataset (dbl.py)
DbLv1Dataset: Loads and manages the LaMPilot-Bench (Drive by Language) datasetDbLv1DemoDataset: Subset of demo cases for quick testing- Supports shuffling and random seed configuration
- Automatically loads configurations from
config_list.txt
📊 LaMPilot-Bench
LaMPilot-Bench (also referred to as DbLv1 - Drive by Language version 1, where DbL stands for Drive by Language) is the first benchmark dataset specifically designed to quantitatively evaluate the efficacy of language model programs in autonomous driving. The benchmark includes 32 diverse driving scenarios, each with multiple samples and commands, resulting in 4,900 total test cases for comprehensive evaluation.
Task Categories
-
Speed Control
- Absolute speed adjustments (increase/decrease to specific speeds)
- Relative speed adjustments (increase/decrease by specific amounts)
-
Following Distance
- Absolute distance adjustments (increase/decrease to specific distances)
- Relative distance adjustments (increase/decrease by specific amounts)
-
Lane Changes
- Left lane change
- Right lane change
-
Overtaking
- Left overtake
- Right overtake
-
Intersection Navigation
- Turn left
- Turn right
- Go straight
-
Maneuvers
- Pull over
Evaluation Metrics
- Safety Score: Based on Time-to-Collision (TTC)
- Speed Variance Score: Measures driving smoothness
- Time Efficiency Score: Evaluates task completion time
- Overall Score: Weighted combination of the above metrics
💡 Usage Examples
Example 1: Simple Command Execution
import json
from projects.lampilot.dt.cg_agent import CodeGenerationAgent
from projects.lampilot.dt.vehicle_dt import CtrlVDT
from projects.lampilot.evaluator import ACCEvalbySpeed
# Initialize agent
agent = CodeGenerationAgent(
model_name="gpt-3.5-turbo",
zero_shot=False # Use few-shot by default
)
vehicle_dt = CtrlVDT()
# Load configuration
with open("projects/lampilot/configs/DbLv1/go_straight.json", 'r') as f:
config = json.load(f)
sample = config['samples'][0]
command = config['commands'][0]
# Create evaluator
evaluator_type = sample.get('eval', {}).get('type', 'AccEval')
evaluator = eval(evaluator_type)(config=sample, show_window=True)
# Generate and execute policy
agent.reset(command=command, context_info=evaluator.get_context_info())
policy = agent.step()
vehicle_dt.reset(ego_vehicle=evaluator.env.unwrapped.vehicle)
vehicle_dt.execute(policy)
# Run simulation
while not evaluator.ended:
evaluator.step(vehicle_dt)
evaluator.close()
print(f"Score: {evaluator.score:.1f}")
Example 2: Using Policy Repository with Human Feedback
from projects.lampilot.dt.hf_agent import HumanFeedbackCGAgent
from projects.lampilot.dt.vehicle_dt import CtrlVDT
from projects.lampilot.evaluator import Overtake
Related Skills
node-connect
347.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
