SkillAgentSearch skills...

LaMPilot

[CVPR 2024] LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs

Install / Use

/learn @PurdueDigitalTwin/LaMPilot
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs

Authors: Yunsheng Ma, Can Cui, Xu Cao, Wenqian Ye, Peiran Liu, Juanwu Lu, Amr Abdelraouf, Rohit Gupta, Kyungtae Han, Aniket Bera, James M. Rehg, Ziran Wang

Abstract

Autonomous driving (AD) has made significant strides in recent years. However, existing frameworks struggle to interpret and execute spontaneous user instructions, such as "overtake the car ahead." Large Language Models (LLMs) have demonstrated impressive reasoning capabilities showing potential to bridge this gap. In this paper, we present LaMPilot, a novel framework that integrates LLMs into AD systems, enabling them to follow user instructions by generating code that leverages established functional primitives. We also introduce LaMPilot-Bench, the first benchmark dataset specifically designed to quantitatively evaluate the efficacy of language model programs in AD. Adopting the LaMPilot framework, we conduct extensive experiments to assess the performance of off-the-shelf LLMs on LaMPilot-Bench. Our results demonstrate the potential of LLMs in handling diverse driving scenarios and following user instructions in driving.

🚀 Features

  • Natural Language to Code: Convert high-level driving commands into executable Python code
  • Policy Repository: Automatically stores and retrieves successful driving policies for reuse
  • Human-in-the-Loop Feedback: Incorporates human feedback to iteratively improve generated policies
  • Multiple LLM Support: Compatible with GPT-3.5, GPT-4, CodeLlama, Llama-2, and Code-Bison
  • Flexible Evaluation: Supports various driving tasks including lane changes, overtaking, intersection navigation, and more

📋 Table of Contents

🔧 Installation

Prerequisites

  • Python 3.8+
  • OpenAI API key (or access to other supported LLM services)

Setup

  1. Clone the repository:
git clone https://github.com/PurdueDigitalTwin/LaMPilot.git
cd LaMPilot
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up your API key:
export OPENAI_API_KEY=your_api_key_here

🚦 Quick Start

Basic Usage

Run a single task with a configuration file:

# Using the helper script (recommended)
./run_demo.sh --config projects/lampilot/configs/DbLv1/go_straight.json

# Or directly with Python
python projects/lampilot/demo.py --config projects/lampilot/configs/DbLv1/go_straight.json

Additional options for demo:

  • --model-name: Specify LLM model (default: gpt-3.5-turbo)
  • --zero-shot: Use zero-shot mode (default: few-shot)
  • --no-window: Disable visualization window
  • --wait-time: Wait time between simulation steps (default: 1e-3)

Human Feedback Agent

Run the human feedback agent for iterative policy improvement:

# Using the helper script (recommended)
./run_test_hf.sh \
    --config-root projects/lampilot/configs/DbLv1 \
    --model-name gpt-3.5-turbo \
    --ckpt-dir ckpt/human-fdbk \
    --resume

# Or directly with Python
python projects/lampilot/test_hf.py \
    --config-root projects/lampilot/configs/DbLv1 \
    --model-name gpt-3.5-turbo \
    --ckpt-dir ckpt/human-fdbk \
    --resume

Additional options:

  • --test-size: Number of test cases to evaluate (default: 98)
  • --use-demo: Use demo dataset instead of full dataset
  • --num-process: Number of parallel processes (default: 1)
  • --few-shot: Enable few-shot learning
  • --record-video: Record simulation videos
  • --shuffle: Shuffle the dataset
  • --random_seed: Random seed for reproducibility (default: 42)

Zero-Shot and Few-Shot Code Generation

Test code generation without policy repository:

# Demo: Run few-shot evaluation on 5 random scenarios with GPT-5.2
./run_test_icl.sh \
    --config-root projects/lampilot/configs/DbLv1 \
    --model-name gpt-5.2 \
    --test-size 5 \
    --few-shot \
    --shuffle \
    --random_seed 123

# Using the helper script (recommended)
./run_test_icl.sh \
    --config-root projects/lampilot/configs/DbLv1 \
    --model-name gpt-4 \
    --few-shot  # Use --few-shot for few-shot, omit for zero-shot

# Or directly with Python
python projects/lampilot/test_icl.py \
    --config-root projects/lampilot/configs/DbLv1 \
    --model-name gpt-4 \
    --few-shot  # Use --few-shot for few-shot, omit for zero-shot

Running Full Benchmark

The LaMPilot-Bench (DbLv1) contains 4,900 test cases total. Run the complete benchmark evaluation:

# Zero-shot evaluation (full benchmark: 4,900 items)
python projects/lampilot/test_icl.py \
    --config-root projects/lampilot/configs/DbLv1 \
    --model-name gpt-3.5-turbo \
    --test-size 4900 \
    --num-process 4

# Few-shot evaluation (full benchmark: 4,900 items)
python projects/lampilot/test_icl.py \
    --config-root projects/lampilot/configs/DbLv1 \
    --model-name gpt-4 \
    --few-shot \
    --test-size 4900 \
    --num-process 4

Note: You can use a smaller --test-size value (e.g., 98, 500, 1000) for faster evaluation or testing purposes. The script will automatically skip already-evaluated items if you use the --resume flag or run with the same checkpoint directory.

Note: The helper scripts (run_demo.sh, run_test_hf.sh, run_test_icl.sh) automatically handle:

  • Virtual environment activation (if present)
  • PYTHONPATH configuration
  • Proper module imports

🏗️ Architecture

LaMPilot consists of several key components:

1. Code Generation Agent (cg_agent.py)

  • Converts natural language commands into Python code
  • Supports multiple LLM backends (OpenAI)
  • Handles zero-shot and few-shot learning modes

2. Policy Repository (policy_repo.py)

  • Stores successful driving policies with semantic descriptions
  • Uses vector database (ChromaDB) for efficient policy retrieval
  • Automatically indexes policies for reuse in similar scenarios

3. Human Feedback Agent (hf_agent.py)

  • Extends the code generation agent with feedback mechanisms
  • Incorporates human critiques to refine generated policies
  • Commits successful policies to the repository

4. Vehicle Digital Twin (vehicle_dt.py)

  • Executes generated Python code in the simulation environment
  • Provides a safe execution environment for LLM-generated code
  • Implements control interfaces for vehicle manipulation

5. Evaluators (evaluator/)

  • Task-specific evaluators for different driving scenarios
  • Metrics: Time-to-Collision (TTC), speed variance, time efficiency
  • Supports ACC (by speed and by distance), lane change, overtaking, intersection, and pullover tasks
  • Evaluator types: AccEval, ACCEvalbySpeed, ACCEvalbyDistance, LaneChangeEval, OvertakeEval, IntersectionEval, PullOverEval

6. Benchmark Dataset (dbl.py)

  • DbLv1Dataset: Loads and manages the LaMPilot-Bench (Drive by Language) dataset
  • DbLv1DemoDataset: Subset of demo cases for quick testing
  • Supports shuffling and random seed configuration
  • Automatically loads configurations from config_list.txt

📊 LaMPilot-Bench

LaMPilot-Bench (also referred to as DbLv1 - Drive by Language version 1, where DbL stands for Drive by Language) is the first benchmark dataset specifically designed to quantitatively evaluate the efficacy of language model programs in autonomous driving. The benchmark includes 32 diverse driving scenarios, each with multiple samples and commands, resulting in 4,900 total test cases for comprehensive evaluation.

Task Categories

  1. Speed Control

    • Absolute speed adjustments (increase/decrease to specific speeds)
    • Relative speed adjustments (increase/decrease by specific amounts)
  2. Following Distance

    • Absolute distance adjustments (increase/decrease to specific distances)
    • Relative distance adjustments (increase/decrease by specific amounts)
  3. Lane Changes

    • Left lane change
    • Right lane change
  4. Overtaking

    • Left overtake
    • Right overtake
  5. Intersection Navigation

    • Turn left
    • Turn right
    • Go straight
  6. Maneuvers

    • Pull over

Evaluation Metrics

  • Safety Score: Based on Time-to-Collision (TTC)
  • Speed Variance Score: Measures driving smoothness
  • Time Efficiency Score: Evaluates task completion time
  • Overall Score: Weighted combination of the above metrics

💡 Usage Examples

Example 1: Simple Command Execution

import json
from projects.lampilot.dt.cg_agent import CodeGenerationAgent
from projects.lampilot.dt.vehicle_dt import CtrlVDT
from projects.lampilot.evaluator import ACCEvalbySpeed

# Initialize agent
agent = CodeGenerationAgent(
    model_name="gpt-3.5-turbo",
    zero_shot=False  # Use few-shot by default
)
vehicle_dt = CtrlVDT()

# Load configuration
with open("projects/lampilot/configs/DbLv1/go_straight.json", 'r') as f:
    config = json.load(f)
sample = config['samples'][0]
command = config['commands'][0]

# Create evaluator
evaluator_type = sample.get('eval', {}).get('type', 'AccEval')
evaluator = eval(evaluator_type)(config=sample, show_window=True)

# Generate and execute policy
agent.reset(command=command, context_info=evaluator.get_context_info())
policy = agent.step()

vehicle_dt.reset(ego_vehicle=evaluator.env.unwrapped.vehicle)
vehicle_dt.execute(policy)

# Run simulation
while not evaluator.ended:
    evaluator.step(vehicle_dt)

evaluator.close()
print(f"Score: {evaluator.score:.1f}")

Example 2: Using Policy Repository with Human Feedback

from projects.lampilot.dt.hf_agent import HumanFeedbackCGAgent
from projects.lampilot.dt.vehicle_dt import CtrlVDT
from projects.lampilot.evaluator import Overtake

Related Skills

View on GitHub
GitHub Stars39
CategoryDevelopment
Updated10d ago
Forks2

Languages

Python

Security Score

90/100

Audited on Mar 24, 2026

No findings