Efficient Virtuoso: A Latent Diffusion Transformer for Trajectory Planning

This repository contains the official PyTorch implementation of "Efficient Virtuoso," a project developing a conditional Denoising Diffusion Probabilistic Model (DDPM) for multi-modal, long-horizon trajectory planning on the Waymo Open Motion Dataset. <a href="https://arxiv.org/abs/2509.03658" target="_blank"> <img src="https://img.shields.io/badge/ArXiv-2509.03658-b31b1b.svg?style=flat-square" alt="ArXiv Paper"> </a> <a href="http://arxiv.org/licenses/nonexclusive-distrib/1.0/" target="_blank"> <img src="https://img.shields.io/badge/Paper%20License-arXiv%20Perpetual-b31b1b.svg?style=flat-square" alt="ArXiv Paper License"> </a> <a href="LICENSE"> <img src="https://img.shields.io/badge/Code%20License-MIT-blue.svg?style=flat-square" alt="Code License"> </a> <a href="https://pytorch.org/"> <img src="https://img.shields.io/badge/Made%20with-PyTorch-EE4C2C.svg?style=flat-square&logo=pytorch" alt="Made with PyTorch"> </a> <img src="https://img.shields.io/badge/Python-3.10-3776AB.svg?style=flat-square&logo=python" alt="Python 3.10"> A project by Antonio Guillen-Perez | <a href="https://antonioalgaida.github.io/" target="_blank">Portfolio</a> | <a href="https://www.linkedin.com/in/antonioguillenperez/" target="_blank">LinkedIn</a> | <a href="https://scholar.google.com/citations?user=BFS6jXwAAAAJ" target="_blank">Google Scholar</a>

Efficient Virtuoso: A Latent Diffusion Transformer for Trajectory Planning

1. Key Result

This project successfully trains a generative model that can produce diverse, realistic, and contextually-aware future trajectories for an autonomous vehicle. Given a single scene context, the model can generate a multi-modal distribution of plausible future plans, a critical capability for robust decision-making.

<img src="figures/fan_out_1.png" width="320" alt="Multi-modal trajectory prediction example 1"> <img src="figures/fan_out_2.png" width="320" alt="Multi-modal trajectory prediction example 2"> <img src="figures/fan_out_3.png" width="300" alt="Multi-modal trajectory prediction example 3"> Figure 1: Multi-modal Trajectory Generation. For the same initial state (SDC in green, past trajectory in red), our model generates 20 diverse yet plausible future trajectories (purple-red scale fan-out) that correctly adhere to the road geometry. Each panel shows a different scenario, highlighting the model's ability to capture scene context and generate multi-modal predictions.

2. Project Mission

The development of safe and intelligent autonomous vehicles hinges on their ability to reason about an uncertain and multi-modal future. Traditional deterministic approaches, which predict a single "best guess" trajectory, often fail to capture the rich distribution of plausible behaviors a human driver might exhibit. This can lead to policies that are overly conservative or dangerously indecisive in complex scenarios.

This project directly confronts this challenge by fundamentally shifting the modeling paradigm from deterministic regression to conditional generative modeling. The mission is to develop a policy that learns to represent and sample from the entire, complex distribution of plausible expert behaviors, enabling the generation of driving behaviors that are not only safe but also contextually appropriate, diverse, and human-like.

3. Technical Approach

The core of this project is a Conditional Latent Diffusion Model. To achieve both high-fidelity and computational efficiency, the diffusion process is performed not on the raw trajectory data, but in a compressed, low-dimensional latent space derived via Principal Component Analysis (PCA).

Data Pipeline: The raw Waymo Open Motion Dataset is processed through a multi-stage pipeline (src/data_processing/). This includes parsing raw data, intelligent filtering of static scenarios, and feature extraction to produce (Context, Target Trajectory) pairs.
Latent Space Creation (PCA): We perform PCA on the entire set of expert Target Trajectories to find the principal components that capture the most variance. This allows us to represent a high-dimensional trajectory (e.g., 80 timesteps * 2 coords = 160 dims) with a much smaller latent vector (e.g., 32 dims), which becomes the new target for the diffusion model.
Context Encoding: The scene Context is encoded by a powerful StateEncoder. It uses dedicated sub-networks for each entity (ego history, agents, map, goal) and fuses them using a Transformer Encoder to produce a single, holistic scene_embedding.
Denoising Model (Latent Diffusion Transformer): The primary model is a Conditional Transformer Decoder. It takes a noisy latent vector z_t and learns to predict the original noise ε, conditioned on the scene_embedding from the StateEncoder and the noise level t. This architecture is more expressive and parameter-efficient for this type of sequential data than a standard U-Net.
Sampling: At inference, we start with pure Gaussian noise z_T in the latent space and iteratively apply the trained denoiser to recover a clean latent vector z_0. This clean latent vector is then projected back into the high-dimensional trajectory space using the inverse PCA transform. This repository implements both the slow, stochastic DDPM sampler and the fast, deterministic DDIM sampler.

To ensure stability, all trajectory data is normalized to a [-1, 1] range before being used in the diffusion process.

Architectural Diagram Figure 2: Model Architecture. A Transformer-based StateEncoder processes the scene context. A separate Transformer Decoder acts as the denoiser in the PCA latent space.

4. Repository Structure

diffusion-trajectory-planner/
├── configs/
│   └── main_config.yaml
├── data/
│   ├── (gitignored) processed_npz/
│   └── (gitignored) featurized_v3_diffusion/
├── models/
│   ├── (gitignored) checkpoints/
│   └── (gitignored) normalization_stats.pt
├── notebooks/
│   ├── 1_analyze_source_data.ipynb
│   ├── 2_analyze_featurized_data.ipynb
│   └── 3_analyze_final_results.ipynb
├── src/
│   ├── data_processing/     # Scripts for parsing, featurizing, and PCA
│   │   ├── parser.py
│   │   ├── featurizer_diffusion.py
│   │   └── compute_normalization_stats.py
│   ├── diffusion_policy/   # Core model, dataset, and training logic
│   │   ├── dataset.py
│   │   ├── networks.py
│   │   └── train.py
│   └── evaluation/         # Scripts for evaluation and visualization
│       └── evaluate_prediction.py
└── README.md

4. Setup and Installation

Clone the repository:

git clone https://github.com/your-username/diffusion-trajectory-planner.git
cd diffusion-trajectory-planner

Create and activate a Conda environment:

conda create --name virtuoso_env python=3.10
conda activate virtuoso_env

Install dependencies:
```
pip install -r requirements.txt
```

6. Data Preparation Pipeline

This is a multi-step, one-time process. All commands should be run from the root of the repository.

Step 0: Download the Waymo Open Motion Dataset

Download the .tfrecord files for the motion prediction task from the Waymo Open Dataset website. Place the scenario folder containing the training and validation shards into a directory of your choice.

Step 1: Parse Raw Data (`.tfrecord` -> `.npz`)

This initial step converts the raw .tfrecord files into a more accessible NumPy format.

Note: This parser.py script is a prerequisite and is assumed to be adapted from a previous project.

Update configs/main_config.yaml with the correct path to your raw data, then run the parser.

# Activate the parser-specific environment
conda activate virtuoso_parser 
python -m src.data_processing.parser

This will create a data/processed_npz/ directory containing the parsed .npz files.

DiffusionTrajectoryPlanner

Install / Use

README

Efficient Virtuoso: A Latent Diffusion Transformer for Trajectory Planning

1. Key Result

2. Project Mission

3. Technical Approach

4. Repository Structure

4. Setup and Installation

6. Data Preparation Pipeline

Step 0: Download the Waymo Open Motion Dataset

Step 1: Parse Raw Data (`.tfrecord` -> `.npz`)

Step 2: Featurize Data for

DiffusionTrajectoryPlanner

Install / Use

README

Efficient Virtuoso: A Latent Diffusion Transformer for Trajectory Planning

1. Key Result

2. Project Mission

3. Technical Approach

4. Repository Structure

4. Setup and Installation

6. Data Preparation Pipeline

Step 0: Download the Waymo Open Motion Dataset

Step 1: Parse Raw Data (.tfrecord -> .npz)

Step 2: Featurize Data for

Step 1: Parse Raw Data (`.tfrecord` -> `.npz`)