VideoCAD: A Dataset and Model for Learning Long‑Horizon 3D CAD UI Interactions from Video

Brandon Man<sup>1*</sup>, Ghadi Nehme<sup>1*</sup>, Md Ferdous Alam<sup>1</sup>, Faez Ahmed<sup>1</sup> <br> <sup>1</sup>Massachusetts Institute of Technology
<sup>*</sup>Equal Contribution

</div> <div align="center">  <img src="assets/00058464.gif" width="270"/> <img src="assets/00058391.gif" width="270"/> <img src="assets/00053009.gif" width="270"/> <br>  <img src="assets/00054448.gif" width="270"/> <img src="assets/00055575.gif" width="270"/> <img src="assets/00056215.gif" width="270"/> </div>

Introduction

VideoCAD is a large-scale video dataset designed to enable learning of precise user interactions in 3D CAD software. It provides over 41,000 high-resolution videos paired with timestamped UI actions and target CAD images, enabling multimodal learning for behavior cloning, visual understanding, and spatial reasoning.

Key Features

41K+ CAD modeling videos with synchronized UI screen recordings and action logs.
Low-level UI actions (mouse, keyboard, typed input) and high-level operations (extrude, sketch).
Pixel-based action grounding, requiring agents to reason about geometry and UI layout.
Long-horizon tasks, with sequences up to 186 actions long, 20× longer than existing UI datasets.
Applications: behavior cloning, CAD generation, autocomplete, video QA, spatial reasoning.

Comparison with Other UI Interaction Datasets

| Environment |------------------------|---- | OSWorld | 369 | Mind2Web | 2,350 | WebArena | 812 | VisualWebArena | 910 | TheAgentCompany | 175 | WorkArena | 33 | WebShop | 12,000 | OmniAct | 9,802 | WebLinx | 2,337 | AITW | MMINA | 1,050 | MetaGUI | 1,125 | PixelHelp | 187 | AndroidWorld | 116 | AgentStudio | 304 | MoTIF | 4,707 | AndroidArena | 116 | WindowsAgentArena | 154 | MiniWoB++ | 125 | GUI-WORLD | 12,379 | VideoCAD | # Samples | Time Horizon | 3D Reasoning | Precise Elements | Avg. # Elements | --------------|------------------|------------------|----------------------|---------------------| | 15* | ❌ | ✅ | -- | | 7.3 | ❌ | ❌ | 1,135 | | -- | ❌ | ❌ | -- | | 35* | ❌ | ❌ | -- | | 40 | ❌ | ❌ | -- | | 15 | ❌ | ✅ | -- | | 11.3 | ❌ | ❌ | 38 | | -- | ❌ | ✅ | -- | | 43 | ❌ | ❌ | 1,849 | | 715,142 | 6.5 | ❌ | ❌ | -- | | 12.9 | ❌ | ✅ | 601 | | -- | ❌ | ❌ | 79 | | 4.2 | ❌ | ❌ | -- | | 18.1 | ❌ | ✅ | -- | | 30* | ❌ | ✅ | -- | | 4.4 | ❌ | ❌ | 188 | | 11.4 | ❌ | ❌ | -- | | 8.1 | ❌ | ✅ | -- | | 3.6 | ✅ | ❌ | 28 | | 10.97 | ✅ | ✅ | -- | | 41,005 | 186 | ✅ | ✅ | 6,740 |

<sup>* Max is used when average is not reported</sup>

Dataset Generation Pipeline

Dataset Structure on Harvard Dataverse

🔹 0000.zip /... 0249.zip      # UI screen recordings at 60 FPS
🔹 cad_imgs.zip                # Rendered isometric view of CAD models
🔹 action_raw.zip              # Low-level UI action sequences
🔹 mouse_raw.zip               # High-level descriptions (extrusions, primitives, parameters, etc.)
🔹 action_json.zip             # Low-level UI action sequences (JSON format)
🔹 mouse_json.zip              # High-level descriptions (extrusions, primitives, parameters, etc.) (JSON format)

Data Preprocessing

Before training or evaluation, you need to preprocess the raw VideoCAD data from the Harvard Dataverse.

1. Download Raw Data

Download the raw VideoCAD dataset from the Harvard Dataverse and organize it as follows:

data/
├── data_raw/
│   ├── videos/           # Video files (.mp4)
│   ├── mouse/            # Mouse action log files (.log)
│   └── images/           # Target CAD images (.png)
└── data_resized/         # Processed data (created by preprocessing)

You can find the images here: cad_imgs.zip

2. Run Data Preprocessing

Process the raw data into the format expected by the training scripts:

python generate_dataset.py

This script will:

Extract video frames at action timestamps
Convert mouse logs to action vectors
Resize images to (224, 224)
Create pickle files with synchronized frames, actions, and timestamps
Copy target CAD images to the processed directory

The processed data will be saved in data/data_resized/ with the following structure:

data/data_resized/
├── 0000/
│   ├── 00000070_data.pkl    # Contains frames, actions, timesteps
│   ├── 00000070.png         # Target CAD image
│   └── ...
└── dataset_split.json       # Train/val/test split (create manually)

3. Dataset Split

We provide a JSON file data/data_resized/dataset_split.json that defines the train/validation/test split:

{
  "00000070": "train",
  "00000073": "val", 
  "00000123": "test",
  ...
}

VideoCADFormer

We introduce VideoCADFormer, a multimodal transformer benchmark designed for learning and evaluating long-horizon CAD user interface (UI) action prediction from video. Trained on the VideoCAD dataset, it models temporal dependencies between visual states and symbolic actions, enabling both accurate imitation and sequence planning in complex CAD environments.

Quick Start

Ensure you have the environment installed:

conda create -n videocadformer python=3.9
conda activate videocadformer
pip install -r requirements.txt

Training

Train the VideoCADFormer model using the default configuration:

python main.py

Or customize the training parameters:

python main.py \
  --dataset_path data/data_resized \
  --config_path data/data_resized/dataset_split.json \
  --model_config model_configs/transformer_experiments.json \
  --model_name cad_past_10_actions_and_states_timestep_embedding \
  --checkpoint_dir checkpoints \
  --epochs 1000 \
  --batch_size 2 \
  --num_workers 2

Evaluation

Test a trained model checkpoint:

python test.py \
  --checkpoint_folder cad_past_10_actions_and_states_timestep_embedding_2025_10_19_18_55_03 \
  --output_root_dir test_results

Or specify a custom checkpoint path:

python test.py \
  --checkpoint_folder my_experiment \
  --checkpoint_path checkpoints/my_experiment/best_model.pt \
  --dataset_path data/data_resized \
  --config_path data/data_resized/dataset_split.json \
  --output_root_dir evaluation_results

The evaluation script will generate:

Performance metrics: Command accuracy, parameter accuracy, sequence accuracy
Analysis plots: Confusion matrices, error analysis, tolerance curves
Sample outputs: Predicted action sequences for test samples

Pre-trained Checkpoint: Download our trained VideoCADFormer model from here.

Citation

If you find our work helpful, please consider citing:

@inproceedings{manvideocad,
  title={VideoCAD: A Dataset and Model for Learning Long-Horizon 3D CAD UI Interactions from Video},
  author={Man, Brandon and Nehme, Ghadi and Alam, Md Ferdous and Ahmed, Faez},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track}
}

Contact

For questions, issues, or collaboration, please contact: ghadi@mit.edu

VideoCAD

Install / Use

README