SkillAgentSearch skills...

DiffusionVLA

A modular, first-principles walk-through of Vision-Language-Action models

Install / Use

/learn @vaidehibagaria/DiffusionVLA
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Creating a Diffusion Based Vision Language Model

This repository contains code for training and testing a diffusion policy for Franka robot arm manipulation using image observations and text commands.

Overview

The system enables a Franka robot arm to pick and place objects based on natural language commands. It uses:

  • Image observations: RGB camera views of the workspace
  • Text commands: Natural language descriptions
  • Diffusion policy: Transformer-based action prediction using diffusion models

Files

  • dataset/collect_data.py: Collects demonstration trajectories with images, joint positions, and text labels
  • train/train.py: Trains the diffusion policy model on collected data
  • franka_test_image.py: Tests the trained model in simulation

How It Works

1. Data Collection (dataset/collect_data.py)

Collects demonstration trajectories by:

  • Capturing RGB images from an overhead camera
  • Recording arm joint positions (7D)
  • Saving text labels describing the task
  • Executing scripted pick-and-place motions

Usage:

python dataset/collect_data.py --collect --num_episodes 100

Data is automatically saved to data/ directory at the repo root.

2. Training (train/train.py)

Trains a diffusion transformer model that:

  • Encodes images using ResNet/ViT
  • Encodes text using CLIP (default) or SigLIP
  • Predicts action sequences using diffusion denoising
  • Learns from demonstration trajectories

Usage:

python train/train.py

3. Testing (franka_test_image.py)

Runs the trained model in closed-loop control:

  • Captures current image and joint positions
  • Processes text command
  • Predicts action sequence using diffusion
  • Executes actions and replans

Usage:

python franka_test_image.py --checkpoint outputs/franka_arm_image_training/checkpoints/final --view

Architecture

The model processes observations as follows:

  • Token 1: Concatenated arm joint positions (7D)
  • Token 2: Image embedding from ResNet
  • Token 3: Projected text embedding
  • Each timestep has 3 tokens: [qpos,text, image]
  • Transformer processes these tokens to predict action sequences

Text Encoders

CLIP (Default)

  • Model: ViT-B-32 (default)
  • Pretrained: openai (default)
  • Install: pip install open-clip-torch

SigLIP (Optional)

  • Model: ViT-B-16-SigLIP (example)
  • Pretrained: webli (example)
  • Install: pip install open-clip-torch (same as CLIP)

To use SigLIP, modify the config in train/train.py:

text_encoder_type: str = "siglip"
clip_model_name: str = "ViT-B-16-SigLIP"
clip_pretrained: str = "webli"

Example Results

The test video of the franka arm picking an object can be seen in Demo

Demo

The trained model successfully:

  • Interprets natural language commands
  • Uses visual observations to locate objects
  • Executes pick-and-place actions
  • Generalizes to different object positions

Requirements

Install all dependencies using: pip install -r requirements.txt

  • Python 3.8+
  • PyTorch
  • MuJoCo
  • open-clip-torch (for CLIP/SigLIP)
  • diffusers (for diffusion scheduler)

Notes

  • The text encoder (CLIP or SigLIP) is frozen during training
  • A learnable projection layer adapts text embeddings to the task
  • The model uses receding horizon control: predicts full sequence, executes first few steps, then replans

Related Skills

View on GitHub
GitHub Stars32
CategoryDevelopment
Updated2mo ago
Forks2

Languages

Python

Security Score

75/100

Audited on Dec 30, 2025

No findings