SkillAgentSearch skills...

Tvl

[ICML 2024] A Touch, Vision, and Language Dataset for Multimodal Alignment

Install / Use

/learn @Max-Fu/Tvl
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

A Touch, Vision, and Language Dataset for Multimodal Alignment

by <a href="https://max-fu.github.io">Max (Letian) Fu</a>, <a href="https://www.linkedin.com/in/gaurav-datta/">Gaurav Datta*</a>, <a href="https://qingh097.github.io/">Huang Huang*</a>, <a href="https://autolab.berkeley.edu/people">William Chung-Ho Panitch*</a>, <a href="https://www.linkedin.com/in/jaimyn-drake/">Jaimyn Drake*</a>, <a href="https://joeaortiz.github.io/">Joseph Ortiz</a>, <a href="https://www.mustafamukadam.com/">Mustafa Mukadam</a>, <a href="https://scholar.google.com/citations?user=p6DCMrQAAAAJ&hl=en">Mike Lambeta</a>, <a href="https://lasr.org/">Roberto Calandra</a>, <a href="https://goldberg.berkeley.edu">Ken Goldberg</a> at UC Berkeley, Meta AI, TU Dresden, and CeTI (*equal contribution).

[Paper] | [Project Page] | [Checkpoints] | [Dataset] | [Citation]

<p align="center"> <img src="img/splash_figure_alt.png" width="800"> </p>

This repo contains the official implementation for A Touch, Vision, and Language Dataset for Multimodal Alignment. This code is based MAE, CrossMAE, and the ImageBind-LLM repos.

Instructions

Please install the dependencies in requirements.txt:

# Optionally create a conda environment
conda create -n tvl python=3.10 -y
conda activate tvl
conda install pytorch==2.1.2 cudatoolkit==11.8.0 -c pytorch -y
# Install dependencies
pip install packaging
pip install -r requirements.txt
pip install -e . 

Dataset

The dataset is hosted on HuggingFace. To use the dataset, we first download them using the GUI or use git:

# install git-lfs
sudo apt install git-lfs
git lfs install
# clone the dataset
git clone git@hf.co:datasets/mlfu7/Touch-Vision-Language-Dataset
# or you can download the zip files manually from here: https://huggingface.co/datasets/mlfu7/Touch-Vision-Language-Dataset/tree/main
cd Touch-Vision-Language-Dataset
zip -s0 tvl_dataset_sharded.zip --out tvl_dataset.zip
unzip tvl_dataset.zip 

Improved version available

There’s an improved version of this dataset, which fixes the tactile/image folder swapping issue. You can find it here:

# clone the revised dataset
git clone git@hf.co:datasets/yoorhim/TVL-revise
# or you can download the zip files manually from here: https://huggingface.co/datasets/yoorhim/TVL-revise/tree/main

Models

Touch-Vision-Language (TVL) Models can be separated into 1) tactile encoders that are aligned to the CLIP latent space and 2) TVL-LLaMA, a variant of ImageBind-LLM that is finetuned on the TVL dataset. The tactile encoders come in three different sizes: ViT-Tiny, ViT-Small, and ViT-Base. As a result, we provide three different TVL-LLaMA. The statistics presented here differ from those in the paper as the checkpoints are re-trained using this repository.

Tactile Encoders

For zero-shot classification, we use OpenCLIP with the following configuration:

CLIP_VISION_MODEL = "ViT-L-14"
CLIP_PRETRAIN_DATA = "datacomp_xl_s13b_b90k"

The checkpoints for the tactile encoders are provided below:

<table><tbody> <!-- START TABLE --> <!-- TABLE HEADER --> <th valign="bottom"></th> <th valign="bottom">ViT-Tiny</th> <th valign="bottom">ViT-Small</th> <th valign="bottom">ViT-Base</th> <!-- TABLE BODY --> <tr><td align="left">Tactile Encoder</td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_enc/tvl_enc_vittiny.pth?download=true'>download</a></td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_enc/tvl_enc_vits.pth?download=true'>download</a></td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_enc/tvl_enc_vitb.pth?download=true'>download</a></td> </tr> <tr><td align="left">Touch-Language Acc (@0.64)</td> <td align="center">36.19%</td> <td align="center">36.82%</td> <td align="center">30.85%</td> </tr> <tr><td align="left">Touch-Vision Acc</td> <td align="center">78.11%</td> <td align="center">77.49%</td> <td align="center">81.22%</td> </tr> </tbody></table>

TVL-LLaMA

Please request access to the pre-trained LLaMA-2 from this form. In particular, we use llama-2-7b as the base model. The weights here contains the trained adapter, the tactile encoder, and the vision encoder for the ease of loading. The checkpoints for TVL-LLaMA are provided below:

<table><tbody> <!-- START TABLE --> <!-- TABLE HEADER --> <th valign="bottom"></th> <th valign="bottom">ViT-Tiny</th> <th valign="bottom">ViT-Small</th> <th valign="bottom">ViT-Base</th> <!-- TABLE BODY --> <tr><td align="left">TVL-LLaMA</td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_llama/tvl_llama_vittiny.pth?download=true'>download</a></td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_llama/tvl_llama_vits.pth?download=true'>download</a></td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_llama/tvl_llama_vitb.pth?download=true'>download</a></td> </tr> <tr><td align="left">Reference TVL Benchmark Score (1-10)</td> <td align="center">5.03</td> <td align="center">5.01</td> <td align="center"> 4.87</td> </tr> </tbody></table>

Training And Evaluation

We provide tactile encoder training script in tvl_enc and TVL-LLaMA training script in tvl_llama. In particular, TVL-Benchmark is described here.

License

This project is under the Apache 2.0 license. See LICENSE for details.

Citation

Please give us a star 🌟 on Github to support us!

Please cite our work if you find our work inspiring or use our code in your work:

@inproceedings{
    fu2024a,
    title={A Touch, Vision, and Language Dataset for Multimodal Alignment},
    author={Letian Fu and Gaurav Datta and Huang Huang and William Chung-Ho Panitch and Jaimyn Drake and Joseph Ortiz and Mustafa Mukadam and Mike Lambeta and Roberto Calandra and Ken Goldberg},
    booktitle={Forty-first International Conference on Machine Learning},
    year={2024},
    url={https://openreview.net/forum?id=tFEOOH9eH0}
}

Related Skills

View on GitHub
GitHub Stars95
CategoryDevelopment
Updated1mo ago
Forks10

Languages

Python

Security Score

95/100

Audited on Mar 1, 2026

No findings