Tvl

[ICML 2024] A Touch, Vision, and Language Dataset for Multimodal Alignment

Generate Convert Improve

Install / Use

/learn @Max-Fu/Tvl

About this skill

Quality Score

0/100

README

A Touch, Vision, and Language Dataset for Multimodal Alignment

by <a href="https://max-fu.github.io">Max (Letian) Fu</a>, <a href="https://www.linkedin.com/in/gaurav-datta/">Gaurav Datta*</a>, <a href="https://qingh097.github.io/">Huang Huang*</a>, <a href="https://autolab.berkeley.edu/people">William Chung-Ho Panitch*</a>, <a href="https://www.linkedin.com/in/jaimyn-drake/">Jaimyn Drake*</a>, <a href="https://joeaortiz.github.io/">Joseph Ortiz</a>, <a href="https://www.mustafamukadam.com/">Mustafa Mukadam</a>, <a href="https://scholar.google.com/citations?user=p6DCMrQAAAAJ&hl=en">Mike Lambeta</a>, <a href="https://lasr.org/">Roberto Calandra</a>, <a href="https://goldberg.berkeley.edu">Ken Goldberg</a> at UC Berkeley, Meta AI, TU Dresden, and CeTI (*equal contribution).

[Paper] | [Project Page] | [Checkpoints] | [Dataset] | [Citation]

This repo contains the official implementation for A Touch, Vision, and Language Dataset for Multimodal Alignment. This code is based MAE, CrossMAE, and the ImageBind-LLM repos.

Instructions

Please install the dependencies in requirements.txt:

# Optionally create a conda environment
conda create -n tvl python=3.10 -y
conda activate tvl
conda install pytorch==2.1.2 cudatoolkit==11.8.0 -c pytorch -y
# Install dependencies
pip install packaging
pip install -r requirements.txt
pip install -e .

Dataset

The dataset is hosted on HuggingFace. To use the dataset, we first download them using the GUI or use git:

# install git-lfs
sudo apt install git-lfs
git lfs install
# clone the dataset
git clone git@hf.co:datasets/mlfu7/Touch-Vision-Language-Dataset
# or you can download the zip files manually from here: https://huggingface.co/datasets/mlfu7/Touch-Vision-Language-Dataset/tree/main
cd Touch-Vision-Language-Dataset
zip -s0 tvl_dataset_sharded.zip --out tvl_dataset.zip
unzip tvl_dataset.zip

Improved version available

There’s an improved version of this dataset, which fixes the tactile/image folder swapping issue. You can find it here:

# clone the revised dataset
git clone git@hf.co:datasets/yoorhim/TVL-revise
# or you can download the zip files manually from here: https://huggingface.co/datasets/yoorhim/TVL-revise/tree/main

Models

Touch-Vision-Language (TVL) Models can be separated into 1) tactile encoders that are aligned to the CLIP latent space and 2) TVL-LLaMA, a variant of ImageBind-LLM that is finetuned on the TVL dataset. The tactile encoders come in three different sizes: ViT-Tiny, ViT-Small, and ViT-Base. As a result, we provide three different TVL-LLaMA. The statistics presented here differ from those in the paper as the checkpoints are re-trained using this repository.

Tactile Encoders

For zero-shot classification, we use OpenCLIP with the following configuration:

CLIP_VISION_MODEL = "ViT-L-14"
CLIP_PRETRAIN_DATA = "datacomp_xl_s13b_b90k"

The checkpoints for the tactile encoders are provided below:

<table><tbody>   <th valign="bottom"></th> <th valign="bottom">ViT-Tiny</th> <th valign="bottom">ViT-Small</th> <th valign="bottom">ViT-Base</th>  <tr><td align="left">Tactile Encoder</td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_enc/tvl_enc_vittiny.pth?download=true'>download</a></td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_enc/tvl_enc_vits.pth?download=true'>download</a></td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_enc/tvl_enc_vitb.pth?download=true'>download</a></td> </tr> <tr><td align="left">Touch-Language Acc (@0.64)</td> <td align="center">36.19%</td> <td align="center">36.82%</td> <td align="center">30.85%</td> </tr> <tr><td align="left">Touch-Vision Acc</td> <td align="center">78.11%</td> <td align="center">77.49%</td> <td align="center">81.22%</td> </tr> </tbody></table>

TVL-LLaMA

Please request access to the pre-trained LLaMA-2 from this form. In particular, we use llama-2-7b as the base model. The weights here contains the trained adapter, the tactile encoder, and the vision encoder for the ease of loading. The checkpoints for TVL-LLaMA are provided below:

<table><tbody>   <th valign="bottom"></th> <th valign="bottom">ViT-Tiny</th> <th valign="bottom">ViT-Small</th> <th valign="bottom">ViT-Base</th>  <tr><td align="left">TVL-LLaMA</td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_llama/tvl_llama_vittiny.pth?download=true'>download</a></td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_llama/tvl_llama_vits.pth?download=true'>download</a></td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_llama/tvl_llama_vitb.pth?download=true'>download</a></td> </tr> <tr><td align="left">Reference TVL Benchmark Score (1-10)</td> <td align="center">5.03</td> <td align="center">5.01</td> <td align="center"> 4.87</td> </tr> </tbody></table>

Training And Evaluation

We provide tactile encoder training script in tvl_enc and TVL-LLaMA training script in tvl_llama. In particular, TVL-Benchmark is described here.

License

This project is under the Apache 2.0 license. See LICENSE for details.

Citation

Please give us a star 🌟 on Github to support us!

Please cite our work if you find our work inspiring or use our code in your work:

@inproceedings{
    fu2024a,
    title={A Touch, Vision, and Language Dataset for Multimodal Alignment},
    author={Letian Fu and Gaurav Datta and Huang Huang and William Chung-Ho Panitch and Jaimyn Drake and Joseph Ortiz and Mustafa Mukadam and Mike Lambeta and Roberto Calandra and Ken Goldberg},
    booktitle={Forty-first International Conference on Machine Learning},
    year={2024},
    url={https://openreview.net/forum?id=tFEOOH9eH0}
}

Related Skills

node-connect

342.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

85.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

342.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

342.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。