Tvl
[ICML 2024] A Touch, Vision, and Language Dataset for Multimodal Alignment
Install / Use
/learn @Max-Fu/TvlREADME
A Touch, Vision, and Language Dataset for Multimodal Alignment
by <a href="https://max-fu.github.io">Max (Letian) Fu</a>, <a href="https://www.linkedin.com/in/gaurav-datta/">Gaurav Datta*</a>, <a href="https://qingh097.github.io/">Huang Huang*</a>, <a href="https://autolab.berkeley.edu/people">William Chung-Ho Panitch*</a>, <a href="https://www.linkedin.com/in/jaimyn-drake/">Jaimyn Drake*</a>, <a href="https://joeaortiz.github.io/">Joseph Ortiz</a>, <a href="https://www.mustafamukadam.com/">Mustafa Mukadam</a>, <a href="https://scholar.google.com/citations?user=p6DCMrQAAAAJ&hl=en">Mike Lambeta</a>, <a href="https://lasr.org/">Roberto Calandra</a>, <a href="https://goldberg.berkeley.edu">Ken Goldberg</a> at UC Berkeley, Meta AI, TU Dresden, and CeTI (*equal contribution).
[Paper] | [Project Page] | [Checkpoints] | [Dataset] | [Citation]
<p align="center"> <img src="img/splash_figure_alt.png" width="800"> </p>This repo contains the official implementation for A Touch, Vision, and Language Dataset for Multimodal Alignment. This code is based MAE, CrossMAE, and the ImageBind-LLM repos.
Instructions
Please install the dependencies in requirements.txt:
# Optionally create a conda environment
conda create -n tvl python=3.10 -y
conda activate tvl
conda install pytorch==2.1.2 cudatoolkit==11.8.0 -c pytorch -y
# Install dependencies
pip install packaging
pip install -r requirements.txt
pip install -e .
Dataset
The dataset is hosted on HuggingFace. To use the dataset, we first download them using the GUI or use git:
# install git-lfs
sudo apt install git-lfs
git lfs install
# clone the dataset
git clone git@hf.co:datasets/mlfu7/Touch-Vision-Language-Dataset
# or you can download the zip files manually from here: https://huggingface.co/datasets/mlfu7/Touch-Vision-Language-Dataset/tree/main
cd Touch-Vision-Language-Dataset
zip -s0 tvl_dataset_sharded.zip --out tvl_dataset.zip
unzip tvl_dataset.zip
Improved version available
There’s an improved version of this dataset, which fixes the tactile/image folder swapping issue. You can find it here:
# clone the revised dataset
git clone git@hf.co:datasets/yoorhim/TVL-revise
# or you can download the zip files manually from here: https://huggingface.co/datasets/yoorhim/TVL-revise/tree/main
Models
Touch-Vision-Language (TVL) Models can be separated into 1) tactile encoders that are aligned to the CLIP latent space and 2) TVL-LLaMA, a variant of ImageBind-LLM that is finetuned on the TVL dataset. The tactile encoders come in three different sizes: ViT-Tiny, ViT-Small, and ViT-Base. As a result, we provide three different TVL-LLaMA. The statistics presented here differ from those in the paper as the checkpoints are re-trained using this repository.
Tactile Encoders
For zero-shot classification, we use OpenCLIP with the following configuration:
CLIP_VISION_MODEL = "ViT-L-14"
CLIP_PRETRAIN_DATA = "datacomp_xl_s13b_b90k"
The checkpoints for the tactile encoders are provided below:
<table><tbody> <!-- START TABLE --> <!-- TABLE HEADER --> <th valign="bottom"></th> <th valign="bottom">ViT-Tiny</th> <th valign="bottom">ViT-Small</th> <th valign="bottom">ViT-Base</th> <!-- TABLE BODY --> <tr><td align="left">Tactile Encoder</td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_enc/tvl_enc_vittiny.pth?download=true'>download</a></td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_enc/tvl_enc_vits.pth?download=true'>download</a></td> <td align="center"><a href='https://huggingface.co/mlfu7/Touch-Vision-Language-Models/resolve/main/ckpt/tvl_enc/tvl_enc_vitb.pth?download=true'>download</a></td> </tr> <tr><td align="left">Touch-Language Acc (@0.64)</td> <td align="center">36.19%</td> <td align="center">36.82%</td> <td align="center">30.85%</td> </tr> <tr><td align="left">Touch-Vision Acc</td> <td align="center">78.11%</td> <td align="center">77.49%</td> <td align="center">81.22%</td> </tr> </tbody></table>TVL-LLaMA
Please request access to the pre-trained LLaMA-2 from this form. In particular, we use llama-2-7b as the base model. The weights here contains the trained adapter, the tactile encoder, and the vision encoder for the ease of loading.
The checkpoints for TVL-LLaMA are provided below:
Training And Evaluation
We provide tactile encoder training script in tvl_enc and TVL-LLaMA training script in tvl_llama. In particular, TVL-Benchmark is described here.
License
This project is under the Apache 2.0 license. See LICENSE for details.
Citation
Please give us a star 🌟 on Github to support us!
Please cite our work if you find our work inspiring or use our code in your work:
@inproceedings{
fu2024a,
title={A Touch, Vision, and Language Dataset for Multimodal Alignment},
author={Letian Fu and Gaurav Datta and Huang Huang and William Chung-Ho Panitch and Jaimyn Drake and Joseph Ortiz and Mustafa Mukadam and Mike Lambeta and Roberto Calandra and Ken Goldberg},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=tFEOOH9eH0}
}
Related Skills
node-connect
342.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
85.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
342.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
