Ltu
Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
Install / Use
/learn @YuanGongND/LtuREADME
Listen, Think, and Understand
- Introduction
- Citation
- OpenAQA (LTU) and OpenASQA (LTU-AS) Dataset
- Set the Virtual Environment
- Inference
- Finetune LTU and LTU-AS
- Reproduce LTU and LTU-AS Training
- Pretrained Models
- Important Code
- Required Computational Resources
- Mirror Links
Introduction
<p align="center"><img src="https://github.com/YuanGongND/ltu/blob/main/ltu.png?raw=true" alt="Illustration of CAV-MAE." width="900"/></p>This repository contains the official implementation (in PyTorch), pretrained checkpoints, and datasets of LTU and LTU-AS. LTU and LTU-AS are the first generation of audio and speech large language model that bridges audio/speech perception with understanding. They not only achieve SOTA on multiple closed-ended audio and speech tasks, but also can answer any open-ended question based on the given audio. Please try the interactive demos to see how good they are!
Citation
LTU (First Generation, Only Supports Audio):
LTU was accepted at ICLR 2024. See you in Vienna!
[Paper] [HuggingFace Space] [ICLR Peer Review]
Authors: Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, and James Glass (MIT & MIT-IBM Watson AI Lab)
@article{gong2023listen,
title={Listen, Think, and Understand},
author={Gong, Yuan and Luo, Hongyin and Liu, Alexander H and Karlinsky, Leonid and Glass, James},
journal={arXiv preprint arXiv:2305.10790},
year={2023}
}
LTU-AS (Second Generation, Supports Speech and Audio):
LTU-AS was accepted at ASRU 2023 (top 3% paper). See you in Taipei!
[Paper] [HuggingFace Space] [ASRU Peer Review]
Authors: Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, and James Glass (MIT & MIT-IBM Watson AI Lab)
@inproceedings{gong_ltuas,
title={Joint Audio and Speech Understanding},
author={Gong, Yuan and Liu, Alexander H and Luo, Hongyin, and Karlinsky, Leonid and Glass, James},
year={2023},
booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
}
OpenAQA (LTU) and OpenASQA (LTU-AS) Dataset
We release the training data for LTU (OpenAQA) and LTU-AS (OpenASQA). Specifically, we release the (question, answer, audio_id) tuples.
The actual audio files are from existing public datasets and need to be downloaded by the users.
We provide the full dataset (including all AQAs) as well as breakdowns (closed-ended and open-ended subsets, subsets of each original dataset, etc). All links are hosted on Dropbox and support wget.
For LTU (OpenAQA)
Toy Set (Contains Raw Audio Files, for Testing Purpose Only):
OpenAQA Training (Only Audio Datasets, 5.6M AQAs in Total):
Full Dataset (2.3GB): [Download]
Breakdown Subsets: [Download]
LTU Evaluation Data: [Download]
For LTU-AS (OpenASQA)
Toy Set (Contains Raw Audio Files, for Testing Purpose Only):
For LTU-AS: [Meta] [Audio and Whisper Feature]
OpenASQA Training (Audio and Speech Datasets, 10.2M AQAs in Total):
Full Dataset (4.6GB): [Download]
Breakdown Subsets: [Download]
LTU-AS Evaluation Data: [Download]
When preparing audio files, please make sure all audio files use the same sampling rate of 16kHz.
The format of the dataset is a JSON file of a list of dicts, in the following format:
[
{
"instruction": "What is the significance of the sound of crying in this audio clip?", % the question
"input": "I am so sad...", % the speech content
"audio_id": "/data/sls/audioset/dave_version/audio/LZq4Neh-oWU.flac", % the audio id
"dataset": "as_strong_train", % the original dataset (optional)
"task": "open-ended question", % question type (optional)
"output": "The sound of crying suggests that there is a sad or emotional situation happening in the audio clip." % the answer
},
...
]
Set the Virtual Environment
For almost all usages, you would need to set up a virtual environment.
Note that LTU and LTU-AS need different environments. Their hf-dev and peft-main are different. Please do not mix use the venvs of LTU and LTU-AS.
Clone or download this repository as ltu-main, then,
For LTU:
cd /ltu-main/src/ltu
conda create --name venv_ltu python=3.10
conda activate venv_ltu
pip install -r requirements.txt
# install customized hugging face transformer, the original transformer won't work
pip install -e hf-dev/transformers-main
# install customized hugging face peft, original peft won't work
pip install -e peft-main
For LTU-AS:
cd /ltu-main/src/ltu_as
conda create --name venv_ltu_as python=3.10
conda activate venv_ltu_as
pip install -r requirements.txt
# install customized hugging face transformer, the original transformer won't work
pip install -e hf-dev/transformers-main
# install customized hugging face peft, original peft won't work
pip install -e peft-main/
# install customized openai-whisper, original whisper won't work
pip install -e whisper/
Inference
We provide three options for inference.
Option 1. Inference via HuggingFace Space (No Code Needed)
<p align="center"><img src="https://github.com/YuanGongND/ltu/blob/main/usage.gif?raw=true" alt="Illustration of CAV-MAE." width="900"/></p>Option 2. Inference with API (No GPU Needed)
API supports batch inference with a simple for loop.
!pip install gradio_client
For LTU:
from gradio_client import Client
client = Client("https://yuangongfdu-ltu.hf.space/")
result = client.predict(
"path_to_your_wav/audio.wav", # your audio file in 16K
"What can be inferred from the audio?", # your question
api_name="/predict"
)
print(result)
For LTU-AS:
# For LTU-AS
from gradio_client import Client
client = Client("https://yuangongfdu-ltu-2.hf.space/")
result = client.predict(
"path_to_your_wav/audio.wav", # your audio file in 16K
"",
"What can be inferred from the audio?", # your question
"7B (Default)", # str in 'LLM size' Radio component
api_name="/predict"
)
print(result)
Option 3. Local Inference
For users interested in training/finetuning, we suggest starting with running inference. This would help debugging.
The bash scripts will automatically download default LTU/LTU-AS models, you do not need to do it by yourself.
inference_gradio.py can be run on CPU or GPU.
For LTU:
conda activate venv_ltu
cd ltu-main/src/ltu
chmod 777 *
./inference.sh
The script may output some warnings which can be ignored. After the script finishes, it will provide a gradio link for inference, which can be run on a browser of any machine. You can also modify the script to run it on a local terminal.
We also provide batch inference script inference_batch.py.
For LTU-AS:
conda activate venv_ltu_as
cd ltu-main/src/ltu_as
chmod 777 *
./inference.sh
The script may output some warnings which can be ignored. After the script finishes, it will provide a gradio link for inference, which
