SkillAgentSearch skills...

Ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".

Install / Use

/learn @YuanGongND/Ltu
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Listen, Think, and Understand


Introduction

<p align="center"><img src="https://github.com/YuanGongND/ltu/blob/main/ltu.png?raw=true" alt="Illustration of CAV-MAE." width="900"/></p>

This repository contains the official implementation (in PyTorch), pretrained checkpoints, and datasets of LTU and LTU-AS. LTU and LTU-AS are the first generation of audio and speech large language model that bridges audio/speech perception with understanding. They not only achieve SOTA on multiple closed-ended audio and speech tasks, but also can answer any open-ended question based on the given audio. Please try the interactive demos to see how good they are!

[LTU Interactive Demo]

[LTU-AS Interactive Demo]


Citation

LTU (First Generation, Only Supports Audio):

LTU was accepted at ICLR 2024. See you in Vienna!

[Paper] [HuggingFace Space] [ICLR Peer Review]

Authors: Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, and James Glass (MIT & MIT-IBM Watson AI Lab)

@article{gong2023listen,
  title={Listen, Think, and Understand},
  author={Gong, Yuan and Luo, Hongyin and Liu, Alexander H and Karlinsky, Leonid and Glass, James},
  journal={arXiv preprint arXiv:2305.10790},
  year={2023}
}

LTU-AS (Second Generation, Supports Speech and Audio):

LTU-AS was accepted at ASRU 2023 (top 3% paper). See you in Taipei!

[Paper] [HuggingFace Space] [ASRU Peer Review]

Authors: Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, and James Glass (MIT & MIT-IBM Watson AI Lab)

@inproceedings{gong_ltuas,
  title={Joint Audio and Speech Understanding},
  author={Gong, Yuan and Liu, Alexander H and Luo, Hongyin, and Karlinsky, Leonid and Glass, James},
  year={2023},
  booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
}

OpenAQA (LTU) and OpenASQA (LTU-AS) Dataset

We release the training data for LTU (OpenAQA) and LTU-AS (OpenASQA). Specifically, we release the (question, answer, audio_id) tuples. The actual audio files are from existing public datasets and need to be downloaded by the users. We provide the full dataset (including all AQAs) as well as breakdowns (closed-ended and open-ended subsets, subsets of each original dataset, etc). All links are hosted on Dropbox and support wget.

For LTU (OpenAQA)

Toy Set (Contains Raw Audio Files, for Testing Purpose Only):

For LTU: [Meta] [Audio]

OpenAQA Training (Only Audio Datasets, 5.6M AQAs in Total):

Full Dataset (2.3GB): [Download]

Breakdown Subsets: [Download]

LTU Evaluation Data: [Download]


For LTU-AS (OpenASQA)

Toy Set (Contains Raw Audio Files, for Testing Purpose Only):

For LTU-AS: [Meta] [Audio and Whisper Feature]

OpenASQA Training (Audio and Speech Datasets, 10.2M AQAs in Total):

Full Dataset (4.6GB): [Download]

Breakdown Subsets: [Download]

LTU-AS Evaluation Data: [Download]


When preparing audio files, please make sure all audio files use the same sampling rate of 16kHz.

The format of the dataset is a JSON file of a list of dicts, in the following format:

[
 {
  "instruction": "What is the significance of the sound of crying in this audio clip?", % the question
  "input": "I am so sad...", % the speech content
  "audio_id": "/data/sls/audioset/dave_version/audio/LZq4Neh-oWU.flac", % the audio id
  "dataset": "as_strong_train", % the original dataset (optional)
  "task": "open-ended question", % question type (optional)
  "output": "The sound of crying suggests that there is a sad or emotional situation happening in the audio clip." % the answer
 },
  ...
]

Set the Virtual Environment

For almost all usages, you would need to set up a virtual environment. Note that LTU and LTU-AS need different environments. Their hf-dev and peft-main are different. Please do not mix use the venvs of LTU and LTU-AS.

Clone or download this repository as ltu-main, then,

For LTU:

cd /ltu-main/src/ltu
conda create --name venv_ltu python=3.10
conda activate venv_ltu
pip install -r requirements.txt
# install customized hugging face transformer, the original transformer won't work
pip install -e hf-dev/transformers-main
# install customized hugging face peft, original peft won't work
pip install -e peft-main

For LTU-AS:

cd /ltu-main/src/ltu_as
conda create --name venv_ltu_as python=3.10
conda activate venv_ltu_as
pip install -r requirements.txt
# install customized hugging face transformer, the original transformer won't work
pip install -e hf-dev/transformers-main
# install customized hugging face peft, original peft won't work
pip install -e peft-main/
# install customized openai-whisper, original whisper won't work 
pip install -e whisper/

Inference

We provide three options for inference.

Option 1. Inference via HuggingFace Space (No Code Needed)

<p align="center"><img src="https://github.com/YuanGongND/ltu/blob/main/usage.gif?raw=true" alt="Illustration of CAV-MAE." width="900"/></p>

[LTU Interactive Demo]

[LTU-AS Interactive Demo]

Option 2. Inference with API (No GPU Needed)

API supports batch inference with a simple for loop.

!pip install gradio_client

For LTU:

from gradio_client import Client

client = Client("https://yuangongfdu-ltu.hf.space/")
result = client.predict(
      "path_to_your_wav/audio.wav",  # your audio file in 16K
      "What can be inferred from the audio?",    # your question
      api_name="/predict"
)
print(result)

For LTU-AS:

# For LTU-AS
from gradio_client import Client

client = Client("https://yuangongfdu-ltu-2.hf.space/")
result = client.predict(
            "path_to_your_wav/audio.wav",  # your audio file in 16K
            "",
            "What can be inferred from the audio?",    # your question
            "7B (Default)",    # str in 'LLM size' Radio component
            api_name="/predict"
)
print(result)

Option 3. Local Inference

For users interested in training/finetuning, we suggest starting with running inference. This would help debugging. The bash scripts will automatically download default LTU/LTU-AS models, you do not need to do it by yourself. inference_gradio.py can be run on CPU or GPU.

For LTU:

conda activate venv_ltu
cd ltu-main/src/ltu
chmod 777 *
./inference.sh

The script may output some warnings which can be ignored. After the script finishes, it will provide a gradio link for inference, which can be run on a browser of any machine. You can also modify the script to run it on a local terminal.

We also provide batch inference script inference_batch.py.

For LTU-AS:

conda activate venv_ltu_as
cd ltu-main/src/ltu_as
chmod 777 *
./inference.sh

The script may output some warnings which can be ignored. After the script finishes, it will provide a gradio link for inference, which

View on GitHub
GitHub Stars473
CategoryContent
Updated4d ago
Forks41

Languages

Python

Security Score

85/100

Audited on Mar 23, 2026

No findings