SkillAgentSearch skills...

CLAP

Contrastive Language-Audio Pretraining

Install / Use

/learn @LAION-AI/CLAP
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

CLAP

<p align="center"> <img src="https://raw.githubusercontent.com/LAION-AI/CLAP/main/assets/logo.PNG" alt="The Contrastive Language-Audio Pretraining Model Architecture" width="60%"/> </p> <p align="center"> <a href="https://arxiv.org/abs/2211.06687"><img src="https://img.shields.io/badge/arXiv-2211.06687-brightgreen.svg?style=flat-square"/></a> <a href="https://pypi.org/project/laion-clap"><img src="https://badge.fury.io/py/laion-clap.svg"/></a> <a href="https://huggingface.co/docs/transformers/v4.27.2/en/model_doc/clap"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Transformers-blue"/></a> </p>

This repository provides representations of audios and texts via Contrastive Language-Audio Pretraining (CLAP)

With CLAP, you can extract a latent representation of any given audio and text for your own model, or for different downstream tasks.

All codes are comming officially with the following paper, accepted by IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023:

New Updates:

<b>1. We release new CLAP pretrained checkpoints pretrained on music and speech data collecstions from our dataset collection repo.</b>

<b>2. CLAP model is incorporated and supported by HuggingFace Transformers. Many thanks to Younes Belkada and Arthur Zucker for contributing to the HuggingFace support. </b>

About this project

This project is a project in LAION that aims at learning better audio understanding and getting more audio data. This is an opensource project. We adopt the codebase of open_clip for this project.

many thanks to <a href="https://github.com/cfoster0/CLAP">@cfoster0</a> for allowing us to use his repo name.

Architecture

Contrastive Language-Audio Pretraining, known as CLAP. Referring to the CLIP (Contrastive Language-Image Pretraining) architecture, the CLAP architecture is as follows.

<p align="center"> <img src="https://raw.githubusercontent.com/LAION-AI/CLAP/main/assets/audioclip-arch.png" alt="The Contrastive Language-Audio Pretraining Model Architecture" width="60%"/> </p>

Quick Start

We provide the PyPI library for our CLAP model:

pip install laion-clap

Then you can follow the below usage or refer to unit_test.py.

For the documentation of the API, please refer to hook.py.

import numpy as np
import librosa
import torch
import laion_clap

# quantization
def int16_to_float32(x):
    return (x / 32767.0).astype('float32')


def float32_to_int16(x):
    x = np.clip(x, a_min=-1., a_max=1.)
    return (x * 32767.).astype('int16')

model = laion_clap.CLAP_Module(enable_fusion=False)
model.load_ckpt() # download the default pretrained checkpoint.

# Directly get audio embeddings from audio files
audio_file = [
    '/home/data/test_clap_short.wav',
    '/home/data/test_clap_long.wav'
]
audio_embed = model.get_audio_embedding_from_filelist(x = audio_file, use_tensor=False)
print(audio_embed[:,-20:])
print(audio_embed.shape)

# Get audio embeddings from audio data
audio_data, _ = librosa.load('/home/data/test_clap_short.wav', sr=48000) # sample rate should be 48000
audio_data = audio_data.reshape(1, -1) # Make it (1,T) or (N,T)
audio_embed = model.get_audio_embedding_from_data(x = audio_data, use_tensor=False)
print(audio_embed[:,-20:])
print(audio_embed.shape)

# Directly get audio embeddings from audio files, but return torch tensor
audio_file = [
    '/home/data/test_clap_short.wav',
    '/home/data/test_clap_long.wav'
]
audio_embed = model.get_audio_embedding_from_filelist(x = audio_file, use_tensor=True)
print(audio_embed[:,-20:])
print(audio_embed.shape)

# Get audio embeddings from audio data
audio_data, _ = librosa.load('/home/data/test_clap_short.wav', sr=48000) # sample rate should be 48000
audio_data = audio_data.reshape(1, -1) # Make it (1,T) or (N,T)
audio_data = torch.from_numpy(int16_to_float32(float32_to_int16(audio_data))).float() # quantize before send it in to the model
audio_embed = model.get_audio_embedding_from_data(x = audio_data, use_tensor=True)
print(audio_embed[:,-20:])
print(audio_embed.shape)

# Get text embedings from texts:
text_data = ["I love the contrastive learning", "I love the pretrain model"] 
text_embed = model.get_text_embedding(text_data)
print(text_embed)
print(text_embed.shape)

# Get text embedings from texts, but return torch tensor:
text_data = ["I love the contrastive learning", "I love the pretrain model"] 
text_embed = model.get_text_embedding(text_data, use_tensor=True)
print(text_embed)
print(text_embed.shape)

Pretrained Models

The pretrained checkpoints can be found in here. Please refer to the previous section for how to load and run the checkpoints. For the PyPI library, 630k-audioset-best.pt and 630k-audioset-fusion-best.pt are our default models (non-fusion and fusion)

We further provide below pretrained models according to your usages:

The checkpoints list here for each model setting is the one with the highest average mAP score in training. The average mAP score is calculated by averaging 4 scores: A-->T mAP@10 on AudioCaps, and T-->A mAP@10 on AudioCaps, A-->T mAP@10 on Clotho, and T-->A mAP@10 on Clotho.

To use above pretrained models, you need to load the ckpt by yourself, as:

Update 2023.4.7: we have released 3 larger CLAP models trained on music, speech dataset in addition to LAION-Audio-630k. Here are descriptions of the model and their performance:

  • music_speech_audioset_epoch_15_esc_89.98.pt: trained on music + speech + Audioset + LAION-Audio-630k. The zeroshot ESC50 performance is 89.98%, the GTZAN performance is 51%.
  • music_audioset_epoch_15_esc_90.14.pt: trained on music + Audioset + LAION-Audio-630k. The zeroshot ESC50 performance is 90.14%, the GTZAN performance is 71%.
  • music_speech_epoch_15_esc_89.25.pt: trained on music + speech + LAION-Audio-630k. The zeroshot ESC50 performance is 89.25%, the GTZAN performance is 69%.

The model uses a larger audio encoder. To load the model using the pip API:

import laion_clap
model = laion_clap.CLAP_Module(enable_fusion=False, amodel= 'HTSAT-base')
model.load_ckpt('checkpoint_path/checkpoint_name.pt')

Please note that this is a temporary release for people who are working on larger-scale down-stream task. We will release a more comprehensive version of the model with detailed experiments in the future. Please take your own risk when using this model.

  • All the new checkpoints did not trained with fusion. The training dataset size for music_speech_audioset_epoch_15_esc_89.98.pt is around 4M samples. The zeroshot GTZAN score is evaluated using the prompt This audio is a <genre> song.
<!-- We provide the CLAP's performance on audio classification tasks under the zero-shot setting or the supervised setting. More results can be found at our paper. <p align="center"> <img src="https://raw.githubusercontent.com/LAION-AI/CLAP/main/assets/clap-zeroshot.PNG" alt="Zero-shot Performance" width="100%"/> </p> -->

Environment Installation

If you want to check and reuse our model into your project instead of directly using the pip library, you need to install the same environment as we use, please run the following command:

conda create env -n clap python=3.10
conda activate clap
git clone https://github.com/LAION-AI/CLAP.git
cd CLAP
# you can also install pytorch by following the official instruction (https://pytorch.org/get-started/locally/)
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

Dataset format

We use training data in webdataset format. For details of our dataset please see https://github.com/LAION-AI/audio-dataset.

Due to copyright reasons, we cannot release the dataset we train this model on. However, we released LAION-audio-630K, the data source we used to compose the dataset with link to each audio and their caption. Please refer to LAION-audio-630K for more details. You could download the dataset, preprocess it on your own and train it locally. To train on th

View on GitHub
GitHub Stars2.1k
CategoryDevelopment
Updated5h ago
Forks206

Languages

Python

Security Score

95/100

Audited on Mar 27, 2026

No findings