SkillAgentSearch skills...

Speechlmm

Multimodal and multilingual foundation models supporting audio, video and text.

Install / Use

/learn @Meetween/Speechlmm
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

[!NOTE] If you're working on a Cyfronet machine, please refer to README_cyfronet.md instead.

📹🎤 SpeechLMM

This repository contains the code for the SpeechLMM foundation model developed as part of the Meetween project. Across the 4-year timeframe of the project (2024-2027), we will release 3 different generations of the model, each time in 4 different sizes (S, M, L, XL).

Below is an illustration of the architecture of SpeechLMM version 1.0: SpeechLMM architecture

📖 Contents

🛠️ Preliminary setup

  • SpeechLMM builds on existing foundation models for the different modalities it supports. Some of these models are hosted on Hugging Face but are gated by default, so you must request access to them before you can use them within SpeechLMM. At the moment, you are required to request access to the following models:

  • In order for the codebase to work properly, you need to set the following environment variables:

    # Directory where your datasets reside
    export DATA_HOME=...
    # Path to this repository
    export SPEECHLMM_ROOT=...
    # Directory where the pre-trained components (e.g. modality encoders) are stored
    export PRETRAINED_COMPONENTS=...
    # Directory where model checkpoints will be stored
    export CHECKPOINTS_HOME=...
    

    For convenience, you can add the exports above to your ~/.bashrc or ~/.zshrc file, replacing the dots with the actual paths.

  • Download pre-trained building blocks for SpeechLMM. Important: you must download these models in $PRETRAINED_COMPONENTS

    1. SeamlessM4T v2

      import os
      from transformers import AutoProcessor, AutoModel
      
      model_name = "facebook/seamless-m4t-v2-large"
      processor = AutoProcessor.from_pretrained(model_name)
      model = AutoModel.from_pretrained(model_name)
      
      processor.save_pretrained(os.path.join(os.getenv("PRETRAINED_COMPONENTS"), model_name))
      model.speech_encoder.save_pretrained(os.path.join(os.getenv("PRETRAINED_COMPONENTS"), model_name))
      
    2. Whisper v3

      import os
      from transformers import AutoProcessor, AutoModel
      
      model_name = "openai/whisper-large-v3"
      processor = AutoProcessor.from_pretrained(model_name)
      model = AutoModel.from_pretrained(model_name)
      
      processor.save_pretrained(os.path.join(os.getenv("PRETRAINED_COMPONENTS"), model_name))
      model.encoder.save_pretrained(os.path.join(os.getenv("PRETRAINED_COMPONENTS"), model_name))
      
    3. AutoAVSR

      Download the checkpoint manually from https://drive.google.com/file/d/1shcWXUK2iauRhW9NbwCc25FjU1CoMm8i and put it in $PRETRAINED_COMPONENTS/.

📊 System requirements

  • The codebase has only been tested on Linux
  • We only tested training and inference on NVIDIA Ampere (A100), Hopper (H100) and Grace Hopper (GH200) architectures with ≥40GB of VRAM per GPU

🔧 Installation

  1. Clone this repository and navigate to the speechlmm folder 📁

    git clone https://github.com/Meetween/speechlmm.git
    cd speechlmm
    
  2. Install package using conda 🐍

    conda create -n speechlmm python=3.10 -y
    conda activate speechlmm
    pip install --upgrade pip  # enable PEP 660 support
    pip install -e .
    
  3. Install additional packages for training and development 🏋️

    pip install -e ".[train,dev]"
    pip install flash-attn --no-build-isolation
    
  4. [Optional, but strongly encouraged 😇] Install pre-commit hooks for automatic code formatting 🪝

    pre-commit install
    
  5. Install decord (library for decoding videos) 🎥

    pip install decord
    
  6. Apply patches 🩹

    We require some patches to be applied to some of this package's dependencies. In order to apply them, make sure you activated the virtual environment you set up <u>in step 2</u> and run:

    python apply_patches.py
    

Upgrade to latest code base

git pull

⌨️ CLI Inference

To run inference using a trained model, run the following command (on a GPU instance):

python speechlmm/serve/cli.py --model-path /path/to/model_directory

While chatting with the model, there are three strings that are not passed directly by the model but are handled differently:

  1. if you send an empty message, the script terminates.
  2. if you write <reset>, you clear the conv history. This also clears the audio tokens from the conversation.
  3. if you write audio:/path/to/new/audio_file at the end of your message, the model will clear the conv history, load a new audio file

💾 Datasets and dataloaders

Dataset specifications are contained inside conf/datasets. If you wish to contribute a new dataset, follow the instructions in docs/custom_datasets.md.

👩‍💻 Codebase

The script you should use for training models is speechlmm/train/train_hydra.py, whereas speechlmm/train/eval_hydra.py should be used for evaluating trained models. The sections below offer an overview of the different modules of this repository, useful if you wish to contribute changes.

Model

Most of the code that implements SpeechLMM is found in modeling_speechlmm.py. The model class is loosely inspired by Hugging Face transformers (and it also has an associated configuration class in configuration_speechlmm.py).

Multimodal encoders, adapters and decoders

Multimodal encoders, multimodal adapters and multimodal decoders are organized into folders. Right now we support the audio and vision modality for encoders and adapters, while decoders support only the text modality. If you wish to contribute additional modalities, make sure to follow the same implementation scheme.

🏋🏼‍♀️ Training

Launching a training is as simple as running a command like the following:

python speechlmm/train/train_hydra.py \
    --config-name pretrain \
    model/audio_encoder=seamless \
    model/audio_adapter=mlp \
    model/text_decoder=llama_3_8b \
    training_setting=paper_1a

To specify training configurations, we use Hydra, which in turn is based on OmegaConf. In the example above, we are launching a pre-training job using SeamlessM4T v2 as the audio encoder, a simple MLP as the audio adapter, and Llama 3 8B as the text decoder. The "training setting" we are using is paper_1a, and the details associated with it (such as which datasets to train on, which tasks, ...) are in conf/speechlmm/training_setting/paper_1a.yaml. Note that this file does not contain all the configuration options, but only those that differ from the default ones found in conf/speechlmm/pretrain.yaml (hence --config-name pretrain).

If you wish to tweak any configuration options, you can do so by creating a new YAML file under conf/speechlmm/training_setting/ and passing that as the training_setting parameter. Alternatively, you can override specific parameters in the command line directly, such as training.per_device_train_batch_size=16.

Fine-tuning

The procedure for fine-tuning a pre-trained model is not very different than the one for pre-training (as shown above). The only important difference is that you must specify the training.pretrained_checkpoint parameter, which should point to a directory containing a pre-trained model checkpoint (in particular, a directory containing a config.json and a model.safetensors file).

For example, let's imagine you trained a model using the command above and the final checkpoint was saved in /path/to/speechlmm-pretrain-paper_1a. Now you want to fine-tune it on a new dataset, so you create conf/datasets/my_finetuning_dataset.yml containing your dataset configuration and conf/speechlmm/training_setting/my_finetuning.yaml with the following content:

⋮

data:
  ⋮    
  data_config_path: conf/datasets/my_finetuning_dataset.yml
  ⋮

training:
  ⋮
  pretrained_checkpoint: /path/to/speechlmm-pretrain-paper_1a
  ⋮

⋮

At this point, you can launch the fine-tuning job with the following command:

python speechlmm/train/train_hydra.py \
    model/audio_encoder=seamless \
    model/audio_adapter=mlp \
    model/text_decoder=llama_3_8b \
    training_setting=my_finetuning

Parameter-efficient pretraining / fine-tuning using LoRA

If you want to run a parameter-efficient pretraining or fine-tuning using LoRA, you must provide an appropriate value for the training.lora_adapters configuration parameter. For example, here's a possible configuration where we apply two different LoRA adapters, one to the text decoder and one to the audio encoder:

⋮

training:
  ⋮
  lora_adapters:
    - name: text_decoder_peft_adapter
      target_module: text_decoder.model
      task_type: CAUSAL_LM
      r: 128
      lora_alpha: 256
      lora_dropout: 0.05
      bias: none
      

Related Skills

View on GitHub
GitHub Stars8
CategoryCustomer
Updated28d ago
Forks2

Languages

Python

Security Score

85/100

Audited on Mar 3, 2026

No findings