SkillAgentSearch skills...

Klaam

Arabic speech recognition, classification and text-to-speech.

Install / Use

/learn @ARBML/Klaam
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

klaam

Arabic speech recognition, classification and text-to-speech using many advanced models like wave2vec and fastspeech2. This repository allows training and prediction using pretrained models.

<p align="center"> <img src="https://raw.githubusercontent.com/ARBML/klaam/main/misc/klaam_logo.png" width="250px"/> </p>

1. Usage

1.1 Speech Classification

from klaam import SpeechClassification
model = SpeechClassification()
model.classify(wav_file)

1.2 Speech Recongnition

from klaam import SpeechRecognition
model = SpeechRecognition()
model.transcribe(wav_file)

1.3 Text To Speech

from klaam import TextToSpeech
prepare_tts_model_path = "../cfgs/FastSpeech2/config/Arabic/preprocess.yaml"
model_config_path = "../cfgs/FastSpeech2/config/Arabic/model.yaml"
train_config_path = "../cfgs/FastSpeech2/config/Arabic/train.yaml"
vocoder_config_path = "../cfgs/FastSpeech2/model_config/hifigan/config.json"
speaker_pre_trained_path = "../data/model_weights/hifigan/generator_universal.pth.tar"

model = TextToSpeech(prepare_tts_model_path, model_config_path, train_config_path, vocoder_config_path, speaker_pre_trained_path)

model.synthesize(sample_text)

There are two avilable models for recognition trageting Modern Standard Arabic (MSA) and Egyptian dialect (EGY) . You can set any of them using the lang attribute.

from klaam import SpeechRecognition
model = SpeechRecognition(lang = 'msa')
model.transcribe('file.wav')

2. Datasets

| Dataset | Description | Link | | -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------- | | MGB-3 | Egyptian Arabic Speech recognition in the wild. Every sentence was annotated by four annotators. More than 15 hours have been collected from YouTube. | here [Registeration required] | | ADI-5 | More than 50 hours collected from Aljazeera TV. 4 regional dialectal: Egyptian (EGY), Levantine (LAV), Gulf (GLF), North African (NOR), and Modern Standard Arabic (MSA). This dataset is a part of the MGB-3 challenge. | here [Registeration required] | | Common voice | Multlilingual dataset avilable on huggingface | here. | | Arabic Speech Corpus | Arabic dataset with alignment and transcriptions | here. |

3. Models

Our project currently supports four models, three of them are avilable on transformers.

| Language | Description | Source | | ----------------------- | --------------------- | --------------------------------------------------------------------------------------------------------------------- | | Egyptian | Speech recognition | wav2vec2-large-xlsr-53-arabic-egyptian | | Standard Arabic | Speech recognition | wav2vec2-large-xlsr-53-arabic | | EGY, NOR, LAV, GLF, MSA | Speech classification | wav2vec2-large-xlsr-dialect-classification | | Standard Arabic | Text-to-Speech | fastspeech2 |

4. Example Notebooks

<table> <tr> <th><b>Name</b></th> <th><b>Description</b></th> <th><b>Notebook</b></th> </tr> <tr> <td>Demo</td> <td>Classification, Recongition and Text-to-speech in a few lines of code.</td> <td><a href="https://colab.research.google.com/github/ARBML/klaam/blob/main/notebooks/demo.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg" > </a></td> </tr> <tr> <td>Demo with mic</td> <td>Audio Recongition and classification with recording.</td> <td><a href="https://colab.research.google.com/github/ARBML/klaam/blob/main/notebooks/demo_with_mic.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg"> </a></td> </tr> <table>

5. Training

The scripts are a modification of jqueguiner/wav2vec2-sprint.

5.1. Classification

This script is used for the classification task on the 5 classes.

python run_classifier.py \
    --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
    --output_dir=/path/to/output \
    --cache_dir=/path/to/cache/ \
    --freeze_feature_extractor \
    --num_train_epochs="50" \
    --per_device_train_batch_size="32" \
    --preprocessing_num_workers="1" \
    --learning_rate="3e-5" \
    --warmup_steps="20" \
    --evaluation_strategy="steps"\
    --save_steps="100" \
    --eval_steps="100" \
    --save_total_limit="1" \
    --logging_steps="100" \
    --do_eval \
    --do_train \

5.2. Recognition

This script is for training on the dataset for pretraining on the egyption dialects dataset.

python run_mgb3.py \
    --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
    --output_dir=/path/to/output \
    --cache_dir=/path/to/cache/ \
    --freeze_feature_extractor \
    --num_train_epochs="50" \
    --per_device_train_batch_size="32" \
    --preprocessing_num_workers="1" \
    --learning_rate="3e-5" \
    --warmup_steps="20" \
    --evaluation_strategy="steps"\
    --save_steps="100" \
    --eval_steps="100" \
    --save_total_limit="1" \
    --logging_steps="100" \
    --do_eval \
    --do_train \

This script can be used for Arabic common voice training

python run_common_voice.py \
    --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
    --dataset_config_name="ar" \
    --output_dir=/path/to/output/ \
    --cache_dir=/path/to/cache \
    --overwrite_output_dir \
    --num_train_epochs="1" \
    --per_device_train_batch_size="32" \
    --per_device_eval_batch_size="32" \
    --evaluation_strategy="steps" \
    --learning_rate="3e-4" \
    --warmup_steps="500" \
    --fp16 \
    --freeze_feature_extractor \
    --save_steps="10" \
    --eval_steps="10" \
    --save_total_limit="1" \
    --logging_steps="10" \
    --group_by_length \
    --feat_proj_dropout="0.0" \
    --layerdrop="0.1" \
    --gradient_checkpointing \
    --do_train --do_eval \
    --max_train_samples 100 --max_val_samples 100

5.3. Text To Speech

We use the pytorch implementation of fastspeech2 by ming024.

The procedure is as the following:

  1. Download the dataset and unzip it.
wget http://en.arabicspeechcorpus.com/arabic-speech-corpus.zip
unzip arabic-speech-corpus.zip
  1. Create multiple directories for data
mkdir -p raw_data/Arabic/Arabic preprocessed_data/Arabic/TextGrid/Arabic
cp arabic-speech-corpus/textgrid/* preprocessed_data/Arabic/TextGrid/Arabic
  1. Prepare metadata
import os
base_dir = '/content/arabic-speech-corpus'
lines = []
for lab_file in os.listdir(f'{base_dir}/lab'):
  lines.append(lab_file[:-4]+'|'+open(f'{base_dir}/lab/{lab_file}', 'r').read())


open(f'{base_dir}/metadata.csv', 'w').write(('\n').join(lines))
  1. Clone my repository (FastSpeech2) and installl the dependencies required.
git clone --depth 1 https://github.com/zaidalyafeai/FastSpeech2
cd FastSpeech2
pip install -r requirements.txt
  1. Prepare alignments and prepreocessed data.
python3 prepare_align.py config/Arabic/preprocess.yaml
python3 preprocess.py config/Arabic/preprocess.yaml
  1. Unzip vocoders.
unzip hifigan/generator_LJSpeech.pth.tar.zip -d hifigan
unzip hifigan/generator_universal.pth.tar.zip -d hifigan
  1. Start the training.
python3 train.py -p config/Arabic/preprocess.yaml -m config/Arabic/model.yaml -t config/Arabic/train.yaml

This repository was created by the ARBML team. If you have any suggestion or contribution feel free to make a pull request.

Related Skills

View on GitHub
GitHub Stars424
CategoryDevelopment
Updated13d ago
Forks85

Languages

Jupyter Notebook

Security Score

100/100

Audited on Mar 13, 2026

No findings