BembaSpeech

This is an ASR corpus for Bemba language. It contains read speech from diverse publicly available Bemba sources; Literature Books, Radio/TV shows transcripts, Youtube Video transcripts, Online sources. The corpus has 14, 438 utterances culminating into over 24 hours of speech.

Generate Convert Improve

Install / Use

/learn @csikasote/BembaSpeech

About this skill

Quality Score

0/100

README

BembaSpeech: a Speech Recognition Corpus for the Bemba Language

1. Introduction

BembaSpeech is an ASR corpus for the Bemba language of Zambia. It contains read speech from diverse publicly available Bemba sources; literature books, radio/TV shows transcripts, Youtube video transcripts as well as various open online sources. Its purpose is to enable the training and testing of automatic speech recognition(ASR) systems in Bemba language. The corpus has 14, 438 utterances culminating into 24.5 hours of speech data.

All signal files are encoded in Waveform Audio File Format (WAVE) from a mono recording with a sample rate of 16K Hz.

The corpus is split into three parts:

training set - of approximately 20 hours of speech
development set- of approximately 2.5 hours of speech
testing set - of approximately 2 hours of speech

2. Structure

The repository is structured as follows:

    BembaSpeech
        ├── bem
        │   ├── audio/*
        │   ├── dev.csv
        │   ├── test.csv
        │   └── train.csv
        ├── Data Statement.md
        ├── README.md
        └── speaker_info.txt

3. Citation

If you use this speech dataset in your project or research, please consider citing as follows:

    @InProceedings{sikasote-anastasopoulos:2022:LREC,
      author    = {Sikasote, Claytone  and  Anastasopoulos, Antonios},
      title     = {BembaSpeech: A Speech Recognition Corpus for the Bemba Language},
      booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
      month          = {June},
      year           = {2022},
      address        = {Marseille, France},
      publisher      = {European Language Resources Association},
      pages     = {7277--7283},
      abstract  = {We present a preprocessed, ready-to-use automatic speech recognition corpus, BembaSpeech, consisting over 24 hours of read speech in the Bemba language, a written but low-resourced language spoken by over 30\% of the population in Zambia. To assess its usefulness for training and testing ASR systems for Bemba, we explored different approaches; supervised pre-training (training from scratch), cross-lingual transfer learning from a monolingual English pre-trained model using DeepSpeech on the portion of the dataset and fine-tuning large scale self-supervised Wav2Vec2.0 based multilingual pre-trained models on the complete BembaSpeech corpus. From our experiments, the 1 billion XLS-R parameter model gives the best results. The model achieves a word error rate (WER) of 32.91\%, results demonstrating that model capacity significantly improves performance and that multilingual pre-trained models transfers cross-lingual acoustic representation better than monolingual pre-trained English model on the BembaSpeech for the Bemba ASR. Lastly, results also show that the corpus can be used for building ASR systems for Bemba language.},
      url       = {https://aclanthology.org/2022.lrec-1.790}
    }

4. Contact

Please feel free to drop me an email claytonsikasote@gmail.com if you would like to discuss anything related to this work or anything else related. Cheers!

Related Skills

qqbot-channel

350.1k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.4k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

350.1k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

csikasote

View profile

View on GitHub

GitHub Stars38

CategoryContent

Updated1mo ago

Forks4

csikasote/BembaSpeech

Security Score

80/100

Audited on Feb 28, 2026

No findings