SkillAgentSearch skills...

Pykaldi

A Python wrapper for Kaldi

Install / Use

/learn @pykaldi/Pykaldi

README

<p align="center"><img src="docs/_static/pykaldi-logo-dark.png" width="40%"/></p>

[![Build Status]][Travis]

PyKaldi is a Python scripting layer for the [Kaldi] speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for the C++ code in Kaldi and [OpenFst] libraries. You can use PyKaldi to write Python code for things that would otherwise require writing C++ code such as calling low-level Kaldi functions, manipulating Kaldi and OpenFst objects in code or implementing new Kaldi tools.

You can think of Kaldi as a large box of legos that you can mix and match to build custom speech recognition solutions. The best way to think of PyKaldi is as a supplement, a sidekick if you will, to Kaldi. In fact, PyKaldi is at its best when it is used alongside Kaldi. To that end, replicating the functionality of myriad command-line tools, utility scripts and shell-level recipes provided by Kaldi is a non-goal for the PyKaldi project.

Overview

Getting Started

Like Kaldi, PyKaldi is primarily intended for speech recognition researchers and professionals. It is jam packed with goodies that one would need to build Python software taking advantage of the vast collection of utilities, algorithms and data structures provided by Kaldi and OpenFst libraries.

If you are not familiar with FST-based speech recognition or have no interest in having access to the guts of Kaldi and OpenFst in Python, but only want to run a pre-trained Kaldi system as part of your Python application, do not fret. PyKaldi includes a number of high-level application oriented modules, such as [asr], [alignment] and [segmentation], that should be accessible to most Python programmers.

If you are interested in using PyKaldi for research or building advanced ASR applications, you are in luck. PyKaldi comes with everything you need to read, write, inspect, manipulate or visualize Kaldi and OpenFst objects in Python. It includes Python wrappers for most functions and methods that are part of the public APIs of Kaldi and OpenFst C++ libraries. If you want to read/write files that are produced/consumed by Kaldi tools, check out I/O and table utilities in the [util] package. If you want to work with Kaldi matrices and vectors, e.g. convert them to [NumPy] ndarrays and vice versa, check out the [matrix] package. If you want to use Kaldi for feature extraction and transformation, check out the [feat], [ivector] and [transform] packages. If you want to work with lattices or other FST structures produced/consumed by Kaldi tools, check out the [fstext], [lat] and [kws] packages. If you want low-level access to Gaussian mixture models, hidden Markov models or phonetic decision trees in Kaldi, check out the [gmm], [sgmm2], [hmm], and [tree] packages. If you want low-level access to Kaldi neural network models, check out the [nnet3], [cudamatrix] and [chain] packages. If you want to use the decoders and language modeling utilities in Kaldi, check out the [decoder], [lm], [rnnlm], [tfrnnlm] and [online2] packages.

Interested readers who would like to learn more about Kaldi and PyKaldi might find the following resources useful:

  • [Kaldi Docs]: Read these to learn more about Kaldi.
  • [PyKaldi Docs]: Consult these to learn more about the PyKaldi API.
  • [PyKaldi Examples]: Check these out to see PyKaldi in action.
  • [PyKaldi Paper]: Read this to learn more about the design of PyKaldi.

Since automatic speech recognition (ASR) in Python is undoubtedly the "killer app" for PyKaldi, we will go over a few ASR scenarios to get a feel for the PyKaldi API. We should note that PyKaldi does not provide any high-level utilities for training ASR models, so you need to train your models using Kaldi recipes or use pre-trained models available online. The reason why this is so is simply because there is no high-level ASR training API in Kaldi C++ libraries. Kaldi ASR models are trained using complex shell-level [recipes][Kaldi Recipes] that handle everything from data preparation to the orchestration of myriad Kaldi executables used in training. This is by design and unlikely to change in the future. PyKaldi does provide wrappers for the low-level ASR training utilities in Kaldi C++ libraries but those are not really useful unless you want to build an ASR training pipeline in Python from basic building blocks, which is no easy task. Continuing with the lego analogy, this task is akin to building [this][Lego Chiron] given access to a truck full of legos you might need. If you are crazy enough to try though, please don't let this paragraph discourage you. Before we started building PyKaldi, we thought that was a mad man's task too.

Automatic Speech Recognition in Python

PyKaldi [asr] module includes a number of easy-to-use, high-level classes to make it dead simple to put together ASR systems in Python. Ignoring the boilerplate code needed for setting things up, doing ASR with PyKaldi can be as simple as the following snippet of code:

asr = SomeRecognizer.from_files("final.mdl", "HCLG.fst", "words.txt", opts)

with SequentialMatrixReader("ark:feats.ark") as feats_reader:
    for key, feats in feats_reader:
        out = asr.decode(feats)
        print(key, out["text"])

In this simplified example, we first instantiate a hypothetical recognizer SomeRecognizer with the paths for the model final.mdl, the decoding graph HCLG.fst and the symbol table words.txt. The opts object contains the configuration options for the recognizer. Then, we instantiate a [PyKaldi table reader][util.table] SequentialMatrixReader for reading the feature matrices stored in the [Kaldi archive][Kaldi Archive Docs] feats.ark. Finally, we iterate over the feature matrices and decode them one by one. Here we are simply printing the best ASR hypothesis for each utterance so we are only interested in the "text" entry of the output dictionary out. Keep in mind that the output dictionary contains a bunch of other useful entries, such as the frame level alignment of the best hypothesis and a weighted lattice representing the most likely hypotheses. Admittedly, not all ASR pipelines will be as simple as this example, but they will often have the same overall structure. In the following sections, we will see how we can adapt the code given above to implement more complicated ASR pipelines.

Offline ASR using Kaldi Models

This is the most common scenario. We want to do offline ASR using pre-trained Kaldi models, such as [ASpIRE chain models]. Here we are using the term "models" loosely to refer to everything one would need to put together an ASR system. In this specific example, we are going to need:

  • a [neural network acoustic model][Kaldi Neural Network Docs],
  • a [transition model][Kaldi Transition Model Docs],
  • a [decoding graph][Kaldi Decoding Graph Docs],
  • a [word symbol table][Kaldi Symbol Table Docs],
  • and a couple of feature extraction [configs][Kaldi Config Docs].

Note that you can use this example code to decode with [ASpIRE chain models].

from kaldi.asr import NnetLatticeFasterRecognizer
from kaldi.decoder import LatticeFasterDecoderOptions
from kaldi.nnet3 import NnetSimpleComputationOptions
from kaldi.util.table import SequentialMatrixReader, CompactLatticeWriter

# Set the paths and read/write specifiers
model_path = "models/aspire/final.mdl"
graph_path = "models/aspire/graph_pp/HCLG.fst"
symbols_path = "models/aspire/graph_pp/words.txt"
feats_rspec = ("ark:compute-mfcc-feats --config=models/aspire/conf/mfcc.conf "
               "scp:wav.scp ark:- |")
ivectors_rspec = (feats_rspec + "ivector-extract-online2 "
                  "--config=models/aspire/conf/ivector_extractor.conf "
                  "ark:spk2utt ark:- ark:- |")
lat_wspec = "ark:| gzip -c > lat.gz"

# Instantiate the recognizer
decoder_opts = LatticeFasterDecoderOptions()
decoder_opts.beam = 13
decoder_opts.max_active = 7000
decodable_opts = NnetSimpleComputationOptions()
decodable_opts.acoustic_scale = 1.0
decodable_opts.frame_subsampling_factor = 3
asr = NnetLatticeFasterRecognizer.from_files(
    model_path, graph_path, symbols_path,
    decoder_opts=decoder_opts, decodable_opts=decodable_opts)

# Extract the features, decode and write output lattices
with SequentialMatrixReader(feats_rspec) as feats_reader, \
     SequentialMatrixReader(ivectors_rspec) as ivectors_reader, \
     CompactLatticeWriter(lat_wspec) as lat_writer:
    for (fkey, feats), (ikey, ivectors) in zip(feats_reader, ivectors_reader):
        assert(fkey == ikey)
        out = asr.decode((feats, ivectors))
        print(fkey, out["text"])
        lat_writer[fkey] = out["lattice"]

The fundamental difference between this example and the short snippet from last section is that for each utterance we are reading the raw audio data from disk and computing two feature matrices on the fly instead of reading a single precomputed feature matrix from disk. The [script file][Kaldi Script File Docs] wav.scp contains a list of WAV files corresponding to the utterances we want to decode. The additional feature matrix we are extracting contains online i-vectors that are used by the neural network acoustic model to perform channel and speaker adaptation. The [speaker-to-utterance map][Kaldi Data Docs] spk2utt is used for accumulating separate statistics for each speaker in online i-vector extraction. It can be a simple identity mapping if the speaker information is not available. We pack the MFCC features and the i-vectors into a tuple and pass this tuple to the recognizer for decoding. The neural network recognizers in PyKaldi know how to h

Related Skills

View on GitHub
GitHub Stars1.0k
CategoryDevelopment
Updated4d ago
Forks249

Languages

Python

Security Score

100/100

Audited on Apr 2, 2026

No findings