Pykaldi
A Python wrapper for Kaldi
Install / Use
/learn @pykaldi/PykaldiREADME
[![Build Status]][Travis]
PyKaldi is a Python scripting layer for the [Kaldi] speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for the C++ code in Kaldi and [OpenFst] libraries. You can use PyKaldi to write Python code for things that would otherwise require writing C++ code such as calling low-level Kaldi functions, manipulating Kaldi and OpenFst objects in code or implementing new Kaldi tools.
You can think of Kaldi as a large box of legos that you can mix and match to build custom speech recognition solutions. The best way to think of PyKaldi is as a supplement, a sidekick if you will, to Kaldi. In fact, PyKaldi is at its best when it is used alongside Kaldi. To that end, replicating the functionality of myriad command-line tools, utility scripts and shell-level recipes provided by Kaldi is a non-goal for the PyKaldi project.
Overview
Getting Started
Like Kaldi, PyKaldi is primarily intended for speech recognition researchers and professionals. It is jam packed with goodies that one would need to build Python software taking advantage of the vast collection of utilities, algorithms and data structures provided by Kaldi and OpenFst libraries.
If you are not familiar with FST-based speech recognition or have no interest in
having access to the guts of Kaldi and OpenFst in Python, but only want to run a
pre-trained Kaldi system as part of your Python application, do not fret.
PyKaldi includes a number of high-level application oriented modules, such as
[asr], [alignment] and [segmentation], that should be accessible to most
Python programmers.
If you are interested in using PyKaldi for research or building advanced ASR
applications, you are in luck. PyKaldi comes with everything you need to read,
write, inspect, manipulate or visualize Kaldi and OpenFst objects in Python. It
includes Python wrappers for most functions and methods that are part of the
public APIs of Kaldi and OpenFst C++ libraries. If you want to read/write files
that are produced/consumed by Kaldi tools, check out I/O and table utilities in
the [util] package. If you want to work with Kaldi matrices and vectors, e.g.
convert them to [NumPy] ndarrays and vice versa, check out the [matrix]
package. If you want to use Kaldi for feature extraction and transformation,
check out the [feat], [ivector] and [transform] packages. If you want to
work with lattices or other FST structures produced/consumed by Kaldi tools,
check out the [fstext], [lat] and [kws] packages. If you want low-level
access to Gaussian mixture models, hidden Markov models or phonetic decision
trees in Kaldi, check out the [gmm], [sgmm2], [hmm], and [tree]
packages. If you want low-level access to Kaldi neural network models, check out
the [nnet3], [cudamatrix] and [chain] packages. If you want to use the
decoders and language modeling utilities in Kaldi, check out the [decoder],
[lm], [rnnlm], [tfrnnlm] and [online2] packages.
Interested readers who would like to learn more about Kaldi and PyKaldi might find the following resources useful:
- [Kaldi Docs]: Read these to learn more about Kaldi.
- [PyKaldi Docs]: Consult these to learn more about the PyKaldi API.
- [PyKaldi Examples]: Check these out to see PyKaldi in action.
- [PyKaldi Paper]: Read this to learn more about the design of PyKaldi.
Since automatic speech recognition (ASR) in Python is undoubtedly the "killer app" for PyKaldi, we will go over a few ASR scenarios to get a feel for the PyKaldi API. We should note that PyKaldi does not provide any high-level utilities for training ASR models, so you need to train your models using Kaldi recipes or use pre-trained models available online. The reason why this is so is simply because there is no high-level ASR training API in Kaldi C++ libraries. Kaldi ASR models are trained using complex shell-level [recipes][Kaldi Recipes] that handle everything from data preparation to the orchestration of myriad Kaldi executables used in training. This is by design and unlikely to change in the future. PyKaldi does provide wrappers for the low-level ASR training utilities in Kaldi C++ libraries but those are not really useful unless you want to build an ASR training pipeline in Python from basic building blocks, which is no easy task. Continuing with the lego analogy, this task is akin to building [this][Lego Chiron] given access to a truck full of legos you might need. If you are crazy enough to try though, please don't let this paragraph discourage you. Before we started building PyKaldi, we thought that was a mad man's task too.
Automatic Speech Recognition in Python
PyKaldi [asr] module includes a number of easy-to-use, high-level classes to
make it dead simple to put together ASR systems in Python. Ignoring the
boilerplate code needed for setting things up, doing ASR with PyKaldi can be as
simple as the following snippet of code:
asr = SomeRecognizer.from_files("final.mdl", "HCLG.fst", "words.txt", opts)
with SequentialMatrixReader("ark:feats.ark") as feats_reader:
for key, feats in feats_reader:
out = asr.decode(feats)
print(key, out["text"])
In this simplified example, we first instantiate a hypothetical recognizer
SomeRecognizer with the paths for the model final.mdl, the decoding graph
HCLG.fst and the symbol table words.txt. The opts object contains the
configuration options for the recognizer. Then, we instantiate a [PyKaldi table
reader][util.table] SequentialMatrixReader for reading the feature
matrices stored in the [Kaldi archive][Kaldi Archive Docs] feats.ark. Finally,
we iterate over the feature matrices and decode them one by one. Here we are
simply printing the best ASR hypothesis for each utterance so we are only
interested in the "text" entry of the output dictionary out. Keep in mind
that the output dictionary contains a bunch of other useful entries, such as the
frame level alignment of the best hypothesis and a weighted lattice representing
the most likely hypotheses. Admittedly, not all ASR pipelines will be as simple
as this example, but they will often have the same overall structure. In the
following sections, we will see how we can adapt the code given above to
implement more complicated ASR pipelines.
Offline ASR using Kaldi Models
This is the most common scenario. We want to do offline ASR using pre-trained Kaldi models, such as [ASpIRE chain models]. Here we are using the term "models" loosely to refer to everything one would need to put together an ASR system. In this specific example, we are going to need:
- a [neural network acoustic model][Kaldi Neural Network Docs],
- a [transition model][Kaldi Transition Model Docs],
- a [decoding graph][Kaldi Decoding Graph Docs],
- a [word symbol table][Kaldi Symbol Table Docs],
- and a couple of feature extraction [configs][Kaldi Config Docs].
Note that you can use this example code to decode with [ASpIRE chain models].
from kaldi.asr import NnetLatticeFasterRecognizer
from kaldi.decoder import LatticeFasterDecoderOptions
from kaldi.nnet3 import NnetSimpleComputationOptions
from kaldi.util.table import SequentialMatrixReader, CompactLatticeWriter
# Set the paths and read/write specifiers
model_path = "models/aspire/final.mdl"
graph_path = "models/aspire/graph_pp/HCLG.fst"
symbols_path = "models/aspire/graph_pp/words.txt"
feats_rspec = ("ark:compute-mfcc-feats --config=models/aspire/conf/mfcc.conf "
"scp:wav.scp ark:- |")
ivectors_rspec = (feats_rspec + "ivector-extract-online2 "
"--config=models/aspire/conf/ivector_extractor.conf "
"ark:spk2utt ark:- ark:- |")
lat_wspec = "ark:| gzip -c > lat.gz"
# Instantiate the recognizer
decoder_opts = LatticeFasterDecoderOptions()
decoder_opts.beam = 13
decoder_opts.max_active = 7000
decodable_opts = NnetSimpleComputationOptions()
decodable_opts.acoustic_scale = 1.0
decodable_opts.frame_subsampling_factor = 3
asr = NnetLatticeFasterRecognizer.from_files(
model_path, graph_path, symbols_path,
decoder_opts=decoder_opts, decodable_opts=decodable_opts)
# Extract the features, decode and write output lattices
with SequentialMatrixReader(feats_rspec) as feats_reader, \
SequentialMatrixReader(ivectors_rspec) as ivectors_reader, \
CompactLatticeWriter(lat_wspec) as lat_writer:
for (fkey, feats), (ikey, ivectors) in zip(feats_reader, ivectors_reader):
assert(fkey == ikey)
out = asr.decode((feats, ivectors))
print(fkey, out["text"])
lat_writer[fkey] = out["lattice"]
The fundamental difference between this example and the short snippet from last
section is that for each utterance we are reading the raw audio data from disk
and computing two feature matrices on the fly instead of reading a single
precomputed feature matrix from disk. The [script file][Kaldi Script File Docs]
wav.scp contains a list of WAV files corresponding to the utterances we want
to decode. The additional feature matrix we are extracting contains online
i-vectors that are used by the neural network acoustic model to perform channel
and speaker adaptation. The [speaker-to-utterance map][Kaldi Data Docs]
spk2utt is used for accumulating separate statistics for each speaker in
online i-vector extraction. It can be a simple identity mapping if the speaker
information is not available. We pack the MFCC features and the i-vectors into a
tuple and pass this tuple to the recognizer for decoding. The neural network
recognizers in PyKaldi know how to h
Related Skills
node-connect
349.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
109.5kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
109.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
349.2kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
