Tdfbanks
Pytorch implementation of time-domain filterbanks
Install / Use
/learn @facebookresearch/TdfbanksREADME
Time-Domain Filterbanks
PyTorch implementation of Learning Filterbanks from Raw Speech for Phone Recognition (ICASSP 2018).
Time-Domain Filterbanks (TD-filterbanks) are neural network layers intended to operate on a raw audio waveform. At initialization, they approximate standard mel-filterbanks by computing first-order scattering coefficients. They can then be fine-tuned with the architecture. Options of mel-filterbanks can be specified, such as a pre-emphasis layer, a log compression of the coefficients, or their mean-variance normalization.
Different types of TD-Filterbanks
There are four different modes for TD-filterbanks:
- Fixed: Initialize the layers to match mel-filterbanks and keep their parameters fixed when training the model
- Learn-all: Initialize the layers and let the filterbank and the averaging be learned jointly with the model
- Learn-filterbank: Start from the initialization and only learn the filterbank with the model, keeping the averaging fixed to a squared hanning window
- Randinit: Initialize the layers randomly and learn them with the network
TD-filterbanks
Time-Domain Filterbanks are a neural architecture composed of a complex-valued convolution, a modulus operator and a grouped real-valued convolution. This structure is based on the computation of first-order scattering coefficients. They are generated by a call to the class TDFbanks:
import melfilters
import utils
import model
# Main parameters
layer_params = dict(mode='fixed', # type of td-fbanks (fixed, learnall, learnfbanks)
nfilters=40, # number of filters
samplerate=16000, # samplerate of the waveform
wlen=25, # length of the window (in milliseconds)
wstride=10, # stride of the window
compression='log', # compression of coefficients (log or None)
preemp=True, # add a pre-emphasis layer below the td-fbanks
mvn=True) # perform mean-variance normalization per utterance on the coefficients
tdfbanks = model.TDFbanks(**layer_params)
Initialization
When Time-Domain Filterbanks are generated, the weights of the convolutional layers are initialized randomly. With mode="learnall" and without initialization, this corresponds to the randinit type of TD-filterbanks. One can initialize them to match standard mel-filterbanks:
# Initialization parameters
init_params = dict(min_freq=0, # minimum frequency spanned by the filters
max_freq=8000, # maximum frequency spanned by the filters
nfft=512, # number of frequency bins for the mel-filterbanks to replicate
window_type='hamming', # windowing function
normalize_energy=False, # replicate mel-filterbanks normalized or energy or that peak at 1
alpha=0.97) # pre-emphasis parameter
tdfbanks.initialize(**init_params)
Dependencies
Installation
Simply clone the repository:
git clone https://github.com/facebookresearch/tdfbanks.git
cd tdfbanks
References
If you find this code useful, please consider citing:
Learning Filterbanks from Raw Speech for Phone Recognition - N. Zeghidour, N. Usunier, I. Kokkinos, T. Schatz, G. Synnaeve, E. Dupoux
@inproceedings{zeghidour2017learning,
title={Learning Filterbanks from Raw Speech for Phone Recognition},
author={Zeghidour, Neil and Usunier, Nicolas and Kokkinos, Iasonas and Schatz, Thomas and Synnaeve, Gabriel and Dupoux, Emmanuel},
booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on},
year={2018},
organization={IEEE}
}
Contact: neilz@fb.com
Related Skills
node-connect
343.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
92.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
