Tdfbanks

Pytorch implementation of time-domain filterbanks

Generate Convert Improve

Install / Use

/learn @facebookresearch/Tdfbanks

About this skill

Quality Score

0/100

README

Time-Domain Filterbanks

PyTorch implementation of Learning Filterbanks from Raw Speech for Phone Recognition (ICASSP 2018).

Time-Domain Filterbanks (TD-filterbanks) are neural network layers intended to operate on a raw audio waveform. At initialization, they approximate standard mel-filterbanks by computing first-order scattering coefficients. They can then be fine-tuned with the architecture. Options of mel-filterbanks can be specified, such as a pre-emphasis layer, a log compression of the coefficients, or their mean-variance normalization.

Different types of TD-Filterbanks

There are four different modes for TD-filterbanks:

Fixed: Initialize the layers to match mel-filterbanks and keep their parameters fixed when training the model
Learn-all: Initialize the layers and let the filterbank and the averaging be learned jointly with the model
Learn-filterbank: Start from the initialization and only learn the filterbank with the model, keeping the averaging fixed to a squared hanning window
Randinit: Initialize the layers randomly and learn them with the network

TD-filterbanks

Time-Domain Filterbanks are a neural architecture composed of a complex-valued convolution, a modulus operator and a grouped real-valued convolution. This structure is based on the computation of first-order scattering coefficients. They are generated by a call to the class TDFbanks:

import melfilters
import utils
import model
# Main parameters
layer_params = dict(mode='fixed',           # type of td-fbanks (fixed, learnall, learnfbanks)
                    nfilters=40,            # number of filters
                    samplerate=16000,       # samplerate of the waveform
                    wlen=25,                # length of the window (in milliseconds)
                    wstride=10,             # stride of the window
                    compression='log',      # compression of coefficients (log or None)
                    preemp=True,            # add a pre-emphasis layer below the td-fbanks
                    mvn=True)               # perform mean-variance normalization per utterance on the coefficients

tdfbanks = model.TDFbanks(**layer_params)

Initialization

When Time-Domain Filterbanks are generated, the weights of the convolutional layers are initialized randomly. With mode="learnall" and without initialization, this corresponds to the randinit type of TD-filterbanks. One can initialize them to match standard mel-filterbanks:

# Initialization parameters
init_params = dict(min_freq=0,              # minimum frequency spanned by the filters
                   max_freq=8000,           # maximum frequency spanned by the filters
                   nfft=512,                # number of frequency bins for the mel-filterbanks to replicate
                   window_type='hamming',   # windowing function
                   normalize_energy=False,  # replicate mel-filterbanks normalized or energy or that peak at 1
                   alpha=0.97)              # pre-emphasis parameter

tdfbanks.initialize(**init_params)

Dependencies

Python 2/3 with NumPy
PyTorch
CUDA

Installation

Simply clone the repository:

git clone https://github.com/facebookresearch/tdfbanks.git
cd tdfbanks

References

If you find this code useful, please consider citing:

Learning Filterbanks from Raw Speech for Phone Recognition - N. Zeghidour, N. Usunier, I. Kokkinos, T. Schatz, G. Synnaeve, E. Dupoux

@inproceedings{zeghidour2017learning,
  title={Learning Filterbanks from Raw Speech for Phone Recognition},
  author={Zeghidour, Neil and Usunier, Nicolas and Kokkinos, Iasonas and Schatz, Thomas and Synnaeve, Gabriel and Dupoux, Emmanuel},
  booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on},
  year={2018},
  organization={IEEE}
}

Contact: neilz@fb.com

Related Skills

node-connect

343.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

92.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。