EfficientAT
This repository aims at providing efficient CNNs for Audio Tagging. We provide AudioSet pre-trained models ready for downstream training and extraction of audio embeddings.
Install / Use
/learn @fschmid56/EfficientATREADME
Efficient Pre-Trained CNNs for Audio Pattern Recognition
In this repository, we publish the pre-trained models and the code described in the papers:
- Efficient Large-Scale Audio Tagging Via Transformer-To-CNN Knowledge Distillation. The paper has been presented in ICASSP 2023 and is published in IEEE (published version).
- Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models. Submitted to IEEE/ACM TASLP. Pre-trained Models included, experiments on downstream tasks will be updated!
The models in this repository are especially suited to you if you are looking for pre-trained audio pattern recognition models that are able to:
- achieve competitive audio tagging performance on resource constrained platforms
- reach high performance on downstream tasks with a simple fine-tuning pipeline
- extract high-quality general purpose audio representations
Pre-training Audio Pattern Recognition models by large-scale, general-purpose Audio Tagging is dominated by Transformers (PaSST [1], AST [2], HTS-AT [3], BEATs [16]) achieving the highest single-model mean average precisions (mAP) on AudioSet [4]. However, Transformers are complex models and scale quadratically with respect to the sequence length, making them slow for inference. CNNs scale linearly with respect to the sequence length and are easy to scale to given resource constraints. However, CNNs (e.g. PANNs [5], ERANN [6], PSLA [7]) have fallen short on Transformers in terms of Audio Tagging performance.
We bring together the best of both worlds by training efficient CNNs of different complexities using Knowledge Distillation from Transformers. The Figures below show the performance-complexity trade-off for existing models trained on AudioSet. The proposed MNs are described in in this work published at ICASSP 2023 and the DyMNs are introduced in our most recent work submitted to TASLP. The plots below are created using the model profiler included in Microsoft's DeepSpeed framework.


Based on a reviewer request for the published ICASSP paper, we add the inference memory complexity of our pre-trained MNs. We calculate the analytical peak memory (memory requirement of input + output activations) as in [14]. We also take into account memory-efficient inference in MobileNets as described in [15].
The plot below compares the trend in peak memory requirement between different CNNs. We use the file peak_memory.py to determine the peak memory. The memory requirement is calculated assuming a 10 seconds audio snippet and fp16 representation for all models.

The next milestones are:
- Add the fine-tuning pipeline used in the DyMN paper submitted to TASLP
- Wrap this repository in an installable python package
- Use pytorch lightening to enable distributed training and training with fp16
The final repository should have similar capabilities as the PANNs codebase with two main advantages:
- Pre-trained models of lower computational and parameter complexity due to the efficient CNN architectures
- Higher performance due to Knowledge Distillation from Transformers and optimized models
This codebase is inspired by the PaSST and PANNs repositories, and the pytorch implementation of MobileNetV3.
Environment
The codebase is developed with Python 3.10.8. After creating an environment install the requirements as follows:
pip install -r requirements.txt
Also make sure you have FFmpeg <v4.4 installed.
Pre-Trained Models
Pre-trained models are available in the Github Releases and are automatically downloaded from there. Loading the pre-trained models is as easy as running the following code pieces:
Pre-trained MobileNet:
from models.mn.model import get_model as get_mn
model = get_mn(pretrained_name="mn10_as")
Pre-trained Dynamic MobileNet:
from models.dymn.model import get_model as get_dymn
model = get_dymn(pretrained_name="dymn10_as")
The Table shows a selection of models contained in this repository. The naming convention for our models is <model><width_mult>_<dataset>. In this sense, mn10_as defines a MobileNetV3 with parameter width_mult=1.0, pre-trained on AudioSet. dymn is the prefix for a dynamic MobileNet.
All models available are pre-trained on ImageNet [9] by default (otherwise denoted as 'no_im_pre'), followed by training on AudioSet [4]. Some results appear slightly better than those reported in the papers. We provide the best models in this repository while the paper is showing averages over multiple runs.
| Model Name | Config | Params (Millions) | MACs (Billions) | Performance (mAP) | |------------------|----------------------------------------------------|-------------------|-----------------|-------------------| | dymn04_as | width_mult=0.4 | 1.97 | 0.12 | 45.0 | | dymn10_as | width_mult=1.0 | 10.57 | 0.58 | 47.7 | | dymn20_as | width_mult=2.0 | 40.02 | 2.2 | 49.1 | | mn04_as | width_mult=0.4 | 0.983 | 0.11 | 43.2 | | mn05_as | width_mult=0.5 | 1.43 | 0.16 | 44.3 | | mn10_as | width_mult=1.0 | 4.88 | 0.54 | 47.1 | | mn20_as | width_mult=2.0 | 17.91 | 2.06 | 47.8 | | mn30_as | width_mult=3.0 | 39.09 | 4.55 | 48.2 | | mn40_as | width_mult=4.0 | 68.43 | 8.03 | 48.4 | | mn40_as_ext | width_mult=4.0,<br/>extended training (300 epochs) | 68.43 | 8.03 | 48.7 | | mn40_as_no_im_pre | width_mult=4.0, no ImageNet pre-training | 68.43 | 8.03 | 48.3 | | mn10_as_hop_15 | width_mult=1.0 | 4.88 | 0.36 | 46.3 | | mn10_as_hop_20 | width_mult=1.0 | 4.88 | 0.27 | 45.6 | | mn10_as_hop_25 | width_mult=1.0 | 4.88 | 0.22 | 44.7 | | mn10_as_mels_40 | width_mult=1.0 | 4.88 | 0.21 | 45.3 | | mn10_as_mels_64 | width_mult=1.0 | 4.88 | 0.27 | 46.1 | | mn10_as_mels_256 | width_mult=1.0 | 4.88 | 1.08 | 47.4 | | MN Ensemble | width_mult=4.0, 9 Models | 615.87 | 72.27 | 49.8 |
MN Ensemble denotes an ensemble of 9 different mn40 models (3x mn40_as, 3x mn40_as_ext, 3x mn40_as_no_im_pre).
Note that computational complexity strongly depends on the resolution of the spectrograms. Our default is 128 mel bands and a hop size of 10 ms.
Inference
You can use the pre-trained models for inference on an audio file using the inference.py script.
For example, use dymn10_as to detect acoustic events at a metro station in paris:
python inference.py --cuda --model_name=dymn10_as --audio_path="resources/metro_station-paris.wav"
This will result in the following output showing the 10 events detected with the highest probability:
************* Acoustic Event Detected: *****************
Train: 0.747
Subway, metro, underground: 0.599
Rail transport: 0.493
Railroad car, train wagon: 0.445
Vehicle: 0.360
Clickety-clack: 0.105
Speech: 0.053
Sliding door: 0.036
Outside, urban or manmade: 0.035
Music: 0.017
********************************************************
You can also use an ensemble for perform inference, e.g.:
python inference.py --ensemble dymn20_as mn40_as_ext mn40_as --cuda --audio_path=resources/metro_station-paris.wav
Important: All models are trained with half precision (float16). If you run float32 inference on cpu, you might notice a slight performance degradation.
Quality of extracted Audio Embeddings
As shown in the paper Low-Complexity Audio Embeddings Extractors (published at EUSIPCO 2023), MNs are excellent at extracting high-quality audio embeddings. Checkout the repository EfficientAT_HEAR for further details and the results on the HEAR Benchmark.
Train and Evaluate on AudioSet
The training and evaluation
Related Skills
node-connect
339.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.9kCommit, push, and open a PR
