UniverSR
Official implemtation of UniverSR (ICASSP 2026)
Install / Use
/learn @woongzip1/UniverSRREADME
UniverSR
Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching
This is the official PyTorch implementation of UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching (ICASSP 2026): paper, demo page.
UniverSR performs audio super-resolution directly in the complex STFT domain using flow matching, without requiring a separate neural vocoder. A single model handles multiple input sample rates (8 / 12 / 16 / 24 kHz → 48 kHz) across speech, music, and sound effects.
<p align="center"> <img src="assets/overview.png" width="720" /> </p>⚙️ Installation
pip install git+https://github.com/woongzip1/UniverSR.git
Or install from source:
git clone https://github.com/woongzip1/UniverSR.git
cd UniverSR
pip install -e .
For training, install extra dependencies:
pip install -e ".[train]"
# or: pip install -r requirements.txt
Requirements: Python ≥ 3.10, PyTorch ≥ 2.0, CUDA ≥ 11.8
🚀 Quick Start
import torchaudio
from universr import UniverSR
model = UniverSR.from_pretrained("woongzip1/universr-audio", device="cuda")
# Enhance a low-resolution audio file to 48 kHz
output = model.enhance("low_res_8k.wav", input_sr=8000)
torchaudio.save("output_48k.wav", output.cpu(), 48000)
Inference Options
output = model.enhance(
"low_res.wav",
input_sr=16000, # 8000 / 12000 / 16000 / 24000
ode_method="midpoint", # 'euler', 'midpoint', 'rk4'
ode_steps=4, # number of ODE integration steps
guidance_scale=1.5, # classifier-free guidance (None = disabled)
)
Handling 48 kHz Input with Limited Bandwidth
If your audio is already at 48 kHz but has limited bandwidth, set input_sr to the sample rate that matches the content range (Nyquist frequency extends up to input_sr / 2 Hz). For instance, if content exists only up to 8 kHz, use input_sr=16000. UniverSR will automatically apply low-pass filtering to match the training pipeline before performing super-resolution.
# File is 48 kHz, but content only up to 8 kHz → use input_sr=16000
output = model.enhance("fullband_file.wav", input_sr=16000)
🤗 Pretrained Models
| Model | Domain | HuggingFace |
|---|---|---|
| universr-audio | General audio (recommended) | woongzip1/universr-audio |
| universr-speech | Speech only | woongzip1/universr-speech |
To download a pretrained model locally:
# via CLI (requires: pip install huggingface_hub)
huggingface-cli download woongzip1/universr-audio --local-dir ./ckpts/universr-audio
# or in Python
from huggingface_hub import snapshot_download
snapshot_download("woongzip1/universr-audio", local_dir="./ckpts/universr-audio")
Loading from a Local Checkpoint
model = UniverSR.from_local(
ckpt_path="ckpts/best_model.pth",
config_path="configs/config.yaml",
device="cuda",
)
🔊 Batch Inference
Enhance all .wav files in a folder:
python scripts/inference.py \
--input_dir [YOUR INPUT DIR] \
--output_dir results/enhanced/ \
--model woongzip1/universr-audio \
--input-sr 8000 \
--ode-method midpoint \
--ode-steps 4
From a local checkpoint:
python scripts/inference.py \
--input_dir [YOUR INPUT DIR] \
--output_dir results/enhanced/ \
--ckpt ckpts/best_model.pth \
--config configs/config.yaml \
--input-sr 8000 \
--ode-method midpoint \
--ode-steps 4
📊 Evaluation
Compute Metrics (Generated vs. Ground Truth)
Compare enhanced outputs against ground-truth references using Log-Spectral Distance (LSD):
python scripts/compute_metrics.py \
--reference_dir data/gt_48k/ \
--output_dir results/enhanced/ \
--cutoff-sr 8 \
--output_json results/metrics.json
Full Pipeline Evaluation
Run evaluation using the training data pipeline with on-the-fly LR generation:
python evaluate.py \
-c configs/config.yaml \
--ckpt ckpts/best_model.pth \
--sampling-rate 8 \
--wandb true
🏋️ Training
1. Prepare File Lists
UniverSR expects 48 kHz mono .wav files listed in a text file (one absolute path per line). Use the provided script to generate these lists from your audio directories:
python scripts/prepare_file_list.py \
--dirs /path/to/dataset1 /path/to/dataset2 \
--output data/train.txt
Or use find directly in your terminal:
find /path/to/dataset -name "*.wav" > data/train.txt
2. Configure Training
Point the config to your file lists:
# configs/config.yaml
dataset:
common:
sr: 48000
num_samples: 32767
train:
file_list: "./data/train.txt"
val:
file_list: "./data/val.txt"
Low-resolution inputs are generated on-the-fly during training via low-pass filtering and downsampling. The input sample rate distribution is controlled by collator.sampling_rates_probs:
collator:
sampling_rates_probs:
8: 0.7 # 70% of batches at 8 kHz
12: 0.1 # 10% at 12 kHz
16: 0.1 # 10% at 16 kHz
24: 0.1 # 10% at 24 kHz
3. Start Training
python train.py -c configs/config.yaml --wandb true
Note: Please refer to the paper for dataset details. The pipeline works with any 48 kHz audio collection.
⚠️ Known Limitations & Tips
Performance across input rates
The default training distribution allocates 70% of batches to 8 kHz and 10% each to 12/16/24 kHz, so 8 kHz→48 kHz results tend to be strongest. Higher input rates (especially 24 kHz) may show weaker high-frequency reconstruction due to spectral aliasing near the cutoff introduced by the naive downsample-upsample augmentation pipeline.
To improve higher input rates, consider:
- Retraining with a more balanced
sampling_rates_probsdistribution. - Using slightly lower
sr_to_lr_binsvalues for 12/16/24 kHz (e.g. 120/160/240 instead of 128/170/256) to give the model a small overlap margin at the spectral boundary.
Guidance scale recommendations
The classifier-free guidance (CFG) scale trades off objective fidelity with perceptual richness:
| Domain | Recommended guidance_scale |
|---|---|
| Speech | 1.0 – 1.5 |
| Music | 1.5 – 2.0 |
| Sound effects | 1.5 |
Higher values produce denser high-frequency structures but deviate more from the ground-truth reference. Setting guidance_scale=None or 0 disables CFG entirely.
📖 Citation and License
We've released our code under the MIT License. If you find UniverSR useful in your research, please consider citing:
@article{choi2025universr,
title={{UniverSR}: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching},
author={Choi, Woongjib and Lee, Sangmin and Lim, Hyungseob and Kang, Hong-Goo},
journal={arXiv preprint arXiv:2510.00771},
year={2025}
}
@inproceedings{choi2026universr,
title = {{UniverSR}: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching},
author = {Choi, Woongjib and Lee, Sangmin and Lim, Hyungseob and Kang, Hong-Goo},
booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
year = {2026}
}
🙏 Acknowledgments
This project was developed at DSPAI Lab, Yonsei University. We thank the following open-source projects that inspired parts of our codebase:
Related Skills
node-connect
339.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.9kCommit, push, and open a PR
