UniverSR

Official implemtation of UniverSR (ICASSP 2026)

Generate Convert Improve

Install / Use

/learn @woongzip1/UniverSR

About this skill

Quality Score

0/100

README

UniverSR

Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching

This is the official PyTorch implementation of UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching (ICASSP 2026): paper, demo page.

UniverSR performs audio super-resolution directly in the complex STFT domain using flow matching, without requiring a separate neural vocoder. A single model handles multiple input sample rates (8 / 12 / 16 / 24 kHz → 48 kHz) across speech, music, and sound effects.

⚙️ Installation

pip install git+https://github.com/woongzip1/UniverSR.git

Or install from source:

git clone https://github.com/woongzip1/UniverSR.git
cd UniverSR
pip install -e .

For training, install extra dependencies:

pip install -e ".[train]"
# or: pip install -r requirements.txt

Requirements: Python ≥ 3.10, PyTorch ≥ 2.0, CUDA ≥ 11.8

🚀 Quick Start

import torchaudio
from universr import UniverSR

model = UniverSR.from_pretrained("woongzip1/universr-audio", device="cuda")

# Enhance a low-resolution audio file to 48 kHz
output = model.enhance("low_res_8k.wav", input_sr=8000)
torchaudio.save("output_48k.wav", output.cpu(), 48000)

Inference Options

output = model.enhance(
    "low_res.wav",
    input_sr=16000,          # 8000 / 12000 / 16000 / 24000
    ode_method="midpoint",   # 'euler', 'midpoint', 'rk4'
    ode_steps=4,             # number of ODE integration steps
    guidance_scale=1.5,      # classifier-free guidance (None = disabled)
)

Handling 48 kHz Input with Limited Bandwidth

If your audio is already at 48 kHz but has limited bandwidth, set input_sr to the sample rate that matches the content range (Nyquist frequency extends up to input_sr / 2 Hz). For instance, if content exists only up to 8 kHz, use input_sr=16000. UniverSR will automatically apply low-pass filtering to match the training pipeline before performing super-resolution.

# File is 48 kHz, but content only up to 8 kHz → use input_sr=16000
output = model.enhance("fullband_file.wav", input_sr=16000)

🤗 Pretrained Models

| Model | Domain | HuggingFace | |---|---|---| | universr-audio | General audio (recommended) | woongzip1/universr-audio | | universr-speech | Speech only | woongzip1/universr-speech |

To download a pretrained model locally:

# via CLI (requires: pip install huggingface_hub)
huggingface-cli download woongzip1/universr-audio --local-dir ./ckpts/universr-audio

# or in Python
from huggingface_hub import snapshot_download
snapshot_download("woongzip1/universr-audio", local_dir="./ckpts/universr-audio")

Loading from a Local Checkpoint

model = UniverSR.from_local(
    ckpt_path="ckpts/best_model.pth",
    config_path="configs/config.yaml",
    device="cuda",
)

🔊 Batch Inference

Enhance all .wav files in a folder:

python scripts/inference.py \
    --input_dir [YOUR INPUT DIR] \
    --output_dir results/enhanced/ \
    --model woongzip1/universr-audio \
    --input-sr 8000 \
    --ode-method midpoint \
    --ode-steps 4

From a local checkpoint:

python scripts/inference.py \
    --input_dir [YOUR INPUT DIR] \
    --output_dir results/enhanced/ \
    --ckpt ckpts/best_model.pth \
    --config configs/config.yaml \
    --input-sr 8000 \
    --ode-method midpoint \
    --ode-steps 4

📊 Evaluation

Compute Metrics (Generated vs. Ground Truth)

Compare enhanced outputs against ground-truth references using Log-Spectral Distance (LSD):

python scripts/compute_metrics.py \
    --reference_dir data/gt_48k/ \
    --output_dir results/enhanced/ \
    --cutoff-sr 8 \
    --output_json results/metrics.json

Full Pipeline Evaluation

Run evaluation using the training data pipeline with on-the-fly LR generation:

python evaluate.py \
    -c configs/config.yaml \
    --ckpt ckpts/best_model.pth \
    --sampling-rate 8 \
    --wandb true

🏋️ Training

1. Prepare File Lists

UniverSR expects 48 kHz mono .wav files listed in a text file (one absolute path per line). Use the provided script to generate these lists from your audio directories:

python scripts/prepare_file_list.py \
    --dirs /path/to/dataset1 /path/to/dataset2 \
    --output data/train.txt

Or use find directly in your terminal:

find /path/to/dataset -name "*.wav" > data/train.txt

2. Configure Training

Point the config to your file lists:

# configs/config.yaml
dataset:
  common:
    sr: 48000
    num_samples: 32767
  train:
    file_list: "./data/train.txt"
  val:
    file_list: "./data/val.txt"

Low-resolution inputs are generated on-the-fly during training via low-pass filtering and downsampling. The input sample rate distribution is controlled by collator.sampling_rates_probs:

collator:
  sampling_rates_probs:
    8: 0.7    # 70% of batches at 8 kHz
    12: 0.1   # 10% at 12 kHz
    16: 0.1   # 10% at 16 kHz
    24: 0.1   # 10% at 24 kHz

3. Start Training

python train.py -c configs/config.yaml --wandb true

Note: Please refer to the paper for dataset details. The pipeline works with any 48 kHz audio collection.

⚠️ Known Limitations & Tips

Performance across input rates

The default training distribution allocates 70% of batches to 8 kHz and 10% each to 12/16/24 kHz, so 8 kHz→48 kHz results tend to be strongest. Higher input rates (especially 24 kHz) may show weaker high-frequency reconstruction due to spectral aliasing near the cutoff introduced by the naive downsample-upsample augmentation pipeline.

To improve higher input rates, consider:

Retraining with a more balanced sampling_rates_probs distribution.
Using slightly lower sr_to_lr_bins values for 12/16/24 kHz (e.g. 120/160/240 instead of 128/170/256) to give the model a small overlap margin at the spectral boundary.

Guidance scale recommendations

The classifier-free guidance (CFG) scale trades off objective fidelity with perceptual richness:

| Domain | Recommended guidance_scale | |---|---| | Speech | 1.0 – 1.5 | | Music | 1.5 – 2.0 | | Sound effects | 1.5 |

Higher values produce denser high-frequency structures but deviate more from the ground-truth reference. Setting guidance_scale=None or 0 disables CFG entirely.

📖 Citation and License

We've released our code under the MIT License. If you find UniverSR useful in your research, please consider citing:

@article{choi2025universr,
  title={{UniverSR}: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching},
  author={Choi, Woongjib and Lee, Sangmin and Lim, Hyungseob and Kang, Hong-Goo},
  journal={arXiv preprint arXiv:2510.00771},
  year={2025}
}

@inproceedings{choi2026universr,
  title     = {{UniverSR}: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching},
  author    = {Choi, Woongjib and Lee, Sangmin and Lim, Hyungseob and Kang, Hong-Goo},
  booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  year      = {2026}
}

🙏 Acknowledgments

This project was developed at DSPAI Lab, Yonsei University. We thank the following open-source projects that inspired parts of our codebase:

Related Skills

node-connect

339.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

339.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.9k

Commit, push, and open a PR