SkillAgentSearch skills...

OmniCodec

OmniCodec: Low Frame Rate Universal Audio Codec with Semantic–Acoustic Disentanglement

Install / Use

/learn @ASLP-lab/OmniCodec
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

OmniCodec

OmniCodec: Low Frame Rate Universal Audio Codec with Semantic–Acoustic Disentanglement

<p align="center"> <img src="assets/imgs/omnicodec.png" alt="OmniCodec" width="85%"/> </p>

Overview

This repo contains:

  • Training: train.py (Accelerate + GAN / WavLM-related losses per config)
  • Dataset: dataset.py (multi-domain mixing; loads audio paths from scp)
  • Inference: infer.py (reconstructs audio with a pretrained checkpoint)
  • Config: config/config_omnicodec.yaml

Environment

Requirements

Install python dependencies:

pip install -r requirements.txt

Notes:

  • requirements.txt contains an editable install line -e OmniCodec/transformers-main. Make sure the referenced path exists in your environment, or adjust/remove that line if you already have transformers installed.

Data preparation (scp)

The training config expects 3 scp files (one per domain): speech / music / sound.

Each line in scp can be either:

  • utt_id /abs/or/rel/path/to/audio.wav
  • /abs/or/rel/path/to/audio.wav (utt will be inferred from filename)

Example:

utt0001 /data/speech/utt0001.wav
utt0002 /data/speech/utt0002.wav

What dataset does

For each item, dataset.py will:

  • load audio with librosa.load(..., sr=sample_rate, mono=True)
  • apply librosa.util.normalize(wav) * 0.95
  • crop/pad/repeat to segment_size (default: 240000 samples @ 24kHz = 10s)
  • return a dict: {"wav": Tensor[T], "utt": str, "text": None}

Failed samples return None and are filtered by collate_fn in train.py.

Configure

Edit config/config_omnicodec.yaml:

  • Data
    • data.speech_train_shards_dir: path to speech.scp
    • data.music_train_shards_dir: path to music.scp
    • data.sound_train_shards_dir: path to sound.scp
    • data.sample_rate: default 24000
    • data.segment_size: default 240000
  • Pretrained SSL (WavLM)
    • model.wavlmloss.ckpt_path: default pretrain_model/ssl/wavlm-base-plus
    • wav_lm_model: default pretrain_model/ssl/wavlm_model/wavlm
  • Output
    • train.save_dir: default ./exps/omnicodec

Training

Run training with the provided config:

python train.py -c config/config_omnicodec.yaml

Checkpoints and logs are written to train.save_dir (default: ./exps/omnicodec).

Inference (reconstruction)

Prepare checkpoint

infer.py loads the checkpoint from:

  • pretrained_model/omnicodec.pth

Place your pretrained weights at that path (or edit infer.py to point to your checkpoint).

Run

Put test audio files in:

  • ./testset/speech/

Then run:

python infer.py -c config/config_omnicodec.yaml

Outputs will be written to:

  • ./outputs/

Project structure

.
├─ config/
│  └─ config_omnicodec.yaml
├─ dataset.py
├─ train.py
├─ infer.py
├─ models/
├─ modules/
├─ quantization/
├─ discriminators/
├─ losses/
├─ utils/
└─ requirements.txt

Acknowledgements

Citation

If you use this work, please cite:

@misc{hu2026omnicodeclowframerate,
      title={OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement}, 
      author={Jingbin Hu and Haoyu Zhang and Dake Guo and Qirui Zhan and Wenhao Li and Huakang Chen and Guobin Ma and Hanke Xie and Chengyou Wang and Pengyuan Xie and Chuan Xie and Qiang Zhang and Lei Xie},
      year={2026},
      eprint={2603.20638},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2603.20638}, 
}

License

See the repository license

View on GitHub
GitHub Stars32
CategoryDevelopment
Updated11h ago
Forks1

Languages

Python

Security Score

90/100

Audited on Apr 6, 2026

No findings