DiffSingerKR

No description available

Generate Convert Improve

Install / Use

/learn @CODEJIN/DiffSingerKR

About this skill

Quality Score

0/100

README

DiffSinger-KR

This code is an implementation of DiffSinger for Korean. The algorithm is based on the following papers:

Liu, J., Li, C., Ren, Y., Chen, F., & Zhao, Z. (2022, June). Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 10, pp. 11020-11028).
Xiao, Y., Wang, X., He, L., & Soong, F. K. (2022, May). Improving Fastspeech TTS with Efficient Self-Attention and Compact Feed-Forward Network. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7472-7476). IEEE.

Structure

Structure is based on the DiffSinger, but I made some minor changes.
- The multi-head attention is changed to linearized attention in FFT Block.
  - Positional encoding is removed.
- Duration embedding is added.
  - It is based on the scaled positional encoding with very low initial scale.
- Aux decoder and Diffusion are learned at the same time, not two stage.
I changed several hyper parameters and data type
- One of mel or spectrogram is can be selected as a feature type.
- Token type is changed from phoneme to grapheme.
- Becuase of the supported vocoder, I changed the sample rate of model to 22050Hz.

Supported dataset

| Using | | Dataset | Dataset Link | |--------|-|----------------------------------------|-------------------------------------------------------------------------------------------| | O | | Children's Song Dataset | Link | | X | | AIHub Korean Multi-Singer Song Dataset | Link |

I fixed some midi score to matching between note and wav F0.
CSD dataset is used for the training of shared checkpoint.
Pattern generator.py supports the AIHub Dataset, but I did not used for the training of shared checkpoint.

Hyper parameters

Before proceeding, please set the pattern, inference, and checkpoint paths in Hyper_Parameters.yaml according to your environment.

Sound
- Setting basic sound parameters.
Tokens
- The number of Lyric token.
Notes
- The highest note value for embedding.
Durations
- The highest duration value for embedding.
Genres
- Setting the number of genres.
Singers
- Setting the number of singers.
Duration
- Min duration is used at pattern generating only.
- Max duration is decided the maximum time step of model. MLP mixer always use the maximum time step.
- Equality set the strategy about syllable to grapheme.
  - When True, onset, nucleus, and coda have same length or ±1 difference.
  - When False, onset and coda have Consonant_Duration length, and nucleus has duration - 2 * Consonant_Duration.
Feature_Type
- Setting the feature type (Mel or Spectrogram).
Encoder
- Setting the encoder(embedding).
Diffusion
- Setting the Diffusion denoiser.
Train
- Setting the parameters of training.
Inference_Batch_Size
- Setting the batch size when inference
Inference_Path
- Setting the inference path
Checkpoint_Path
- Setting the checkpoint path
Log_Path
- Setting the tensorboard log path
Use_Mixed_Precision
- Setting using mixed precision
Use_Multi_GPU
- Setting using multi gpu
- By the nvcc problem, Only linux supports this option.
- If this is True, device parameter is also multiple like '0,1,2,3'.
- And you have to change the training command also: please check multi_gpu.sh.
Device
- Setting which GPU devices are used in multi-GPU enviornment.
- Or, if using only CPU, please set '-1'. (But, I don't recommend while training.)

Generate pattern

python Pattern_Generate.py [parameters]

Parameters

-csd
- The path of children's song dataset
-am
- The path of AIHub multi-singer song dataset
-step
- The note step that is explored when generating patterns.
- The smaller step is, the more patterns are created in one song.
-hp
- The path of hyperparameter.

Training

Command

Single GPU

python Train.py -hp <path> -s <int>

-hp <path>
- The hyper paramter file path
- This is required.
-s <int>
- The resume step parameter.
- Default is 0.
- If value is 0, model try to search the latest checkpoint.

Multi GPU

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OMP_NUM_THREADS=32 python -m torch.distributed.launch --nproc_per_node=8 Train.py --hyper_parameters Hyper_Parameters.yaml --port 54322

I recommend to check the multi_gpu.sh.

Inference

Please check Inference.ipynb

Checkpoint

Please check Huggingface Space

TODO

Multi singer version version training with AIHub Multi-Singer Song Dataset

Related Skills

node-connect

345.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

106.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

345.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

345.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。