DiffSingerKR
No description available
Install / Use
/learn @CODEJIN/DiffSingerKRREADME
DiffSinger-KR
This code is an implementation of DiffSinger for Korean. The algorithm is based on the following papers:
- Liu, J., Li, C., Ren, Y., Chen, F., & Zhao, Z. (2022, June). Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 10, pp. 11020-11028).
- Xiao, Y., Wang, X., He, L., & Soong, F. K. (2022, May). Improving Fastspeech TTS with Efficient Self-Attention and Compact Feed-Forward Network. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7472-7476). IEEE.
Structure
- Structure is based on the DiffSinger, but I made some minor changes.
- The multi-head attention is changed to linearized attention in FFT Block.
- Positional encoding is removed.
- Duration embedding is added.
- It is based on the scaled positional encoding with very low initial scale.
- Aux decoder and Diffusion are learned at the same time, not two stage.
- The multi-head attention is changed to linearized attention in FFT Block.
- I changed several hyper parameters and data type
- One of mel or spectrogram is can be selected as a feature type.
- Token type is changed from phoneme to grapheme.
- Becuase of the supported vocoder, I changed the sample rate of model to 22050Hz.
Supported dataset
| Using | | Dataset | Dataset Link | |--------|-|----------------------------------------|-------------------------------------------------------------------------------------------| | O | | Children's Song Dataset | Link | | X | | AIHub Korean Multi-Singer Song Dataset | Link |
- I fixed some midi score to matching between note and wav F0.
- CSD dataset is used for the training of shared checkpoint.
- Pattern generator.py supports the AIHub Dataset, but I did not used for the training of shared checkpoint.
Hyper parameters
Before proceeding, please set the pattern, inference, and checkpoint paths in Hyper_Parameters.yaml according to your environment.
-
Sound
- Setting basic sound parameters.
-
Tokens
- The number of Lyric token.
-
Notes
- The highest note value for embedding.
-
Durations
- The highest duration value for embedding.
-
Genres
- Setting the number of genres.
-
Singers
- Setting the number of singers.
-
Duration
- Min duration is used at pattern generating only.
- Max duration is decided the maximum time step of model. MLP mixer always use the maximum time step.
- Equality set the strategy about syllable to grapheme.
- When
True, onset, nucleus, and coda have same length or ±1 difference. - When
False, onset and coda have Consonant_Duration length, and nucleus has duration - 2 * Consonant_Duration.
- When
-
Feature_Type
- Setting the feature type (
MelorSpectrogram).
- Setting the feature type (
-
Encoder
- Setting the encoder(embedding).
-
Diffusion
- Setting the Diffusion denoiser.
-
Train
- Setting the parameters of training.
-
Inference_Batch_Size
- Setting the batch size when inference
-
Inference_Path
- Setting the inference path
-
Checkpoint_Path
- Setting the checkpoint path
-
Log_Path
- Setting the tensorboard log path
-
Use_Mixed_Precision
- Setting using mixed precision
-
Use_Multi_GPU
- Setting using multi gpu
- By the nvcc problem, Only linux supports this option.
- If this is
True, device parameter is also multiple like '0,1,2,3'. - And you have to change the training command also: please check multi_gpu.sh.
-
Device
- Setting which GPU devices are used in multi-GPU enviornment.
- Or, if using only CPU, please set '-1'. (But, I don't recommend while training.)
Generate pattern
python Pattern_Generate.py [parameters]
Parameters
- -csd
- The path of children's song dataset
- -am
- The path of AIHub multi-singer song dataset
- -step
- The note step that is explored when generating patterns.
- The smaller step is, the more patterns are created in one song.
- -hp
- The path of hyperparameter.
Training
Command
Single GPU
python Train.py -hp <path> -s <int>
-
-hp <path>- The hyper paramter file path
- This is required.
-
-s <int>- The resume step parameter.
- Default is
0. - If value is
0, model try to search the latest checkpoint.
Multi GPU
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OMP_NUM_THREADS=32 python -m torch.distributed.launch --nproc_per_node=8 Train.py --hyper_parameters Hyper_Parameters.yaml --port 54322
- I recommend to check the multi_gpu.sh.
Inference
- Please check Inference.ipynb
Checkpoint
- Please check Huggingface Space
TODO
- Multi singer version version training with AIHub Multi-Singer Song Dataset
Related Skills
node-connect
345.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
106.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
345.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
345.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
