GAME: Generative Adaptive MIDI Extractor

Overview

GAME is the upgraded successor of SOME, designed for transcribing singing voice into music scores.

Highlights

Generative boundary extraction: trade off quality and speed through D3PM (Structured Denoising Diffusion Models in Discrete State-Spaces).
Adaptive architecture: notes and pitches can align and adapt to known boundaries.
Robust model: works on dirty or separated voice mixed with noise, reverb or even accompaniments.
Multilingual support: choose the right language or a similar one to improve the segmentation results.
Thresholds of boundaries and note presence are adjustable.
Produces floating point pitch values, same as what SOME does.

Use cases

Transcribe unlabeled raw singing voice waveforms into music scores, in MIDI format.
Align notes to labeled word boundaries, in dataset processing scenarios.
Estimate note pitches from note boundaries adjusted by user in interactive tuning tools.

Installation

GAME is tested under Python 3.12, PyTorch 2.8.0, CUDA 12.9, Lightning 2.6.1. But it should have good compatibility.

Step 1: You are recommended to start with a clean, separated UV or Conda environment with suitable Python version.

Step 2: Install the latest version of PyTorch following its official website.

Step 3: Run:

pip install -r requirements.txt

Step 4: If you want to use pretrained models, download them from releases or discussions.

Inference

Transcribe raw audio files

The inference script can process single or multiple audio files.

python infer.py extract [path-or-directory] -m [model-path]

By default, MIDI files are saved besides each audio file in the same directory. Text formats (.txt and .csv) are also supported.

For example, transcribing all WAV files in a directory:

python infer.py extract /path/to/audio/dir/ -m /path/to/model.pt --glob *.wav --output-formats mid,txt,csv

For detailed descriptions of more functionalities and options, please run the following command:

python infer.py extract --help

Process singing voice datasets

The inference script is compatible with DiffSinger dataset format. Each dataset contains a wavs folder including all audio files, and a CSV file with the following columns: name for item name, ph_seq for phoneme names, ph_dur for phoneme durations and ph_num for word span. The script can process single or multiple datasets.

python infer.py align [path-or-glob] -m [model-path]

For example, processing single dataset:

python infer.py align transcriptions.csv -m /path/to/model.pt --save-path transcriptions-midi.csv

Processing all datasets matched by glob pattern:

python infer.py align *.transcriptions.csv -m /path/to/model.pt --save-name transcriptions-midi.csv

Prediction results are inserted (or replaced) into the CSV: note_seq for note names, note_dur for note durations, note_slur for slur flags; note_glude will be removed from CSV because the model does not support glide types.

For detailed descriptions of more functionalities and options, please run the following command:

python infer.py align --help

[!IMPORTANT]

Notice for v/uv flags and word-note alignment

Word boundaries have slightly different definitions between DiffSinger and GAME:

In DiffSinger, some special unvoiced tags like AP (breathing) and SP (space) are considered as independent words, with boundaries between them.

In GAME, consecutive unvoiced notes are merged into whole unvoiced regions, with no boundaries inside.

To improve the alignment of v/uv flags between words and notes, we should also merge consecutive unvoiced words before inference. This process is done automatically by the inference API and will not affect the original phoneme sequence. For better comprehension, here is an example of v/uv flags and word-note alignment:
ph_seq       | n  |  i   | h  |      ao       |  SP  |   AP   |  => phoneme names
ph_dur       |0.05| 0.07 |0.05|     0.16      | 0.07 |  0.09  |  => phoneme durations
ph_num       | 1  |     2     |       1       |  1   |   1    |  => word spans
word_dur     |0.05|   0.12    |     0.16      | 0.07 |  0.09  |  => word durations
word_vuv     | 0  |     1     |       1       |  0   |   0    |  => word v/uv
word_dur_m   |0.05|   0.12    |     0.16      |     0.16      |  => word durations (merged)
word_vuv_m   | 0  |    1      |       1       |      0        |  => word v/uv (merged)
note_seq     | C4 |    C4     |  D4   |  E4   |      E4       |  => note names (predicted)
note_vuv     | 0  |    1      |   1   |   1   |       0       |  => note v/uv (predicted)
note_dur     |0.05|    0.12   | 0.08  | 0.08  |     0.16      |  => note durations (predicted)
note_seq_a   |rest|    C4     |  D4   |  E4   | rest |  rest  |  => note names (aligned)
note_dur_a   |0.05|    0.12   | 0.08  | 0.08  | 0.07 |  0.09  |  => note durations (aligned)
note_slur    | 0  |     0     |   0   |   1   |  0   |   0    |  => note slur flags (aligned)
By default, a word is considered as unvoiced if its leading phoneme hits a built-in unvoiced phoneme set, and note v/uv flags are predicted by the model. This logic can be controlled through the following options:

--uv-vocab and --uv-vocab-path defines the unvoiced phoneme set.

--uv-word-cond sets the condition for judging a word as unvoiced.

lead (default): If the leading phoneme is unvoiced, the word is unvoiced. This is enough for most cases because normal words start with vowels. In this mode, you only need to define special tags in the unvoiced phoneme set.

all: If all phonemes are unvoiced, the word is unvoiced. This is the most precise way to judge unvoiced words, but you need to define all special tags and consonants in the unvoiced phoneme set.

--uv-note-cond sets the condition for judging a note as unvoiced.

predict (default): Note u/uv flags are predicted by the model and decoded with a threshold.

follow: Note u/uv flags follow word v/uv flags. If you use this mode, you still need to define all special tags and consonants in the unvoiced phoneme set (because sometimes the first word only has one consonant in it).

--no-wb bypasses all logic above, with no word-note alignment, and everything is purely predicted by the model. Also, no note_slur column will be written since the word information is unavailable. Not recommended.

Training

Data preparation

Singing voice dataset with labeled music scores. Each subset includes an index.csv. File structure:
```
path/to/datasets/
├── dataset1/
│   ├── index.csv
│   ├── waveforms/
│   │   ├── item1-1.wav
│   │   ├── item1-2.wav
│   │   ├── ...
├── dataset2/
│   ├── index.csv
│   ├── waveforms/
│   │   ├── item2-1.wav
│   │   ├── item2-2.wav
│   │   ├── ...
├── ...
```
Each index.csv contains the following columns:
- name: audio file name (without suffix).
- language (optional): code of the singing language, i.e. zh.
- notes: note pitch sequence split by spaces, i.e. rest E3-3 G3+17 D3-9. Use librosa to get note names like this.
- durations: note durations (in seconds) split by spaces, i.e. 1.570 0.878 0.722 0.70.
Natural noise datasets (optional). Collect any types of noise or accompaniments and put them into a directory. Be careful not to include singing voice or clear speech voice.
Reverb datasets (optional). Put a series of Room Impulse Response (RIR) kernels in a directory, usually in WAV format. MB-RIRs is recommended.

Configuration

This repository uses an inheritable configuration system based on YAML format. Each configuration file can derive from others through bases key. Also, in preprocessing, training and evaluation scripts, configurations can be overridden with dotlist-style CLI options like --override key.path=value.

Most training hyperparameters and framework options are stored in configs/base.yaml, while model hyperparameters and data-related options are stored in configs/midi.yaml. You can also organize your own inheritance structure.

Configure your dataset paths in the configuration:

binarizer:
  data_dir: "data/notes"  # <-- singing voice dataset with labeled music scores

training:
  augmentation:
    natural_noise:
      enabled: true  # <-- false if you don't use natural noise
      noise_path_glob: "data/noise/**/*.wav"  # <-- natural noise datasets
    rir_reverb:
      enabled: true  # <-- false if you don't use reverb
      kernel_path_glob: "data/reverb/**/*.wav"  # <-- reverb datasets

The default configuration trains a model with ~50M parameters and consumes ~20GB GPU memory. Before proceeding, it is recommended to read the other part of the configuration files and edit according to your needs and hardware.

Preprocessing

Run the following command to preprocess the raw dataset:

python binarize.py --config [config-path]

Please note that only singing voice dataset and its labels are processed here. The trainer uses online augmentation, so you need to carry everything inside your singing voice, noise and reverb datasets if you need to train models on another machine.

Training

Run the following command to start a new training or resume from one:

python train.py --config [config-path] --exp-name [experiment-name]

By default, checkpoints and lightning logs are stored in experiments/[experiment-name]/. For other training startup opt

GAME

Install / Use

README

GAME: Generative Adaptive MIDI Extractor

Overview

Highlights

Use cases

Installation

Inference

Transcribe raw audio files

Process singing voice datasets

Notice for v/uv flags and word-note alignment

Training

Data preparation

Configuration

Preprocessing

Training