ShiftySpeech

No description available

Generate Convert Improve

Install / Use

/learn @Ashigarg123/ShiftySpeech

About this skill

Quality Score

0/100

README

🌀 ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts

This is the official repository of ShiftySpeech – a diverse and extensive dataset containing 3000+ hours of synthetic speech generated using various TTS systems and vocoders, while covering multiple distribution shifts.

🔥 Key Features

3000+ hours of synthetic speech
Diverse Distribution Shifts: The dataset spans 7 key distribution shifts, including:
- 📖 Reading Style
- 🎙️ Podcast
- 🎥 YouTube
- 🗣️ Languages (Three different languages)
- 🌎 Demographics (including variations in age, accent, and gender)
Multiple Speech Generation Systems: Includes data synthesized from various TTS models and vocoders.

💡 Why We Built This Dataset

Driven by advances in self-supervised learning for speech, state-of-the-art synthetic speech detectors have achieved low error rates on popular benchmarks such as ASVspoof. However, prior benchmarks do not address the wide range of real-world variability in speech. Are reported error rates realistic in real-world conditions? To assess detector failure modes and robustness under controlled distribution shifts, we introduce ShiftySpeech, a benchmark with more than 3000 hours of synthetic speech from 7 domains, 6 TTS systems, 12 vocoders, and 3 languages.

Downloading the dataset

ShiftySpeech

Dataset can be downloaded from HuggingFace

Dataset Structure

The dataset is structured as follows:

/ShiftySpeech
    ├── Vocoders/ 
    │  ├── vocoder-1/
    │  │   ├── vocoder-1_aishell_flac.tar.gz
    │  │   ├── vocoder-1_jsut_flac.tar.gz
    │  │   ├── vocoder-1_youtube_flac.tar.gz
    │  │   ├──vocoder-1_audiobook_flac.tar.gz
    │  │   ├── vocoder-1_podcast_flac.tar.gz
    │  │   ├── vocoder-1_voxceleb_test_flac.tar.gz
    │  │   ├── vocoder-1_commonvoice_flac.tar.gz
    │  │   ├── vocoder-1_ljspeech_flac.tar.gz
    │  │   ├── vocoder-1_train-clean-360_flac.tar.gz
    │  │   
    │  ├── vocoder-2/
    │       ├── ...
    │      
    │     
    │ 
    ├── TTS/  # Contains multiple TTS-based generated speech
    │   ├── TTS_Grad-TTS.tar.gz
    │   ├── TTS_Glow-TTS.tar.gz  
    │   ├── TTS_FastPitch.tar.gz   
    │   ├── TTS_VITS.tar.gz
    │   ├── TTS_XTTS.tar.gz   
    │   ├── TTS_YourTTS.tar.gz   
    │   ├── TTS_hfg_vocoded.tar.gz #Note: ``TTS_hfg_vocoded.tar.gz`` contains synthetic speech generated using HiFiGAN vocoder trained on LJSpeech
    │

The source datasets covered by different TTS and Vocoder systems are listed in tts.yaml and vocoders.yaml

Source dataset for above synthetic speech can be download using the following links:

Wav list corresponding to the source datasets can be found here :

   dataset/wav_lists

We utilize WaveFake dataset for training. It can be downloaded from here Train, dev and test split used for WaveFake can be found in dataset/train_wav_list

Downloading the pre-trained models

Pre-trained models can be downloaded from here

Individual models based on the below folder structure Example, to download hifigan.pt:

wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/single_vocoder/hifigan.pth

Folder structure:

detection-system/
│
├──single_vocoder/         # Selecting synthetic speech to train on       
│   ├──hifigan.pth         # Detectors trained on synthetic speech from a single vocoder  
│   ├──melgan.pth          # Synthetic speech derived from MelGAN vocoder
│   ├── ...                
│   
├──leave_one_out/          # Detectors trained on synthetic speech from multiple vocoders       
│   ├──leave_hifigan.pth   # Synthetic speech derived from vocoders other than HiFiGAN      
│   ├──leave_melgan.pth         
│   ├── ...
│   
├──num_spks/               # Detectors trained on synthetic speech from ``n`` number of speakers       
│   ├──exp-1/              # Round one of selecting different set of speakers to train on
|   |   ├──spk1.pt         # Synthetic speech derived from single speaker
│   │   ├──spks4.pt        # Synthetic speech derived from four speakers
│   │                           
│   ├──... 
│   │   
│   │     
│   ├──exp-5/               
|   |  ├──spk1.pt/
│   │  ├──spks4.pt/
|
├──augmentations/          # Detectors trained on synthetic speech from HiFiGAN vocoder 
|      ├──hfg_aug_1_2.pt   # Augmentations applied during training:
│                            (1) Linear and non-linear convolutive noise                                    
|                            (2) Impulsive signal-dependent additive noise
│

Reproducing Experiments

To reproduce the experiments, please follow the following instructions:

Environment Setup

conda create -n SSL python=3.10.14

conda activate SSL

conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.1 -c pytorch -c nvidia

Download and install fairseq from here

pip install -r requirements.txt

Download xlsr model from here and add in /synthetic_speech_detection/SSL_Anti-spoofing/model.py

Prepare the data

Create evaluation file of the format :

   <real_audio_path> bonafide 
   <synthetic_audio_path> spoof

bonafide represents genuine audio.
spoof represents synthetic or fake audio.

The helper script for creating the evaluation file is prepare_test.py. See below for an example usage:

dataset/prepare_test.py --bona_dir dataset/real-data/jsut --spoof_dir dataset/Shiftyspeech/Vocoders/bigvgan/jsut_flac --save_path dataset/test_files/bigvgan_jsut_test.txt

🚀Evaluating In-the-Wild

Download In-the-Wild dataset from here and create the evaluation file in the above format
Evaluate the model trained on HiFiGAN generated utterances with and without augmentations Download pre-trained models:

wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/single_vocoder/hifigan.pth

wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/augmentations/hfg_aug_1_2.pt

Evaluate using SSL-AASIST model:

python train.py --eval \
    --test_score_dir <path_to_save_scores> \
    --model_name SSL-AASIST \
    --test_list_path <path_to_test_file_with_utterances_and_labels> \
    --model_path <path_to_downloaded_model>

You can also train the models on WaveFake dataset using the following commands:

python train.py --trn_list_path <path_to_train_file> \ 
                --dev_list_path <path_to_dev_file> \
                --save_path <path_to_save_checkpoint>

For the following experiments, create the evaluation files for all distribution shifts as mentioned in Prepare the data section

Impact of Synthetic Speech Selection

Here, we study the impact of training on single vocoder vs multiple-vocoder systems. In addition, we analyze the effect of training on one vocoder vs the other.

Load the pre-trained model saved in models/pre-trained

Example evaluation:

python train.py --eval \
    --test_score_dir dataset/test_scores \
    --model_name SSL-AASIST \
    --test_list_path dataset/test_files/bigvgan_jsut.txt \
    --model_path models/pre-trained/hifigan.pt>

🗣️Training on more speakers

We analyze the impact of training a detector on single speaker vs training on multiple speakers. Number of speakers in training vary from 1 to 10. We release pre-trained models for models trained on single-speaker and four speakers. Training data used is LibriTTS. Five different models are trained for both single and multi-speaker experiment. Speakers are randomly selected.

You can download the pre-trained models as follows:

wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/num_spks/exp-1/spk1.pt

Similarly pre-trained model for experiment 2 and single speaker model can be downloaded from --

wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/num_spks/exp-2/spk1.pt

Example evaluation:

python train.py --eval \
    --test_score_dir dataset/test_scores \
    --model_name SSL-AASIST \
    --test_list_path dataset/test_files/bigvgan_jsut.txt \
    --model_path models/pre-trained/spk1.pt>

Newly released vocoder

Now, we include new vocoders in training in chronological order of release. For vocoder systems not included in WaveFake dataset, we release the generated samples for training and can be downloaded from folder --


wget https://huggingface.co/datasets/ash56/ShiftySpeech/tree/main/Vocoders/<vocoder_of_choice>/ljspeech_flac.tar.gz

Trained models can then be evaluated on distribution shifts similar

Related Skills

node-connect

341.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

341.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.4k

Commit, push, and open a PR