SkillAgentSearch skills...

ShiftySpeech

No description available

Install / Use

/learn @Ashigarg123/ShiftySpeech
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center"> <a href="https://github.com/Ashigarg123/ShiftySpeech"> <img src="https://github.com/Ashigarg123/ShiftySpeech/blob/main/Images/shiftyspeech_logo.png" alt="ShiftySpeech Logo" width="500" height="auto"> </a> </p>

🌀 ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts

This is the official repository of ShiftySpeech – a diverse and extensive dataset containing 3000+ hours of synthetic speech generated using various TTS systems and vocoders, while covering multiple distribution shifts.

🔥 Key Features

  • 3000+ hours of synthetic speech
  • Diverse Distribution Shifts: The dataset spans 7 key distribution shifts, including:
    • 📖 Reading Style
    • 🎙️ Podcast
    • 🎥 YouTube
    • 🗣️ Languages (Three different languages)
    • 🌎 Demographics (including variations in age, accent, and gender)
  • Multiple Speech Generation Systems: Includes data synthesized from various TTS models and vocoders.

💡 Why We Built This Dataset

Driven by advances in self-supervised learning for speech, state-of-the-art synthetic speech detectors have achieved low error rates on popular benchmarks such as ASVspoof. However, prior benchmarks do not address the wide range of real-world variability in speech. Are reported error rates realistic in real-world conditions? To assess detector failure modes and robustness under controlled distribution shifts, we introduce ShiftySpeech, a benchmark with more than 3000 hours of synthetic speech from 7 domains, 6 TTS systems, 12 vocoders, and 3 languages.

Downloading the dataset

ShiftySpeech

Dataset can be downloaded from HuggingFace

Dataset Structure

The dataset is structured as follows:

/ShiftySpeech
    ├── Vocoders/ 
    │  ├── vocoder-1/
    │  │   ├── vocoder-1_aishell_flac.tar.gz
    │  │   ├── vocoder-1_jsut_flac.tar.gz
    │  │   ├── vocoder-1_youtube_flac.tar.gz
    │  │   ├──vocoder-1_audiobook_flac.tar.gz
    │  │   ├── vocoder-1_podcast_flac.tar.gz
    │  │   ├── vocoder-1_voxceleb_test_flac.tar.gz
    │  │   ├── vocoder-1_commonvoice_flac.tar.gz
    │  │   ├── vocoder-1_ljspeech_flac.tar.gz
    │  │   ├── vocoder-1_train-clean-360_flac.tar.gz
    │  │   
    │  ├── vocoder-2/
    │       ├── ...
    │      
    │     
    │ 
    ├── TTS/  # Contains multiple TTS-based generated speech
    │   ├── TTS_Grad-TTS.tar.gz
    │   ├── TTS_Glow-TTS.tar.gz  
    │   ├── TTS_FastPitch.tar.gz   
    │   ├── TTS_VITS.tar.gz
    │   ├── TTS_XTTS.tar.gz   
    │   ├── TTS_YourTTS.tar.gz   
    │   ├── TTS_hfg_vocoded.tar.gz #Note: ``TTS_hfg_vocoded.tar.gz`` contains synthetic speech generated using HiFiGAN vocoder trained on LJSpeech
    │

The source datasets covered by different TTS and Vocoder systems are listed in tts.yaml and vocoders.yaml

Source dataset for above synthetic speech can be download using the following links:

Wav list corresponding to the source datasets can be found here :

   dataset/wav_lists

We utilize WaveFake dataset for training. It can be downloaded from here Train, dev and test split used for WaveFake can be found in dataset/train_wav_list

Downloading the pre-trained models

Pre-trained models can be downloaded from here

Individual models based on the below folder structure Example, to download hifigan.pt:

wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/single_vocoder/hifigan.pth

Folder structure:

detection-system/
│
├──single_vocoder/         # Selecting synthetic speech to train on       
│   ├──hifigan.pth         # Detectors trained on synthetic speech from a single vocoder  
│   ├──melgan.pth          # Synthetic speech derived from MelGAN vocoder
│   ├── ...                
│   
├──leave_one_out/          # Detectors trained on synthetic speech from multiple vocoders       
│   ├──leave_hifigan.pth   # Synthetic speech derived from vocoders other than HiFiGAN      
│   ├──leave_melgan.pth         
│   ├── ...
│   
├──num_spks/               # Detectors trained on synthetic speech from ``n`` number of speakers       
│   ├──exp-1/              # Round one of selecting different set of speakers to train on
|   |   ├──spk1.pt         # Synthetic speech derived from single speaker
│   │   ├──spks4.pt        # Synthetic speech derived from four speakers
│   │                           
│   ├──... 
│   │   
│   │     
│   ├──exp-5/               
|   |  ├──spk1.pt/
│   │  ├──spks4.pt/
|
├──augmentations/          # Detectors trained on synthetic speech from HiFiGAN vocoder 
|      ├──hfg_aug_1_2.pt   # Augmentations applied during training:
│                            (1) Linear and non-linear convolutive noise                                    
|                            (2) Impulsive signal-dependent additive noise
│                          

Reproducing Experiments

To reproduce the experiments, please follow the following instructions:


Environment Setup

conda create -n SSL python=3.10.14
conda activate SSL 
conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.1 -c pytorch -c nvidia

Download and install fairseq from here

pip install -r requirements.txt

Download xlsr model from here and add in /synthetic_speech_detection/SSL_Anti-spoofing/model.py

Prepare the data

  • Create evaluation file of the format :
   <real_audio_path> bonafide 
   <synthetic_audio_path> spoof
  • bonafide represents genuine audio.
  • spoof represents synthetic or fake audio.

The helper script for creating the evaluation file is prepare_test.py. See below for an example usage:

dataset/prepare_test.py --bona_dir dataset/real-data/jsut --spoof_dir dataset/Shiftyspeech/Vocoders/bigvgan/jsut_flac --save_path dataset/test_files/bigvgan_jsut_test.txt

🚀Evaluating In-the-Wild

  • Download In-the-Wild dataset from here and create the evaluation file in the above format
  • Evaluate the model trained on HiFiGAN generated utterances with and without augmentations Download pre-trained models:
wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/single_vocoder/hifigan.pth
wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/augmentations/hfg_aug_1_2.pt

Evaluate using SSL-AASIST model:

python train.py --eval \
    --test_score_dir <path_to_save_scores> \
    --model_name SSL-AASIST \
    --test_list_path <path_to_test_file_with_utterances_and_labels> \
    --model_path <path_to_downloaded_model>

You can also train the models on WaveFake dataset using the following commands:

python train.py --trn_list_path <path_to_train_file> \ 
                --dev_list_path <path_to_dev_file> \
                --save_path <path_to_save_checkpoint> 

For the following experiments, create the evaluation files for all distribution shifts as mentioned in Prepare the data section

Impact of Synthetic Speech Selection

Here, we study the impact of training on single vocoder vs multiple-vocoder systems. In addition, we analyze the effect of training on one vocoder vs the other.

  • Load the pre-trained model saved in models/pre-trained

Example evaluation:

python train.py --eval \
    --test_score_dir dataset/test_scores \
    --model_name SSL-AASIST \
    --test_list_path dataset/test_files/bigvgan_jsut.txt \
    --model_path models/pre-trained/hifigan.pt>

🗣️Training on more speakers

We analyze the impact of training a detector on single speaker vs training on multiple speakers. Number of speakers in training vary from 1 to 10. We release pre-trained models for models trained on single-speaker and four speakers. Training data used is LibriTTS. Five different models are trained for both single and multi-speaker experiment. Speakers are randomly selected.

You can download the pre-trained models as follows:

wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/num_spks/exp-1/spk1.pt

Similarly pre-trained model for experiment 2 and single speaker model can be downloaded from --

wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/num_spks/exp-2/spk1.pt

Example evaluation:

python train.py --eval \
    --test_score_dir dataset/test_scores \
    --model_name SSL-AASIST \
    --test_list_path dataset/test_files/bigvgan_jsut.txt \
    --model_path models/pre-trained/spk1.pt>

Newly released vocoder

Now, we include new vocoders in training in chronological order of release. For vocoder systems not included in WaveFake dataset, we release the generated samples for training and can be downloaded from folder --


wget https://huggingface.co/datasets/ash56/ShiftySpeech/tree/main/Vocoders/<vocoder_of_choice>/ljspeech_flac.tar.gz

Trained models can then be evaluated on distribution shifts similar

Related Skills

View on GitHub
GitHub Stars14
CategoryDevelopment
Updated3mo ago
Forks0

Languages

Jupyter Notebook

Security Score

70/100

Audited on Dec 30, 2025

No findings