ShiftySpeech
No description available
Install / Use
/learn @Ashigarg123/ShiftySpeechREADME
🌀 ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts
This is the official repository of ShiftySpeech – a diverse and extensive dataset containing 3000+ hours of synthetic speech generated using various TTS systems and vocoders, while covering multiple distribution shifts.
🔥 Key Features
- 3000+ hours of synthetic speech
- Diverse Distribution Shifts: The dataset spans 7 key distribution shifts, including:
- 📖 Reading Style
- 🎙️ Podcast
- 🎥 YouTube
- 🗣️ Languages (Three different languages)
- 🌎 Demographics (including variations in age, accent, and gender)
- Multiple Speech Generation Systems: Includes data synthesized from various TTS models and vocoders.
💡 Why We Built This Dataset
Driven by advances in self-supervised learning for speech, state-of-the-art synthetic speech detectors have achieved low error rates on popular benchmarks such as ASVspoof. However, prior benchmarks do not address the wide range of real-world variability in speech. Are reported error rates realistic in real-world conditions? To assess detector failure modes and robustness under controlled distribution shifts, we introduce ShiftySpeech, a benchmark with more than 3000 hours of synthetic speech from 7 domains, 6 TTS systems, 12 vocoders, and 3 languages.
Downloading the dataset
ShiftySpeech
Dataset can be downloaded from HuggingFace
Dataset Structure
The dataset is structured as follows:
/ShiftySpeech
├── Vocoders/
│ ├── vocoder-1/
│ │ ├── vocoder-1_aishell_flac.tar.gz
│ │ ├── vocoder-1_jsut_flac.tar.gz
│ │ ├── vocoder-1_youtube_flac.tar.gz
│ │ ├──vocoder-1_audiobook_flac.tar.gz
│ │ ├── vocoder-1_podcast_flac.tar.gz
│ │ ├── vocoder-1_voxceleb_test_flac.tar.gz
│ │ ├── vocoder-1_commonvoice_flac.tar.gz
│ │ ├── vocoder-1_ljspeech_flac.tar.gz
│ │ ├── vocoder-1_train-clean-360_flac.tar.gz
│ │
│ ├── vocoder-2/
│ ├── ...
│
│
│
├── TTS/ # Contains multiple TTS-based generated speech
│ ├── TTS_Grad-TTS.tar.gz
│ ├── TTS_Glow-TTS.tar.gz
│ ├── TTS_FastPitch.tar.gz
│ ├── TTS_VITS.tar.gz
│ ├── TTS_XTTS.tar.gz
│ ├── TTS_YourTTS.tar.gz
│ ├── TTS_hfg_vocoded.tar.gz #Note: ``TTS_hfg_vocoded.tar.gz`` contains synthetic speech generated using HiFiGAN vocoder trained on LJSpeech
│
The source datasets covered by different TTS and Vocoder systems are listed in tts.yaml and vocoders.yaml
Source dataset for above synthetic speech can be download using the following links:
Wav list corresponding to the source datasets can be found here :
dataset/wav_lists
We utilize WaveFake dataset for training. It can be downloaded from here
Train, dev and test split used for WaveFake can be found in dataset/train_wav_list
Downloading the pre-trained models
Pre-trained models can be downloaded from here
Individual models based on the below folder structure
Example, to download hifigan.pt:
wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/single_vocoder/hifigan.pth
Folder structure:
detection-system/
│
├──single_vocoder/ # Selecting synthetic speech to train on
│ ├──hifigan.pth # Detectors trained on synthetic speech from a single vocoder
│ ├──melgan.pth # Synthetic speech derived from MelGAN vocoder
│ ├── ...
│
├──leave_one_out/ # Detectors trained on synthetic speech from multiple vocoders
│ ├──leave_hifigan.pth # Synthetic speech derived from vocoders other than HiFiGAN
│ ├──leave_melgan.pth
│ ├── ...
│
├──num_spks/ # Detectors trained on synthetic speech from ``n`` number of speakers
│ ├──exp-1/ # Round one of selecting different set of speakers to train on
| | ├──spk1.pt # Synthetic speech derived from single speaker
│ │ ├──spks4.pt # Synthetic speech derived from four speakers
│ │
│ ├──...
│ │
│ │
│ ├──exp-5/
| | ├──spk1.pt/
│ │ ├──spks4.pt/
|
├──augmentations/ # Detectors trained on synthetic speech from HiFiGAN vocoder
| ├──hfg_aug_1_2.pt # Augmentations applied during training:
│ (1) Linear and non-linear convolutive noise
| (2) Impulsive signal-dependent additive noise
│
Reproducing Experiments
To reproduce the experiments, please follow the following instructions:
Environment Setup
conda create -n SSL python=3.10.14
conda activate SSL
conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.1 -c pytorch -c nvidia
Download and install fairseq from here
pip install -r requirements.txt
Download xlsr model from here and add in /synthetic_speech_detection/SSL_Anti-spoofing/model.py
Prepare the data
- Create evaluation file of the format :
<real_audio_path> bonafide
<synthetic_audio_path> spoof
bonafiderepresents genuine audio.spoofrepresents synthetic or fake audio.
The helper script for creating the evaluation file is prepare_test.py. See below for an example usage:
dataset/prepare_test.py --bona_dir dataset/real-data/jsut --spoof_dir dataset/Shiftyspeech/Vocoders/bigvgan/jsut_flac --save_path dataset/test_files/bigvgan_jsut_test.txt
🚀Evaluating In-the-Wild
- Download In-the-Wild dataset from here and create the evaluation file in the above format
- Evaluate the model trained on HiFiGAN generated utterances with and without augmentations Download pre-trained models:
wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/single_vocoder/hifigan.pth
wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/augmentations/hfg_aug_1_2.pt
Evaluate using SSL-AASIST model:
python train.py --eval \
--test_score_dir <path_to_save_scores> \
--model_name SSL-AASIST \
--test_list_path <path_to_test_file_with_utterances_and_labels> \
--model_path <path_to_downloaded_model>
You can also train the models on WaveFake dataset using the following commands:
python train.py --trn_list_path <path_to_train_file> \
--dev_list_path <path_to_dev_file> \
--save_path <path_to_save_checkpoint>
For the following experiments, create the evaluation files for all distribution shifts as mentioned in Prepare the data section
Impact of Synthetic Speech Selection
Here, we study the impact of training on single vocoder vs multiple-vocoder systems. In addition, we analyze the effect of training on one vocoder vs the other.
- Load the pre-trained model saved in
models/pre-trained
Example evaluation:
python train.py --eval \
--test_score_dir dataset/test_scores \
--model_name SSL-AASIST \
--test_list_path dataset/test_files/bigvgan_jsut.txt \
--model_path models/pre-trained/hifigan.pt>
🗣️Training on more speakers
We analyze the impact of training a detector on single speaker vs training on multiple speakers. Number of speakers in training vary from 1 to 10. We release pre-trained models for models trained on single-speaker and four speakers. Training data used is LibriTTS. Five different models are trained for both single and multi-speaker experiment. Speakers are randomly selected.
You can download the pre-trained models as follows:
wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/num_spks/exp-1/spk1.pt
Similarly pre-trained model for experiment 2 and single speaker model can be downloaded from --
wget https://huggingface.co/ash56/ShiftySpeech/resolve/main/detection-systems/num_spks/exp-2/spk1.pt
Example evaluation:
python train.py --eval \
--test_score_dir dataset/test_scores \
--model_name SSL-AASIST \
--test_list_path dataset/test_files/bigvgan_jsut.txt \
--model_path models/pre-trained/spk1.pt>
Newly released vocoder
Now, we include new vocoders in training in chronological order of release. For vocoder systems not included in WaveFake dataset, we release the generated samples for training and can be downloaded from folder --
wget https://huggingface.co/datasets/ash56/ShiftySpeech/tree/main/Vocoders/<vocoder_of_choice>/ljspeech_flac.tar.gz
Trained models can then be evaluated on distribution shifts similar
Related Skills
node-connect
341.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.4kCommit, push, and open a PR
