SkillAgentSearch skills...

GPA

[AutoArk] GPA (General Purpose Audio) can do ASR, TTS and voice conversion with one tiny 300M model!

Install / Use

/learn @AutoArk/GPA
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <img src="scripts/GPA.png" width="80%" alt="GPA Logo"/>

GPA: One Model for Speech Recognition, Text-to-Speech, and Voice Conversion

ArXiv Demo Hugging Face Interactive Demo ModelScope

</div>

TL;DR GPA incorporates three speech tasks into one single model and this repo includes codes of training, fine-tuning and effecient deployment of GPA.

If you find GPA useful, a ⭐ helps support the project.

<details open> <summary><strong>📢 Announcements</strong></summary> <div style="max-height: 100px; overflow-y: auto; border: 1px solid #ddd; padding: 10px; margin-top: 8px;">
  • 🆕 2026.03.31: GPA-TTS — Standalone lightweight TTS runtime released! Extracted from GPA with INT8/INT4 quantization for edge deployment. Among the smallest open-source TTS runtimes with voice cloning support! Details →

  • 🔄 2026.03.04: Inference update (TTS transcript conditioning): Added --ref_transcript support to guide generation using reference text, effectively eliminating accent drift in zero-shot tasks. (Inspired by insights from @or965)

  • 🔄 2026.01.29: Updated the roadmap: Our next release will be GPA-v1.5-0.6B! It includes incremental improvements in ASR robustness. The previously planned standalone GPA-0.3B full release is no longer scheduled.

  • 📌 2026.1.17: Initial GPA release.

</div> </details>

📖 Abstract

GPA stands for General Purpose Audio.

In academia, a student’s GPA (Grade Point Average) serves as a unified metric that reflects performance across diverse subjects—ranging from Calculus and Philosophy to Gym class.

Similarly, our GPA model unifies the three major pillars of audio tasks—Text-to-Speech (TTS), Automatic Speech Recognition (ASR), and Voice Conversion (VC)—into a single auto-regreesive transformer.

  • Our open-source content includes support for multiple frameworks and provides production-ready code suitable for cloud deployment.
  • we include concise inference examples and training pipelines for research purpose.
  • The released 0.3B model is also perfect for edge devices and edge deployment is to be released.

Table of Content

<div align="center">

| 🗺️ Roadmap | 🚀 Quick Start | 🛠️ Deployment | 📊 Evaluation | ⚡ Performance | | :---: | :--- | :--- | :---: | :--- | | | • Environment Setup<br>Checkpoint Download<br>Inference<br>Training | • Start the Service (Docker)<br>Start the Gradio GUI<br>Basic Testing | | • Speed<br>RTF<br>Concurrency<br>VRAM Usage |

</div>

🗺️ Roadmap

| Category | Item | Status | | :--- | :--- | :---: | | Core Features | Unified LLM-based audio generation & understanding | ✅ | | | Inference Scripts (STT, TTS, VC) | ✅ | | | Training Pipeline (DeepSpeed) | ✅ | | | Interactive Demo | ✅ | | Basic Service Deployment (vLLM/FastAPI) | ✅ | | Paper (ArXiv) | ✅ | | Model Releases | GPA-0.3B-preview (Edge-focused) | ✅ | | | GPA-v1.5-0.6B (Edge-focused) | ⬜ | | | GPA-TTS — Lightweight TTS runtime (INT8/INT4 ONNX) | ✅ | | Edge Deployment | Android Platform | ⬜ | | | RK Series | ⬜ | | | IOS Platform | ⬜ | | Frameworks | vllm | ✅ | | | llama-cpp | ✅ | | | sglang | ✅ | | | torch | ✅ | | | mlx-lm | ✅ | | | rknn | ⬜ |

🎙️ GPA-TTS: Edge-Ready Voice-Cloning TTS

We noticed that TTS is by far the most popular feature in our online demo. While GPA-v1.5 will ship as a larger unified model, we extracted the TTS component into a standalone, self-contained runtime:

<div align="center">

| | GPA-TTS | | :--- | :--- | | Quantization | Qwen INT4 + Detokenizer INT8 (ONNX Runtime) | | Voice Cloning | Zero-shot, from a short reference audio | | Footprint | Among the smallest open-source TTS runtimes with cloning support | | Optimized for | Local CPU inference (Mac / Linux / Edge) |

📖 GPA-TTS README →   |   🤗 Download from HuggingFace

</div>

🔍 Model Overview

<div align="center"> <img src="docs/GPA.png" width="80%" alt="GPA Model Architecture"/> <br> <div style="text-align: justify; width: 100%; margin: 10px auto; text-indent: 2em;"> <strong>Figure 1: Architecture of the proposed GPA framework.</strong> The model utilizes a shared Large Language Model (LLM) backbone to unify three core audio tasks: Understanding (ASR), Generation (TTS), and Editing (Voice Conversion). Depending on the task, the model processes different combinations of inputs (Source Audio, Target Text, or Reference Audio) via Semantic and Acoustic modules to generate the corresponding text or audio output. </div> </div>

🚀 Quick Start

🧹 Environment Setup

🧩 Option A: Reproducible Setup with uv (Recommended)

⚠️ Prerequisites (Important)

The default development environment is configured for:

    • OS: Linux (x86_64)*
    • GPU: NVIDIA*
    • CUDA: 12.x*

The provided uv.lock file was generated under this configuration.

If your system matches the above, you can use the uv-based setup for a fully reproducible environment.

If you are using:

  • CUDA 11.x (e.g. cu116)
  • CPU-only systems
  • macOS or Windows

please follow the pip-based installation described below.

We use uv for fast and reproducible Python environment management.

1. Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

# Or install via pip if you prefer:
# pip install uv

2. Sync the environment (installs all dependencies)

💡Note: If training is not required, or if building flash_attn is difficult/slow on your device, you may comment out this dependency in pyproject.toml. Training should be switched to eager mode in such condition.

uv sync

🧩 Option B: Flexible Setup with pip (Any CUDA / CPU)

1. Create and activate a virtual environment

python -m venv .venv
source .venv/bin/activate

2. Install base dependencies

pip install -r requirements.txt

📥 Checkpoint Download

Before running inference, please download the model checkpoints from Hugging Face or ModelScope.

| Model | Hugging Face | ModelScope | | :--- | :---: | :---: | | GPA-0.3B-preview | Download | Download | | GPA-v1.5-0.6B | Coming Soon | Coming Soon |

Important: After downloading the checkpoints, please verify that your model directory structure matches the hierarchy below.

${GPA_MODEL_DIR}/
├── BiCodec/
├──── wav2vec2-large-xlsr-53/
├── glm-4-voice-tokenizer/
├── added_tokens.json
├── chat_template.jinja
├── config.json
├── generation_config.json
├── merges.txt
├── model.safetensors
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── vocab.json

💭 Inference

You can perform various tasks like Speech-to-Text, Text-to-Speech, and Voice Conversion using the provided scripts.

💡Note: Please navigate to the inference directory to ensure relative paths for audio files work correctly.

💡Note: Currently, we only support input in WAV format at a sample rate of 16 kHz.

cd scripts/inference

💡Note: To use other python environments, replace "uv run" with "path_to_your_python".

💡Update (2026.03.04): TTS now supports transcript-conditioned cloning via --ref_transcript or --auto_ref_transcript, this feature can help preserve accent fidelity better in zero-shot tasks.

Speech-to-Text (STT/ASR):

# Using uv
uv run gpa_inference.py --task stt \
    --src_audio_path "test_audio/000.wav" \
    --gpa_model_path "${GPA_MODEL_DIR}" \
    --tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
    --bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
    --text_tokenizer_path "${GPA_MODEL_DIR}"

# Or using python
python gpa_inference.py --task stt \
    --src_audio_path "test_audio/000.wav" \
    --gpa_model_path "${GPA_MODEL_DIR}" \
    --tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
    --bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
    --text_tokenizer_path "${GPA_MODEL_DIR}"

Text-to-Speech (TTS):

# Using uv
uv run gpa_inference.py --task tts-a \
    --text "Hello world, this is Major Tom speaking." \
    --ref_audio_path "test_audio/astro.wav" \
    --gpa_model_path "${GPA_MODEL_DIR}" \
    --tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
    --bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
    --text_tokenizer_path "${GPA_MODEL_DIR}"

# Or using python
python gpa_inference.py --task tts-a \
    --text "Hello world, this is Major Tom speaking." \
    --ref_audio_path "test_audio/astro.wav" \
    --gpa_model_path "${GPA_MODEL_DIR}" \
    --tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
    --bicodec_tokenizer_pa
View on GitHub
GitHub Stars106
CategoryDevelopment
Updated9h ago
Forks17

Languages

Python

Security Score

100/100

Audited on Mar 31, 2026

No findings