SkillAgentSearch skills...

FireRedASR2S

A SOTA Industrial-Grade All-in-One ASR system with ASR, VAD, LID, and Punc modules. FireRedASR2 supports Chinese (Mandarin, 20+ dialects/accents), English, code-switching, and both speech and singing ASR. FireRedVAD supports speech/singing/music in 100+ langs. FireRedLID supports 100+ langs and 20+ zh dialects. FireRedPunc supports zh and en.

Install / Use

/learn @FireRedTeam/FireRedASR2S

README

<div align="center"> <h1> FireRedASR2S <br> A SOTA Industrial-Grade All-in-One ASR System </h1> </div>

[Paper] [Model🤗] [Model🤖] [Demo]

FireRedASR2S is a state-of-the-art (SOTA), industrial-grade, all-in-one ASR system with ASR, VAD, LID, and Punc modules. All modules achieve SOTA performance:

  • FireRedASR2: Automatic Speech Recognition (ASR) supporting speech and singing transcription for Chinese (Mandarin, 20+ dialects/accents), English, code-switching. 2.89% average CER on 4 public Mandarin benchmarks, 11.55% on 19 Chinese dialects and accents benchmarks, outperforming Doubao-ASR, Qwen3-ASR-1.7B, Fun-ASR, and Fun-ASR-Nano-2512. FireRedASR2-AED also supports word-level timestamps and confidence scores.
  • FireRedVAD: Voice Activity Detection (VAD) supporting speech/singing/music in 100+ languages. 97.57% F1, outperforming Silero-VAD, TEN-VAD, FunASR-VAD and WebRTC-VAD. Supports non-streaming/streaming VAD and Multi-label VAD (mVAD).
  • FireRedLID: Spoken Language Identification (LID) supporting 100+ languages and 20+ Chinese dialects/accents. 97.18% accuracy, outperforming Whisper and SpeechBrain.
  • FireRedPunc: Punctuation Prediction (Punc) for Chinese and English. 78.90% average F1, outperforming FunASR-Punc (62.77%).

2S: 2nd-generation FireRedASR, now expanded to an all-in-one ASR System

🔥 News

  • [2026.03.12] 🔥 We release FireRedASR2S technical report. See arXiv.
  • [2026.03.05] 🚀 vLLM supports FireRedASR2-LLM. See vLLM Usage part.
  • [2026.02.25] 🔥 We release FireRedASR2-LLM model weights. 🤗 🤖
  • [2026.02.13] 🚀 Support TensorRT-LLM inference acceleration for FireRedASR2-AED (contributed by NVIDIA). Benchmark on AISHELL-1 test set shows 12.7x speedup over PyTorch baseline (single H20).
  • [2026.02.12] 🔥 We release FireRedASR2S (FireRedASR2-AED, FireRedVAD, FireRedLID, and FireRedPunc) with model weights and inference code. Download links below. Technical report and finetuning code coming soon.

Available Models and Languages

|Model|Supported Languages & Dialects|Download| |:-------------:|:---------------------------------:|:----------:| |FireRedASR2-LLM| Chinese (Mandarin and 20+ dialects/accents<sup></sup>), English, Code-Switching | 🤗 | 🤖| |FireRedASR2-AED| Chinese (Mandarin and 20+ dialects/accents<sup></sup>), English, Code-Switching | 🤗 | 🤖| |FireRedVAD | 100+ languages, 20+ Chinese dialects/accents<sup></sup> | 🤗 | 🤖| |FireRedLID | 100+ languages, 20+ Chinese dialects/accents<sup></sup> | 🤗 | 🤖| |FireRedPunc| Chinese, English | 🤗 | 🤖|

<sup>*</sup>Supported Chinese dialects/accents: Cantonese (Hong Kong & Guangdong), Sichuan, Shanghai, Wu, Minnan, Anhui, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Liaoning, Ningxia, Shaanxi, Shanxi, Shandong, Tianjin, Yunnan, etc.

Method

FireRedASR2S: System Overview

Model

FireRedASR2

FireRedASR2 builds upon FireRedASR with improved accuracy, designed to meet diverse requirements in superior performance and optimal efficiency across various applications. It comprises two variants:

  • FireRedASR2-LLM: Designed to achieve state-of-the-art performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
  • FireRedASR2-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.

Model

Other Modules

  • FireRedVAD: DFSMN-based non-streaming/streaming Voice Activity Detection and Multi-label VAD (mVAD). mVAD can be viewed as a lightweight Audio Event Detection (AED) system specialized for a small set of sound categories (speech/singing/music).
  • FireRedLID: Encoder-Decoder-based Spoken Language Identification. See FireRedLID README for language details.
  • FireRedPunc: BERT-based Punctuation Prediction.

Quick Start

Setup

  1. Create a clean Python environment:
$ conda create --name fireredasr2s python=3.10
$ conda activate fireredasr2s
$ git clone https://github.com/FireRedTeam/FireRedASR2S.git
$ cd FireRedASR2S  # or fireredasr2s
  1. Install dependencies and set up PATH and PYTHONPATH:
$ pip install -r requirements.txt
$ export PATH=$PWD/fireredasr2s/:$PATH
$ export PYTHONPATH=$PWD/:$PYTHONPATH
  1. Download models:
# Download via ModelScope (recommended for users in China)
pip install -U modelscope
modelscope download --model xukaituo/FireRedASR2-AED --local_dir ./pretrained_models/FireRedASR2-AED
modelscope download --model xukaituo/FireRedVAD --local_dir ./pretrained_models/FireRedVAD
modelscope download --model xukaituo/FireRedLID --local_dir ./pretrained_models/FireRedLID
modelscope download --model xukaituo/FireRedPunc --local_dir ./pretrained_models/FireRedPunc
modelscope download --model xukaituo/FireRedASR2-LLM --local_dir ./pretrained_models/FireRedASR2-LLM

# Download via Hugging Face
pip install -U "huggingface_hub[cli]"
huggingface-cli download FireRedTeam/FireRedASR2-AED --local-dir ./pretrained_models/FireRedASR2-AED
huggingface-cli download FireRedTeam/FireRedVAD --local-dir ./pretrained_models/FireRedVAD
huggingface-cli download FireRedTeam/FireRedLID --local-dir ./pretrained_models/FireRedLID
huggingface-cli download FireRedTeam/FireRedPunc --local-dir ./pretrained_models/FireRedPunc
huggingface-cli download FireRedTeam/FireRedASR2-LLM --local-dir ./pretrained_models/FireRedASR2-LLM
  1. Convert your audio to 16kHz 16-bit mono PCM format if needed:
$ ffmpeg -i <input_audio_path> -ar 16000 -ac 1 -acodec pcm_s16le -f wav <output_wav_path>

Script Usage

$ cd examples_infer/asr_system
$ bash inference_asr_system.sh

Command-line Usage

$ fireredasr2s-cli --help
$ fireredasr2s-cli --wav_paths "assets/hello_zh.wav" "assets/hello_en.wav" --outdir output
$ cat output/result.jsonl 
# {"uttid": "hello_zh", "text": "你好世界。", "sentences": [{"start_ms": 310, "end_ms": 1840, "text": "你好世界。", "asr_confidence": 0.875, "lang": "zh mandarin", "lang_confidence": 0.999}], "vad_segments_ms": [[310, 1840]], "dur_s": 2.32, "words": [{"start_ms": 490, "end_ms": 690, "text": "你"}, {"start_ms": 690, "end_ms": 1090, "text": "好"}, {"start_ms": 1090, "end_ms": 1330, "text": "世"}, {"start_ms": 1330, "end_ms": 1795, "text": "界"}], "wav_path": "assets/hello_zh.wav"}
# {"uttid": "hello_en", "text": "Hello speech.", "sentences": [{"start_ms": 120, "end_ms": 1840, "text": "Hello speech.", "asr_confidence": 0.833, "lang": "en", "lang_confidence": 0.998}], "vad_segments_ms": [[120, 1840]], "dur_s": 2.24, "words": [{"start_ms": 340, "end_ms": 1020, "text": "hello"}, {"start_ms": 1020, "end_ms": 1666, "text": "speech"}], "wav_path": "assets/hello_en.wav"}

Python API Usage

from fireredasr2s import FireRedAsr2System, FireRedAsr2SystemConfig

asr_system_config = FireRedAsr2SystemConfig()  # Use default config
asr_system = FireRedAsr2System(asr_system_config)

result = asr_system.process("assets/hello_zh.wav")
print(result)
# {'uttid': 'tmpid', 'text': '你好世界。', 'sentences': [{'start_ms': 440, 'end_ms': 1820, 'text': '你好世界。', 'asr_confidence': 0.868, 'lang': 'zh mandarin', 'lang_confidence': 0.999}], 'vad_segments_ms': [(440, 1820)], 'dur_s': 2.32, 'words': [], 'wav_path': 'assets/hello_zh.wav'}

result = asr_system.process("assets/hello_en.wav")
print(result)
# {'uttid': 'tmpid', 'text': 'Hello speech.', 'sentences': [{'start_ms': 260, 'end_ms': 1820, 'text': 'Hello speech.', 'asr_confidence': 0.933, 'lang': 'en', 'lang_confidence': 0.993}], 'vad_segments_ms': [(260, 1820)], 'dur_s': 2.24, 'words': [], 'wav_path': 'assets/hello_en.wav'}

Usage of Each Module

The four components under fireredasr2s, i.e. fireredasr2, fireredvad, fireredlid, and fireredpunc are self-contained and designed to work as a standalone modules. You can use any of them independently without depending on the others. FireRedVAD and FireRedLID will also be open-sourced as standalone libraries in separate repositories.

Script Usage

# ASR
$ cd examples_infer/asr
$ bash inference_asr_aed.sh
$ bash inference_asr_llm.sh

# VAD & mVAD (mVAD=Audio Event Detection, AED)
$ cd examples_infer/vad
$ bash inference_vad.sh
$ bash inference_streamvad.sh
$ bash inference_aed.sh

# LID
$ cd examples_infer/lid
$ bash inference_lid.sh

# Punc
$ cd examples_infer/punc
$ bash inference_punc.sh

vLLM Usage

# Serving FireRedASR2-LLM with latest vLLM for the highest performance.
# For more details, see https://github.com/vllm-project/vllm/pull/35727.
$ vllm serve allendou/FireRedASR2-LLM-vllm -tp=2 --dtype=float32
$ python3 examples
View on GitHub
GitHub Stars433
CategoryCustomer
Updated10h ago
Forks24

Languages

Python

Security Score

100/100

Audited on Mar 27, 2026

No findings