Cascade
Cascade is a production-ready, high-performance, and low-latency audio stream processing library designed for Voice Activity Detection (VAD). Built upon the excellent Silero VAD model, Cascade significantly reduces VAD processing latency while maintaining high accuracy through its 1:1:1 binding architecture and asynchronous streaming technology.
Install / Use
/learn @xucailiang/CascadeREADME
Cascade - Production-Ready, High-Performance, Asynchronous VAD Library
Cascade is a production-ready, high-performance, and low-latency audio stream processing library designed for Voice Activity Detection (VAD). Built upon the excellent Silero VAD model, Cascade significantly reduces VAD processing latency while maintaining high accuracy through its 1:1:1 binding architecture and asynchronous streaming technology.
📊 Performance Benchmarks
Based on our latest streaming VAD performance tests with different chunk sizes:
Streaming Performance by Chunk Size
| Chunk Size (bytes) | Processing Time (ms) | Throughput (chunks/sec) | Total Test Time (s) | Speech Segments | |-------------------|---------------------|------------------------|-------------------|-----------------| | 1024 | 0.66 | 92.2 | 3.15 | 2 | | 4096 | 1.66 | 82.4 | 0.89 | 2 | | 8192 | 2.95 | 72.7 | 0.51 | 2 |
Key Performance Metrics
| Metric | Value | Description | |-------------------------|---------------|-----------------------------------------| | Best Processing Speed | 0.66ms/chunk | Optimal performance with 1024-byte chunks | | Peak Throughput | 92.2 chunks/sec | Maximum processing throughput | | Success Rate | 100% | Processing success rate across all tests | | Accuracy | High | Guaranteed by the Silero VAD model | | Architecture | 1:1:1:1 | Independent model per processor instance |
Performance Characteristics
- Excellent performance across chunk sizes: High throughput and low latency with various chunk sizes
- Real-time capability: Sub-millisecond processing enables real-time applications
- Scalability: Linear performance scaling with independent processor instances
✨ Core Features
🚀 High-Performance Engineering
- Lock-Free Design: The 1:1:1 binding architecture eliminates lock contention, boosting performance.
- Frame-Aligned Buffer: A highly efficient buffer optimized for 512-sample frames.
- Asynchronous Streaming: Non-blocking audio stream processing based on
asyncio. - Memory Optimization: Zero-copy design, object pooling, and cache alignment.
- Concurrency Optimization: Dedicated threads, asynchronous queues, and batch processing.
🎯 Intelligent Interaction
- Real-time Interruption Detection: VAD-based intelligent interruption detection, allowing users to interrupt system responses at any time
- State Synchronization Guarantee: Two-way guard mechanism ensures strong consistency between physical and logical layers
- Automatic State Management: VAD automatically manages speech collection state, external services control processing state
- Anti-false-trigger Design: Minimum interval checking and state mutex locks effectively prevent false triggers
- Low-latency Response: Interruption detection latency < 50ms for natural conversation experience
🔧 Robust Software Engineering
- Modular Design: A component architecture with high cohesion and low coupling.
- Interface Abstraction: Dependency inversion through interface-based design.
- Type System: Data validation and type checking using Pydantic.
- Comprehensive Testing: Unit, integration, and performance tests.
- Code Standards: Adherence to PEP 8 style guidelines.
🛡️ Production-Ready Reliability
- Error Handling: Robust error handling and recovery mechanisms.
- Resource Management: Automatic cleanup and graceful shutdown.
- Monitoring Metrics: Real-time performance monitoring and statistics.
- Scalability: Horizontal scaling by increasing the number of instances.
- Stability Assurance: Handles boundary conditions and exceptional cases gracefully.
🏗️ Architecture
Cascade employs a 1:1:1:1 independent architecture to ensure optimal performance and thread safety.
graph TD
Client --> StreamProcessor
subgraph "1:1:1:1 Independent Architecture"
StreamProcessor --> |per connection| IndependentProcessor[Independent Processor Instance]
IndependentProcessor --> |independent loading| VADModel[Silero VAD Model]
IndependentProcessor --> |independent management| VADIterator[VAD Iterator]
IndependentProcessor --> |independent buffering| FrameBuffer[Frame-Aligned Buffer]
IndependentProcessor --> |independent state| StateMachine[State Machine]
end
subgraph "Asynchronous Processing Flow"
VADModel --> |asyncio.to_thread| VADInference[VAD Inference]
VADInference --> StateMachine
StateMachine --> |None| SingleFrame[Single Frame Output]
StateMachine --> |start| Collecting[Start Collecting]
StateMachine --> |end| SpeechSegment[Speech Segment Output]
end
🚀 Quick Start
Installation
pip install cascade-vad
OR
# Using uv is recommended
uv venv -p 3.12
source .venv/bin/activate
# Install from PyPI (recommended)
pip install cascade-vad
# Or install from source
git clone https://github.com/xucailiang/cascade.git
cd cascade
pip install -e .
Basic Usage
import cascade
import asyncio
async def basic_example():
"""A basic usage example."""
# Method 1: Simple file processing
async for result in cascade.process_audio_file("audio.wav"):
if result.result_type == "segment":
segment = result.segment
print(f"🎤 Speech Segment: {segment.start_timestamp_ms:.0f}ms - {segment.end_timestamp_ms:.0f}ms")
else:
frame = result.frame
print(f"🔇 Single Frame: {frame.timestamp_ms:.0f}ms")
# Method 2: Stream processing
async with cascade.StreamProcessor() as processor:
async for result in processor.process_stream(audio_stream):
if result.result_type == "segment":
segment = result.segment
print(f"🎤 Speech Segment: {segment.start_timestamp_ms:.0f}ms - {segment.end_timestamp_ms:.0f}ms")
else:
frame = result.frame
print(f"🔇 Single Frame: {frame.timestamp_ms:.0f}ms")
asyncio.run(basic_example())
Advanced Configuration
import cascade
async def advanced_example():
"""An advanced configuration example."""
# Custom configuration
config = cascade.Config(
vad_threshold=0.7, # Higher detection threshold
min_silence_duration_ms=100,
speech_pad_ms=100
)
# Use the custom config
async with cascade.StreamProcessor(config) as processor:
# Process audio stream
async for result in processor.process_stream(audio_stream):
# Process results...
pass
# Get performance statistics
stats = processor.get_stats()
print(f"Throughput: {stats.throughput_chunks_per_second:.1f} chunks/sec")
asyncio.run(advanced_example())
Interruption Detection
import cascade
async def interruption_example():
"""Interruption detection example"""
# Configure interruption detection
config = cascade.Config(
vad_threshold=0.5,
interruption_config=cascade.InterruptionConfig(
enable_interruption=True, # Enable interruption detection
min_interval_ms=500 # Minimum interruption interval 500ms
)
)
async with cascade.StreamProcessor(config) as processor:
async for result in processor.process_stream(audio_stream):
# Detect interruption events
if result.is_interruption:
print(f"🛑 Interruption detected! Interrupted state: {result.interruption.system_state.value}")
# Stop current TTS playback
await tts_service.stop()
# Cancel LLM request
await llm_service.cancel()
# Process speech segments
elif result.is_speech_segment:
# ASR recognition
text = await asr_service.recognize(result.segment.audio_data)
# Set to processing
processor.set_system_state(cascade.SystemState.PROCESSING)
# LLM generation
response = await llm_service.generate(text)
# Set to responding
processor.set_system_state(cascade.SystemState.RESPONDING)
# TTS playback
await tts_service.play(response)
# Reset to idle after completion
processor.set_system_state(cascade.SystemState.IDLE)
asyncio.run(interruption_example())
For detailed documentation, see: Interruption Implementation Summary
🧪 Testing
# Run basic integration tests
python tests/test_simple_vad.py -v
# Run simulated audio strea
