SkillAgentSearch skills...

Xtalk

X-Talk is an open-source full-duplex cascaded spoken dialogue system framework enabling low-latency, interruptible, and human-like speech interaction with a lightweight, pure-Python, production-ready architecture.

Install / Use

/learn @xcc-zach/Xtalk
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

X-Talk

<img width="460" height="249" alt="xtalk-logo-new" src="https://github.com/user-attachments/assets/4e252ce8-7450-4335-b86a-4b9b26200792" />

Live Demo arXiv Python License

<!-- <img src="PENDING" alt="Watermark" style="width: 40px; height: auto"> -->

⚠️ X-Talk is in active prototyping. Interfaces and functions are subject to change. We will try to keep interfaces stable.

X-Talk is an open-source full-duplex cascaded spoken dialogue system framework featuring:

  • Low-Latency, Interruptible, Human-Like Speech Interaction
    • Speech flow is optimized to support impressive low latency
    • Enables natural user interruption during interaction
    • Paralinguistic information (e.g. environment noise, emotion) is encoded in parallel to support in-depth understanding and empathy
  • 🧪 Researcher Friendly
    • New models and relevant logic can be added within one Python script, and seamlessly integrated with the default pipeline.
  • 🧩 Super Lightweight
    • The framework backend is pure Python; nothing to build and install beyond pip install.
  • 🏭 Production Ready
    • Concurrency is ensured through asynchronous backend
    • Websocket-based implementation empowers deployment from web browsers to edge devices.

📚 Contents

<a id="demo"></a>

🎬 Demo

Online Demo

Demo Link

This demo runs on 4090 cluster with 8-bit quantized SenseVoice as speech recognizer, IndexTTS 1.5 as speech generator, and 4-bit quantized Qwen3-30B-A3B as language model. Though at the cost of intelligence due to a relatively small language model, it demonstrates low latency.

Demo Videos

<table class="center"> <tr> <td width=50% style="border: none"> <video controls autoplay loop src="https://github.com/user-attachments/assets/e7946357-cd83-493c-8967-354cf87b2acb" muted="false"></video> </td> <td width=50% style="border: none"> <video controls autoplay loop src="https://github.com/user-attachments/assets/ca45c463-6738-4b5c-8305-71fce4ab490e" muted="false"></video> </td> </tr> <tr> <td width=50% style="border: none"> <video controls autoplay loop src="https://github.com/user-attachments/assets/8c0f489a-6af6-4711-a28c-7a48740f666c" muted="false"></video> </td> <td width=50% style="border: none"> <video controls autoplay loop src="https://github.com/user-attachments/assets/d8fc4d15-edfb-4476-a9d3-983a1ce9be0e" muted="false"></video> </td> </tr> <tr> <td width=50% style="border: none"> <video controls autoplay loop src="https://github.com/user-attachments/assets/7ea4dc44-d43c-45ca-8788-2032b3a387d8" muted="false"></video> </td> <td width=50% style="border: none"> <video controls autoplay loop src="https://github.com/user-attachments/assets/9f296d5e-a752-435e-91a2-a9f1a71f9fac" muted="false"></video> </td> </tr> <tr> <td width=50% style="border: none"> <video controls autoplay loop src="https://github.com/user-attachments/assets/2b44f2f1-93c4-47b8-99e0-830338cdba02" muted="false"></video> </td> <td width=50% style="border: none"> <video controls autoplay loop src="https://github.com/user-attachments/assets/c4cd4c1b-c4fd-493b-8cb2-347c48ac5809" muted="false"></video> </td> </tr> <tr> <td width=50% style="border: none"> <video controls autoplay loop src="https://github.com/user-attachments/assets/d33ca5ef-c722-45a6-93df-2fdb7ffcc729" muted="false"></video> </td> <td width=50% style="border: none"> <video controls autoplay loop src="https://github.com/user-attachments/assets/09370641-7a26-4f93-9c98-dee887612fda" muted="false"></video> </td> </tr> </table>

The tour guiding demos are conducted with Qwen3-Next-80B-A3B-Instruct as language model, and the other eight demos are aligned with the online demo setting. Larger language models are more intelligent at the cost of latency.

<a id="installation"></a>

🛠️ Installation

pip install git+https://github.com/xcc-zach/xtalk.git@main

<a id="quickstart"></a>

🚀 Quickstart

We will use APIs from AliCloud to demonstrate the basic capability of X-Talk.

First, install dependencies for AliCloud and server script:

pip install "xtalk[ali] @ git+https://github.com/xcc-zach/xtalk.git@main"
pip install jinja2 python-multipart 'uvicorn[standard]'

Then, obtain an API key from AliCloud Bailian Platform. We will be using free-tier service from AliCloud.

Online service may be unstable and of high latency. We recommend using locally deployed models for better user experience. See server config tutorial and supported models for details.

After that, create a JSON config specifying the models to use, and fill in <API_KEY> with the key you obtained:

{
    "asr": {
        "type": "Qwen3ASRFlashRealtime",
        "params": {
            "api_key": "<API_KEY>"
        }
    },
    "llm_agent": {
        "type": "DefaultAgent",
        "params": {
            "model": {
                "api_key": "<API_KEY>",
                "model": "qwen-plus-2025-12-01",
                "base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1"
            }
        }
    },
    "tts": {
        "type": "CosyVoice",
        "params": {
            "api_key": "<API_KEY>"
        }
    }
}

If you find Qwen3ASRFlashRealtime not working properly, you can use "asr": "SenseVoiceSmallLocal", instead which is a ~1GB local model. Also, you can try to use local speech generation model IndexTTS (setup tutorial):

"tts": {
    "type": "IndexTTS",
    "params": {
        "port": 6006
    }
},

If you want all models deployed locally, see here.

The next step is to compose the startup script. Since we also need to link frontend webpage and scripts to get the demo working, the startup script is ready at examples/sample_app/configurable_server.py. We simply need to start the server with the config file (fill in <PATH_TO_CONFIG>.json with the path to the config file we just created) and a custom port:

git clone https://github.com/xcc-zach/xtalk.git
cd xtalk
python examples/sample_app/configurable_server.py  --port 7635 --config <PATH_TO_CONFIG>.json

Finally, our demo is ready at http://localhost:7635. View it in the browser!

<a id="tutorial"></a>

📖 Tutorial

Start the Server

[!NOTE] See examples/sample_app/configurable_server.py, frontend/src, examples/sample_app/templates and examples/sample_app/static for details.

X-Talk has most models and execution on server side, and the client is responsible for interacting with microphone, transmitting audio and Websocket messages, and handle lightweight operations like Voice-Actitvty-Detection.

For client side, you can start with snippet in examples\sample_app\static\js\index.js and track where convo is used to see how to use the client API:

async function loadXtalk() {
    try {
        return await import("../../xtalk/index.js"); // Try local import first, this is dev only
    } catch (e) {
        return await import("https://unpkg.com/xtalk-client@latest/dist/index.js"); // Use unpkg CDN for production
    }
}

const { createConversation } = await loadXtalk();


function getWebSocketURL() {
    const proto = location.protocol === "https:" ? "wss:" : "ws:";
    const wsPath = new URL("./ws", window.location.href);
    wsPath.protocol = proto;
    wsPath.host = window.location.host;
    return wsPath
}

const convo = createConversation(getWebSocketURL());

We recently published the client API as a separate package xtalk-client. Therefore, you can directly import it from https://unpkg.com/xtalk-client@latest/dist/index.js without hosting the client code by yourself, as shown above. We plan to continuously improve the client-side API in the future.

For the server side, the core logic is to connect a X-Talk instance to Websocket of FastAPI instance:

from fastapi import FastAPI, WebSocket
from xtalk import Xtalk
app = FastAPI(title="Xtalk Server")
xtalk_instance = Xtalk.from_config("path/to/config.json")
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await xtalk_instance.connect(websocket)

Then you can check examples/sample_app/configurable_server.py for how to mount client-side scripts and pages.

Text Embedding

[!NOTE] See examples/sample_app/configurable_server.py and `frontend/src/js/index.j

Related Skills

View on GitHub
GitHub Stars189
CategoryDevelopment
Updated4d ago
Forks19

Languages

Python

Security Score

85/100

Audited on Apr 1, 2026

No findings