Matcha

Agent-native voice + vision OS for wearables. Voice-first, agent-first framework for smart glasses, earbuds, and beyond.

Generate Convert Improve

Install / Use

/learn @Intent-Lab/Matcha

About this skill

Quality Score

0/100

README

🍵 Matcha

An agent-native voice-and-vision framework. Turn any audio/visual device -- earbuds, smart glasses, pendants, phones -- into an always-on AI companion that can perceive, understand, and act on your behalf.

Built by Intentlabs.

Supported platforms: iOS (iPhone) and Android

The Problem

Today's voice AI apps (ChatGPT Voice, Gemini Live, Sesame) are conversational but not agentic. They can talk to you, but they cannot act for you. When they try to do complex tasks (search, multi-step workflows, API calls), they go silent for 10-30 seconds -- broken UX.

Meanwhile, agent frameworks (OpenClaw, Manus, Claude Code) can execute complex tasks but have no real-time voice interface.

No consumer product today combines real-time voice conversation with general-purpose agent execution. Matcha fills this gap.

Core Architecture: Dual-Agent System

Matcha separates real-time voice interaction from asynchronous task execution, allowing both to run simultaneously without blocking each other.

                         +-----------------------------+
                         |       MATCHA CORE        |
                         |                             |
 User ---- Audio ------> |  +---------------------+   |
 Device    Stream        |  |   VOICE AGENT        |   |
 (glasses,               |  |   (synchronous)      |   |
  earbuds,               |  |                      |   |
  pendant,               |  |   Real-time voice    |   |
  phone)                 |  |   conversation.      |   |
           <-- Audio --- |  |   Always responsive. |   |
               Response  |  |   Never blocked.     |   |
                         |  +----------+-----------+   |
                         |             |               |
                         |     delegates tasks         |
                         |             |               |
                         |  +----------v-----------+   |
                         |  |   ACTION AGENT        |   |
                         |  |   (asynchronous)      |   |
 User ---- Video ------> |  |                      |   |
 Device    Frames        |  |   Web search, API    |   |
 (camera   (~1fps)       |  |   calls, messaging,  |   |
  on                     |  |   smart home, etc.   |   |
  glasses,               |  |                      |   |
  phone)                 |  |   Reports results    |   |
                         |  |   back to Voice      |   |
                         |  |   Agent when ready.  |   |
                         |  +----------------------+   |
                         |                             |
                         +-----------------------------+

Voice Agent -- maintains real-time bidirectional audio with the user. Sub-second latency. Never blocked by tasks. Powered by Gemini Live API or OpenAI Realtime API.

Action Agent -- receives task delegations from Voice Agent. Executes complex, multi-step tasks in the background via either E2B cloud sandboxes (Claude Agent SDK) or OpenClaw (56+ skills: web search, messaging, smart home, notes, reminders, etc.). Reports results back to Voice Agent when ready.

Example flow:

User: "Find me the best ramen places in SF that are open late"
Voice Agent: "Sure, let me search for late-night ramen spots."
Action Agent begins web search in background
User: "Oh also, I want somewhere with vegetarian options"
Voice Agent: "Got it, I'll filter for vegetarian-friendly places too."
Action Agent returns results
Voice Agent speaks the answer conversationally

The user is never left in silence. The agent is never limited to shallow answers.

Supported Hardware

Matcha is device-agnostic. It connects to any audio I/O device:

| Device | Audio In | Audio Out | Video In | Status | |--------|----------|-----------|----------|--------| | Phone (built-in) | Mic | Speaker | Camera | Working | | AirPods / earbuds | Mic | Speaker | -- | Working | | Meta Ray-Ban glasses | Mic | Speaker | Camera (via DAT SDK) | Working | | Any Bluetooth audio | Mic | Speaker | -- | Working | | Sesame glasses | Mic | Speaker | Camera | Planned | | Apple glasses | Mic | Speaker | Camera | Planned | | Pendant devices | Mic | Speaker | Camera | Planned |

Supported Voice Models

Matcha is model-agnostic:

| Provider | Model | Status | |----------|-------|--------| | Google | Gemini 2.0 Flash (Live API) | Working | | OpenAI | GPT-4o Realtime API | Planned |

Quick Start (iOS)

1. Clone and open

git clone https://github.com/Intent-Lab/matcha.git
cd matcha/samples/CameraAccess
open CameraAccess.xcodeproj

2. Add your secrets

cp CameraAccess/Secrets.swift.example CameraAccess/Secrets.swift

Edit Secrets.swift with your Gemini API key (required) and optional E2B/OpenClaw/WebRTC config.

3. Build and run

Select your iPhone as the target device and hit Run (Cmd+R).

4. Try it out

Without glasses (iPhone mode):

Tap "Start on iPhone" -- uses your iPhone's back camera
Tap the AI button to start a voice session
Talk to the AI -- it can see through your iPhone camera and execute tasks

With Meta Ray-Ban glasses:

First, enable Developer Mode in the Meta AI app:

Open the Meta AI app on your iPhone
Go to Settings (gear icon, bottom left)
Tap App Info
Tap the App version number 5 times -- this unlocks Developer Mode
Go back to Settings -- you'll now see a Developer Mode toggle. Turn it on.

Then in the app:

Tap "Start Streaming"
Tap the AI button for voice + vision conversation

Quick Start (Android)

1. Clone and open

git clone https://github.com/Intent-Lab/matcha.git

Open samples/CameraAccessAndroid/ in Android Studio.

2. Configure GitHub Packages (DAT SDK)

The Meta DAT Android SDK is distributed via GitHub Packages. You need a GitHub Personal Access Token with read:packages scope.

Go to GitHub > Settings > Developer Settings > Personal Access Tokens and create a classic token with read:packages scope
In samples/CameraAccessAndroid/local.properties, add:

github_token=YOUR_GITHUB_TOKEN

3. Add your secrets

cd samples/CameraAccessAndroid/app/src/main/java/com/meta/wearable/dat/externalsampleapps/cameraaccess/
cp Secrets.kt.example Secrets.kt

Edit Secrets.kt with your Gemini API key (required) and optional E2B/OpenClaw/WebRTC config.

4. Build and run

Let Gradle sync in Android Studio
Select your Android phone as the target device
Click Run (Shift+F10)

5. Try it out

Without glasses (Phone mode):

Tap "Start on Phone" -- uses your phone's back camera
Tap the AI button to start a voice session
Talk to the AI -- it can see through your phone camera and execute tasks

With Meta Ray-Ban glasses:

Enable Developer Mode in the Meta AI app (same steps as iOS above), then:

Tap "Start Streaming" in the app
Tap the AI button for voice + vision conversation

Agent Backends

Matcha supports two agent backends for task execution. You can switch between them at runtime in the in-app Settings > Agent Backend picker.

| Backend | Description | Best for | |---------|-------------|----------| | E2B | Cloud-hosted sandbox (E2B + Claude Agent SDK). Deploy the agent/ directory to Vercel. Supports streaming tool progress. | Production, multi-user | | OpenClaw | Local Mac gateway with 56+ skills. Runs on your local network. | Development, personal use |

Without either backend configured, the AI is voice + vision only (no task execution).

Setup: E2B Agent (Optional)

The E2B backend runs a Claude Agent SDK sandbox in the cloud. It supports real-time streaming of tool execution progress (which tools are running, their results, etc.).

1. Deploy the agent

Deploy the agent/ directory to Vercel:

cd agent
vercel deploy

2. Configure the app

iOS -- In Secrets.swift:

static let agentBaseURL = "https://your-deployment.vercel.app"
static let agentToken = "your-shared-secret-token"

Android -- In Secrets.kt:

const val agentBaseURL = "https://your-deployment.vercel.app"
const val agentToken = "your-shared-secret-token"

3. Select the backend

Open Settings in the app and set Agent Backend to E2B.

Setup: OpenClaw (Optional)

OpenClaw gives Matcha the ability to take real-world actions: send messages, search the web, manage lists, control smart home devices, and more.

1. Install and configure OpenClaw

Follow the OpenClaw setup guide. Make sure the gateway is enabled:

In ~/.openclaw/openclaw.json:

{
  "gateway": {
    "port": 18789,
    "bind": "lan",
    "auth": {
      "mode": "token",
      "token": "your-gateway-token-here"
    },
    "http": {
      "endpoints": {
        "chatCompletions": { "enabled": true }
      }
    }
  }
}

2. Configure the app

iOS -- In Secrets.swift:

static let openClawHost = "http://Your-Mac.local"
static let openClawPort = 18789
static let openClawGatewayToken = "your-gateway-token-here"

Android -- In Secrets.kt:

const val openClawHost = "http://Your-Mac.local"
const val openClawPort = 18789
const val openClawGatewayToken = "your-gateway-token-here"

3. Select the backend

Open Settings in the app and set Agent Backend to OpenClaw. You can use the Test Connection button to verify connectivity.

4. Start the gateway

openclaw gateway restart

Architecture

Project Structure (iOS)

samples/CameraAccess/

Related Skills

node-connect

343.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

92.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。