droidclaw

an ai agent that controls your android phone. give it a goal in plain english — it figures out what to tap, type, and swipe.

Download Android APK (v0.5.3) | Dashboard | Discord

i wanted to turn my old android devices into ai agents. after a few hours reverse engineering accessibility trees and playing with tailscale.. it worked.

think of it this way — a few years back, we could automate android with predefined flows. now imagine that automation layer has an llm brain. it can read any screen, understand what's happening, decide what to do, and execute. you don't need api's. you don't need to build integrations. just install your favourite apps and tell the agent what you want done.

one of the coolest things it can do right now is delegate incoming requests to chatgpt, gemini, or google search on the device... and bring the result back. no api keys for those services needed — it just uses the apps like a human would.

$ bun run src/kernel.ts
enter your goal: open youtube and search for "lofi hip hop"

--- step 1/30 ---
think: i'm on the home screen. launching youtube.
action: launch (842ms)

--- step 2/30 ---
think: youtube is open. tapping search icon.
action: tap (623ms)

--- step 3/30 ---
think: search field focused.
action: type "lofi hip hop" (501ms)

--- step 4/30 ---
action: enter (389ms)

--- step 5/30 ---
think: search results showing. done.
action: done (412ms)

how it works

the core idea is dead simple — a perception → reasoning → action loop that repeats until the goal is done (or it runs out of steps).

                         ┌─────────────────────────────────────────┐
                         │              your goal                  │
                         │   "send good morning to mom on whatsapp"│
                         └────────────────┬────────────────────────┘
                                          │
                                          ▼
                    ┌─────────────────────────────────────────────────┐
                    │                                                 │
                    │              ┌──────────────┐                   │
                    │              │  1. perceive  │                   │
                    │              └──────┬───────┘                   │
                    │                     │                           │
                    │    dump accessibility tree via adb               │
                    │    parse xml → interactive ui elements           │
                    │    diff with previous screen (detect changes)    │
                    │    optionally capture screenshot                 │
                    │                     │                           │
                    │                     ▼                           │
                    │              ┌──────────────┐                   │
                    │              │  2. reason    │                   │
                    │              └──────┬───────┘                   │
                    │                     │                           │
                    │    send screen state + goal + history to llm     │
                    │    llm returns { think, plan, action }           │
                    │    "i see the search icon at (890, 156).         │
                    │     i should tap it."                            │
                    │                     │                           │
                    │                     ▼                           │
                    │              ┌──────────────┐                   │
                    │              │  3. act       │                   │
                    │              └──────┬───────┘                   │
                    │                     │                           │
                    │    execute via adb: tap, type, swipe, etc.       │
                    │    feed result back to llm on next step          │
                    │    check if goal is done                        │
                    │                     │                           │
                    │                     ▼                           │
                    │               done? ─────── yes ──→ exit        │
                    │                │                                │
                    │                no                               │
                    │                │                                │
                    │                └─────── loop back to perceive   │
                    │                                                 │
                    └─────────────────────────────────────────────────┘

what makes it not fall apart

llms controlling ui's sounds fragile. and it is, if you don't handle the failure modes. here's what droidclaw does:

stuck loop detection — if the screen doesn't change for 3 steps, recovery hints get injected into the prompt. context-aware hints based on what type of action is failing (tap vs swipe vs wait).
repetition tracking — a sliding window of recent actions catches retry loops even across screen changes. if the agent taps the same coordinates 3+ times, it gets told to stop and try something else.
drift detection — if the agent spams navigation actions (swipe, back, wait) without interacting with anything, it gets nudged to take direct action.
vision fallback — when the accessibility tree is empty (webviews, flutter apps, games), a screenshot gets sent to the llm instead, with coordinate-based tap suggestions.
action feedback — every action result (success/failure + message) gets fed back to the llm on the next step. the agent knows whether its last move worked.
multi-turn memory — conversation history is maintained across steps so the llm has context about what it already tried.

setup

quick install

curl -fsSL https://droidclaw.ai/install.sh | sh

this installs bun and adb if missing, clones the repo, and sets up .env.

manual install

prerequisites:

bun (required — node/npm won't work. droidclaw uses bun-specific apis like Bun.spawnSync and native .env loading)
adb (android debug bridge — comes with android sdk platform tools)
an android phone with usb debugging enabled
an llm provider api key (or ollama for fully local)

# install adb
# macos:
brew install android-platform-tools
# linux:
sudo apt install android-tools-adb
# windows:
# download from https://developer.android.com/tools/releases/platform-tools

# install bun
curl -fsSL https://bun.sh/install | bash

# clone and setup
git clone https://github.com/unitedbyai/droidclaw.git
cd droidclaw
bun install
cp .env.example .env

configure your llm

edit .env and pick a provider. fastest way to start is groq (free tier):

LLM_PROVIDER=groq
GROQ_API_KEY=gsk_your_key_here

or run fully local with ollama (no api key, no internet needed):

ollama pull llama3.2
# then in .env:
LLM_PROVIDER=ollama
OLLAMA_MODEL=llama3.2

connect your phone

go to settings → about phone → tap "build number" 7 times to enable developer options
go to settings → developer options → enable "usb debugging"
plug in via usb and tap "allow" on the phone when prompted

adb devices   # should show your device

run it

bun run src/kernel.ts
# type your goal and press enter

three ways to use it

droidclaw has three modes, each for a different use case:

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│   interactive mode          workflows             flows             │
│   ─────────────────    ─────────────────    ─────────────────       │
│                                                                     │
│   type a goal and       chain goals          fixed sequences        │
│   the agent figures     across multiple      of taps and types.     │
│   it out on the fly.    apps with ai.        no llm, instant.       │
│                                                                     │
│   $ bun run              --workflow            --flow               │
│     src/kernel.ts         file.json             file.yaml           │
│                                                                     │
│   best for:             best for:            best for:              │
│   one-off tasks,        multi-app tasks,     things you do          │
│   exploration,          recurring routines,  exactly the same       │
│   quick commands        morning briefings    way every time         │
│                                                                     │
│   uses llm: yes         uses llm: yes        uses llm: no          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

interactive mode

just type what you want:

bun run src/kernel.ts
# enter your goal: open settings and turn on dark mode

workflows (ai-powered, multi-app)

workflows are json files describing a sequence of sub-goals. each step can optionally switch to a different app. the llm decides how to navigate, what to tap, what to type.

bun run src/kernel.ts --workflow examples/workflows/research/weather-to-whatsapp.json

{
  "name": "weather to whatsapp",
  "steps": [
    {
      "app": "com.google.android.googlequicksearchbox",
      "goal": "search for chennai weather today"
    },
    {
      "goal": "share the result to whatsapp contact Sanju"
    }
  ]
}

you can inject specific data into steps using formData:

Droidclaw

Install / Use

README