ChromePilot

AI-powered browser automation agent using a dual-LLM architecture. The orchestrator (qwen3-vl-32k) creates execution plans from screenshots, while the executor (llama3.1:8b) translates steps into browser actions using an accessibility tree for reliable element selection. Local, private, powered by Ollama.

Generate Convert Improve

Install / Use

/learn @Varun-Patkar/ChromePilot

About this skill

Quality Score

0/100

README

ChromePilot v2.0

An AI-powered browser automation agent using a two-LLM architecture: a reasoning model (qwen3-vl-32k) orchestrates tasks, while an executor model (llama3.1-8b-32k:latest) translates steps into tool calls with full context from previous actions.

Version History

v2 (Current): One-shot agent with plan-and-execute workflow

Orchestrator creates a complete plan upfront based on screenshot
Executor executes each step sequentially with context from previous steps
User approves/rejects plans before execution
Post-execution verification to confirm task completion

v3 (Planned): True iterative agent with dynamic re-evaluation

Agent iterates and adapts plan based on execution results
Re-evaluates after each step and adjusts strategy if needed
Asks user for clarification when encountering ambiguity
Similar to GitHub Copilot's conversational debugging approach
Handles unexpected page states and errors gracefully

Architecture

ChromePilot uses a dual-LLM system:

Orchestrator (qwen3-vl-32k): Vision-enabled reasoning model that sees your page and creates plain English step-by-step plans
Executor (llama3.1-8b-32k:latest): Fast, lightweight model that translates each step into specific tool calls with access to previous step outputs

This architecture enables:

Steps can reference previous outputs (e.g., "Click the first link from the search results")
Reasoning model focuses on high-level planning without tool syntax
Executor model has full context of execution history for each step

→ See ARCHITECTURE.md for detailed explanation with examples and flow diagrams

Features

🎯 Visual AI Agent: Sees and understands web pages using vision models
🔄 Two-Stage Execution: Orchestrator plans, executor executes with context
📸 Screenshot Analysis: Automatically captures and analyzes the current tab
🌐 HTML Context: Extracts complete page HTML structure
💭 Reasoning Process: View the orchestrator's step-by-step thinking
🔄 Streaming Responses: Real-time streaming of AI responses with markdown rendering
📊 Execution Tracking: See each step's status, inputs, and outputs
↕️ Collapsible Plans: Expand/collapse plan details and execution history
💾 Conversation History: Maintains context of last 4 messages
🎨 Clean UI: Beautiful sidebar interface with smooth animations
🔐 Privacy-Focused: All processing happens locally through Ollama
🎛️ Context Controls: Toggle screenshot and HTML context on/off

Current Capabilities (v2)

✅ Multi-step task planning with plain English descriptions
✅ Context-aware execution (steps can use previous outputs)
✅ 10 comprehensive browser tools (click, type, select, pressKey, scroll, navigate, manageTabs, waitFor, getSchema, getHTML)
✅ Accessibility tree extraction with smart element filtering
✅ Visual execution feedback with status tracking
✅ Approve/reject workflow with plan correction support
✅ Post-execution verification with screenshot analysis
✅ One-shot planning: complete plan created upfront before execution

Planned for v3 (Iterative Agent)

🔨 Dynamic re-planning based on execution results
🔨 Step-by-step evaluation and strategy adjustment
🔨 Conversational clarification requests to user
🔨 Error recovery with intelligent retry logic
🔨 Handling unexpected page states and navigation changes

Prerequisites

Ollama: Install Ollama from https://ollama.ai
Orchestrator Model: Create the qwen3-vl-32k model with extended context:

First, pull the base model:
```
ollama pull qwen3-vl:8b
```
Create a file named Modelfile1 with this content:
```
FROM qwen3-vl:8b
PARAMETER num_ctx 32768
```
Create the extended context model:
```
ollama create qwen3-vl-32k -f Modelfile1
```
Verify it was created:
```
ollama list
```
Executor Model: Create the llama3.1-32k model with extended context:

First, pull the base model:
```
ollama pull llama3.1-8b-32k:latest
```
Create a file named Modelfile2 with this content:
```
FROM llama3.1-8b-32k:latest
PARAMETER num_ctx 32768
```
Create the extended context model:
```
ollama create llama3.1-32k -f Modelfile2
```
Verify it was created:
```
ollama list
```
Enable CORS: Ollama must be started with CORS enabled for Chrome extensions:

Windows:
```
set OLLAMA_ORIGINS=chrome-extension://*
ollama serve
```
Or simply run the provided batch file:
```
start-ollama-with-cors.bat
```
macOS/Linux:
```
OLLAMA_ORIGINS=chrome-extension://* ollama serve
```

Installation

Clone or download this repository
Open Chrome and navigate to chrome://extensions/
Enable "Developer mode" in the top right
Click "Load unpacked" and select the ChromePilot folder
The ChromePilot icon should appear in your extensions toolbar

Usage

Start Ollama with CORS enabled (see Prerequisites)
Click the ChromePilot icon in your Chrome toolbar to open the sidebar
The extension will automatically:
- Capture a screenshot of the current tab
- Extract the complete HTML structure (not just visible area)
- Send both to the AI model
Ask questions about the page:
- "What is this page about?"
- "Where can I find the filters?"
- "What options are available on this form?"
- "Explain what I'm looking at"
View reasoning: Click "View Reasoning" to see the AI's step-by-step thinking
Follow-up questions: Ask related questions - the AI remembers the last 2 exchanges
Toggle context: Use the switches to enable/disable screenshot or HTML context
Reset: Click the reset button to start a fresh conversation

Technical Details

Token Management

Maximum input: 32K tokens (including image)
Automatic token estimation prevents truncation
HTML is simplified and truncated to reduce token usage

HTML Processing

The extension extracts all displayed elements from the page:

Captures entire page HTML, not just viewport-visible elements
Removes styling, scripts, SVGs, and non-interactive elements
Preserves IDs, classes, semantic attributes, and ARIA labels
Includes elements below the fold (scrolled out of view)
Maximum 20K characters of HTML
Skips CSS-hidden elements (display: none, visibility: hidden)

Permissions

The extension requests these permissions for future features:

activeTab: Capture screenshots and inject scripts
tabs: Access tab information
scripting: Execute content scripts
sidePanel: Display the chat interface
storage: Save conversation history
debugger: Future mouse/keyboard control
<all_urls>: Work on any webpage

Future Enhancements (v3+)

Planned features for v3 (Iterative Agent):

🤖 Dynamic Re-planning: Adjust strategy based on execution outcomes
🔄 Iterative Evaluation: Re-evaluate after each step instead of one-shot planning
💬 Conversational Clarification: Ask user for input when encountering ambiguity
🛡️ Adaptive Error Handling: Recover from failures with alternative approaches
🎯 Context-Aware Adaptation: Handle unexpected page states intelligently

v2 provides one-shot plan-and-execute workflow. v3 will introduce true agentic behavior with iteration and dynamic adaptation.

Troubleshooting

"Cannot connect to Ollama" or "Failed to fetch"

Ensure Ollama is running with CORS enabled:
- Windows: set OLLAMA_ORIGINS=chrome-extension://* && ollama serve
- Use the start-ollama-with-cors.bat file provided
Check that Ollama is accessible: Open http://localhost:11434/api/tags in your browser
Restart Ollama if you forgot to set CORS initially

"Model not found"

Make sure you created both models (see Prerequisites)
Orchestrator: ollama pull qwen3-vl:8b then ollama create qwen3-vl-32k -f Modelfile1
Executor: ollama pull llama3.1-8b-32k:latest then ollama create llama3.1-32k -f Modelfile2
Verify with: ollama list (should show qwen3-vl-32k:latest and llama3.1-32k:latest)

"Request too large"

The page content exceeds 32K tokens
Try asking a more specific question
Navigate to a simpler page section

License

MIT License - Feel free to modify and distribute

Credits

Built with Ollama for local AI processing
Uses qwen3-vl-32k for vision and reasoning capabilities

Related Skills

node-connect

350.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

350.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

350.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。