ChromePilot
AI-powered browser automation agent using a dual-LLM architecture. The orchestrator (qwen3-vl-32k) creates execution plans from screenshots, while the executor (llama3.1:8b) translates steps into browser actions using an accessibility tree for reliable element selection. Local, private, powered by Ollama.
Install / Use
/learn @Varun-Patkar/ChromePilotREADME
ChromePilot v2.0
An AI-powered browser automation agent using a two-LLM architecture: a reasoning model (qwen3-vl-32k) orchestrates tasks, while an executor model (llama3.1-8b-32k:latest) translates steps into tool calls with full context from previous actions.
Version History
v2 (Current): One-shot agent with plan-and-execute workflow
- Orchestrator creates a complete plan upfront based on screenshot
- Executor executes each step sequentially with context from previous steps
- User approves/rejects plans before execution
- Post-execution verification to confirm task completion
v3 (Planned): True iterative agent with dynamic re-evaluation
- Agent iterates and adapts plan based on execution results
- Re-evaluates after each step and adjusts strategy if needed
- Asks user for clarification when encountering ambiguity
- Similar to GitHub Copilot's conversational debugging approach
- Handles unexpected page states and errors gracefully
Architecture
ChromePilot uses a dual-LLM system:
- Orchestrator (qwen3-vl-32k): Vision-enabled reasoning model that sees your page and creates plain English step-by-step plans
- Executor (llama3.1-8b-32k:latest): Fast, lightweight model that translates each step into specific tool calls with access to previous step outputs
This architecture enables:
- Steps can reference previous outputs (e.g., "Click the first link from the search results")
- Reasoning model focuses on high-level planning without tool syntax
- Executor model has full context of execution history for each step
→ See ARCHITECTURE.md for detailed explanation with examples and flow diagrams
Features
- 🎯 Visual AI Agent: Sees and understands web pages using vision models
- 🔄 Two-Stage Execution: Orchestrator plans, executor executes with context
- 📸 Screenshot Analysis: Automatically captures and analyzes the current tab
- 🌐 HTML Context: Extracts complete page HTML structure
- 💭 Reasoning Process: View the orchestrator's step-by-step thinking
- 🔄 Streaming Responses: Real-time streaming of AI responses with markdown rendering
- 📊 Execution Tracking: See each step's status, inputs, and outputs
- ↕️ Collapsible Plans: Expand/collapse plan details and execution history
- 💾 Conversation History: Maintains context of last 4 messages
- 🎨 Clean UI: Beautiful sidebar interface with smooth animations
- 🔐 Privacy-Focused: All processing happens locally through Ollama
- 🎛️ Context Controls: Toggle screenshot and HTML context on/off
Current Capabilities (v2)
- ✅ Multi-step task planning with plain English descriptions
- ✅ Context-aware execution (steps can use previous outputs)
- ✅ 10 comprehensive browser tools (click, type, select, pressKey, scroll, navigate, manageTabs, waitFor, getSchema, getHTML)
- ✅ Accessibility tree extraction with smart element filtering
- ✅ Visual execution feedback with status tracking
- ✅ Approve/reject workflow with plan correction support
- ✅ Post-execution verification with screenshot analysis
- ✅ One-shot planning: complete plan created upfront before execution
Planned for v3 (Iterative Agent)
- 🔨 Dynamic re-planning based on execution results
- 🔨 Step-by-step evaluation and strategy adjustment
- 🔨 Conversational clarification requests to user
- 🔨 Error recovery with intelligent retry logic
- 🔨 Handling unexpected page states and navigation changes
Prerequisites
-
Ollama: Install Ollama from https://ollama.ai
-
Orchestrator Model: Create the qwen3-vl-32k model with extended context:
First, pull the base model:
ollama pull qwen3-vl:8bCreate a file named
Modelfile1with this content:FROM qwen3-vl:8b PARAMETER num_ctx 32768Create the extended context model:
ollama create qwen3-vl-32k -f Modelfile1Verify it was created:
ollama list -
Executor Model: Create the llama3.1-32k model with extended context:
First, pull the base model:
ollama pull llama3.1-8b-32k:latestCreate a file named
Modelfile2with this content:FROM llama3.1-8b-32k:latest PARAMETER num_ctx 32768Create the extended context model:
ollama create llama3.1-32k -f Modelfile2Verify it was created:
ollama list -
Enable CORS: Ollama must be started with CORS enabled for Chrome extensions:
Windows:
set OLLAMA_ORIGINS=chrome-extension://* ollama serveOr simply run the provided batch file:
start-ollama-with-cors.batmacOS/Linux:
OLLAMA_ORIGINS=chrome-extension://* ollama serve
Installation
- Clone or download this repository
- Open Chrome and navigate to
chrome://extensions/ - Enable "Developer mode" in the top right
- Click "Load unpacked" and select the ChromePilot folder
- The ChromePilot icon should appear in your extensions toolbar
Usage
- Start Ollama with CORS enabled (see Prerequisites)
- Click the ChromePilot icon in your Chrome toolbar to open the sidebar
- The extension will automatically:
- Capture a screenshot of the current tab
- Extract the complete HTML structure (not just visible area)
- Send both to the AI model
- Ask questions about the page:
- "What is this page about?"
- "Where can I find the filters?"
- "What options are available on this form?"
- "Explain what I'm looking at"
- View reasoning: Click "View Reasoning" to see the AI's step-by-step thinking
- Follow-up questions: Ask related questions - the AI remembers the last 2 exchanges
- Toggle context: Use the switches to enable/disable screenshot or HTML context
- Reset: Click the reset button to start a fresh conversation
Technical Details
Token Management
- Maximum input: 32K tokens (including image)
- Automatic token estimation prevents truncation
- HTML is simplified and truncated to reduce token usage
HTML Processing
The extension extracts all displayed elements from the page:
- Captures entire page HTML, not just viewport-visible elements
- Removes styling, scripts, SVGs, and non-interactive elements
- Preserves IDs, classes, semantic attributes, and ARIA labels
- Includes elements below the fold (scrolled out of view)
- Maximum 20K characters of HTML
- Skips CSS-hidden elements (display: none, visibility: hidden)
Permissions
The extension requests these permissions for future features:
activeTab: Capture screenshots and inject scriptstabs: Access tab informationscripting: Execute content scriptssidePanel: Display the chat interfacestorage: Save conversation historydebugger: Future mouse/keyboard control<all_urls>: Work on any webpage
Future Enhancements (v3+)
Planned features for v3 (Iterative Agent):
- 🤖 Dynamic Re-planning: Adjust strategy based on execution outcomes
- 🔄 Iterative Evaluation: Re-evaluate after each step instead of one-shot planning
- 💬 Conversational Clarification: Ask user for input when encountering ambiguity
- 🛡️ Adaptive Error Handling: Recover from failures with alternative approaches
- 🎯 Context-Aware Adaptation: Handle unexpected page states intelligently
v2 provides one-shot plan-and-execute workflow. v3 will introduce true agentic behavior with iteration and dynamic adaptation.
Troubleshooting
"Cannot connect to Ollama" or "Failed to fetch"
- Ensure Ollama is running with CORS enabled:
- Windows:
set OLLAMA_ORIGINS=chrome-extension://* && ollama serve - Use the
start-ollama-with-cors.batfile provided
- Windows:
- Check that Ollama is accessible: Open
http://localhost:11434/api/tagsin your browser - Restart Ollama if you forgot to set CORS initially
"Model not found"
- Make sure you created both models (see Prerequisites)
- Orchestrator:
ollama pull qwen3-vl:8bthenollama create qwen3-vl-32k -f Modelfile1 - Executor:
ollama pull llama3.1-8b-32k:latestthenollama create llama3.1-32k -f Modelfile2 - Verify with:
ollama list(should showqwen3-vl-32k:latestandllama3.1-32k:latest)
"Request too large"
- The page content exceeds 32K tokens
- Try asking a more specific question
- Navigate to a simpler page section
License
MIT License - Feel free to modify and distribute
Credits
- Built with Ollama for local AI processing
- Uses qwen3-vl-32k for vision and reasoning capabilities
Related Skills
node-connect
350.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
350.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
350.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
