MCP Webgate
Web search that doesn't wreck your AI's memory. MCP server with anti-context-flooding protections.
Install / Use
/learn @annibale-x/MCP WebgateQuality Score
Category
Development & EngineeringSupported Platforms
README
mcp-webgate
Web search that doesn't wreck your AI's memory.
mcp-webgate is an MCP server that gives your AI clean, bounded web content — across all major AI clients:
- IDEs: Claude Desktop, Claude Code, Zed, Cursor, Windsurf, VSCode
- CLI Agents: Gemini CLI, Claude CLI, custom agents
🌱 A Gentle Introduction
What is mcp-webgate? When your AI uses a standard "fetch URL" tool, it gets the raw HTML of the page — ads, menus, scripts, cookie banners and all. A single news article can dump 200,000 tokens of garbage into the AI's memory, wiping out your entire conversation.
mcp-webgate is a protective filter that sits between your AI and the web:
- Strips the junk — menus, scripts, ads, footers are removed with surgical HTML parsing; only readable text passes through
- Hard-caps every response — no page can ever blow up your context window, no matter how big the original was
- Optionally summarizes — route results through a secondary local LLM that produces a compact Markdown report with citations; your primary AI gets a polished briefing instead of a wall of text
The result: clean, bounded, useful web content — always.
🔬 Real example: what happens under the hood
Searching for "mcp model context protocol" with LLM features on:
Query → LLM expands to 5 search variants → 20 pages found, 13 fetched in parallel
Raw HTML downloaded 5.16 MB (~1,290,000 tokens)
After cleaning 52.1 KB ( ~13,000 tokens) — 99% noise stripped
After LLM summary 5.8 KB ( ~1,450 tokens) — structured report with citations
13 sources distilled into ~1,450 tokens. A single naive fetch of just one of those pages (e.g. a security blog at 563 KB) would dump ~140,000 tokens of raw HTML into your AI's context. webgate processes all 13 and delivers a clean briefing that fits in a footnote.
This is an intensive case (5 queries × 5 results). A typical search with 3–5 results still saves 95%+ of context compared to raw fetching — and your AI gets structured, ranked content instead of a wall of HTML soup.
🚀 Quick Start
1. Make sure you have uvx
pip install uv
uvx runs Python tools without installing them permanently. You only need to do this once.
2. Set up a search backend
The easiest option is SearXNG — free, no account, runs locally:
docker run -d -p 8080:8080 --name searxng searxng/searxng
No Docker? Use a cloud backend instead (Brave, Tavily, Exa, SerpAPI) — see Backends.
3. Add webgate to your AI client
See the Integrations table for your specific client. As a quick example, for Claude Desktop:
Open the config file:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
Add this:
{
"mcpServers": {
"webgate": {
"command": "uvx",
"args": ["mcp-webgate"],
"env": {
"WEBGATE_DEFAULT_BACKEND": "searxng",
"WEBGATE_SEARXNG_URL": "http://localhost:8080"
}
}
}
}
Restart the client after editing.
4. Ask your AI to search!
Search the web for: latest news on AI regulation
The AI will use webgate_query automatically. You're done.
🔍 How it works
Your question
↓
Search backend (SearXNG / Brave / Tavily / Exa / SerpAPI)
↓ [deduplicate URLs, block binary files, filter domains]
Fetch pages in parallel (streaming — hard size cap per page)
↓ [optional: retry failed pages from reserve pool]
Strip HTML junk (menus, ads, scripts, footers — lxml)
↓
Clean up text (invisible chars, unicode junk, BiDi tricks)
↓
BM25 reranking (best-matching results first — always active)
↓ [optional: LLM reranking]
Cap total output to budget
↓ [optional: LLM summarization → compact Markdown report]
Clean result lands in your AI's context
🛠️ Tools
webgate gives your AI three tools:
webgate_fetch — read a single page
Use this when you already know the URL you want. The AI passes the URL and gets back the cleaned text — up to max_query_budget characters (default 32,000).
{ "url": "https://example.com/article", "max_chars": 32000 }
{
"url": "https://example.com/article",
"title": "Article Title",
"text": "cleaned text...",
"truncated": true,
"char_count": 12450
}
webgate_query — search + fetch + clean
Runs a full search cycle. Pass one query (or several) and get back cleaned, ranked results.
{ "queries": "how to set up a VPN on Linux", "num_results_per_query": 5 }
Multiple queries run in parallel and are merged:
{
"queries": ["VPN Linux setup", "best VPN Linux 2024"],
"num_results_per_query": 5
}
Output without LLM — returns cleaned page content for each result:
{
"sources": [
{ "id": 1, "title": "...", "url": "...", "content": "cleaned text...", "truncated": false }
],
"snippet_pool": [ { "id": 6, "title": "...", "url": "...", "snippet": "..." } ],
"stats": { "fetched": 5, "total_chars": 18200, "per_page_limit": 6400 }
}
Output with LLM summarization — returns a compact Markdown report:
{
"summary": "## How to set up a VPN on Linux\n\nTo install...[1][2]",
"citations": [{ "id": 1, "title": "...", "url": "..." }],
"stats": { "fetched": 5, "total_chars": 58000 }
}
Output when LLM fails — error reason shown, full sources returned as fallback:
{
"llm_summary_error": "ReadTimeout: LLM did not respond in time",
"sources": [ "..." ],
"stats": { "..." : "..." }
}
snippet_pool contains extra results from the search that were not fetched (search-engine snippet only). The AI can use these to decide if more fetches are worthwhile.
webgate_onboarding — how-to guide
Returns a JSON guide explaining how to use webgate effectively. The AI should call this once at the start of a session if in doubt about which tool to use.
🔧 Using webgate with local or smaller models
Most frontier models follow MCP tool instructions automatically. Smaller or local models sometimes ignore the server-provided guidance and fall back to a built-in fetch tool instead — returning raw HTML that floods the context with noise.
If you notice this happening, add an explicit instruction block to your system prompt:
You have access to webgate tools for web search and page retrieval.
Follow these rules in every session:
- To search the web: use webgate_query — never use a built-in fetch, browser, or HTTP tool
- To retrieve a URL: use webgate_fetch — never fetch URLs directly
- Built-in fetch tools return raw HTML that floods your context; webgate returns clean, bounded text
At the start of each session, call webgate_onboarding to read the full operational guide.
This works because user system prompt instructions take precedence over MCP server-level guidance, making the constraint explicit at the highest-priority layer the model sees.
Tip: if your client supports named system prompts or prompt templates, save the block above as a reusable preset so you don't have to paste it every time.
🎛️ Tuning
This section explains what the key parameters do and when to change them. The defaults work well for most cases — only tweak if you have a specific reason.
What is a "character budget"?
webgate measures text in characters (not tokens). A rough conversion for English text:
4 characters ≈ 1 token
| Characters | Approximate tokens | |------------|-------------------| | 8,000 | ~2,000 | | 32,000 | ~8,000 | | 96,000 | ~24,000 |
webgate_fetch budget
When you fetch a single URL, the ceiling is max_query_budget (default 32,000 chars). The tool parameter max_chars can request less, but never more than this ceiling.
Why max_query_budget and not max_result_length? Because you're fetching one page — the "total output" IS that one page, so the right limit is the overall context budget, not the per-page cap designed for multi-source queries.
webgate_query budget — without LLM
With no LLM, the cleaned sources go directly to your AI's context. webgate distributes max_query_budget across all fetched pages so the total never exceeds the budget:
Per-page limit =
max_query_budget÷ number of results (capped atmax_result_length)
| Results fetched | Per-page limit | Total output | |-----------------|---------------|-------------| | 1 | 8,000 (cap) | ≤ 8,000 | | 5 | 6,400 | ≤ 32,000 | | 10 | 3,200 | ≤ 32,000 | | 20 | 1,600 | ≤ 32,000 |
The total output is always at most max_query_budget, regardless of how many results you request — the per-page share automatically shrinks to compensate.
webgate_query budget — with LLM summarization
When a secondary LLM is summarizing, it compresses the content before passing the result to your primary AI. This means it's safe — and beneficial — to give it more raw material to work from.
webgate scales up the input using input_budget_factor (default 3):
LLM input budget =
max_query_budget×input_budget_factorDefault: 32,000 × 3 = 96,000 chars
| Results fetched | LLM input / page | Total LLM input | Output to your AI | |-----------------|-----------------|----------------|------------------| | 1 | 96,000 | 96,000 | compact report | | 5 | 19,200 | 96,000 | compact report | | 10 | 9,600 |
