OpenDocuments
Open source RAG tool for AI document search - connect GitHub, Notion, Google Drive and ask questions with cited answers. Self-hosted with Ollama/OpenAI/Claude.
Install / Use
/learn @joungminsung/OpenDocumentsQuality Score
Category
Development & EngineeringSupported Platforms
README
The Problem: Scattered Knowledge, No AI Search
Your team's knowledge is trapped in silos:
- Engineering docs live in GitHub READMEs and Wiki pages
- Product specs are scattered across Notion databases
- Budget reports sit in Excel files on Google Drive
- API docs are auto-generated Swagger specs nobody reads
- Meeting notes rot in Confluence spaces
- Onboarding guides are buried in
.docxfiles on S3
When someone asks "How does our auth system work?" or "What was the Q3 budget for the AI team?", they spend 15 minutes hunting through 5 different tools. And they still might not find the answer.
The Solution: Self-Hosted AI Document Search
OpenDocuments connects to all your document sources, indexes everything into a unified search engine, and answers questions in natural language -- with source citations so you know exactly where the answer came from.
npm install -g opendocuments
opendocuments init
opendocuments start
Open http://localhost:3000, and ask away.
OpenDocuments is a free, open source alternative to proprietary enterprise AI search tools. It's a self-hosted RAG (Retrieval-Augmented Generation) platform that runs on your own infrastructure.
Recent Improvements
- One-touch Ollama setup:
initauto-detects Ollama, offers to pull missing models .envauto-loading: API keys in.envare loaded automatically (no manual export needed)- Multi-turn conversations: Chat remembers previous context for follow-up questions
- Degraded mode warnings: Clear banners when models aren't configured, with fix instructions
- Enhanced diagnostics:
opendocuments doctorchecks Ollama connectivity, model availability, and config validity - Security hardening: FTS5 injection prevention, file upload sanitization, OAuth state limits, workspace isolation
Real-World Use Cases
For Engineering Teams
"How do I authenticate against our internal API?"
OpenDocuments pulls the answer from your GitHub repo's docs/auth.md, links to the relevant Swagger endpoint, and includes a code example from the codebase -- all in one response.
# Index your repo and API docs
opendocuments index ./docs
opendocuments connector sync github
opendocuments ask "How does JWT token refresh work in our API?"
For Operations & HR Teams
"What's the remote work policy for the Tokyo office?"
OpenDocuments searches across your Confluence HR space, the employee handbook on Google Drive, and the latest policy update email -- even if some documents are in Korean and others in English.
opendocuments ask "도쿄 오피스 원격 근무 정책이 뭐야?" --profile precise
# Cross-lingual search finds both Korean and English documents
For Product Managers
"Compare the feature specs of v2.0 vs v3.0"
OpenDocuments decomposes the question, searches both versions' specs, and presents a structured comparison table -- citing each source document.
For AI-Assisted Development (MCP)
Use OpenDocuments as a knowledge base for Claude Code, Cursor, or any MCP-compatible AI tool:
{
"mcpServers": {
"opendocuments": {
"command": "opendocuments",
"args": ["start", "--mcp-only"]
}
}
}
Now your AI coding assistant can search your organization's entire document corpus while writing code.
For Self-Hosted Knowledge Bases
Deploy on your own infrastructure. Your data never leaves your network when using a local LLM via Ollama. No cloud dependency, no vendor lock-in, no subscription fees.
docker compose --profile with-ollama up -d
# Everything runs locally: LLM, embeddings, vector search, web UI
Quick Start
1. Install
npm install -g opendocuments
2. Initialize
opendocuments init
The interactive wizard will:
- Detect your hardware (CPU, RAM) and recommend the optimal LLM
- Let you choose between local (Ollama) or cloud (OpenAI, Claude, Gemini, Grok) models
- Auto-detect Ollama and offer to pull missing models automatically
- Validate cloud API keys before saving
- Select a plugin preset:
Developer,Enterprise,All, orCustom - Generate
opendocuments.config.tsand.env(API keys loaded automatically)
3. Start
opendocuments start
Open http://localhost:3000 -- you'll see a chat UI, document manager, and admin dashboard.
First time? If Ollama isn't running, you'll see a clear DEGRADED MODE banner with step-by-step fix instructions. Run
opendocuments doctorfor full diagnostics.
4. Index Your Documents
# Index a local directory (recursively finds all supported files)
opendocuments index ./docs
# Watch mode: auto-reindex when files change
opendocuments index ./docs --watch
# Or drag-and-drop files in the Web UI
5. Ask Questions
opendocuments ask "What's our deployment process?"
How It Works
Your Documents OpenDocuments You
───────────── ────────────── ───
GitHub repos ──┐
Notion pages ──┤ ┌─────────────┐
Google Drive ──┤ ── Ingest ──► │ Parse │
Confluence ──┤ │ Chunk │ "How does
S3 buckets ──┤ │ Embed │ auth work?"
Swagger specs──┤ │ Store │ │
Local files ──┤ └──────┬───────┘ │
Web pages ──┘ │ ▼
┌──────┴───────┐ ┌─────────────┐
│ SQLite │ │ RAG Engine │
│ (metadata) │◄─┤ Search │
│ │ │ Rerank │
│ LanceDB │ │ Generate │
│ (vectors) │ │ Cite sources│
└──────────────┘ └──────┬──────┘
│
▼
"Auth uses JWT
tokens with
refresh flow.
[Source: auth.md]"
The RAG Pipeline
- Intent Classification -- Understands whether you're asking about code, concepts, data, or want a comparison
- Query Decomposition -- Breaks complex questions into sub-queries for better retrieval
- Cross-Lingual Search -- Finds documents in both Korean and English regardless of query language
- Hybrid Search -- Combines dense vector search (semantic) with FTS5 sparse search (keyword) via Reciprocal Rank Fusion
- Reranking -- Scores results by keyword overlap and model-based relevance
- Confidence Scoring -- Tells you honestly when it's not sure about an answer
- Hallucination Guard -- Verifies each sentence is grounded in the retrieved sources
- 3-Tier Caching -- L1 query cache (5min), L2 embedding cache (24h), L3 web search cache (1h)
Supported File Formats
| Format | Extensions | How It's Parsed |
|--------|-----------|-----------------|
| Markdown | .md, .mdx | Heading hierarchy, code block separation |
| Plain Text | .txt | Direct text indexing |
| PDF | .pdf | Page-level extraction, OCR fallback for scanned docs |
| Word | .docx | HTML conversion with heading detection |
| Excel / CSV | .xlsx, .xls, .csv | Sheet-aware table chunking (header + rows) |
| HTML | .html, .htm | Structure-preserving extraction, script/nav stripping |
| Jupyter Notebook | .ipynb | Markdown cells + code cells with language detection |
| Email | .eml | Header parsing (from/to/subject/date) + body extraction |
| Source Code | .js, .ts, .py, .java, .go, .rs, .rb, .php, .swift, .kt + more | Function/class-level chunking with import extraction |
| PowerPoint | .pptx | Slide-level text extraction |
| Structured Data | .json, .yaml, .yml, .toml | Config and schema indexing |
| Archive | .zip | Placeholder (full extraction planned) |
Fallback Chains: If a parser fails, the next one tries automatically:
parserFallbacks: {
'.pdf': ['@opendocuments/parser-pdf', '@opendocuments/parser-ocr'],
}
Data Sources
| Source | What It Indexes | Auth | How It Syncs | |--------|----------------|------|-------------| | Local Files | Any suppor
Related Skills
node-connect
351.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
Writing Hookify Rules
110.6kThis skill should be used when the user asks to "create a hookify rule", "write a hook rule", "configure hookify", "add a hookify rule", or needs guidance on hookify rule syntax and patterns.
Hook Development
110.6kThis skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.
