Wdoc
Summarize and query from a lot of heterogeneous documents. Any LLM provider, any filetype, advanced RAG, advanced summaries, scriptable, etc
Install / Use
/learn @thiswillbeyourgithub/WdocREADME
wdoc
<p align="center"><img src="https://github.com/thiswillbeyourgithub/wdoc/blob/main/images/icon.png?raw=true" width="512" style="background-color: transparent !important"></p>I'm wdoc. I solve RAG problems.
- wdoc, imitating Winston "The Wolf" Wolf
wdoc is a powerful RAG (Retrieval-Augmented Generation) system designed to summarize, search, and query documents across various file types. It's particularly useful for handling large volumes of diverse document types, making it ideal for researchers, students, and professionals dealing with extensive information sources.
Created by a psychiatry resident who needed a way to get a definitive answer from multiple sources at the same time (audio recordings, video lectures, Anki flashcards, PDFs, EPUBs, etc.). wdoc was born from frustration with existing RAG solutions for querying and summarizing.
(The online documentation can be found here)
-
Goal and project specifications:
wdoc's goal is to create perfectly useful summaries and perfectly useful sourced answers to questions on heterogeneous corpus. It's capable of querying tens of thousands of documents across various file types at the same time. The project also includes an opinionated summary feature to help users efficiently keep up with large amounts of information. It uses mostly LangChain and LiteLLM as backends. -
Current status: usable, tested, still under active development, tens of planned features
- I don't plan on stopping to read anytime soon so if you find it promising, stick around as I have many improvements planned (see roadmap section).
- I would greatly benefit from testing by users as it's the quickest way for me to find the many minor quick-to-fix bugs.
- The main branch is more stable than the dev branch, which in turns offers more features.
- Open to feature requests and pull requests. All feedbacks, including reports of typos, are highly appreciated
- Please open an issue before making a PR, as there may be ongoing improvements in the pipeline.
-
Key Features:
- Docker Web UI: Easy deployment with a Gradio-based web interface for simplified document processing without CLI interaction.
- High recall and specificity: it was made to find A LOT of documents using carefully designed embedding search then carefully aggregate gradually each answer using semantic batch to produce a single answer that mentions the source pointing to the exact portion of the source document.
- Use both an expensive and cheap LLM to make recall as high as possible because we can afford fetching a lot of documents per query (via embeddings)
- Supports virtually any LLM providers, including local ones, and even with extra layers of security for super secret stuff.
- Aims to support any filetypes and query from all of them at the same time (15+ are already implemented!)
- Actually useful AI powered summary: get the thought process of the author instead of nebulous takeaways.
- Actually useful AI powered queries: get the sourced indented markdown answer yo your questions instead of hallucinated nonsense.
- Extensible: this is both a tool and a library. It was even turned into an Open-WebUI Tool. Also available as a Docker web UI for easy deployment.
- Web Search: Preliminary web search support using DuckDuckGo (via the ddgs library)
Table of contents
- Comprehensive reference (SKILL.md)
- Explanatory diagrams
- Ultra short guide for people in a hurry
- Features
- Getting started
- Scripts made with wdoc
- FAQ
- Roadmap
Comprehensive reference
A single-page comprehensive reference covering every CLI argument, environment variable, filetype, and the full Python API can be found in SKILL.md.
Explanatory diagrams
<p float="left" align="middle"> <img src="https://github.com/thiswillbeyourgithub/wdoc/blob/main/images/diagram_query.png?raw=true" alt="Query task workflow diagram showing the flow from user inputs through Raphael the Rephraser, VectorStore, Eve the Evaluator, Anna the Answerer, and recursive combining to final output" height="400"> <img src="https://github.com/thiswillbeyourgithub/wdoc/blob/main/images/diagram_summary.png?raw=true" alt="Summary task workflow diagram showing the flow from user inputs through loading & chunking, Sam the Summarizer, concatenation to wdocSummary output" height="400"> <img src="https://github.com/thiswillbeyourgithub/wdoc/blob/main/images/diagram_search.png?raw=true" alt="Search task workflow diagram showing the flow from user inputs through Raphael the Rephraser, VectorStore, Eve the Evaluator to search output" height="400"> </p>Ultra short guide for people in a hurry
<details> <summary> Give it to me I am in a hurry! </summary>Note: a list of examples can be found in examples.md
Quick Start with Docker: If you want an experimental web UI, check out the Docker deployment guide.
First, let's see how to query a pdf.
link="https://situational-awareness.ai/wp-content/uploads/2024/06/situationalawareness.pdf"
wdoc --path=$link --task=query --filetype="online_pdf" --query="What does it say about alphago?" --query_retrievers='basic_multiquery' --top_k=auto_200_500
- This will:
- parse what's in --path as a link to a pdf to download (otherwise the url could simply be a webpage, but in most cases you can leave it to 'auto' by default as heuristics are in place to detect the most appropriate parser).
- cut the text into chunks and create embeddings for each
- Take the user query, create embeddings for it ('basic') AND ask the default LLM to generate alternative queries and embed those
- Use those embeddings to search through all chunks of the text and get the 200 most appropriate documents
- Pass each of those documents to the smaller LLM (default: openrouter/google/gemini-2.5-flash) to tell us if the document seems appropriate given the user query
- If More than 90% of the 200 documents are appropriate, then we do another search with a higher top_k and repeat until documents start to be irrelevant OR we it 500 documents.
- Then each relevant doc is sent to the strong LLM (by default, openrouter/google/gemini-3.1-pro-preview) to extract relevant info and give one answer per relevant document.
- Then all those "intermediate" answers are 'semantic batched' (meaning we create embeddings, do hierarchical clustering, then create small batch containing several intermediate answers of similar semantics, sort the batch in semantic order too), each batch is combined into a single answer per batch of relevant doc (or after: per batch of batches).
- Rinse and repeat steps 7+8 (i.e. gradually aggregate batches) until we have only one answer, that is returned to the user.
Now, let's see how to summarize a pdf.
link="https://situational-awareness.ai/wp-content/uploads/2024/06/situationalawareness.pdf"
wdoc --path=$link --task=summarize --filetype="online_pdf"
-
This will:
- Split the text into chunks
- pass each chunk into the strong LLM (by default openrouter/google/gemini-3.1-pro-preview) for a very low level (=with all details) summary. The format is markdown bullet points for each idea and with logical indentation.
- When creating each new chunk, the LLM has access to the previous chunk for context.
- All summary are then concatenated and returned to the user
-
For extra large documents like books for example, this summary can be recusively fed to
wdocusing argument --summary_n_recursion=2 for example. -
Those two tasks, query and summary, can be combined with --task summarize_then_query which will summarize the document but give you a prompt at the end to ask question in case you want to clarify things.
-
For more, you can read examples.md.
-
Note that there is an official Open-WebUI Tool that is even simpler to use.
Features
- 15+ filetypes: also supports combination to load recursively or define complex heterogenous corpus like a list of files, list of links, using regex, youtube playlists etc. See Filestypes and Recursive Filetypes. All filetype can be seamlessly combined in the same index, meaning you can query your anki collection at the same time as your work PDFs). It supports removing silence from audio files and youtube videos too! There is even a
ddgfiletype to search the web using DuckDuckGo. - 100+ LLMs and many embeddings: Supports any LLM by OpenAI, Mistral, Claude, Ollama, Openrouter, etc. thanks to litellm. The list of supported embeddings engine can be found here but includes at least Openai (or any openai API compatible models), Cohere, Azure, Bedr
