<div align="center"> <h1>DocMason</h1> A repo-native agent app for deep research over private work files. The repo is the app. Codex is the runtime. Build a local, evidence-first knowledge base with provenance. <a href="https://github.com/JetXu-LLM/DocMason/releases/latest/download/DocMason-clean.zip"> <img alt="Download DocMason" src="https://img.shields.io/badge/⬇️_Download_DocMason-0DAFC6?style=for-the-badge"> </a> <img alt="Platform" src="https://img.shields.io/badge/platform-macOS-7959A2?style=flat-square&logo=apple&logoColor=white"> <img alt="Python" src="https://img.shields.io/badge/python-3.11%2B-0F64B5?style=flat-square"> <img alt="License" src="https://img.shields.io/badge/license-Apache%202.0-0F64B5?style=flat-square"> <img alt="Release" src="https://img.shields.io/github/v/release/JetXu-LLM/DocMason?style=flat-square&color=04D994"> Already paying for OpenAI? <a href="https://openai.com/codex">Codex for macOS</a> is included in your plan. Open DocMason inside Codex — and finally put your AI to work on your real private documents, not just chat prompts. Zero-to-working in minutes. Unlock the full power of your subscription. </div>

Most workspace AI tools flatten your complex office documents into a single, unstructured text blob. They might summarize a file or retrieve a stray quote, but once your research gets complex, the illusion breaks. You lose the tables, the slide layouts, the hidden notes—and it becomes impossible to verify where the AI's answer actually came from.

DocMason is built on a different thesis: answers must be strictly traceable. It compiles your private decks, spreadsheets, PDFs, and emails into a local, file-based knowledge base. Instead of chatting with anonymous text chunks, your AI agent reasons over structured, multimodal evidence bundles. It’s not a cloud service or a lightweight wrapper. It is a local repo running as a deep-research AI app on Codex. No hidden backends, no cloud ingestion. Just your files, and answers you can actually trust.

How It Works: A Production-Grade Runtime

DocMason is designed to enforce strict data contracts and provenance boundaries. The repo holds the truth; the agent does the reasoning.

DocMason Architecture

Why This Exists

Most document AI tools map complex corporate files into flat, unreadable text strings. They strip out critical structural and formatting semantics:

Slide Decks: Visual layout, presenter notes, and chart-text relationships are discarded.
Spreadsheets: Multi-sheet references and nested tables break existing parsers.
Format-as-Semantics: Critical signals (like red text for "Risk" or indentation for hierarchies) are erased.
Cross-Document Reasoning: Multi-part proposals are disconnected, making global synthesis impossible.

DocMason addresses this by forcing AI to respect original document structure and visual semantics. It produces deterministic file-based evidence, runs strong offline retrieval and trace algorithms, and validates the resulting knowledge base through strict code rules — all locally, with nothing leaving your machine. The repo holds the truth. The agent does the reasoning.

Two Easy Ways to Start

Getting started requires zero developer experience. Just drop your files and let your AI agent handle the rest.

Two ways to reach your first answer

Path A: Start Small Drop a handful of work files (.pptx, .docx, .xlsx, PDFs) into the DocMason/original_doc/ folder. Open the DocMason folder in Codex, and ask your question naturally. DocMason intelligently guides you through environment setup and quietly builds the knowledge base in the background — just approve when prompted. After that, you can keep adding or revising files inside original_doc/; on the native path, DocMason can quietly and incrementally sync the published knowledge base instead of forcing a full restart.
Path B: Stage Entire Folders Drop your massive, department-level folders into DocMason/original_doc/. Open the DocMason folder in Codex. Tell Codex:

"Please prepare the DocMason environment."

Then:

"Please build the knowledge base."

Once it's done, start asking complex research questions against the entire published corpus.

Inside a valid workspace, you do not need to memorize internal commands. Just speak naturally to your AI agent.

Getting Started on macOS

Five steps from download to your first traceable answer — no developer experience required.

1. Download, unzip, and drop in your files Download DocMason, unzip it to any folder on your Mac, then drag your .pptx, .docx, .xlsx, .pdf, and other work files into DocMason/original_doc/.

2. Open the DocMason folder in Codex Launch Codex for macOS (or Claude Code) and open the DocMason folder as your workspace. This is the operating model — the repo is your app, the agent is your runtime.

3. Ask your agent to prepare the environment

"Please prepare the DocMason environment."

DocMason will set up a managed local Python environment, install required dependencies, and guide you through LibreOffice installation if it's not already present. Just grant full access to Codex when prompted.

4. Build the knowledge base (for medium-to-large corpora)

"Please build the knowledge base."

DocMason stages, compiles, validates, and publishes your documents into a searchable evidence layer. For a small handful of files, DocMason may handle this step automatically during your first question.

5. Start asking questions

"What are the main rollout risks across these documents, and which sources support them?"

Your answers come with exact source identity and provenance trace — you can verify every claim against the original file and page.

The Public Proof Case (Demo Bundle)

If you want to see a rigorously traceable answer before using your own files, the fastest public proof uses the ICO + GCS demo corpus compiled from official UK public-sector releases.

Try the ICO + GCS Demo Bundle to test a governed truth environment before transitioning to your own private folders.

Ask this through your AI agent:

"Across the ICO and GCS materials, what are the main rollout risks, and which sources support them?"

What good looks like:

Cross-Document Reasoning: The answer synthesizes overlapping governance risks instead of echoing documents one by one.
Strict Provenance: The answer explicitly points to the exact document origin, instead of blurring the corpus into one anonymous narrative.
Inherently Traceable: It provides the real evidence bundles so you can verify the root context.

Why It Feels Safer

DocMason is built for deep research over your real work files — where every answer must be traceable to its actual source.

Strict Source Identity. DocMason enforces strict document boundaries. It prevents agents from hallucinating cross-source facts that only vaguely fit together.
Answers Are Traceable. You don't just get convincing text. You get a verifiable lineage pointing directly to the exact file and page you dropped in.
100% Local and Auditable. Your files, staged data, and compiled knowledge base remain physically inside your local folder boundary. See more →

What gets installed: DocMason needs LibreOffice to parse Office files (.pptx, .docx, .xlsx) with full fidelity — this is the most important external dependency. It also sets up a local Python environment automatically. All setup is handled through your AI agent — just approve installations when prompted.

Supported Work File Types

First-Class Office & PDF: pdf, pptx, ppt, docx, doc, xlsx, xls
First-Class Deep Text: md, markdown, txt, eml (email)
Lightweight Text: mdx, yaml, yml, tex, csv, tsv

High-fidelity Office file parsing relies on a lightweight local LibreOffice shim. PDF parsing uses the embedded stack (PyMuPDF, pypdfium2, pypdf, pillow). Together they preserve multimodal structure, layout, and sheet/page context for deeper analysis, not just plain-text extraction. Markdown, plain text, .eml, and the lightweight-compatible family do not require LibreOffice.

What You Get Today

Incremental Sync: Add or revise files in original_doc/, and DocMason can quietly rebuild and republish your local knowledge_base/current/ without forcing a full reset.
Validation-Gated Commits: Bad data fails the build instead of quietly degrading answers.
Rich Source Parsing: First-class handling for .pdf, .pptx, .xlsx, .md, .eml, and more.
Deterministic Retrieval: Exact provenance trace over published corpora.
Review Surface: Conversation-native logging and extraction for real analysis.

Privacy and Local-First Boundary

DocMason is designed to run entirely over local files. Here's exactly what that means:

DocMason does NOT send any of the following over the network:

Your document content, file names, or file paths
Your queries or answer text
Any corpus data, evidence bundles, or knowledge-base artifacts

All AI inference traffic is handled by your chosen host agent (Codex, Claude Code, etc.) — DocMason itself makes zero model API calls. The network behavior of your AI agent is governed by that agent's own privacy and telemetry policy.

The only network request DocMason may make: Generated clean and demo-ico-gcs

DocMason

Install / Use

README