AIWhisperer

** Based on DPG Media Course. Whisper your documents to AI—with reduced risk of exposing sensitive data.**

"4,713 pages. An experienced researcher would need five days to build a timeline. I did it in 20 minutes, during a coffee break."

Why This Tool Exists

Problem 1: Too big to upload

You have a 170 MB investigation file. You try cloud AI:

ChatGPT: "Failed upload"
Claude.ai: "Files larger than 31 MB not supported"
Gemini: "File larger than 100 MB"

Your files are too big to fail—but too big to upload. AIWhisperer converts PDFs to text (upto 92% smaller) and splits them into chunks cloud AI can handle.

Problem 2: Too sensitive to upload, too slow to run locally

You have confidential documents. Local AI would be safe, but it's painfully slow—hours for what cloud AI does in minutes. So you upload to cloud AI anyway, unredacted, hoping for the best.

AIWhisperer gives you a middle path: sanitize locally, analyze in the cloud, decode locally. You get cloud AI speed with reduced exposure of sensitive data.

How It Works

| Step | Where | What happens | |:----:|:-----:|:-------------| | 1 | Local | Convert - PDF to text (with OCR for scanned pages) | | 2 | Local | Split - Break into chunks (500 pages each) | | 3 | Local | Encode - Replace names with placeholders | | | | John Smith → PERSON_001 | | | | +31 6 12345678 → PHONE_001 | | | | Saves mapping.json locally | | 4 | Cloud | Upload sanitized files to AI (NotebookLM, etc.) | | 5 | Cloud | AI analyzes - finds patterns, builds timelines | | 6 | Local | Download AI output | | 7 | Local | Decode - restore real names using mapping.json |

This reduces—but does not eliminate—the risk of exposing sensitive data. Always review the sanitized output before uploading.

What Can You Whisper to AI?

Once your documents are sanitized, whisper questions to AI:

Build timelines - "Create a chronological timeline of all events"
Find connections - "Who communicated with whom? Map the relationships"
Identify patterns - "What phone numbers appear together? What locations overlap?"
Summarize - "What are the key findings in this 4,000-page investigation?"
Extract data - "List all financial transactions with dates and amounts"
Cross-reference - "Which people appear in multiple documents?"

The AI works with PERSON_001, PHONE_002, PLACE_003. After analysis, AIWhisperer restores the real names: PERSON_001 → John Smith, PHONE_002 → +32 489 66 70 88, etc.

Result: AI-powered analysis with reduced exposure of sensitive data.

Important Warnings

ALWAYS CHECK THE OUTPUT BEFORE UPLOADING TO AI.

This tool is not perfect. Detection can miss things. Before uploading any sanitized document:

Review the sanitized output - Open the file and verify sensitive data is actually replaced
Use --dry-run first - See what gets detected before committing
Check for unique identifiers - Job titles, rare events, or specific descriptions can still identify people:
- BAD: "PERSON_001, the mayor of Springfield" → Still identifiable
- BAD: "PERSON_001 arrested in Europe's largest drug bust" → The event identifies the person
- OK: "PERSON_001 transferred money to PERSON_002" → Safe
Test with sample data first - Before processing real confidential documents
You are responsible - This tool assists, but YOU must verify the output is safe

No detection is 100% accurate. Names with unusual spelling, new patterns, or edge cases may slip through. When in doubt, manually check.

The Story Behind This Tool

This tool was born from a real investigation: a 170-megabyte cocaine smuggling case file containing court orders, wiretap transcripts, cell tower data, arrest warrants, bank statements, and interrogation protocols.

The problem? You shouldn't upload confidential files to cloud AI. And even if you wanted to:

ChatGPT: "Failed upload"
Gemini: "File larger than 100 MB"
Claude.ai: "You may not upload files larger than 31 MB"

The solution? Encode locally → Analyze in cloud → Decode locally.

Read the full story: Speed reading a massive criminal investigation with AI - How to make sense of 4,713 pages in 20 minutes without leaking data

The Concept

BEFORE:    "On 16/10/2023, officers arrested John Smith at 123 Harbor Road.
            He was hired by Marcus Johnson."

AFTER:     "On 16/10/2023, officers arrested PERSON_001 at ADDRESS_001.
            He was hired by PERSON_002."

AI OUTPUT: "Timeline shows PERSON_001 arrested on 16/10/2023, connected to
            PERSON_002 who runs COMPANY_001, COMPANY_002 and COMPANY_003"

DECODED:   "Timeline shows John Smith arrested on 16/10/2023, connected to
            Marcus Johnson who runs Hideout 1, Hideout 2 and Hideout 3"

What changes: Names, locations, phones, emails, IBANs, vehicles, addresses What stays: Structure, relationships, patterns, dates, amounts

Quick Start

Installation

Option 1: macOS App (Apple Silicon)

Download the .dmg installer from Releases - no Python needed.

Download AIWhisperer-x.x.x-arm64.dmg
Open the DMG and run install.command
Run aiwhisperer --help in Terminal

Note: The app is not code-signed. On first run, right-click and select "Open" to bypass Gatekeeper.

Option 2: pip install (all platforms)

# Install with spaCy and OCR support (recommended)
pip install aiwhisperer[spacy,ocr]

# Download Dutch language model
python -m spacy download nl_core_news_sm

# Other languages available: en, de, fr, it, es

# Check what's installed and what's missing
aiwhisperer check

The check command shows exactly what's installed and how to fix missing dependencies:

$ aiwhisperer check

AIWhisperer Dependency Check
===================================
Python: 3.10.5  (OK)

PDF Conversion:
  [x] marker-pdf: Installed (best accuracy)
  [x] pymupdf: Installed
  [x] tesseract: Installed (OCR fallback)

NER Detection:
  [x] spaCy: Installed (v3.8.11)

Language Models:
  [x] nl: nl_core_news_sm
  [ ] en: en_core_web_sm
      -> Fix: python -m spacy download en_core_web_sm

Command Line

Two workflows:

# WORKFLOW 1: Non-confidential files (just convert)
aiwhisperer convert document.pdf

# WORKFLOW 2: Confidential files (convert + sanitize in one step)
aiwhisperer convert document.pdf --sanitize

Full workflow for large confidential files:

# Step 1: Convert and sanitize (with split for large files)
aiwhisperer convert investigation.pdf --split --max-pages 500 --sanitize

# Creates:
#   investigation_part1.txt            ← Plain text
#   investigation_part1_sanitized.txt  ← Send to AI
#   investigation_part1_mapping.json   ← Keep this LOCAL

# IMPORTANT: Check sanitized files before uploading!
# Make sure no sensitive data slipped through.

# Step 2: Upload sanitized files to NotebookLM
#         Ask AI to build timeline, find patterns, etc.

# Step 3: Save AI output, then decode back to real names
aiwhisperer decode ai_analysis.txt -m investigation_part1_mapping.json

# Result: Full analysis with real names restored

For smaller files (under ~50MB), you can skip the --split flag.

Python API

from aiwhisperer import encode, decode, Mapping
from aiwhisperer.converter import convert_pdf

# Convert PDF to text
text, metadata = convert_pdf("investigation.pdf")

# Encode
sanitized, mapping = encode(text, language='nl')

# Save
open("sanitized.txt", "w").write(sanitized)
mapping.save("mapping.json")

# IMPORTANT: Review sanitized.txt before uploading!

# ... send sanitized.txt to AI, get analysis back ...

# Decode
ai_output = open("ai_analysis.txt").read()
final = decode(ai_output, mapping)
open("final_report.txt", "w").write(final)

PDF Conversion

AIWhisperer includes built-in PDF to text conversion with OCR for scanned documents.

OCR Backends

| Backend | Accuracy | Install | |---------|----------|---------| | marker-pdf (recommended) | Excellent | pip install marker-pdf | | pytesseract (fallback) | Good | pip install pymupdf pytesseract pdf2image + Tesseract |

marker-pdf uses Surya OCR under the hood - currently one of the most accurate OCR solutions available.

Usage

# Convert PDF (auto-selects best available backend)
aiwhisperer convert document.pdf

# Force specific backend
aiwhisperer convert document.pdf --backend marker

# Split large PDFs into multiple text files
aiwhisperer convert large.pdf --split --max-pages 500

# Just show PDF info
aiwhisperer convert document.pdf --info

Installing Tesseract (fallback)

If marker-pdf doesn't work for you:

# macOS
brew install tesseract tesseract-lang

# Ubuntu/Debian
apt install tesseract-ocr tesseract-ocr-nld tesseract-ocr-deu tesseract-ocr-fra

# Windows: download from https://github.com/UB-Mannheim/tesseract/wiki

What Gets Detected

| Category | Examples | What it catches | |----------|----------|-----------------| | PERSON | Jan de Vries, El Mansouri Brahim | Names via NER + context patterns | | PLACE | Antwerpen, te Wuustwezel | Cities, "te X

Aiwhisperer

Install / Use

README