TruthSeeker
Automated pipeline for downloading, staging, ingesting, and investigating leaked and declassified archives (DDoSecrets, National Security Archive, WikiLeaks) using recursive RAG with LM Studio.
Install / Use
/learn @RawdogReverend/TruthSeekerREADME
Truth Seeker
This project provides a complete workflow for downloading, staging, ingesting, and researching leaked and declassified archives (e.g., DDoSecrets, National Security Archive, WikiLeaks).
It automates:
- Downloading archives
- Preparing & staging files
- Ingesting into a vector database
- Running recursive research with Truth Seeker
Requirements
Python Dependencies
Install all dependencies with:
pip install -r requirements.txt
External Tools
- ocrmypdf (for OCR on scanned PDFs, optional if you use --skip-ocr)
- libpff / pypff (required to parse PST files)
- Linux / macOS:
sudo apt-get update sudo apt-get install -y build-essential python3-dev git autoconf automake libtool git clone https://github.com/libyal/libpff.git cd libpff ./synclibs.sh ./autogen.sh ./configure make sudo make install sudo ldconfig # Linux only cd pypff python3 setup.py build sudo python3 setup.py install - Windows:
- Install Visual Studio Build Tools
- Install Python dev headers (matching your Python version)
- Clone and build
libpffwith MSVC or MSYS2 - Build the Python bindings:
python setup.py build python setup.py install - Verify with:
import pypff print(pypff.get_version())
- Linux / macOS:
- Torrent client (e.g., qBittorrent, Transmission, aria2) for WikiLeaks archives
- LMStudio
Quick Start
1. Clone the repo
git clone https://github.com/RawdodReverend/TruthSeeker.git
cd repo
2. Install dependencies
pip install -r requirements.txt
3. Run downloaders
-
National Security Archive (NSA EBBs)
python natsecarchive.pyDownloads and logs Briefing Book PDFs.
-
DDoSecrets
python ddosecrets.pyThread-safe spider for data.ddosecrets.com — downloads docs, archives, images, etc.
-
WikiLeaks
Add torrents (e.g., WikiLeaksTorrentArchive_archive.torrent) to your torrent client and wait for completion.
After the download completes, move the files into the project’s ./data folder.
Stage Data
Once downloads/torrents finish, run staging:
python stage_data.py
This will:
- Unzip all .zip archives
- Move supported files into ./docs
- Leave processed zips in ./processed
Supported extensions include: .pdf, .doc/.docx, .txt, .eml, .pst, .json, .csv, .xls/.xlsx, .xml, .htm/.html, .rtf, .md, code files, configs, logs
Ingest
Convert staged docs into vector embeddings and insert into ChromaDB:
python ingest.py
Options:
- --skip-ocr → skip OCR processing for image-only PDFs.
- Faster and simpler (no ocrmypdf needed).
- Scanned PDFs without text will be skipped.
Processed docs are moved into ./processed_docs and skipped on re-runs. Failed docs go into ./failed_docs.
Truth Seeker
The Truth Seeker agent lets you research across all ingested documents using recursive retrieval-augmented generation (RAG).
Required Configuration
Before running, edit Truth_Seeker.py to point to your LM Studio instance and models:
LM_STUDIO_API = "http://<your-host>:<port>/v1"
EMBED_MODEL = "text-embedding-nomic-embed-text-v1.5"
CHAT_MODEL = "lmstudio-community/gemma-3-27b-it"
- Replace <your-host>:<port> with your LM Studio server.
- Ensure both the embedding model and chat model are downloaded and served.
Run interactive agent
python Truth_Seeker.py
Available modes:
- recursive: <query> → deep recursive research + generates a flowchart
- research: <query> → multi-angle research (generates queries, synthesizes evidence)
- simple: <query> → single vector search
- search: <query> → debug search results across shards
- stats → show database statistics
Workflow Recap
- Download
- Run natsecarchive.py, ddosecrets.py, and download torrents.
- Move completed WikiLeaks torrents into ./data.
- Stage
- Run stage_data.py → prepares files in ./docs.
- Ingest
- Run ingest.py → embed into ChromaDB (--skip-ocr optional).
- Truth Seek
- Configure Truth_Seeker.py with LM Studio settings.
- Run Truth_Seeker.py → perform investigations.
Notes
- OCR trade-off:
- Default: OCR is enabled (requires ocrmypdf).
- With --skip-ocr: faster but loses text from image-only PDFs.
- Scalability: Uses sharding across ChromaDB collections to handle large archives.
- Persistence: Data is stored in ./chroma_db by default.
Related Skills
node-connect
349.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
