monkeSearch

logo

Read the technical report at: monkesearch.github.io

A prototype system that brings semantic search capabilities to your file system, allowing you to search for files using natural language queries with temporal awareness like "documents from last week" or "photos from 3 days ago". Nothing leaves your PC, fully offline with local vector embeddings.

⚠️ Prototype: This is an initial proof-of-concept implementation. Expect rough edges and limited functionality. Multi-platform support: Now available for macOS, Linux, and Windows!

watch an intro video i made on this project here

The Idea
The Original Implementation (LLM → Spotlight)
Current Implementation (Vector DB-based)
Example Queries
Requirements
Installation
Usage by Platform
Customization
Project Structure
Limitations
License

The Idea

The core idea behind monkeSearch is simple: you should be able to search your own file system using natural language. Not exact filenames, not regex, not folder browsing — just describe what you're looking for and when, and the system finds it. Fully offline, nothing leaves your machine.

Any natural language file search query can be broken down into 3 constituents:

File type — what kind of file (pdf, image, code, etc.)
Temporal context — when (3 days ago, last week, etc.)
Misc keywords — any remaining context (project name, topic, etc.)

The original implementation used a local LLM to extract these constituents and convert them directly into macOS Spotlight query arguments. The current main branch achieves the same thing using vector databases instead. Both approaches are fully offline.

The Original Implementation (LLM → Spotlight)

This was the first version of monkeSearch and the original vision behind the project. The idea: use a local LLM to convert a natural language query directly into arguments for macOS's built-in Spotlight search — no vector database, no embeddings index, no metadata dump. Just natural language in, structured OS-level query out, instant results back.

How it works

User writes a natural language query like "python scripts from 3 days ago"
Stop words are stripped (find, search, files, ago, back, etc.) to clean up the query before it hits the LLM
A local LLM (Qwen3-0.6B running via llama.cpp) parses the cleaned query and extracts structured components:
```
{
  "file_types": ["py"],
  "time_unit": "days",
  "time_unit_value": "3",
  "is_specific": true,
  "source_text": {
    "file_types": "python scripts",
    "time_unit": "3 days ago"
  }
}
```
The LLM understands that "python scripts" means .py, "images" means jpg,png, "yesterday" means days,1, "last week" means weeks,1, etc. It uses constrained JSON output via llama.cpp's response_format to guarantee valid structured output.
The extracted components are converted into NSMetadataQuery predicates — the same API that powers Spotlight and mdfind under the hood:
- File types → mapped to UTIs via utitools, then used as kMDItemContentTypeTree predicates. For broad categories (is_specific: false), the UTI hierarchy is climbed to match parent types (e.g., "images" matches all image formats, not just jpg)
- Temporal data → converted to kMDItemFSContentChangeDate date predicates using timedelta
- Remaining misc keywords → matched against kMDItemTextContent and kMDItemFSName
- All predicates are combined with NSCompoundPredicate
The compound query runs against Spotlight's existing index — results come back instantly since macOS already maintains the index. No separate database to build or maintain.

The two LLM-based branches

LangExtract implementation — uses LangExtract with a local Llama server (llama_cpp.server on localhost:8000) for structured extraction. Defines file_type_indicator and temporal_indicator extraction classes with few-shot examples.
llama.cpp direct implementation (legacy-main-llm-implementation branch) — uses llama_cpp.Llama directly with constrained JSON output via Pydantic schema. More examples in the system prompt, tighter structured output.

Both use the same parser.py that converts the LLM's structured output into NSMetadataQuery predicates and runs the query against Spotlight.

Why this matters: There's no index to build, no metadata to dump, no embeddings to generate. The LLM is the only "intelligence" layer — it converts human language to Spotlight's query language, and Spotlight does the actual searching using its pre-existing system index. This makes it safe by design (read-only, scoped through Spotlight's own access controls).

For Agentic Use: These LLM-based implementations are particularly suitable for integration into larger AI pipelines and agentic systems. They provide a direct LLM-to-filesystem bridge through natural language without modifying any files, leveraging OS-level scoped safety through Spotlight. If you're building autonomous agents or LLM orchestration systems that need file discovery capabilities, these branches give you that without the overhead of maintaining a separate index.

Current Implementation (Vector DB-based)

The current main branch achieves the same functionality using vector databases instead of a live LLM at query time. This was built to make monkeSearch cross-platform (the LLM → Spotlight approach is macOS-only) and to make search faster since it doesn't need an LLM running.

The tradeoff: you need to build and maintain an index, but search is sub-second and doesn't require a running LLM.

How it works

Metadata extraction (platform-specific):
- macOS: Spotlight metadata extraction via Foundation framework
- Linux/Windows: os.walk-based file system traversal with configurable search folders
Embedding generation: File metadata is converted to a text representation and embedded using sentence transformers (default: facebook/contriever)
Vector database indexing:
- Mac/Linux → LEANN: Graph-based vector index with 97% storage savings
- Windows → ChromaDB: Persistent collection-based semantic search
Temporal expression parsing: Regex-based extraction of time expressions ("3 days ago", "last week", "around 2 months ago") → ISO timestamp ranges. Stop words removed during parsing.
Search & filter: Clean query is embedded and matched via semantic similarity against the vector index. Results are filtered by date range if a temporal expression was found.

Metadata fields indexed (all platforms)

Path: Full file path
Name: File name
Size: File size in bytes
ContentType: MIME type (e.g., text/plain, image/jpeg)
Kind: File extension or type
CreationDate: File creation timestamp (ISO format)
ContentChangeDate: Last modification timestamp (ISO format)

Developer note:

I've been working on this project since long and this idea had many versions. The LLM → Spotlight approach was the original vision — letting a language model convert your natural language directly into OS-level search commands, no database needed. The vector DB approach on main right now is faster and cross-platform, but the LLM branches are still very relevant, especially for agentic pipelines where you want zero setup overhead. Rigorous evals and testing will be done before finalizing on a single approach for the main release. This is under active development and any new suggestions + PRs are welcome. My goal for this tool is to be open source, safe and cross platform.

please star the repo too, if you've read it till here :P

Example Queries

The system performs semantic search on file metadata with temporal filtering, understanding context without exact keyword matching:

| Natural Language Query | What It Finds | |------------------------|---------------| | "photos from wedding" | Image files with name/path semantically matching "wedding" | | "documents from 3 weeks ago" | Any document-like files from 3 weeks ago | | "old music files" | Audio files with temporal context | | "invoices from last month" | Files semantically similar to "invoices" from last month | | "presentations about 2 months ago" | Files matching "presentations" context from ~2 months ago | | "downloads from last week" | Files in downloads folder from last week |

Requirements

Python 3.12
Platform-specific notes:
- macOS: Spotlight indexing enabled (uses Foundation framework)
- Linux: Standard file system access via os.walk
- Windows: Standard file system access via os.walk (future: pywin32 for performance)

Installation

1. Clone the Repository

git clone https://github.com/monkesearch/monkesearch
cd monkeSearch

2. Install Dependencies

Mac/Linux (LEANN-based):

pip install leann
pip install numpy

Windows (ChromaDB-based):

pip install sentence-transformers
pip install chromadb
pip install numpy

Usage by Platform

macOS

cd app/

# 1. Dump Spotlight metadata
python spotlight_index_dump.py 1000  # Dump 1000 files

# 2. Build LEANN index
python leann_index_builder.py spotlight_dump.json

# 3. Search
python leann-plus-temporal-search.py "pdf documents 2 weeks ago"
python leann-plus-temporal-search.py "presentations" 10  # Top 10 results

Linux

cd app/

# 1. Dump file met

MonkeSearch

Install / Use

README

monkeSearch

Read the technical report at: monkesearch.github.io

watch an intro video i made on this project here

Table of Contents

The Idea

The Original Implementation (LLM → Spotlight)

How it works

The two LLM-based branches

Current Implementation (Vector DB-based)

How it works

Metadata fields indexed (all platforms)

Developer note:

Example Queries

Requirements

Installation

1. Clone the Repository

2. Install Dependencies

Mac/Linux (LEANN-based):

Windows (ChromaDB-based):

Usage by Platform

macOS

Linux