Private Web Search with Tiptoe

This repository contains the reference implementation of Tiptoe, accompanying the paper "Private Web Search with Tiptoe" by Alexandra Henzinger, Emma Dauterman, Henry Corrigan-Gibbs, and Nickolai Zeldovich (SOSP 2023).

The implementation of Tiptoe's cryptosystem is available at github.com/ahenzinger/underhood.

The version of Tiptoe that was artifact evaluated is tagged as v0.

Tiptoe is a private web search engine that allows clients to search over hundreds of millions of documents, while revealing no information about their search query to the search engine’s servers. Tiptoe’s privacy guarantee is based on cryptography alone; it does not require hardware enclaves or non-colluding servers. Tiptoe uses semantic embeddings to reduce the problem of private full-text search to private nearest-neighbor search. Then, Tiptoe implements private nearest-neighbor search with a new, high-throughput protocol based on linearly homomorphic encryption. Running on a 45-server cluster, Tiptoe can privately search over 360 million web pages with 145 core-seconds of server compute, 56.9 MiB of client-server communication (74% of which can occur before the client enters its search query), and 2.7 seconds of end-to-end latency. Tiptoe’s search works best on conceptual queries (“knee pain”) and less well on exact string matches (“123 Main Street, New York”). On the standard MS MARCO search-quality benchmark, Tiptoe ranks the best-matching result in position 7.7 on average. This is worse than a state-of-the-art, non-private neural search algorithm (average rank: 2.3), but is close to the classical tf-idf search algorithm (average rank: 6.7). Finally, Tiptoe is extensible: it also supports private text-to-image search and, with minor modifications, it can support private search over audio, code, and more.

To minimize the time and cost to reproduce the paper's results, we additionally provide scripts, data sets, and preprocessed data structures to reproduce the Tiptoe text search results over the Common Crawl data set.

Warning: This code is a research prototype.

Overview
Setup
- Dependencies
- Getting started
Usage
Reproducing results from the paper
Indexing the corpus
Acknowledgements
Citation

Overview<a name="overview"></a>

This repository is organized as follows:

The search/ directory implements the Tiptoe private search protocol. It includes code for the Tiptoe client and the client-facing Tiptoe services (including all cryptographic operations).
The embeddings/, dim-reduce/, and cluster/ directories contain the code for Tiptoe's corpus indexing batch jobs.
The perf-eval/ directory contains scripts to launch Tiptoe servers on a cluster of AWS instances and run performance experiments. The quality-eval/ directory contains scripts to evalute Tiptoe's search quality. The tf-idf/ directory contains code to compare search quality against the tf-idf baseline.
The plots/ directory contains scripts to generate the tables and plots from the paper.

Setup<a name="setup"></a>

Dependencies<a name="dependencies"></a>

To build from source, install the following dependencies:

Go (tested with version 1.20.2, requires at least 1.19)
Python3 and pip3
a C compiler (tested with GCC 11.3.0)
Microsoft SEAL (tested with version 4.1.1). You must compile SEAL with -DSEAL_THROW_ON_TRANSPARENT_CIPHERTEXT=off. We recommend also compiling with Intel HEXL (-DSEAL_USE_INTEL_HEXL=ON) to use hardware acceleration.

Getting started<a name="start"></a>

After building from source, run:

git clone git@github.com:ahenzinger/tiptoe.git
cd tiptoe
pip3 install -r requirements.txt

Finally, run mkdir ~/.aws and enter your AWS credentials in ~/.aws/credentials. (This step is needed to use our scripts to launch AWS instances and to download data from S3.) The credentials file should be of the form

[default]
aws_access_key_id = <your access key id>
aws_secret_access_key = <your secret access key>

Usage<a name="usage"></a>

Unit tests<a name="unit"></a>

[10 min, 3 GiB disk space] To run Tiptoe's correctness tests on a synthetically generated corpus, run:

cd search/protocol
./fake_corpus_test.sh
cd ../..

These correctness tests run each of the client-facing Tiptoe services and the Tiptoe client, and ensure that the search results (and all intermediate computations) are correct. While executing, the tests log performance information to the console. The tests will print PASS if all unit tests succeeded.

End-to-end tests (optional for artifact eval)<a name="e2e"></a>

[3.5h, 230 GiB disk space, 90 GiB RAM] To run Tiptoe's correctness tests on a slice of the Common Crawl text corpus, substitute in the desired path_to_corpus (e.g., /home/ubuntu/) and run:

cd search/protocol
./real_corpus_test.sh $path_to_corpus
cd ../..

This script performs the following operations:

First, it downloads the Common Crawl text corpus from S3 (on which we have already performed the embedding, dimensionality reduction, and clustering steps). This step requires an AWS credential to be in ~/.aws/credential/, takes roughly 115 minutes, and writes two directories in path_to_corpus of total size roughly 230 GiB.
Then, it runs correctness tests on a slice of the Common Crawl corpus. As with the other unit tests, the tests log performance information to the console and will print PASS if all correctness tests succeeded. The correctness tests take 1.5h and require 90 GiB of RAM.

Running the Tiptoe servers and client (optional for artifact eval)<a name="run"></a>

It is possible to run the Tiptoe client-facing services and the Tiptoe client to perform private searches over the Common Crawl data set, outside of the correctness test and the performance test harnesses. To do so, you must (1) download the required pre-processed server and client states from S3, and then (2) launch each of the Tiptoe server and client processes.

We store pre-processed copies of the server and client state in S3. The ranking service's state is split into 80 shards, each of which can be downloaded separately and can then be launched as an individual process. (In our evaluation, we run these 80 processes which constitute the ranking service on 40 physical machines.) The URL service's state is split into 8 shards, each of which can be downloaded separately and can then be launched as an individual process. (In our evaluation, we run these 8 processes which constitute the URL service on 4 physical machines.) The Tiptoe coordinator and the Tiptoe client also each have their own state, and can then be launched as individual processes. These processes communicate with each other over the network.

Downloading the state from S3<a name="download"></a>

This step requires having an AWS credential in ~/.aws/credentials. Downloading the state from S3 to the directory path_to_corpus (e.g. /home/ubuntu/) works as follows:

Ranking service: [<1 min, 5 GiB of disk space] To download each of the 80 shards, run (where idx is between 0 and 79):
```
cd perf-eval/s3
python3 text_download_from_s3.py embedding $idx $path_to_corpus
cd ../..
```
URL service: [<1 min, 2.5 GiB of disk space] To download each of the 8 shards, run (where idx is between 0 and 7):
```
cd perf-eval/s3
python3 text_download_from_s3.py url $idx $path_to_corpus
cd ../..
```

Coordinator: [<1 min, 1 GiB of disk space] To download the coordinator state, run:

cd perf-eval/s3
python3 text_download_from_s3.py coordinator 0 $path_to_corpus
cd ../..

Client: [<1 min, 80 MiB of disk space] To download the client's state (which consists of the cluster centroids and PCA components), run:
```
cd perf-eval/s3
python3 text_download_from_s3.py client 0 $path_to_corpus
cd ../..
```

All state: [<1h, 370 GiB of disk] To download all of these files from S3 at once, run

cd perf-eval/s3
./text_download_all.sh $path_to_corpus
cd ../../

Launching the server and client processes<a name="launch"></a>

Launching the Tiptoe servers and clients given their state stored in the directory path_to_corpus (e.g. /home/ubuntu/) works as follows:

Ranking service: To launch each of the 80 shards, run (where idx is between 0 and 79):
```
cd search
go run . emb-server $idx -preamble $path_to_corpus
cd ..
```
After roughly 1 min, this command will print TCP server listening on ip:port to the console. At this point, the Tiptoe ranking server process is running and ready to answer queries.
URL service: To launch each of the 8 shar

Tiptoe

Install / Use

README