:dollar: What is it?

Index Compression Methods (INCOME) repository helps you easily train and evaluate different memory efficient dense retrievers across any custom dataset. The pre-trained models produce float embeddings of sizes from between 512 - 1024. However, when storing a large number of embeddings within an index for fast inference, this requires quite a lot of memory / storage.

In this repository, we focus on index compression and provide models which produce binary embeddings i.e. 1 or -1 which require less dimensions and help you save both storage and money on hosting such models in a practical setup with limited money.

We currently support the following memory efficient dense retriever model architectures:

BPR: Binary Passage Retriever (ACL 2021)
JPQ: Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance (CIKM 2021)

For more information, checkout our publication:

Domain Adaptation for Memory-Efficient Dense Retrieval (Arxiv preprint)

:dollar: Installation

One can either install income via pip

pip install income

or via source using git clone

$ git clone https://github.com/Nthakur20/income.git
$ cd income
$ pip install -e .

With that, you should be ready to go!

:dollar: Models Supported

We currently support training and inference of these compressed dense retrievers within our repository. We compare the performance and cost of hosting these models below:

| | Backbone| MSMARCO | BEIR | Memory Size | Query Time | GCP Cloud | Cost per. Month (in $) | |:---:|:----:|:----:|:----:|:----:|:----:|:----:|:----:| | No Compression | | TAS-B (Hofstatter et al., 2021) | TAS-B | 0.408 | 0.415 | 65 GB (1x) | 456.9 ms | n2-highmem-8 | $306.05 |
| TAS-B + HNSW (Hofstatter et al., 2021) | TAS-B| 0.408 | 0.415 | 151 GB (1x) | 1.8 ms | n2-highmem-32 | $1224.19 | | TAS-B + PQ (Hofstatter et al., 2021) | TAS-B | 0.358 | 0.361 | 2 GB (32x) | 44.0 ms | n1-standard-1 | $24.27 |
| Supervised Compression: BPR | | BPR (TAS-B) (Thakur et al., 2022) | TAS-B | 0.397 | 0.357 | 2.2 GB (32x) | 38.1 ms |n1-standard-1 | $24.27 | | BPR+GenQ (TAS-B) (Thakur et al., 2022) | TAS-B | 0.397 | 0.377 | 2.2 GB (32x) | 38.1 ms |n1-standard-1 | $24.27 | | BPR+GPL (TAS-B) (Thakur et al., 2022) | TAS-B | 0.397 | 0.398 | 2.2 GB (32x) | 38.1 ms |n1-standard-1 | $24.27 | | Supervised Compression: JPQ | | JPQ (TAS-B) (Thakur et al., 2022) | TAS-B (query) (doc) | 0.400 | 0.402 | 2.2 GB (32x) | 44.0 ms | n1-standard-1 | $24.27 | | JPQ+GenQ (TAS-B) (Thakur et al., 2022) | TAS-B (query) (doc) | 0.400 | 0.417 | 2.2 GB (32x) | 44.0 ms | n1-standard-1 | $24.27 | | JPQ+GPL (TAS-B) (Thakur et al., 2022) | TAS-B (query) (doc) | 0.400 | 0.435 | 2.2 GB (32x) | 44.0 ms | n1-standard-1 | $24.27 |

The scores denote the NDCG@10 performance of the model. The Index size and costs are estimated for a user who wants to build a semantic search engine over the English Wikipedia containing about 21 million passages you need to encode. Using float32 (and no further compression techniques) and 768 dimensions, the resulting embeddings have a size of about 65GB. The n2-highmem-8 server can provide upto 64 GB of memory, whereas the n1-standard-1 server can provide upto 3.75 GB of memory.

:dollar: Easily compress your dense retriever

Our technique can easily wrap around any HF-based dense retriever and convert them into a BPR or JPQ based model. Overall, we find the stronger the backbone dense retriever in generalization, the better the BPR and JPQ models. We recently converted these new models and made them available publicly on HF. Incase, you wish to convert your model open a pull request or follow the scripts below for BPR and JPQ seperately.

| | Backbone| MSMARCO | BEIR | Memory Size | Query Time | GCP Cloud | Cost per. Month (in $) | |:---:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
| Supervised Compression | | BPR (Contriever) (Izacard et al., 2021) | Contriever | 0.407 | 0.367 | 2.2 GB (32x) | 38.1 ms | n1-standard-1 | $24.27 | | BPR (DPR) (Yamada et al., 2021) | NQ (DPR) | 0.130 | 0.201 | 2.2 GB (32x) | 38.1 ms | n1-standard-1 | $24.27 | | JPQ (STAR) (Zhan et al., 2021) | STAR (query) (doc) | 0.402 | 0.389 | 2.2 GB (32x) | 44.0 ms | n1-standard-1 | $24.27 |

:dollar: Using the INCOME library

The income library can be used to learn various different vector compression strategies for information retrieval. These can be used

:dollar: BPR Model (Getting Started)

This section would introduce few quick examples to train and evaluate BPR models on any custom data you wish to search on.

Training using GPL: Generative Pseudo Labeling

export dataset="nfcorpus"

python -m income.bpr.train \
    --path_to_generated_data "generated/$dataset" \
    --base_ckpt "msmarco-distilbert-base-tas-b" \
    --gpl_score_function "dot" \
    --batch_size_gpl 32 \
    --gpl_steps 10000 \
    --new_size -1 \
    --queries_per_passage -1 \
    --output_dir "output/$dataset" \
    --generator "BeIR/query-gen-msmarco-t5-base-v1" \
    --retrievers "msmarco-distilbert-base-tas-b" "msmarco-distilbert-base-v3" "msmarco-MiniLM-L-6-v3" \
    --retriever_score_functions "dot" "cos_sim" "cos_sim" \
    --cross_encoder "cross-encoder/ms-marco-MiniLM-L-6-v2" \
    --qgen_prefix "gen-t5-base-2-epoch-default-lr-3-ques" \
    --evaluation_data "./$dataset" \
    --evaluation_output "evaluation/$dataset" \
    --do_evaluation \
    --use_amp   # Use this for efficient training if the machine supports AMP

Inference

from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import BinaryFaissSearch
from beir import util
from income.bpr.model import BPR

dataset = "nfcorpus"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = os.path.join(pathlib.Path(__file__).parent.absolute(), "datasets")
data_path = util.download_and_unzip(url, out_dir)

corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
faiss_search = BinaryFaissSearch(BPR("income/bpr-base-msmarco-distilbert-tas-b"), batch_size=128)

retriever = EvaluateRetrieval(faiss_search, score_function="dot")
results = retriever.retrieve(corpus, queries, rerank=True, binary_k=1000)
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

:dollar: JPQ Model (Getting Started)

This section would introduce few quick examples to train and evaluate JPQ models on any custom data you wish to search on.

Training using GPL: Generative Pseudo Labeling

Training using JPQ and GPL occurs in four steps:

Preprocess Dataset to JPQ-friendly format

export dataset="nfcorpus"
export PREFIX="gen"

python -m income.jpq.beir.transform \
          --dataset ${dataset} \
          --output_dir "./datasets/${dataset}" \
          --prefix  ${PREFIX} \

Preprocessing Script tokenizes the queries and corpus

CUDA_VISIBLE_DEVICES=0 python -m income.jpq.preprocess \
                                --data_dir "./datasets/${dataset}" \
                                --out_data_dir "./preprocessed/${dataset}"

INIT script trains the IVFPQ corpus faiss index

CUDA_VISIBLE_DEVICES=0 python -m income.jpq.init \
  --preprocess_dir "./preprocessed/${dataset}" \
  --model_dir "sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco"
  --max_doc_length 350 \
  --output_dir "./init/${dataset}" \
  --subvector_num 96

TRAIN script trains TAS-B using Generative Pseudo Labeling (GPL)

CUDA_VISIBLE_DEVICES=0 python -m income.jpq.train_gpl \
    --preprocess_dir "./preprocessed/${dataset}" \
    --model_save_dir "./final_models/${dataset}/gpl" \
    --log_dir "./logs/${dataset}/log" \
    --init_index_path "./init/${dataset}/OPQ96,IVF1,PQ96x8.index" \
    --init_model_path "sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco" \
    --data_path "./datasets/${dataset}" \
    --cross_encoder "cross-encoder/ms-marco-MiniLM-L-6-v2" \

Income

Install / Use

README

:dollar: What is it?

:dollar: Installation

:dollar: Models Supported

:dollar: Easily compress your dense retriever

:dollar: Using the INCOME library

:dollar: BPR Model (Getting Started)

Training using GPL: Generative Pseudo Labeling

Inference

:dollar: JPQ Model (Getting Started)

Training using GPL: Generative Pseudo Labeling