</div>

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutting edge technologies, this repository provides an easy-to-use toolkit for running and fine-tuning the state-of-the-art dense retrievers, namely 🚀RocketQA. This toolkit has the following advantages:

State-of-the-art: 🚀RocketQA provides our well-trained models, which achieve SOTA performance on many dense retrieval datasets. And it will continue to update the latest models.
First-Chinese-model: 🚀RocketQA provides the first open source Chinese dense retrieval model, which is trained on millions of manual annotation data from DuReader.
Easy-to-use: By integrating this toolkit with JINA, 🚀RocketQA can help developers build an end-to-end retrieval system and question answering system with several lines of code. <img src="https://github.com/PaddlePaddle/RocketQA/blob/main/RocketQA_flow.png" alt="" align=center />

News

🎉 Nov 27, 2022: Our survey paper on dense retrieval Dense Text Retrieval based on Pretrained Language Models: A Survey was publicly available.
Oct 8, 2022: DuReaderretrieval was accepted by EMNLP 2022. [data]; The latest version of DuReaderretrieval contains cross-lingual retrieval benchmarks. Stay tuned!
Apr 29, 2022: Training function is added to RocketQA toolkit. And the baseline models of DuReaderretrieval (both cross encoder and dual encoder) are available in RocketQA models.
Mar 30, 2022: We released DuReaderretrieval, a large-scale Chinese benchmark for passage retrieval. The dataset contains over 90K questions and 8M passages from Baidu Search. [paper] [data] ; The baseline of DuReaderretrieval leaderboard was also released. [code/model]
Dec 3, 2021: The toolkit of dense retriever RocketQA was released, including the first chinese dense retrieval model trained on DuReader.
Aug 26, 2021: RocketQA v2 was accepted by EMNLP 2021. [code/model]
May 5, 2021: PAIR was accepted by ACL 2021. [code/model]
Mar 11, 2021: RocketQA v1 was accepted by NAACL 2021. [code/model]

Installation

We provide two installation methods: Python Installation Package and Docker Environment

Install with Python Package

First, install PaddlePaddle.

# GPU version:
$ pip install paddlepaddle-gpu

# CPU version:
$ pip install paddlepaddle

Second, install rocketqa package (latest version: 1.1.0):

$ pip install rocketqa

NOTE: this toolkit MUST be running on Python3.6+ with PaddlePaddle 2.0+.

Install with Docker

docker pull rocketqa/rocketqa

docker run -it docker.io/rocketqa/rocketqa bash

Getting Started

Refer to the examples below, you can build and run your own Search Engine with several lines of code. We also provide a Playground with JupyterNotebook. Try 🚀RocketQA straight away in your browser!

Running with JINA

JINA is a cloud-native neural search framework to build SOTA and scalable deep learning search applications in minutes. Here is a simple example to build a Search Engine based on JINA and RocketQA.

cd examples/jina_example
pip3 install -r requirements.txt

# Generate vector representations and build a libray for your Documents
# JINA will automaticlly start a web service for you
python3 app.py index toy_data/test.tsv

# Try some questions related to the indexed Documents
python3 app.py query_cli

Please view JINA example to know more.

Running with FAISS

We also provide a simple example built on Faiss.

cd examples/faiss_example/
pip3 install -r requirements.txt

# Generate vector representations and build a libray for your Documents
python3 index.py zh ../data/dureader.para test_index

# Start a web service on http://localhost:8888/rocketqa
python3 rocketqa_service.py zh ../data/dureader.para test_index

# Try some questions related to the indexed Documents
python3 query.py

API

You can also easily integrate 🚀RocketQA into your own task. We provide two types of models, ERNIE-based dual encoder for answer retrieval and ERNIE-based cross encoder for answer re-ranking. For running our models, you can use the following functions.

Load model

`rocketqa.available_models()`

Returns the names of the available RocketQA models. To know more about the available models, please see the code comment.

`rocketqa.load_model(model, use_cuda=False, device_id=0, batch_size=1)`

Returns the model specified by the input parameter. It can initialize both dual encoder and cross encoder. By setting input parameter, you can load either RocketQA models returned by "available_models()" or your own checkpoints.

Dual encoder

Dual-encoder returned by "load_model()" supports the following functions:

`model.encode_query(query: List[str])`

Given a list of queries, returns their representation vectors encoded by model.

`model.encode_para(para: List[str], title: List[str])`

Given a list of paragraphs and their corresponding titles (optional), returns their representations vectors encoded by model.

`model.matching(query: List[str], para: List[str], title: List[str])`

Given a list of queries and paragraphs (and titles), returns their matching scores (dot product between two representation vectors).

`model.train(train_set: str, epoch: int, save_model_path: str, args)`

Given the hyperparameters train_set, epoch and save_model_path, you can train your own dual encoder model or finetune our models. Other settings like save_steps and learning_rate can also be set in args. Please refer to examples/example.py for detail.

Cross encoder

Cross-encoder returned by "load_model()" supports the following function:

`model.matching(query: List[str], para: List[str], title: List[str])`

Given a list of queries and paragraphs (and titles), returns their matching scores (probability that the paragraph is the query's right answer).

`model.train(train_set: str, epoch: int, save_model_path: str, args)`

Given the hyperparameters train_set, epoch and save_model_path, you can train your own cross encoder model or finetune our models. Other settings like save_steps and learning_rate can also be set in args. Please refer to examples/example.py for detail.

Examples

Following the examples below, you can retrieve the vector representations of your documents and connect 🚀RocketQA to your own tasks.

Run RocketQA Model

To run RocketQA models, you should set the parameter model in 'load_model()' with RocketQA model name returned by 'available_models()'.

import rocketqa

query_list = ["trigeminal definition"]
para_list = [
    "Definition of TRIGEMINAL. : of or relating to the trigeminal nerve.ADVERTISEMENT. of or relating to the trigeminal nerve. ADVERTISEMENT."]

# init dual encoder
dual_encoder = rocketqa.load_model(model="v1_marco_de", use_cuda=True, device_id=0, batch_size=16)

# encode query & para
q_embs = dual_encoder.encode_query(query=query_list)
p_embs = dual_encoder.encode_para(para=para_list)
# compute dot product of query repr

RocketQA

Install / Use

README

News

Installation

Install with Python Package

Install with Docker

Getting Started

Running with JINA

Running with FAISS

API

Load model

`rocketqa.available_models()`

`rocketqa.load_model(model, use_cuda=False, device_id=0, batch_size=1)`

Dual encoder

`model.encode_query(query: List[str])`

`model.encode_para(para: List[str], title: List[str])`

`model.matching(query: List[str], para: List[str], title: List[str])`

`model.train(train_set: str, epoch: int, save_model_path: str, args)`

Cross encoder

`model.matching(query: List[str], para: List[str], title: List[str])`

`model.train(train_set: str, epoch: int, save_model_path: str, args)`

Examples

Run RocketQA Model