RocketQA
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.
Install / Use
/learn @PaddlePaddle/RocketQAREADME
In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutting edge technologies, this repository provides an easy-to-use toolkit for running and fine-tuning the state-of-the-art dense retrievers, namely 🚀RocketQA. This toolkit has the following advantages:
- State-of-the-art: 🚀RocketQA provides our well-trained models, which achieve SOTA performance on many dense retrieval datasets. And it will continue to update the latest models.
- First-Chinese-model: 🚀RocketQA provides the first open source Chinese dense retrieval model, which is trained on millions of manual annotation data from DuReader.
- Easy-to-use: By integrating this toolkit with JINA, 🚀RocketQA can help developers build an end-to-end retrieval system and question answering system with several lines of code. <img src="https://github.com/PaddlePaddle/RocketQA/blob/main/RocketQA_flow.png" alt="" align=center />
News
- 🎉 Nov 27, 2022: Our survey paper on dense retrieval Dense Text Retrieval based on Pretrained Language Models: A Survey was publicly available.
- Oct 8, 2022: DuReader<sub>retrieval</sub> was accepted by EMNLP 2022. [data]; The latest version of DuReader<sub>retrieval</sub> contains cross-lingual retrieval benchmarks. Stay tuned!
- Apr 29, 2022: Training function is added to RocketQA toolkit. And the baseline models of DuReader<sub>retrieval</sub> (both cross encoder and dual encoder) are available in RocketQA models.
- Mar 30, 2022: We released DuReader<sub>retrieval</sub>, a large-scale Chinese benchmark for passage retrieval. The dataset contains over 90K questions and 8M passages from Baidu Search. [paper] [data] ; The baseline of DuReader<sub>retrieval</sub> leaderboard was also released. [code/model]
- Dec 3, 2021: The toolkit of dense retriever RocketQA was released, including the first chinese dense retrieval model trained on DuReader.
- Aug 26, 2021: RocketQA v2 was accepted by EMNLP 2021. [code/model]
- May 5, 2021: PAIR was accepted by ACL 2021. [code/model]
- Mar 11, 2021: RocketQA v1 was accepted by NAACL 2021. [code/model]
Installation
We provide two installation methods: Python Installation Package and Docker Environment
Install with Python Package
First, install PaddlePaddle.
# GPU version:
$ pip install paddlepaddle-gpu
# CPU version:
$ pip install paddlepaddle
Second, install rocketqa package (latest version: 1.1.0):
$ pip install rocketqa
NOTE: this toolkit MUST be running on Python3.6+ with PaddlePaddle 2.0+.
Install with Docker
docker pull rocketqa/rocketqa
docker run -it docker.io/rocketqa/rocketqa bash
Getting Started
Refer to the examples below, you can build and run your own Search Engine with several lines of code. We also provide a Playground with JupyterNotebook. Try 🚀RocketQA straight away in your browser!
Running with JINA
JINA is a cloud-native neural search framework to build SOTA and scalable deep learning search applications in minutes. Here is a simple example to build a Search Engine based on JINA and RocketQA.
cd examples/jina_example
pip3 install -r requirements.txt
# Generate vector representations and build a libray for your Documents
# JINA will automaticlly start a web service for you
python3 app.py index toy_data/test.tsv
# Try some questions related to the indexed Documents
python3 app.py query_cli
Please view JINA example to know more.
Running with FAISS
We also provide a simple example built on Faiss.
cd examples/faiss_example/
pip3 install -r requirements.txt
# Generate vector representations and build a libray for your Documents
python3 index.py zh ../data/dureader.para test_index
# Start a web service on http://localhost:8888/rocketqa
python3 rocketqa_service.py zh ../data/dureader.para test_index
# Try some questions related to the indexed Documents
python3 query.py
API
You can also easily integrate 🚀RocketQA into your own task. We provide two types of models, ERNIE-based dual encoder for answer retrieval and ERNIE-based cross encoder for answer re-ranking. For running our models, you can use the following functions.
Load model
rocketqa.available_models()
Returns the names of the available RocketQA models. To know more about the available models, please see the code comment.
rocketqa.load_model(model, use_cuda=False, device_id=0, batch_size=1)
Returns the model specified by the input parameter. It can initialize both dual encoder and cross encoder. By setting input parameter, you can load either RocketQA models returned by "available_models()" or your own checkpoints.
Dual encoder
Dual-encoder returned by "load_model()" supports the following functions:
model.encode_query(query: List[str])
Given a list of queries, returns their representation vectors encoded by model.
model.encode_para(para: List[str], title: List[str])
Given a list of paragraphs and their corresponding titles (optional), returns their representations vectors encoded by model.
model.matching(query: List[str], para: List[str], title: List[str])
Given a list of queries and paragraphs (and titles), returns their matching scores (dot product between two representation vectors).
model.train(train_set: str, epoch: int, save_model_path: str, args)
Given the hyperparameters train_set, epoch and save_model_path, you can train your own dual encoder model or finetune our models. Other settings like save_steps and learning_rate can also be set in args. Please refer to examples/example.py for detail.
Cross encoder
Cross-encoder returned by "load_model()" supports the following function:
model.matching(query: List[str], para: List[str], title: List[str])
Given a list of queries and paragraphs (and titles), returns their matching scores (probability that the paragraph is the query's right answer).
model.train(train_set: str, epoch: int, save_model_path: str, args)
Given the hyperparameters train_set, epoch and save_model_path, you can train your own cross encoder model or finetune our models. Other settings like save_steps and learning_rate can also be set in args. Please refer to examples/example.py for detail.
Examples
Following the examples below, you can retrieve the vector representations of your documents and connect 🚀RocketQA to your own tasks.
Run RocketQA Model
To run RocketQA models, you should set the parameter model in 'load_model()' with RocketQA model name returned by 'available_models()'.
import rocketqa
query_list = ["trigeminal definition"]
para_list = [
"Definition of TRIGEMINAL. : of or relating to the trigeminal nerve.ADVERTISEMENT. of or relating to the trigeminal nerve. ADVERTISEMENT."]
# init dual encoder
dual_encoder = rocketqa.load_model(model="v1_marco_de", use_cuda=True, device_id=0, batch_size=16)
# encode query & para
q_embs = dual_encoder.encode_query(query=query_list)
p_embs = dual_encoder.encode_para(para=para_list)
# compute dot product of query repr
