QPP4CS

Query Performance Prediction for Conversational Search (QPP4CS)

Generate Convert Improve

Install / Use

/learn @ChuanMeng/QPP4CS

About this skill

Quality Score

0/100

README

Query Performance Prediction for Conversational Search (QPP4CS)

This is the repository for the papers:

Query Performance Prediction: From Ad-hoc to Conversational Search (SIGIR 2023)
Performance Prediction for Conversational Search Using Perplexities of Query Rewrites (QPP++ 2023)

The repository offers the implementation of a comprehensive collection of pre- and post-retrieval query performance prediction (QPP) methods, all integrated within a unified Python/Pytorch framework. It would be an ideal package for anyone interested in conducting research into QPP for ad-hoc or conversational search.

We kindly ask you to cite our papers if you find this repository useful:

@inproceedings{meng2023query,
 author = {Meng, Chuan and Arabzadeh, Negar and Aliannejadi, Mohammad and de Rijke, Maarten},
 title = {Query Performance Prediction: From Ad-hoc to Conversational Search},
 booktitle = {Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval},
 pages = {2583–2593},
 year = {2023},
 url = {https://doi.org/10.1145/3539618.3591919},
 doi = {10.1145/3539618.3591919},
}

@inproceedings{meng2023Performance,
 author = {Meng, Chuan and Aliannejadi, Mohammad and de Rijke, Maarten},
 title = {Performance Prediction for Conversational Search Using Perplexities of Query Rewrites},
 booktitle = {Proceedings of the The QPP++ 2023: Query Performance Prediction and Its Evaluation in New Tasks Workshop co-located with The 45th European Conference on Information Retrieval},
 year = {2023},
 pages = {25--28}
}

This repository allows the replication of all results reported in the papers. In particular, it is organized as follows:

Prerequisites
Data Preparation
Replicating Results
Plots

Note that for ease of use, we already uploaded the predicted performance files for all QPP methods reported in our paper. See here.

Prerequisites

We recommend running all the things in a Linux environment. Please create a conda environment with all required packages, and activate the environment by the following commands:

$ conda env create -f environment.yaml
$ conda activate QPP4CS

Data Preparation

Query performance prediction for conversational search needs query rewrites, retrieval run files and actual performance files. To this end, we need to download raw dataset files, conduct preprocessing, build indexes, perform query rewriting, perform retrieval, and generate actual performance files.

For ease of use, you can directly download the dataset folder here, which contains the preprocessed qrels files, query rewrites, retrieval run files and actual performance files for CAsT-19 & CAsT-20 and OR-QuAC datasets; please put the unzipped dataset folder in the current folder. Raw files for CAsT-19 & CAsT-20 and OR-QuAC are not included and they need to be downloaded by the following procedure. The collections and indexes for CAsT-19 & CAsT-20 and OR-QuAC are too large and so they also need to be produced by the following procedure.

Raw File Download

CAsT-19 & CAsT-20

CAsT-19 and CAsT-20 share the same collection. Use the following commands to download the collection files (the MS MARCO Passage Ranking collection, the TREC CAR paragraph collection v2.0 and the MARCO duplicate file) of CAsT-19 & CAsT-20:

mkdir datasets/
mkdir datasets/cast-19-20/ 
mkdir datasets/cast-19-20/raw    
wget -P datasets/cast-19-20/raw https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz
wget -P datasets/cast-19-20/raw http://trec-car.cs.unh.edu/datareleases/v2.0/paragraphCorpus.v2.0.tar.xz
wget -P datasets/cast-19-20/raw http://boston.lti.cs.cmu.edu/Services/treccast19/duplicate_list_v1.0.txt
tar zxvf datasets/cast-19-20/raw/collection.tar.gz -C datasets/cast-19-20/raw/
tar xvJf datasets/cast-19-20/raw/paragraphCorpus.v2.0.tar.xz -C datasets/cast-19-20/raw/
mv datasets/cast-19-20/raw/collection.tsv datasets/cast-19-20/raw/msmarco.tsv

These files are stored in ./datasets/cast-19-20/raw.

Use the following commands to download query and qrels files of CAsT-19 & CAsT-20:

wget -P datasets/cast-19-20/raw https://raw.githubusercontent.com/daltonj/treccastweb/master/2019/data/evaluation/evaluation_topics_v1.0.json
wget -P datasets/cast-19-20/raw https://raw.githubusercontent.com/daltonj/treccastweb/master/2019/data/evaluation/evaluation_topics_annotated_resolved_v1.0.tsv
wget -P datasets/cast-19-20/raw https://trec.nist.gov/data/cast/2019qrels.txt --no-check-certificate
wget -P datasets/cast-19-20/raw https://raw.githubusercontent.com/daltonj/treccastweb/master/2020/2020_automatic_evaluation_topics_v1.0.json
wget -P datasets/cast-19-20/raw https://raw.githubusercontent.com/daltonj/treccastweb/master/2020/2020_manual_evaluation_topics_v1.0.json
wget -P datasets/cast-19-20/raw https://trec.nist.gov/data/cast/2020qrels.txt --no-check-certificate

These files are stored in ./datasets/cast-19-20/raw.

OR-QuAC

Use the following commands to download the collection, query and qrels files of OR-QuAC:

mkdir datasets/or-quac/
mkdir datasets/or-quac/raw
wget -P datasets/or-quac/raw https://ciir.cs.umass.edu/downloads/ORConvQA/all_blocks.txt.gz --no-check-certificate
wget -P datasets/or-quac/raw https://ciir.cs.umass.edu/downloads/ORConvQA/qrels.txt.gz --no-check-certificate
gzip -d datasets/or-quac/raw/*.txt.gz
wget -P datasets/or-quac/raw https://ciir.cs.umass.edu/downloads/ORConvQA/preprocessed/train.txt --no-check-certificate
wget -P datasets/or-quac/raw https://ciir.cs.umass.edu/downloads/ORConvQA/preprocessed/dev.txt --no-check-certificate
wget -P datasets/or-quac/raw https://ciir.cs.umass.edu/downloads/ORConvQA/preprocessed/test.txt --no-check-certificate

These files are stored in ./datasets/or-quac/raw.

Preprocessing

CAsT-19 & CAsT-20

Note that our preprocessing follows Yu et al.. Use the following command to preprocess the collection, query and qrels files of CAsT-19 and CAsT-20.

python preprocess.py --dataset cast-19-20

The preprocessed collection file is stored in datasets/cast-19-20/jsonl, and the qrels files of CAsT-19 (cast-19.qrels.txt) and CAsT-20 (cast-20.qrels.txt) are stored in datasets/cast-19-20/qrels. This preprocessing process also produces human-rewritten query files of CAsT-19 (cast-19.queries-manual.tsv) and CAsT-20 (cast-20.queries-manual.tsv), which are stored in datasets/cast-19-20/queries.

OR-QuAC

Use the following command to preprocess the collection, query and qrels files of OR-QuAC:

python preprocess.py --dataset or-quac

The preprocessed collection file is stored in ./datasets/or-quac/jsonl, and the qrels file (or-quac.qrels.txt) is stored in ./datasets/or-quac/qrels. This preprocessing process also produces human-rewritten query files of the training set (or-quac-train.queries-manual.tsv), development set (or-quac-dev.queries-manual.tsv) and test set (or-quac-test.queries-manual.tsv), which are stored in ./datasets/or-quac/queries.

Indexing

We use Pyserini to conduct indexing and retrieval. We follow the default Pyserini setting to index collections. Use the following commands to index the collection of CAsT-19 & CAsT-20:

python -m pyserini.index.lucene --collection JsonCollection --generator DefaultLuceneDocumentGenerator --threads 16 -input datasets/cast-19-20/jsonl -index datasets/cast-19-20/index --storePositions --storeDocvectors --storeRaw

The index is stored in ./datasets/cast-19-20/index.

Use the following commands to index the collection of OR-QuAC:

python -m pyserini.index.lucene --collection JsonCollection --generator DefaultLuceneDocumentGenerator --threads 16 -input datasets/or-quac/jsonl -index datasets/or-quac/index  --storePositions --storeDocvectors --storeRaw

The index is stored in ./datasets/or-quac/index.

Generating Query Rewrites

We consider three kinds of query rewrites: T5-based query rewrites, QuReTeC-based query rewrites and human-rewritten queries.

We already obtain the human-rewritten queries for CAsT-19, CAsT-20 and OR-QuAC during Preprocessing.
We use the T5 rewriter released by Lin et al. to generate T5-based query rewrites on CAsT-19, CAsT-20 and OR-QuAC.
We use QuReTeC proposed by [Voskarides et al.](https://dl.acm.org/doi/10.1145/3397

Related Skills

node-connect

347.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。