QPP4CS
Query Performance Prediction for Conversational Search (QPP4CS)
Install / Use
/learn @ChuanMeng/QPP4CSREADME
Query Performance Prediction for Conversational Search (QPP4CS)
This is the repository for the papers:
- Query Performance Prediction: From Ad-hoc to Conversational Search (SIGIR 2023)
- Performance Prediction for Conversational Search Using Perplexities of Query Rewrites (QPP++ 2023)
The repository offers the implementation of a comprehensive collection of pre- and post-retrieval query performance prediction (QPP) methods, all integrated within a unified Python/Pytorch framework. It would be an ideal package for anyone interested in conducting research into QPP for ad-hoc or conversational search.
We kindly ask you to cite our papers if you find this repository useful:
@inproceedings{meng2023query,
author = {Meng, Chuan and Arabzadeh, Negar and Aliannejadi, Mohammad and de Rijke, Maarten},
title = {Query Performance Prediction: From Ad-hoc to Conversational Search},
booktitle = {Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2583–2593},
year = {2023},
url = {https://doi.org/10.1145/3539618.3591919},
doi = {10.1145/3539618.3591919},
}
@inproceedings{meng2023Performance,
author = {Meng, Chuan and Aliannejadi, Mohammad and de Rijke, Maarten},
title = {Performance Prediction for Conversational Search Using Perplexities of Query Rewrites},
booktitle = {Proceedings of the The QPP++ 2023: Query Performance Prediction and Its Evaluation in New Tasks Workshop co-located with The 45th European Conference on Information Retrieval},
year = {2023},
pages = {25--28}
}
This repository allows the replication of all results reported in the papers. In particular, it is organized as follows:
Note that for ease of use, we already uploaded the predicted performance files for all QPP methods reported in our paper. See here.
Prerequisites
We recommend running all the things in a Linux environment. Please create a conda environment with all required packages, and activate the environment by the following commands:
$ conda env create -f environment.yaml
$ conda activate QPP4CS
Data Preparation
Query performance prediction for conversational search needs query rewrites, retrieval run files and actual performance files. To this end, we need to download raw dataset files, conduct preprocessing, build indexes, perform query rewriting, perform retrieval, and generate actual performance files.
For ease of use, you can directly download the
datasetfolder here, which contains the preprocessed qrels files, query rewrites, retrieval run files and actual performance files for CAsT-19 & CAsT-20 and OR-QuAC datasets; please put the unzippeddatasetfolder in the current folder. Raw files for CAsT-19 & CAsT-20 and OR-QuAC are not included and they need to be downloaded by the following procedure. The collections and indexes for CAsT-19 & CAsT-20 and OR-QuAC are too large and so they also need to be produced by the following procedure.
Raw File Download
CAsT-19 & CAsT-20
CAsT-19 and CAsT-20 share the same collection. Use the following commands to download the collection files (the MS MARCO Passage Ranking collection, the TREC CAR paragraph collection v2.0 and the MARCO duplicate file) of CAsT-19 & CAsT-20:
mkdir datasets/
mkdir datasets/cast-19-20/
mkdir datasets/cast-19-20/raw
wget -P datasets/cast-19-20/raw https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz
wget -P datasets/cast-19-20/raw http://trec-car.cs.unh.edu/datareleases/v2.0/paragraphCorpus.v2.0.tar.xz
wget -P datasets/cast-19-20/raw http://boston.lti.cs.cmu.edu/Services/treccast19/duplicate_list_v1.0.txt
tar zxvf datasets/cast-19-20/raw/collection.tar.gz -C datasets/cast-19-20/raw/
tar xvJf datasets/cast-19-20/raw/paragraphCorpus.v2.0.tar.xz -C datasets/cast-19-20/raw/
mv datasets/cast-19-20/raw/collection.tsv datasets/cast-19-20/raw/msmarco.tsv
These files are stored in ./datasets/cast-19-20/raw.
Use the following commands to download query and qrels files of CAsT-19 & CAsT-20:
wget -P datasets/cast-19-20/raw https://raw.githubusercontent.com/daltonj/treccastweb/master/2019/data/evaluation/evaluation_topics_v1.0.json
wget -P datasets/cast-19-20/raw https://raw.githubusercontent.com/daltonj/treccastweb/master/2019/data/evaluation/evaluation_topics_annotated_resolved_v1.0.tsv
wget -P datasets/cast-19-20/raw https://trec.nist.gov/data/cast/2019qrels.txt --no-check-certificate
wget -P datasets/cast-19-20/raw https://raw.githubusercontent.com/daltonj/treccastweb/master/2020/2020_automatic_evaluation_topics_v1.0.json
wget -P datasets/cast-19-20/raw https://raw.githubusercontent.com/daltonj/treccastweb/master/2020/2020_manual_evaluation_topics_v1.0.json
wget -P datasets/cast-19-20/raw https://trec.nist.gov/data/cast/2020qrels.txt --no-check-certificate
These files are stored in ./datasets/cast-19-20/raw.
OR-QuAC
Use the following commands to download the collection, query and qrels files of OR-QuAC:
mkdir datasets/or-quac/
mkdir datasets/or-quac/raw
wget -P datasets/or-quac/raw https://ciir.cs.umass.edu/downloads/ORConvQA/all_blocks.txt.gz --no-check-certificate
wget -P datasets/or-quac/raw https://ciir.cs.umass.edu/downloads/ORConvQA/qrels.txt.gz --no-check-certificate
gzip -d datasets/or-quac/raw/*.txt.gz
wget -P datasets/or-quac/raw https://ciir.cs.umass.edu/downloads/ORConvQA/preprocessed/train.txt --no-check-certificate
wget -P datasets/or-quac/raw https://ciir.cs.umass.edu/downloads/ORConvQA/preprocessed/dev.txt --no-check-certificate
wget -P datasets/or-quac/raw https://ciir.cs.umass.edu/downloads/ORConvQA/preprocessed/test.txt --no-check-certificate
These files are stored in ./datasets/or-quac/raw.
Preprocessing
CAsT-19 & CAsT-20
Note that our preprocessing follows Yu et al.. Use the following command to preprocess the collection, query and qrels files of CAsT-19 and CAsT-20.
python preprocess.py --dataset cast-19-20
The preprocessed collection file is stored in datasets/cast-19-20/jsonl, and the qrels files of CAsT-19 (cast-19.qrels.txt) and CAsT-20 (cast-20.qrels.txt) are stored in datasets/cast-19-20/qrels.
This preprocessing process also produces human-rewritten query files of CAsT-19 (cast-19.queries-manual.tsv) and CAsT-20 (cast-20.queries-manual.tsv), which are stored in datasets/cast-19-20/queries.
OR-QuAC
Use the following command to preprocess the collection, query and qrels files of OR-QuAC:
python preprocess.py --dataset or-quac
The preprocessed collection file is stored in ./datasets/or-quac/jsonl, and the qrels file (or-quac.qrels.txt) is stored in ./datasets/or-quac/qrels.
This preprocessing process also produces human-rewritten query files of the training set (or-quac-train.queries-manual.tsv), development set (or-quac-dev.queries-manual.tsv) and test set (or-quac-test.queries-manual.tsv), which are stored in ./datasets/or-quac/queries.
Indexing
We use Pyserini to conduct indexing and retrieval. We follow the default Pyserini setting to index collections. Use the following commands to index the collection of CAsT-19 & CAsT-20:
python -m pyserini.index.lucene --collection JsonCollection --generator DefaultLuceneDocumentGenerator --threads 16 -input datasets/cast-19-20/jsonl -index datasets/cast-19-20/index --storePositions --storeDocvectors --storeRaw
The index is stored in ./datasets/cast-19-20/index.
Use the following commands to index the collection of OR-QuAC:
python -m pyserini.index.lucene --collection JsonCollection --generator DefaultLuceneDocumentGenerator --threads 16 -input datasets/or-quac/jsonl -index datasets/or-quac/index --storePositions --storeDocvectors --storeRaw
The index is stored in ./datasets/or-quac/index.
Generating Query Rewrites
We consider three kinds of query rewrites: T5-based query rewrites, QuReTeC-based query rewrites and human-rewritten queries.
- We already obtain the human-rewritten queries for CAsT-19, CAsT-20 and OR-QuAC during Preprocessing.
- We use the T5 rewriter released by Lin et al. to generate T5-based query rewrites on CAsT-19, CAsT-20 and OR-QuAC.
- We use QuReTeC proposed by [Voskarides et al.](https://dl.acm.org/doi/10.1145/3397
Related Skills
node-connect
347.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
