ClariQ
ClariQ: SCAI Workshop data challenge on conversational search clarification.
Install / Use
/learn @aliannejadi/ClariQREADME
ClariQ
Introduction
The main aim of the conversational systems is to return an appropriate answer in response to the user requests. However, some user requests might be ambiguous. In Information Retrieval (IR) settings such a situation is handled mainly through the diversification of search result page. It is however much more challenging in dialogue settings.
We release the ClariQ dataset [3, 4], aiming to study the following situation for dialogue settings:
- a user is asking an ambiguous question (where ambiguous question is a question to which one can return > 1 possible answers);
- the system must identify that the question is ambiguous, and, instead of trying to answer it directly, ask a good clarifying question.
The main research questions we aim to answer as part of the challenge are the following:
- RQ1: When to ask clarifying questions during dialogues?
- RQ2: How to generate the clarifying questions?
ConvAI3 Data Challenge
ClariQ was collected as part of the ConvAI3 (http://convai.io) challenge which was co-organized with the SCAI workshop (https://scai-workshop.github.io/2020/). The challenge ran in two stages. At Stage 1 (described below) participants were provided with a static dataset consisting mainly of an initial user request, clarifying question and user answer, which is suitable for initial training, validating and testing. At Stage 2, we brought a human in the loop. Namely, the top 3 systems, resulted from Stage 1, were invited to develop systems that were exposed to human annotators.
Stage 1: initial dataset
Taking inspiration from Qulac [1] dataset, we have crowdsourced a new dataset to study clarifying questions that is suitable for conversational settings. Namely, the collected dataset consists of:
- User Request: an initial user request in the conversational form, e.g., "What is Fickle Creek Farm?", with a label reflects if is needed ranged from 1 to 4. If an initial user request is self-contained and would not need any clarification, the label would be 1. While if a initial user request is absolutely ambiguous, making it impossible for a search engine to guess the user's right intent before clarification, the label would be 4. Labels 2 and 3 represent other levels of clarification need, where clarification is still needed but not as much as label 4;
- Clarification questions: a set of possible clarifying questions, e.g., "Do you want to know the location of fickle creek farm?";
- User Answers: each questions is supplied with a user answer, e.g., "No, I want to find out where can i purchase fickle creek farm products."
For training, the collected dataset is split into training (187 topics) and validation (50 topics) sets. For testing, the participants are supplied with: (1) a set of user requests in conversational form and (2) a set a set of questions (i.e., question bank) which contains all the questions that we have collected for the collection. Therefore to answer our research questions, we suggest the following two tasks:
- To answer RQ1: Given a user request, return a score [1 −4] indicating the necessity of asking clarifying questions.
- To answer RQ2: Given a user request which needs clarification, return the
most suitable clarifying question. Here participants are able to choose: (1)
either select the clarifying question from the provided question bank (all
clarifying questions we collected), aiming to maximize the precision, (2) or
choose not to ask any question (by choosing
Q0001from the question bank.)
Stage 2: human-in-the-loop
The second stage of the ClariQ data challenge enables the top-performing teams of the first stage to evaluate their models with the help of human evaluators. To do so, we ask the teams to generate their responses in a given conversation and pass the results to human evaluators. We instruct the human evaluators to read and understand the context of the conversation and write a response to the system. The evaluator assumes that they are part of the conversation. We evaluate the performance of a system in two respects: (i) How much the conversation can help a user find the information they are looking for and (ii) How natural and realistic does the conversation appear to a human evaluator.
ClariQ Dataset
We have extended the Qulac [1] dataset and base the competition mostly
on the training data that Qulac provides.
In addition, we have added some new topics, questions, and answers in the training set.
The test set is completely unseen and newly collected.
Like Qulac, ClariQ consists of single-turn conversations (initial_request, followed by clarifying question and answer).
In addition, it comes with synthetic multi-turn conversations (up to three turns). ClariQ features approximately 18K single-turn conversations, as well as 1.8 million multi-turn conversations.
Below, we provide a short summary of the data characteristics, for the training set:
ClariQ Train
Feature | Value --------------------------------| ----- # train (dev) topics | 187 (50) # faceted topics | 141 # ambiguous topics | 57 # single topics | 39 # facets | 891 # total questions | 3,929 # single-turn conversations | 11,489 # multi-turn conversations | ~ 1 million # documents | ~ 2 million
Below, we provide a brief overview of the structure of the data.
Files
Below we list the files in the repository:
./data/train.tsvand./data/dev.tsvare TSV files consisting of topics (queries), facets, clarifying questions, user's answers, and labels for how much clarification is needed (clarification needs)../data/test.tsvis a TSV file consisting of test topic ID's, as well as queries (text)../data/test_with_labels.tsvis a TSV file consiting of test topic ID's with the labels. It can be used with the evaluation script../data/multi_turn_human_generated_data.tsvis a TSV file containing the human-generated multi turn conversations which is the result of of the human-in-the-loop process../data/question_bank.tsvis a TSV file containing all the questions in the collection, as well as their ID's. Participants' models should select questions from this file../data/top10k_docs_dict.pkl.tar.gzis adictcontaining the top 10,000 document ID's retrieved from ClueWeb09 and ClueWeb12 collections for each topic. This may be used by the participants who wish to leverage documents content in their models../data/single_turn_train_eval.pklis adictcontaining the performance of each topic after asking a question and getting the answer. The evaluation tool that we provide uses this file to evaluate the selected questions../data/multi_turn_train_eval.pkl.tar.gz.**and./data/multi_turn_dev_eval.pkl.tar.gzaredicts that contain the performance of each conversation after asking a question from thequestion_bankand getting the answer from the user. The evaluation tool that we provide uses this file to evaluate the selected questions. Notice that thesedicts are built based on the synthetic multi-turn conversations../data/dev_synthetic.pkl.tar.gzand./data/train_synthetic.pkl.tar.gzare two compressedpicklefiles that containdicts of synthetic multi-turn conversations. We have generated these conversations following the method explained in [1]../src/clariq_eval_tool.pyis a python script to evaluate the runs. The participants may use this tool to evaluate their models on thedevset. We would use the same tool to evaluate the submitted runs on thetestset../sample_runs/contains some sample runs and baselines. Among them, we have included the two oracle modelsBestQuestionandWorstQuestion, as well asNoQuestion, the model choosing no question. Participants may check these files as sample run files. Also, they could test the evaluation tool using these files.
File Format
train.tsv, dev.tsv:
train.tsv and dev.tsv have the same format. They contain the topics, facets, questions, answers, and clarification need labels. These are considered to be the main files, containing the labels of the training set. Note that the clarification needs labels are already explicitly included in the files. Regarding the question relevance labels for each topic, these labels can be extracted inderictly: each row only contains the questions that are considered to be relevant to a topic. Therefore, any other question is deemed irrelevant while computing Recall@k.
In the train.tsv and dev.tsv files, you will find these fields:
topic_id: the ID of the topic (initial_request).initial_request: the query (text) that initiates the conversation.topic_desc: a full description of the topic as it appears in the TREC Web Track data.clarification_need: a label from 1 to 4, indicating how much it is needed to clarify a topic. If aninitial_requestis self-contained and would not need any clarification, the label would be 1. While if ainitial_requestis absolutely ambiguous, making it impossible for a search engine to guess the user's right intent before clarification, the label would be 4. Labels 2 and 3 represent other levels of clarification need, where clarification is still needed but not as much as label 4.facet_id: the ID of the facet.facet_desc: a full description of the facet (information need) as it appears in the TREC Web Track data.question_id: the ID of the question as it appears inquestion_bank.tsv.question: a clarifying question that the system can pose to the user for the current topic and facet.answer: an answer to the clarifying question, assuming that the user is in the context of the current row (i.e., the user's initial query isinitial_request, their i
Related Skills
node-connect
349.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
