SkillAgentSearch skills...

ClariQ

ClariQ: SCAI Workshop data challenge on conversational search clarification.

Install / Use

/learn @aliannejadi/ClariQ
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

ClariQ

Introduction

The main aim of the conversational systems is to return an appropriate answer in response to the user requests. However, some user requests might be ambiguous. In Information Retrieval (IR) settings such a situation is handled mainly through the diversification of search result page. It is however much more challenging in dialogue settings.

We release the ClariQ dataset [3, 4], aiming to study the following situation for dialogue settings:

  • a user is asking an ambiguous question (where ambiguous question is a question to which one can return > 1 possible answers);
  • the system must identify that the question is ambiguous, and, instead of trying to answer it directly, ask a good clarifying question.

The main research questions we aim to answer as part of the challenge are the following:

  • RQ1: When to ask clarifying questions during dialogues?
  • RQ2: How to generate the clarifying questions?

ConvAI3 Data Challenge

ClariQ was collected as part of the ConvAI3 (http://convai.io) challenge which was co-organized with the SCAI workshop (https://scai-workshop.github.io/2020/). The challenge ran in two stages. At Stage 1 (described below) participants were provided with a static dataset consisting mainly of an initial user request, clarifying question and user answer, which is suitable for initial training, validating and testing. At Stage 2, we brought a human in the loop. Namely, the top 3 systems, resulted from Stage 1, were invited to develop systems that were exposed to human annotators.

Stage 1: initial dataset

Taking inspiration from Qulac [1] dataset, we have crowdsourced a new dataset to study clarifying questions that is suitable for conversational settings. Namely, the collected dataset consists of:

  • User Request: an initial user request in the conversational form, e.g., "What is Fickle Creek Farm?", with a label reflects if is needed ranged from 1 to 4. If an initial user request is self-contained and would not need any clarification, the label would be 1. While if a initial user request is absolutely ambiguous, making it impossible for a search engine to guess the user's right intent before clarification, the label would be 4. Labels 2 and 3 represent other levels of clarification need, where clarification is still needed but not as much as label 4;
  • Clarification questions: a set of possible clarifying questions, e.g., "Do you want to know the location of fickle creek farm?";
  • User Answers: each questions is supplied with a user answer, e.g., "No, I want to find out where can i purchase fickle creek farm products."

For training, the collected dataset is split into training (187 topics) and validation (50 topics) sets. For testing, the participants are supplied with: (1) a set of user requests in conversational form and (2) a set a set of questions (i.e., question bank) which contains all the questions that we have collected for the collection. Therefore to answer our research questions, we suggest the following two tasks:

  • To answer RQ1: Given a user request, return a score [1 −4] indicating the necessity of asking clarifying questions.
  • To answer RQ2: Given a user request which needs clarification, return the most suitable clarifying question. Here participants are able to choose: (1) either select the clarifying question from the provided question bank (all clarifying questions we collected), aiming to maximize the precision, (2) or choose not to ask any question (by choosing Q0001 from the question bank.)

Stage 2: human-in-the-loop

The second stage of the ClariQ data challenge enables the top-performing teams of the first stage to evaluate their models with the help of human evaluators. To do so, we ask the teams to generate their responses in a given conversation and pass the results to human evaluators. We instruct the human evaluators to read and understand the context of the conversation and write a response to the system. The evaluator assumes that they are part of the conversation. We evaluate the performance of a system in two respects: (i) How much the conversation can help a user find the information they are looking for and (ii) How natural and realistic does the conversation appear to a human evaluator.

ClariQ Dataset

We have extended the Qulac [1] dataset and base the competition mostly on the training data that Qulac provides. In addition, we have added some new topics, questions, and answers in the training set. The test set is completely unseen and newly collected. Like Qulac, ClariQ consists of single-turn conversations (initial_request, followed by clarifying question and answer). In addition, it comes with synthetic multi-turn conversations (up to three turns). ClariQ features approximately 18K single-turn conversations, as well as 1.8 million multi-turn conversations. Below, we provide a short summary of the data characteristics, for the training set:

ClariQ Train

Feature | Value --------------------------------| ----- # train (dev) topics | 187 (50) # faceted topics | 141 # ambiguous topics | 57 # single topics | 39 # facets | 891 # total questions | 3,929 # single-turn conversations | 11,489 # multi-turn conversations | ~ 1 million # documents | ~ 2 million

Below, we provide a brief overview of the structure of the data.

Files

Below we list the files in the repository:

  • ./data/train.tsv and ./data/dev.tsv are TSV files consisting of topics (queries), facets, clarifying questions, user's answers, and labels for how much clarification is needed (clarification needs).
  • ./data/test.tsv is a TSV file consisting of test topic ID's, as well as queries (text).
  • ./data/test_with_labels.tsv is a TSV file consiting of test topic ID's with the labels. It can be used with the evaluation script.
  • ./data/multi_turn_human_generated_data.tsv is a TSV file containing the human-generated multi turn conversations which is the result of of the human-in-the-loop process.
  • ./data/question_bank.tsv is a TSV file containing all the questions in the collection, as well as their ID's. Participants' models should select questions from this file.
  • ./data/top10k_docs_dict.pkl.tar.gz is a dict containing the top 10,000 document ID's retrieved from ClueWeb09 and ClueWeb12 collections for each topic. This may be used by the participants who wish to leverage documents content in their models.
  • ./data/single_turn_train_eval.pkl is a dict containing the performance of each topic after asking a question and getting the answer. The evaluation tool that we provide uses this file to evaluate the selected questions.
  • ./data/multi_turn_train_eval.pkl.tar.gz.** and ./data/multi_turn_dev_eval.pkl.tar.gz are dicts that contain the performance of each conversation after asking a question from the question_bank and getting the answer from the user. The evaluation tool that we provide uses this file to evaluate the selected questions. Notice that these dicts are built based on the synthetic multi-turn conversations.
  • ./data/dev_synthetic.pkl.tar.gz and ./data/train_synthetic.pkl.tar.gz are two compressed pickle files that contain dicts of synthetic multi-turn conversations. We have generated these conversations following the method explained in [1].
  • ./src/clariq_eval_tool.py is a python script to evaluate the runs. The participants may use this tool to evaluate their models on the dev set. We would use the same tool to evaluate the submitted runs on the test set.
  • ./sample_runs/ contains some sample runs and baselines. Among them, we have included the two oracle models BestQuestion and WorstQuestion, as well as NoQuestion, the model choosing no question. Participants may check these files as sample run files. Also, they could test the evaluation tool using these files.

File Format

train.tsv, dev.tsv:

train.tsv and dev.tsv have the same format. They contain the topics, facets, questions, answers, and clarification need labels. These are considered to be the main files, containing the labels of the training set. Note that the clarification needs labels are already explicitly included in the files. Regarding the question relevance labels for each topic, these labels can be extracted inderictly: each row only contains the questions that are considered to be relevant to a topic. Therefore, any other question is deemed irrelevant while computing Recall@k. In the train.tsv and dev.tsv files, you will find these fields:

  • topic_id: the ID of the topic (initial_request).
  • initial_request: the query (text) that initiates the conversation.
  • topic_desc: a full description of the topic as it appears in the TREC Web Track data.
  • clarification_need: a label from 1 to 4, indicating how much it is needed to clarify a topic. If an initial_request is self-contained and would not need any clarification, the label would be 1. While if a initial_request is absolutely ambiguous, making it impossible for a search engine to guess the user's right intent before clarification, the label would be 4. Labels 2 and 3 represent other levels of clarification need, where clarification is still needed but not as much as label 4.
  • facet_id: the ID of the facet.
  • facet_desc: a full description of the facet (information need) as it appears in the TREC Web Track data.
  • question_id: the ID of the question as it appears in question_bank.tsv.
  • question: a clarifying question that the system can pose to the user for the current topic and facet.
  • answer: an answer to the clarifying question, assuming that the user is in the context of the current row (i.e., the user's initial query is initial_request, their i

Related Skills

View on GitHub
GitHub Stars142
CategoryDevelopment
Updated5mo ago
Forks26

Languages

Jupyter Notebook

Security Score

77/100

Audited on Nov 6, 2025

No findings