TriviaHG
A Dataset for Automatic Hint Generation from Factoid Questions
Install / Use
/learn @DataScienceUIBK/TriviaHGREADME
TriviaHG: A Dataset for Automatic Hint Generation from Factoid Questions
<img src="https://github.com/DataScienceUIBK/TriviaHG/blob/main/Framework/Framework.png">TriviaHG is an extensive dataset crafted specifically for hint generation in question answering. Unlike conventional datasets, TriviaHG provides 10 hints per question instead of direct answers. This unique approach encourages users to engage in critical thinking and reasoning to derive the solution. Covering diverse question types across varying difficulty levels, the dataset is partitioned into training, validation, and test sets. These subsets facilitate the fine-tuning and training of large language models, enhancing the generation of high-quality hints.
<img src="https://github.com/DataScienceUIBK/TriviaHG/blob/main/Framework/gif-dan.gif" width="32" height="32"/> Attention<img src="https://github.com/DataScienceUIBK/TriviaHG/blob/main/Framework/gif-dan.gif" width="32" height="32"/>
As of February 2025, we recommend using HintEval, the framework for hint generation and evaluation. HintEval includes the TriviaHG dataset and the evaluation metrics introduced in the TriviaHG paper, such as Convergence and Familiarity, making it easier than ever to work with hints.
Check out HintEval here:
- 📖 HintEval Documentation
- 📦 HintEval PyPI Installation
- 💻 HintEval GitHub Repository
- 📜 HintEval Paper (arXiv)
For seamless integration of hint generation and evaluation, we highly recommend migrating to HintEval!
Dataset
TriviaHG comprises several sub-datasets, each encompassing ⬇️Training, ⬇️Validation, and ⬇️Test sets. You can access and download each subset by clicking on its respective link.
The dataset is structured as JSON files, including training.json, validation.json, and test.json for training, validation, and test phases, respectively:
[
{
"Q_ID": "",
"Question": "",
"Hints": [ ],
"Hints_Sources": [ ],
"Snippet": "",
"Snippet_Sources": [ ],
"ExactAnswer": [ ],
"MajorType": "",
"MinorType": "",
"Candidates_Answers": [ ],
"Q_Popularity": { },
"Exact_Answer_Popularity": { },
"H_Popularity": [ ],
"Scores": [ ],
"Convergence": [ ],
"Familiarity": [ ]
}
]
Dataset Statistics
| | Training | Validation | Test | | ----------------- | -------- | ---------- | ----- | | Num. of Questions | 14,645 | 1,000 | 1,000 | | Num. of Hints | 140,973 | 9,638 | 9,619 |
Framework and Model Deployment
The Framework directory houses essential files for the hint generation framework. Notably, you will find Framework.ipynb, a Jupyter Notebook tailored for executing and exploring the framework's code. Utilize 🌐Google Colab to seamlessly run this notebook and delve into the hint generation process.
Finetuned Language Models
We have finetuned several large language models, including LLaMA 7b, LLaMA 13b, and LLaMA 70b, on the TriviaHG dataset. These models are not available for direct download but can be accessed via API functions provided by AnyScale.com. Below are the IDs for the finetuned models:
- LLaMA 7b Finetuned:
meta-llama/Llama-2-7b-chat-hf:Hint_Generator:X6odC0D - LLaMA 13b Finetuned:
meta-llama/Llama-2-13b-chat-hf:Hint_Generator:ajid9Dr - LLaMA 70b Finetuned:
meta-llama/Llama-2-70b-chat-hf:Hint_Generator:NispySP
Querying Finetuned Models
Using CURL:
export ENDPOINTS_AUTH_TOKEN=YOUR_API_KEY
curl "https://api.endpoints.anyscale.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ENDPOINTS_AUTH_TOKEN" \
-d '{
"model": "meta-llama/Llama-2-70b-chat-hf:Hint_Generator:NispySP",
"messages": [
{"role": "user", "content": "Generate 10 hints for the following question. Question: Which country has the highest population?"}
],
"temperature": 0.0
}'
Or using Python:
import os
import requests
s = requests.Session()
api_base = "https://api.endpoints.anyscale.com/v1"
# Replace with long-lived credentials for production
token = YOUR_API_KEY
url = f"{api_base}/chat/completions"
body = {
"model": "meta-llama/Llama-2-70b-chat-hf:Hint_Generator:NispySP",
"messages": [
{"role": "user", "content": "Generate 10 hints for the following question. Question: Which country has the highest population?"}
],
"temperature": 0.0
}
with s.post(url, headers={"Authorization": f"Bearer {token}"}, json=body) as resp:
print(resp.json())
Evaluation
Human Evaluation - Answering
The Human Evaluation - Answering folder is a repository that houses Excel files utilized to gather responses from six human participants. Each participant was assigned ten distinct Excel files, each containing a set of ten questions. The table below outlines the types of questions included in the Excel files, along with corresponding statistics collected from participants. The columns in the table below adhere to the format {Difficulty}-{Model}, where B, F, and V represent Bing, LLaMA 7b Finetuned, and LLaMA 7b Vanilla, respectively.
| Question Type | Hard-B | Hard-F | Hard-V | Medium-B | Medium-F | Medium-V | Easy-B | Easy-F | Easy-V | |-----------------|----------|----------|----------|------------|------------|------------|----------|----------|----------| | ENTITY | 5 / 9 | 5 / 9 | 4 / 9 | 8 / 8 | 6 / 8 | 4 / 8 | 8 / 8 | 8 / 8 | 6 / 8 | | HUMAN | 2 / 9 | 0 / 9 | 0 / 9 | 5 / 8 | 1 / 8 | 0 / 8 | 6 / 8 | 6 / 8 | 4 / 8 | | LOCATION | 0 / 9 | 0 / 9 | 0 / 9 | 7 / 8 | 5 / 8 | 2 / 8 | 7 / 8 | 6 / 8 | 4 / 8 | | OTHER | 3 / 9 | 2 / 9 | 0 / 9 | 5 / 8 | 2 / 8 | 0 / 8 | 8 / 8 | 7 / 8 | 7 / 8 |
Human Evaluation - Quality
The Human Evaluation - Quality folder encompasses ten Excel files, each containing human annotation values assigned to 2,791 hints across various quality attributes such as relevance, readability, ambiguity, convergence, and familiarity. These attributes are essential markers in assessing the overall quality and effectiveness of the hints generated. The table below provides a concise summary of the average scores attained for each quality attribute, offering insights into the perceived quality of the hints evaluated by human participants.
| Method | Match | Readability | Ambiguity | Convergence | Familiarity | |----------------------|--------------|--------------|--------------|--------------|--------------| | Copilot | 4.09 | 4.67 | 1.51 | 2.23 | 2.47 | | LLaMA 7b - Finetuned | 4.01 | 4.70 | 1.56 | 2.20 | 2.41 | | LLaMA
