CODEC
CODEC is a document and entity ranking dataset that focuses on complex essay-style topics.
Install / Use
/learn @grill-lab/CODECREADME
Colab demo showing indexing, query reformulations, entity links, and evaluation:
CODEC includes 42 topics developed by researchers and a new focused web corpus with semantic annotations including entity links. It includes expert judgments on 6,186 document (147.3 per topic) and 11,323 entity (269.6 per topic) from diverse automatic and interactive manual runs. The manual runs include 387 query reformulations (9.2 per topic), providing data for query performance prediction and automatic rewriting evaluation.
</p> <p align="center"> <img src="https://github.com/grill-lab/CODEC/blob/main/assets/overview.png" alt="CODEC Diagram" width="700" height="275" > <!-- Paper --> <h3 id="paper">Paper</h3>This work will be presented at SIGIR 2022: https://arxiv.org/abs/2205.04546
Correct citation:
@inproceedings{mackie2022codec,
title={CODEC: Complex Document and Entity Collection},
author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery},
booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
year={2022}
}
<!-- Dataset -->
<h3 id="dataset">Dataset</h3>
<p> CODEC provides 42 topics for document and entity retrieval: </p>
<ul>
<li><a href="https://github.com/grill-lab/CODEC/blob/main/topics/topics.json">Topics</a>
<li><a href="https://github.com/grill-lab/CODEC/blob/main/topics/query_reformulations.txt">Query reformulations</a>
<li><a href="https://github.com/grill-lab/CODEC/blob/main/qrels">Qrels</a></li>
<li><a href="https://github.com/grill-lab/CODEC/blob/main/raw_judgments">Raw judgments</a></li>
<li><a href="https://github.com/grill-lab/CODEC/blob/main/system_runs/folds/folds.json">Standard 4-folds</a></li>
<li><a href="https://github.com/grill-lab/CODEC/blob/main/system_runs/runs">Baseline runs</a></li>
</ul>
CODEC full document corpus is available for research purpose: <a href="https://huggingface.co/datasets/macavaney/codec">FULL</a>.
CODEC entity KB is <a href="https://ai.facebook.com/tools/kilt/">KILT's</a> snapshot of Wikipedia (~30GB).
Colab demo showing indexing, query reformulations, entity links, and evaluation:
Dataset is available via <a href="https://github.com/allenai/ir_datasets"><i>ir-datasets</i></a>.
<!-- Change Log --> <h3 id="change-log">Change Log</h3>Major dataset changes historic users should be aware:
<ul> <li> <b>25th April</b>: CODEC v1 released. </ul> <!-- Tasks --> <h3 id="tasks">Tasks</h3> <p> CODEC is a test collection that provides two tasks: <b>document ranking</b> and <b>entity ranking</b>. This dataset benchmarks a social science researcher who is attempting to find supporting entities and documents that will form the basis of a long-form essay discussing the topic from various perspectives. The researcher would explore the topic to (1) identify relevant sources and (2) understand key concepts. </p> <p>Document ranking systems have to return a relevance-ranked list of documents for a given natural language query. Entity ranking systems have to return a relevance-ranked list of entities for a given natural language query. Document ranking uses CODEC’s new document corpus and entity ranking uses KILT as the entity knowledge base. For the experimental setup, we provide four pre-defined ‘standard’ folds for k-fold cross-validation to allow parameter tuning. Initial retrieval or re-ranking of provided baseline runs can both be evaluated using this dataset.
</p> <!-- Complex Topics --> <h3 id="complex-topics">Complex Topics</h3> CODEC provides 42 complex topics which intend to benchmark the role of a researcher. Social science experts from <b>history</b> (history teacher, published history scholar), <b>economics</b> (FX trader, accountant, investment banker) and <b>politics</b> (political scientists, politician) helped to generate interesting and factually-grounded topics. The authors develop the following criteria for complex topics: <ul> <li> <i>Open-ended essay-style</i> <li> <i>Natural language question</i> <li> <i>Multiple points of view</i> <li> <i>Concern multiple key entities</i> <li> <i>Complex</i> <li> <i>Requires knowledge</i> </ul>Each topic contains a query and narrative. The query is the question the researcher seeks to understand by exploring documents and entities, i.e., the text input posed to the search system. The narratives provide an overview of the topic (key concepts, arguments, facts, etc.) and allow non-domain-experts to understand the topic.
<p align="center"> <img src="https://github.com/grill-lab/CODEC/blob/main/assets/econ_topic.png" alt="CODEC Topcs" width="400" height="400" > <!-- Document Corpus --> <h3 id="doc-corpus">Document Corpus</h3>We use Common Crawl to curate a 729,824 document corpus with focused content across finance, history, and politics.
The corpus is released in jsonline format with following fields:
<ul> <li> <b>id</b>: <i>Unique identifier is the MD5 hash of URL.</i> <li> <b>url</b>: <i>Location of the webpage (URL). </i> <li> <b>title</b>: <i>Title of the webpage if available.</i> <li> <b>contents</b>: <i>The text content of the webpage after removing any unnecessary advertising or formatting. New lines provide some structure between the extracted sections of the webpage, while still easy for neural systems to process.</i> </ul>Document distribution:
<table class="tg"> <thead> <tr> <th class="tg-7zrl"></th> <th class="tg-1wig">Document Count</th> </tr> </thead> <tbody> <tr> <td class="tg-7zrl">reuters.com</td> <td class="tg-2b7s">172,127</td> </tr> <tr> <td class="tg-7zrl">forbes.com</td> <td class="tg-2b7s">147,399</td> </tr> <tr> <td class="tg-7zrl">cnbc.com</td> <td class="tg-2b7s">100,842</td> </tr> <tr> <td class="tg-7zrl">britannica.com</td> <td class="tg-2b7s">93,484</td> </tr> <tr> <td class="tg-7zrl">latimes.com</td> <td class="tg-2b7s">88,486</td> </tr> <tr> <td class="tg-7zrl">usatoday.com</td> <td class="tg-2b7s">31,803</td> </tr> <tr> <td class="tg-7zrl">investopedia.com</td> <td class="tg-2b7s">21,459</td> </tr> <tr> <td class="tg-7zrl">bbc.co.uk</td> <td class="tg-2b7s">21,414</td> </tr> <tr> <td class="tg-7zrl">history.state.gov</td> <td class="tg-2b7s">9,187</td> </tr> <tr> <td class="tg-7zrl">brookings.edu</td> <td class="tg-2b7s">9,058</td> </tr> <tr> <td class="tg-7zrl">ehistory.osu.edu</td> <td class="tg-2b7s">8,805</td> </tr> <tr> <td class="tg-7zrl">history.com</td> <td class="tg-2b7s">6,749</td> </tr> <tr> <td class="tg-7zrl">spartacus-educational.com</td> <td class="tg-2b7s">3,904</td> </tr> <tr> <td class="tg-7zrl">historynet.com</td> <td class="tg-2b7s">3,811</td> </tr> <tr> <td class="tg-7zrl">historyhit.com</td> <td class="tg-2b7s">3,173</td> </tr> <tr> <td class="tg-7zrl">...</td> <td class="tg-7zrl">...</td> </tr> <tr> <td class="tg-j6zm"><span style="font-weight:bold">TOTAL</span></td> <td class="tg-kex3"><span style="font-weight:bold">721,701</span></td> </tr> </tbody> </table> <!-- Entity KB --> <h3 id="ent-corpus">Entity KB</h3>CODEC uses KILT’s Wikipedia KB for the entity ranking task, which is based on the 2019/08/01 Wikipedia snapshot. KILT contains 5.9M preprocessed articles which is freely available to use: <a href="https://ai.facebook.com/tools/kilt/">link</a>.
<!-- Judgments --> <h3 id="judgments">Judgments</h3>CODEC uses a 2-stage assessment approach to balance adequate coverage of current systems while allowing annotators to explore topics using iterative search system. This creates 6,186 document judgments (147.3 per topic) and 11,323 entity judgments (269.6 per topic):
These raw judgements are released: <a href="https://github.com/grill-lab/CODEC/blob/main/raw_judgments">link</a>.
<table class="tg"> <thead> <tr> <th class="tg-j6zm"><span style="font-weight:bold">Judgment</span></th> <th class="tg-j6zm"><span style="font-weight:bold">Document Ranking</span></th> <th class="tg-j6zm"><span style="font-weight:bold">Entity Ranking</span></th> </tr> </thead> <tbody> <tr> <td class="tg-kex3"><span style=Related Skills
node-connect
347.6kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.6kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.6kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
