Codex
CoDEx: A set of knowledge graph Completion Datasets Extracted from Wikidata and Wikipedia
Install / Use
/learn @tsafavi/CodexREADME
CoDEx is a set of knowledge graph Completion Datasets Extracted from Wikidata and Wikipedia. As introduced and described by our EMNLP 2020 paper <a href="https://arxiv.org/pdf/2009.07810.pdf" target="_blank">CoDEx: A Comprehensive Knowledge Graph Completion Benchmark</a>, CoDEx offers three rich knowledge graph datasets that contain positive and hard negative triples, entity types, entity and relation descriptions, and Wikipedia page extracts for entities. We provide baseline performance results, configuration files, and pretrained models on CoDEx using the LibKGE framework for two knowledge graph completion tasks, link prediction and triple classification.
The statistics for each CoDEx dataset are as follows:
| | Entities | Relations | Train | Valid (+) | Test (+) | Valid (-) | Test (-) | Total triples | |----------|---------:|----------:|--------:|----------:|---------:|----------:|---------:|--------------:| | CoDEx-S | 2,034 | 42 | 32,888 | 1,827 | 1,828 | 1,827 | 1,828 | 36,543 | | CoDEx-M | 17,050 | 51 | 185,584 | 10,310 | 10,311 | 10,310 | 10,311 | 206,205 | | CoDEx-L | 77,951 | 69 | 551,193 | 30,622 | 30,622 | - | - | 612,437 | | Raw dump | 380,038 | 75 | - | - | - | - | - | 1,156,222 |
Note: If you are interested in contributing to the CoDEx corpus, feel free to open an issue or a PR!
Table of contents
- <a href="#quick-start">Quick start</a>
- <a href="#explore">Data exploration and analysis</a>
- <a href="#models">Pretrained models and results</a>
- <a href="#kge">LibKGE setup</a>
- <a href="#scripts">Reproducing our results</a>
- <a href="#lp-script">Link prediction</a>
- <a href="#tc-script">Triple classification</a>
- <a href="#baseline-script">Comparison to FB15k-237</a>
- <a href="#pretrained">Downloading pretrained models via the command line</a>
- <a href="#lp">Link prediction results</a>
- <a href="#s-lp">CoDEx-S</a>
- <a href="#m-lp">CoDEx-M</a>
- <a href="#l-lp">CoDEx-L</a>
- <a href="#tc">Triple classification results</a>
- <a href="#s-tc">CoDEx-S</a>
- <a href="#m-tc">CoDEx-M</a>
- <a href="#data">Data directory structure</a>
- <a href="#entities">Entities and entity types</a>
- <a href="#relations">Relations</a>
- <a href="#triples">Triples</a>
- <a href="#cite">How to cite</a>
- <a href="#ref">References and acknowledgements</a>
<a id="quick-start">Quick start</a>
If you'd like to download the CoDEx data, code, and/or pretrained models locally to your machine, run the following commands. If you only want to play with the data in a remote environment, head to the <a href="#explore">next section on data exploration and analysis</a>, and follow the instructions to view the CoDEx data with Colab.
# unzip the repository
git clone https://github.com/tsafavi/codex.git
cd codex
# extract English Wikipedia plain-text excerpts for entities
# other language codes available: ar, de, es, ru, zh
./extract.sh en
# set up a virtual environment and install the Python requirements
python3.7 -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt
# finally, install the codex data-loading API
pip install -e .
<a id="explore">Data exploration and analysis</a>
To get familiar with the CoDEx datasets and the data-loading API in an easy-to-use interface, we have provided an exploration Jupyter notebook called Explore CoDEx.ipynb.
You have two options for running the notebook:
- Run on Google Colab: Open the <a href="https://colab.research.google.com/github/tsafavi/codex/blob/master/Explore%20CoDEx.ipynb" target="_blank">notebook on Google's Colab platform</a> and follow the instructions in the first cell to install all the requirements and data remotely. Make sure to restart the Colab runtime after installing the requirements before you run any of the following cells.
- Run locally: Run the following commands to register your virtual environment with JupyterLab and launch JupyterLab:
Now, navigate to JupyterLab in your browser and open the# run from codex/ python -m ipykernel install --user --name=myenv jupyter labExplore CoDEx.ipynbnotebook in your browser.
<a id="models">Pretrained models and results</a>
<a id="kge">LibKGE setup</a>
To use the pretrained models or run any scripts that involve pretrained models, you will need to set up LibKGE. Run the following:
# run from codex/
# this may take a few minutes
./libkge_setup.sh
This script will install the library inside codex/kge/, download the FB15K-237 dataset (which we use in our experiments) to kge/data/, and copy each CoDEx dataset to kge/data/ and preprocess each dataset according to the format the LibKGE requires.
<a id="scripts">Reproducing our results</a>
We provide evaluation scripts to reproduce results in our paper. You must have set up LibKGE using the <a href="#kge">instructions we provided</a>.
<a id="lp-script">Link prediction</a>
scripts/lp_gpu.sh and scripts/lp_cpu.sh run link prediction on all models and datasets using the LibKGE evaluation API.
To run on GPU:
# run from codex/
# this may take a few minutes
scripts/lp_gpu.sh # change to lp_cpu.sh to run on CPU
Note that this script first downloads all link prediction models on CoDEx-S through L and saves them to models/link-prediction/codex-{s,m,l}/ if they do not already exist.
<a id="tc-script">Triple classification</a>
scripts/tc.sh runs triple classification and outputs validation and test accuracy/F1.
To run:
# run from codex/
# this may take a few minutes
scripts/tc.sh # runs on CPU
Note that this script first downloads all triple classification models on CoDEx-S and CoDEx-M and saves them to models/triple-classification/codex-{s,m}/ if they do not already exist.
<a id="baseline-script">Comparison to FB15k-237</a>
scripts/baseline.sh compares a simple frequency baseline to the best model on CoDEx-M and the FB15K-237 benchmark.
The results are saved to CSV files named fb.csv and codex.csv, respectively.
To run:
# run from codex/
# this may take a few minutes
scripts/baseline.sh # runs on CPU
Note that this script first downloads the best pretrained LibKGE model on FB15K-237 to models/link-prediction/fb15k-237/rescal/ and the best link prediction model on CoDEx-M to models/link-prediction/codex-m/complex/ if they do not already exist.
<a id="pretrained">Downloading pretrained models via the command line</a>
To download pretrained models via the command line, use our download_pretrained.py Python script.
The arguments are as follows:
usage: download_pretrained.py [-h]
{s,m,l} {triple-classification,link-prediction}
{rescal,transe,complex,conve,tucker}
[{rescal,transe,complex,conve,tucker} ...]
positional arguments:
{s,m,l} CoDEx dataset to download model(s)
{triple-classification,link-prediction}
Task to download model(s) for
{rescal,transe,complex,conve,tucker}
Model(s) to download for this task
For example, if you want to download the pretrained link prediction models for ComplEx and ConvE on CoDEx-M:
# run from codex/
python download_pretrained.py m link-prediction complex conve
This script will place a checkpoint_best.pt LibKGE checkpoint file in models/link-prediction/codex-m/complex/ and models/link-prediction/codex-m/conve/, respectively.
Alternatively, you can download the models manually following the links we provide here.
<a id="lp">Link prediction results</a>
<a id="s-lp">CoDEx-S</a>
| | MRR | Hits@1 | Hits@3 | Hits@10 | Config file | Pretrained model | |---------|----:|----:|-------:|--------:|------------:|-----------------:| | RESCAL | 0.404 | 0.293 | 0.4494 | 0.623 | config.yaml | 1vsAll-kl | | TransE | 0.354 | 0.219 | 0.4218 | 0.634 | config.yaml | NegSamp-kl | | ComplEx | 0.465 | 0.372 | 0.5038 | 0.646 | config.yaml | 1vsAll-kl | | ConvE | 0.444 | 0.343 | 0.4926 | 0.635 | config.yaml | 1vsAll-kl | | TuckER | 0.444 | 0.339 | 0.4975 | 0.638 | config.yaml | KvsAll-kl |
<a id="m-lp">CoDEx-M</a>
| | MRR | Hits@1 | Hits@3 |Hits@10 | Config file | Pretrained model | |---------|----:|----:|-------:|--------:|------------:|-----------------:| | RESCAL | 0.317 | 0.244 | 0.3477 | 0.456 | [config.yaml](https://github.com/tsafavi/codex/tree/master/models/l
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
400Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
19.5kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
