Research.lpca
This project analyzes the results of various models for Link Prediction on Knowledge Graphs using Knowledge Graph Embeddings. It allows to replicate the results in our work "Knowledge Graph Embeddings for Link Prediction: A Comparative Analysis" (https://arxiv.org/abs/2002.00819). Principal contributor: ANDREA ROSSI (https://github.com/AndRossi)
Install / Use
/learn @merialdo/Research.lpcaREADME
research.lpca
This project analyzes the results of various models for Link Prediction on Knowledge Graphs using Knowledge Graph Embeddings. It allows to replicate the results in our work "Knowledge Graph Embeddings for Link Prediction: A Comparative Analysis".
Models
We include 16 models representative of various families of architectural choices. For each model we used the best-performing implementation available.
-
DistMult:
-
ComplEx-N3:
-
ANALOGY:
-
SimplE:
-
HolE:
-
TuckER:
-
TransE:
-
STransE:
-
CrossE:
-
TorusE:
-
RotatE:
-
ConvE:
-
ConvKB:
-
ConvR:
- Paper
- (implementation kindly shared by the authors privately)
-
CapsE:
-
RSN:
-
We also employ the rule-based model AnyBURL as a baseline.
Project
Language
The project is completely written in Python 3.
Dependencies
- numpy
- matplotlib
- seaborn
Structure
The project is structured as a set of Python scripts, each of which can be run separately from the others:
- folder
efficiencycontains the scripts to visualize our results on efficiency of LP models.- Our findings for training times can be replicated by running script
barchart_training_times.py - Our findings for prediction times can be replicated by running script
barchart_prediction_times.py
- Our findings for training times can be replicated by running script
- folder
effectivenesscontains the scripts to obtain our results on the effectiveness:- folder
performances_by_peerscontains various scripts that show how the predictive performances of LP models vary, depending on the number of source and target peers of test facts. - folder
performances_by_pathscontains various scripts that show how the predictive performances of LP models vary, depending on the Relational Path Support of test facts. - folder
performances_by_relation_propertiescontains various scripts that show how the predictive performances of LP models vary, depending on the properties of the relations of test facts. - folder
performances_by_reified_relation_degreecontains various scripts that show how the predictive performances of LP models vary, depending on the degree of the original reified relation in FreeBase.
- folder
- folder
dataset_analysiscontains various scripts to analyze the structural properties of the original datasets featured in our analysis (e.g. for computing the source peers and target peers for each test fact, or its Relational Path Support, etc). We share the results we obtained using these scripts in ...
In each of these folders, the scripts to run in order to replicate the results of our paper are contained in the folders named papers.
We note that
- In WN18RR, as reported by the authors of the dataset, a small percentage of test facts feature entities not included in the training set, so no meaningful predictions can be obtained for these facts. A few implementations (e.g. Ampligraph, ComplEx-N3) would actively skip such facts in their evaluation pipelines. Since the large majority of systems would keep them, we have all models include them in order to provide the fairest possible setting.
- In YAGO3-10 we observe that a few entities appear in two different versions depending on HTML escaping policies or on capitalisation. In these cases, odels would handle each version as a separate, independent entity; to solve this issue we have performed deduplication manually. The duplicate entities we have identified are:
- Brighton_&_Hove_Albion_F.C. and Brighton_&_Hove_Albion_F.C.
- College_of_William_&_Mary and College_of_William_&_Mary
- Maldon_&_Tiptree_F.C. and Maldon_&_Tiptree_F.C.
- Alaska_Department_of_Transportation_&_Public_Facilities and Alaska_Department_of_Transportation_&_Public_Facilities
- Turing_award and Turing_Award
How to run the project (Linux/MacOS)
-
Open a terminal shell;
-
Create a new folder named
comparative_analysisin your filesystem by running command:mkdir comparative_analysis -
Download and the
datasetsfolder and theresultsfolder from our storage, and move them into thecomparative_analysisfolder. Be aware that the files to download occupy around 100GB overall. -
Clone this repository under the same
comparative_analysisfolder with command:git clone https://github.com/merialdo/research.lpca.git analysis -
Open the project in folder
comparative_analysis/analysis(using a Python IDE is suggested).- Access file
comparative_analysis/analysis/config.pyand updateROOTvariable with the absolute path of your "comparative_analysis" folder. - In order to replicate the plots and experiments performed in our work, just run the corresponding Python scripts in the
paperfolders mentioned above. By default, these experiments will be run on datasetFB15K. In order to change the dataset on which to run the experiment, just change the value of variabledataset_namein the script you wish to launch. Acceptable values areFB15K,FB15K_237,WN18,WN18RRandYAGO3_10.
- Access file
Please note that the data in folders datasets and results are required in order to launch most scripts in this repository.
Those data can also be obtained by running the various scripts in folder dataset_analysis, that we include for the sake of completeness.
The global performances of all models on both min and avg tie policies can be printed on screen by running the the script print_global_performances.py.
