PyRDF2Vec
🐍 Python Implementation and Extension of RDF2Vec
Install / Use
/learn @predict-idlab/PyRDF2VecREADME
.. raw:: html
<p align="center"> <img width="100%" src="assets/embeddings.svg"> </p> <p align="center"> <a href="https://www.ugent.be/ea/idlab/en"> <img src="assets/imec-idlab.svg" alt="Logo" width=350> </a> </p> <p align="center"> <a href="https://pypi.org/project/pyrdf2vec/"> <img src="https://img.shields.io/pypi/pyversions/pyrdf2vec.svg" alt="Python Versions"> </a> <a href="https://pypi.org/project/pyrdf2vec"> <img src="https://img.shields.io/pypi/v/pyrdf2vec?logo=pypi&color=1082C2" alt="Downloads"> </a> <a href="https://pypi.org/project/pyrdf2vec"> <img src="https://img.shields.io/pypi/dm/pyrdf2vec.svg?logo=pypi&color=1082C2" alt="Version"> </a> <a href="https://github.com/IBCNServices/pyRDF2Vec/blob/main/LICENSE"> <img src="https://img.shields.io/github/license/IBCNServices/pyRDF2vec" alt="License"> </a> </p> <p align="center"> <a href="https://github.com/IBCNServices/pyRDF2Vec/actions"> <img src="https://github.com/IBCNServices/pyRDF2Vec/workflows/CI/badge.svg" alt="Actions Status"> </a> <a href="https://pyrdf2vec.readthedocs.io/en/latest/?badge=latest"> <img src="https://readthedocs.org/projects/pyrdf2vec/badge/?version=latest" alt="Documentation Status"> </a> <a href="https://codecov.io/gh/IBCNServices/pyRDF2Vec?branch=main"> <img src="https://codecov.io/gh/IBCNServices/pyRDF2Vec/coverage.svg?branch=main&precision=2" alt="Coverage Status"> </a> <a href="https://github.com/psf/black"> <img src="https://img.shields.io/badge/code%20style-black-000000.svg" alt="Code style: black"> </a> </p> <p align="center">Python implementation and extension of <a href="http://rdf2vec.org/">RDF2Vec</a> <b>to create a 2D feature matrix from a Knowledge Graph</b> for downstream ML tasks.<p>.. raw:: html
<p align="center"> <img width="100%" src="./assets/header.svg" alt="text"> </p>.. rdf2vec-begin
What is RDF2Vec?
RDF2Vec is an unsupervised technique that builds further on
Word2Vec <https://en.wikipedia.org/wiki/Word2vec>__, where an
embedding is learned per word, in two ways:
- the word based on its context: Continuous Bag-of-Words (CBOW);
- the context based on a word: Skip-Gram (SG).
To create this embedding, RDF2Vec first creates "sentences" which can be fed to Word2Vec by extracting walks of a certain depth from a Knowledge Graph.
This repository contains an implementation of the algorithm in "RDF2Vec:
RDF Graph Embeddings and Their Applications" by Petar Ristoski, Jessica
Rosati, Tommaso Di Noia, Renato De Leone, Heiko Paulheim
([paper] <http://semantic-web-journal.net/content/rdf2vec-rdf-graph-embeddings-and-their-applications-0>__
[original code] <http://data.dws.informatik.uni-mannheim.de/rdf2vec/>__).
Recently, a book about RDF2Vec <http://rdf2vec.org/book>__ was published by Heiko Paulheim, Jan Portisch, and Petar Ristoski. The book is a great introduction to what RDF2Vec is, and what can be done with it. The examples in the book use pyRDF2Vec, so it is recommended to have a look at it!
.. rdf2vec-end .. getting-started-begin
Getting Started
For most uses-cases, here is how pyRDF2Vec should be used to generate
embeddings and get literals from a given Knowledge Graph (KG) and entities:
.. code:: python
import pandas as pd
from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.embedders import Word2Vec from pyrdf2vec.graphs import KG from pyrdf2vec.walkers import RandomWalker
Read a CSV file containing the entities we want to classify.
data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t") entities = [entity for entity in data["location"]] print(entities)
[
"http://dbpedia.org/resource/Belgium",
"http://dbpedia.org/resource/France",
"http://dbpedia.org/resource/Germany",
]
Define our knowledge graph (here: DBPedia SPARQL endpoint).
knowledge_graph = KG( "https://dbpedia.org/sparql", skip_predicates={"www.w3.org/1999/02/22-rdf-syntax-ns#type"}, literals=[ [ "http://dbpedia.org/ontology/wikiPageWikiLink", "http://www.w3.org/2004/02/skos/core#prefLabel", ], ["http://dbpedia.org/ontology/humanDevelopmentIndex"], ], )
Create our transformer, setting the embedding & walking strategy.
transformer = RDF2VecTransformer( Word2Vec(epochs=10), walkers=[RandomWalker(4, 10, with_reverse=False, n_jobs=2)], # verbose=1 )
Get our embeddings.
embeddings, literals = transformer.fit_transform(knowledge_graph, entities) print(embeddings)
[
array([ 1.5737595e-04, 1.1333118e-03, -2.9838676e-04, ..., -5.3064007e-04,
4.3192197e-04, 1.4529384e-03], dtype=float32),
array([-5.9027621e-04, 6.1689125e-04, -1.1987977e-03, ..., 1.1066757e-03,
-1.0603866e-05, 6.6087965e-04], dtype=float32),
array([ 7.9996325e-04, 7.2907173e-04, -1.9482171e-04, ..., 5.6251377e-04,
4.1435464e-04, 1.4478950e-04], dtype=float32)
]
print(literals)
[
[('1830 establishments in Belgium', 'States and territories established in 1830',
'Western European countries', ..., 'Member states of the Organisation
internationale de la Francophonie', 'Member states of the Union for the
Mediterranean', 'Member states of the United Nations'), 0.919],
[('Group of Eight nations', 'Southwestern European countries', '1792
establishments in Europe', ..., 'Member states of the Union for the
Mediterranean', 'Member states of the United Nations', 'Transcontinental
countries'), 0.891]
[('Germany', 'Group of Eight nations', 'Articles containing video clips', ...,
'Member states of the European Union', 'Member states of the Union for the
Mediterranean', 'Member states of the United Nations'), 0.939]
]
If you are using a dataset other than MUTAG (where the interested entities have
no parents in the KG), it is highly recommended to specify
with_reverse=True (defaults to False) in the walking strategy (e.g.,
RandomWalker). Such a parameter allows Word2Vec to have a better
learning window for an entity based on its parents and children and thus
predict test data with better accuracy.
In a more concrete way, we provide a blog post with a tutorial on how to use
pyRDF2Vec here <https://towardsdatascience.com/how-to-create-representations-of-entities-in-a-knowledge-graph-using-pyrdf2vec-82e44dad1a0>__.
NOTE: this blog uses an older version of pyRDF2Vec, some commands need
be to adapted.
If you run the above snippet, you will not necessarily have the same
embeddings, because there is no conservation of the random determinism, however
it remains possible to do it (SEE: FAQ <#faq>__).
Installation
``pyRDF2Vec`` can be installed in three ways:
1. from `PyPI <https://pypi.org/project/pyrdf2vec>`__ using ``pip``:
.. code:: bash
pip install pyRDF2vec
2. from any compatible Python dependency manager (e.g., ``poetry``):
.. code:: bash
poetry add pyRDF2vec
3. from source:
.. code:: bash
git clone https://github.com/IBCNServices/pyRDF2Vec.git
pip install .
Introduction
To create embeddings for a list of entities, there are two steps to do beforehand:
- use a KG;
- define a walking strategy.
For more elaborate examples, check the examples <https://github.com/IBCNServices/pyRDF2Vec/blob/main/examples>__ folder.
If no sampling strategy is defined, UniformSampler is used. Similarly for
the embedding techniques, Word2Vec is used by default.
Use a Knowledge Graph
To use a KG, you can initialize it in three ways:
1. **From a endpoint server using SPARQL**:
.. code:: python
from pyrdf2vec.graphs import KG
# Defined the DBpedia endpoint server, as well as a set of predicates to
# exclude from this KG and a list of predicate chains to fetch the literals.
KG(
"https://dbpedia.org/sparql",
skip_predicates={"www.w3.org/1999/02/22-rdf-syntax-ns#type"},
literals=[
[
"http://dbpedia.org/ontology/wikiPageWikiLink",
"http://www.w3.org/2004/02/skos/core#prefLabel",
],
["http://dbpedia.org/ontology/humanDevelopmentIndex"],
],
),
2. **From a file using RDFLib**:
.. code:: python
from pyrdf2vec.graphs import KG
# Defined the MUTAG KG, as well as a set of predicates to exclude from
# this KG and a list of predicate chains to get the literals.
KG(
"samples/mutag/mutag.owl",
skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"},
literals=[
[
"http://dl-learner.org/carcinogenesis#hasBond",
"http://dl-learner.org/carcinogenesis#inBond",
],
[
"http://dl-learner.org/carcinogenesis#hasAtom",
"http://dl-learner.org/carcinogenesis#charge",
],
],
),
3. **From scratch**:
.. code:: python
from pyrdf2vec.graphs import KG, Vertex
GRAPH = [
["Alice", "knows", "Bob"],
["Alice", "knows", "Dean"],
["Dean", "loves", "Alice"],
]
URL = "http://pyRDF2Vec"
CUSTOM_KG = KG()
for row in GRAPH:
subj = Vertex(f"{URL}#{row[0]}")
obj = Vertex((f"{URL}#{row[2]}"))
pred = Vertex((f"{URL}#{row[1]}"), predicate=True, vprev=subj, vnext=obj)
CUSTOM_KG.add_walk(subj, pred, obj)
Define Walking Strategies With Their Sampling Strategy
