JRDF2Vec

A high-performance Java Implementation of RDF2Vec

Generate Convert Improve

Install / Use

/learn @dwslab/JRDF2Vec

About this skill

Quality Score

0/100

README

jRDF2Vec

jRDF2Vec is a Java implementation of <a href="http://rdf2vec.org/">RDF2Vec</a>. It supports multi-threaded, in-memory (or disk-access-based) walk generation and training. You can generate embeddings for any NT, NQ, OWL/XML, RDF HDT, TDB 1, or TTL file.

Found a bug? Don't hesitate to <a href="https://github.com/dwslab/jRDF2Vec/issues">open an issue</a>.

How to cite?

Portisch, Jan; Hladik, Michael; Paulheim, Heiko. RDF2Vec Light - A Lightweight Approach for Knowledge Graph Embeddings. Proceedings of the ISWC 2020 Posters & Demonstrations. 2020. [to appear]

An open-access version of the paper is available here.

How to use the jRDF2Vec Command-Line Interface?

Download this project, execute mvn clean install. Alternatively, you can download the packaged JAR of the latest successful: commit <a href="https://github.com/dwslab/jRDF2Vec/tree/jars/jars">here</a>.

System Requirements

Java 8 or later.
Python 3.8 or later with the dependencies described in requirements.txt installed. (Conda users can directly use the environment.yml file.)

You can check if you set up the environment (Python 3 + dependencies) correctly by running:

java -jar jrdf2vec-1.1-SNAPSHOT.jar -checkInstallation

The command line output will list missing requirements or print Installation is ok ✔.

Command-Line Interface (jRDF2Vec CLI) for Training and Walk Generation

Use the resulting jar from the target directory.

Minimal Example

java -jar jrdf2vec-1.1-SNAPSHOT.jar -graph ./kg_file.hdt

Required Parameters

-graph <graph_file> The file containing the knowledge graph for which you want to generate embeddings. The <graph_file> can be any triple file, HDT file, a directory which contains NT files, or a TDB1 directory.

Optional Parameters

jRDF2Vec follows the <a href="https://en.wikipedia.org/wiki/Convention_over_configuration">convention over configuration</a> design paradigm to increase usability. You can overwrite the default values by setting one or more optional parameters.

Parameters for the Walk Configuration

-onlyWalks If added to the call, this switch will deactivate the training part so that only walks are generated. If training parameters are specified, they are ignored. The walk generation also works with the -light parameter.
-light <entity_file> If you intend to use RDF2VecLight, you have to use this switch followed by the file path ot the describing the entities for which you require an embedding space. The file should contain one entity (full URI) per line.
-numberOfWalks <number> (default: 100) The number of walks to be performed per entity.
-depth <depth> (default: 4) This parameter controls the depth of each walk. Depth is defined as the number of hops. Hence, you can also set an odd number. A depth of 1 leads to a sentence in the form <s p o>.
-walkGenerationMode <MID_WALKS | MID_WALKS_DUPLICATE_FREE | RANDOM_WALKS | RANDOM_WALKS_DUPLICATE_FREE> (default for light: MID_WALKS, default for classic: RANDOM_WALKS_DUPLICATE_FREE) This parameter determines the mode for the walk generation (multiple walk generation algorithms are available).
-threads <number_of_threads> (default: (# of available processors) / 2) This parameter allows you to set the number of threads that shall be used for the walk generation as well as for the training.
-walkDirectory <directory where walk files shall be generated/reside> The directory where the walks shall be generated into. In case of -onlyTraining, the directory where the walks reside.
-embedText If added to the call, this switch will also generate walks that contain textual fragments of datatype properties.

Parameters for the Training Configuration

-onlyTraining If added to the call, this switch will deactivate the walk generation part so that only the training is performed. The parameter -walkDirectory must be set. If walk generation parameters are specified, they are ignored.
-trainingMode <cbow | sg> (default: sg) This parameter controls the mode to be used for the word2vec training. Allowed values are cbow and sg.
-dimension <size_of_vector> (default: 200) This parameter allows you to control the size of the resulting vectors (e.g. 100 for 100-dimensional vectors).
-minCount <number> (default: 1) This parameter controls the minimum word count for the word2vec training. Unlike in the gensim defaults, this parameter is set to 1 by default because for knowledge graph embeddings, a vector for each node/arc is desired.
-noVectorTextFileGeneration | -vectorTextFileGeneration A switch which indicates whether a text file with the vectors shall be persisted on the disk. This is enabled by default. Use -noVectorTextFileGeneration to disable the file generation.
-sample <rate> (default: 0.0) The threshold for configuring which higher-frequency words are randomly downsampled, a useful range is, according to the gensim framework, (0, 1e-5).
-window <window_size> (default: 5) The size of the window in the training process.
-epochs <number_of_epochs> (default: 5) The number of epochs to use in training.
-port <port_number> (default: 1808) The port that shall be used for the server.

Advanced Parameters

-continue <existing_walk_directory> In some cases, old walks need to be re-used (e.g. if the program was interrupted after 48h). With the -continue option, the walk generation can be continued; this means that old walks will be re-used and only missing walks are generated. This does not work for MID_WALKS (and flavors). If you do not need to generate additional walks use -onlyTraining instead.

Command-Line Interface (jRDF2Vec CLI) - Additional Services

Besides generating walks and training embeddings, the CLI offers additional services which are described below.

Generating a Vector Text File

(1) Full Vocabulary jRDF2vec is compatible with the <a href="https://github.com/mariaangelapellegrino/Evaluation-Framework">evaluation framework for KG embeddings (GEval)</a>. The latter framework requires the vectors to be present in a text file. If you have a gensim model or vector file, you can use the following command to generate this file:

java -jar jrdf2vec-1.1-SNAPSHOT.jar -generateTextVectorFile ./path-to-your-model-or-vector-file

You can find the file (named vectors.txt) in the directory where the model/vector file is located. If you want to specify the file name/path yourself, you can use option -newFile <file_path>.

(2) Subset of the Vocabulary If you want to write a vectors.txt file that contains only a subset of the vocabulary, you can additionally specify the entities of interest using the -light <entity_file> option (The <entity_file> should contain one entity (full URI) per line.):

java -jar jrdf2vec-1.1-SNAPSHOT.jar -generateTextVectorFile ./path-to-your-model-or-vector-file -light ./path-to-entity-file

You can find the file (named vectors.txt) in the directory where the model/vector file is located. If you want to specify the file name/path yourself, you can use option -newFile <file_path>. If the vector concepts contain surrounding tags that you want to remove in the process, use option -noTags. This command also works if ./path-to-your-model-or-vector-file is an existing vector text file that shall be reduced.

Generating a Vocabulary Text File

jRDF2vec provides functionality to print all concepts for which a vector has been trained. One word of the vocabulary will be printed per line to a file named vocabulary.txt. The model or vector file needs to be specified. If you have a gensim model or vector file, you can use the following command to generate this file:

java -jar jrdf2vec-1.1-SNAPSHOT.jar -generateVocabularyFile ./path-to-your-model-or-vector-file

Converting a Text Vector File

jRDF2vec generates a vectors.txt file where one line represents a vector. This is the format also used by GloVe, for instance. In some cases, however, other file formats are required. You can use jRDF2vec to convert text vector files to other common formats. The vector file does not have to be generated by jRDF2vec.

(1) Converting to w2v Format To create a word2vec formatted file from the text file, you can use the following command:

java -jar jrdf2vec-1.1-SNAPSHOT.jar -convertToW2V <txt_file_path> <new_file.w2v>

(2) Converting to kv Format The provided txt file (first parameter) can be either in txt format or in w2v format. Make sure you use the correct file ending (.txt/.w2v).

You can run the command as follows:

java -jar jrdf2vec-1.1-SNAPSHOT.jar -convertToKv <txt_file_path> <new_file.kv>

(3) Converting to Tensorflow Projector Format If you want to vis

Related Skills

node-connect

340.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

340.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.1k

Commit, push, and open a PR