sheXer

This library can be used to perform automatic extraction of shape expressions (ShEx) or Shapes Constraint Language (SHACL) for a target RDF graph. Please, feel free to add an issue to this repository if you find any bug in sheXer or if you have a feature request.

Language:

Citation

Use this work in case you want to cite this software: Automatic extraction of shapes using sheXer.

If you want to read the paper but cannot access the full-content using the previous link, there is a preprint available in Researchgate.

However, please, be aware that this software capabilities' have evolved and improved since the publication of the mentioned paper.

Installation

sheXer can be installed using pip:

$ pip install shexer

If you want to install sheXer by source, all its external dependencies are listed in the file requirements.txt. You can install them all as well using pip:

$ pip install -r requirements.txt

sheXer includes a package to deploy a web service exposing sheXer with a REST API. In case you are not interested in deploying this web service, you don't need to install any dependency related to Flask.

Features

Process huge sources. sheXer does not need to load the whole content of the graph in main memory at any time, so big graphs can be processed in average hardware. Currently this is available just for some input formats: n-triples (choose const.NT as for input_format), and turtle (choose const.TURTLE_ITER).
Several ways to provide input data, consisting of a target graph and some target shapes. The graph can be provided via raw string content, local/remote file(s), or tracking on the fly some triples from a SPARQL endpoint. There are defined interfaces in case you want to implement some other way to provide input information.
Several ways to select your target shapes. You may want to generate shapes for each class in the graph or maybe just for some of them. You may want to generate a shape for some custom node groupings. Or maybe you are extracting some shapes from a big grpah and you just want to explore the neighborhood of some seed nodes. For custom node aggrupations sheXer supports ShEx's shape maps syntax, and it provides configuration params to target different classes or graph depths.
Valid ShEx and SHACL. The produced shapes are compliant with the current specification of ShEx2 and SHACL.
UML. You can also generate UML-like views of the extracted schemas.
rdf-config generation. You can generate rdf-config YAML files as well. Check uses of this technology at the rdf-config repository.
Threshold of tolerance. The constraints inferred for each shape may not be compatible with every node associated to the shapes. With this threshold you can indicate the minimum percentage of nodes that should conform with a constraint c. If c does not reach the indicated ratio, its associated information will not appear in the final shape.
Informative comments (just for ShEx, by now). Each constraint inferred is associated to one or more comments. Those comments include different types of information, such as the ratio of nodes that actually conform with a given constraint. You can keep this informative comments or exclude them from the results.
Sorted constraints (just for ShEx, by now). For a given constraint, sheXer keeps the ratio of nodes that conform with it. This is used as a score of trustworthiness. The constraints in a shape are sorted w.r.t. this score.
Literals recognition. All kinds of typed literals are recognized and treated separately when inferring the constraints. In case a literal is not explicitly associated with a type in the original KG, xsd:string is used by default. By default, when sheXer finds an untyped literal it tries to infer its type when it is a number. Support to some other untyped literals, such as geolocated points, may be included in future releases.
Shapes interlinkage: sheXer is able to detect links between shapes when there is a link between two nodes and those nodes are used to extract some shape. When it detects triples linking a node that does not belong to any other shape, then it uses the macro IRI instead.
Special treatment of rdf:type (or the specified instantiation property). When the predicate of a triple is rdf:type, sheXer creates a constraint whose object is a value set containing a single element. This is the actual object of the original triple.
Cardinality management. Some of the triples of a given instance may fit in an infinite number of constraint triples with the same predicate and object but different cardinality. For example, if a given instance has a single label specified by rdfs:label, that makes it fit with infinite triple constraints with the schema {rdfs:label xsd:string C}, where C can be any cardinality that includes the possibility of a single occurrence: {1}, + , {1,2}, {1,3}, {1,4},... Currently, sheXer recognises exact cardinalities ({2}, {3}..), kleene closure (*), positive closure (+), and optional cardinality (?).
Inverse paths. sheXer can extract constraints related to incoming links. Shapes are usually described using constraints realted to outgoing links, i.e., triples in which the node is the subject. However, sheXer can extract also constraints where the node is the object.
Configurable priority of cardinalities. sheXer can be configured to prioritize the less specific cardinality or the most specific one if its trustworthiness score is high enough.
Example serialization. sheXer is able to produce outputs that include examples of instances among the input data matching each shape and/or examples of node constraints matching each constraint of each shape. Currently, this feature works only with ShEx outputs.
All compliant mode: You can produce shapes that conform with every instance using to extract them. This is done by using cadinalities * or ? for every constraint extracted that does not conform with EVERY instance. You may prefer to avoid these cardinalities and keep constraints that may not conform with every instance, but include the most frequent features of the instances. Both settings are available in sheXer.
Management of empty shapes. You may get some shapes with no constraints, either because there where no isntances to explore or because the extracted features were not as common as requested with the threshold of tolerance. You can configure sheXer to automatically erase those shapes and every mention to them from the results.
Adaptation to Wikidata model. sheXer includes configuration params to handle Wikidata's data model regarding qualifiers, so you can automatically extract the schema of qualifier nodes too. You can also produce content where each Wikidata ID is associated with its label in comments, as sheXer is integrated with wLighter.
Extraction of shapes for federation. You can configure sheXer to extract information form several endpoints whose URIs are connected. sheXer will extract shapes combining information of both ends which can be helpful for making federated queries.
Machine-readable annotations. frequencies, examples, instance counts, and, in general, any extra information otu or mere shapes and constraints, can be provided in machine-readable RDF annotations.

Experimental results

In the folder experiments, you can see some results of applying this tool over different graphs with different configurations.

Example code

The following code takes the graph in raw_graph and extracts shapes for instances of the classes http://example.org/Person and http://example.org/Gender. The input file format in n-triples and the results are serialized in ShExC to the file shaper_example.shex.

from shexer.shaper import Shaper
from shexer.consts import NT, SHEXC, SHACL_TURTLE

target_classes = [
    "http://example.org/Person",
    "http://example.org/Gender"
]

namespaces_dict = {"http://www.w3.org/1999/02/22-rdf-syntax-ns#": "rdf",
                   "http://example.org/": "ex",
                   "http://weso.es/shapes/": "",
                   "http://www.w3.org/2001/XMLSchema#": "xsd"
                   }

raw_graph = """
<http://example.org/sarah> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Person> .
<http://example.org/sarah> <http://example.org/age> "30"^^<http://www.w3.org/2001/XMLSchema#int> .
<http://example.org/sarah> <http://example.org/name> "Sarah" .
<http://example.org/sarah> <http://example.org/gender> <http://example.org/Female> .
<http://example.org/sarah> <http://example.org/occupation> <http://example.org/Doctor> .
<http://example.org/sarah> <http://example.org/brother> <http://example.org/Jim> .

<http://example.org/jim> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Person> .
<http://example.org/jim> <http://example.org/age> "28"^^<http://www.w3.org/2001/XMLSchema#int> .
<http://example.org/jim> <http://example.org/name> "Jimbo".
<http://example.org/jim> <http://example.org/surname> "Mendes".
<http://example.org/jim> <http://example.org/gender> <http://example.org/Male> .

<http://example.org/Male> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Gender> .
<http://example.org/Male> <http://www.w3.org/2000/01/rdf-schema#label> "Male" .
<http://example.org/Female> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Gender> .
<http://example.org/Female> <http://www.w3.org/2000/01/rdf-schema#label> "Female"

Shexer

Install / Use

README