ShExML
A heterogeneous data mapping language based on Shape Expressions
Install / Use
/learn @herminiogg/ShExMLREADME
ShExML
Shape Expressions Mapping Language (ShExML) is a DSL that offers a solution for mapping and merging heterogeneous data sources. As being based on ShEx the shape is the main foundation to define the transformations.
Example
PREFIX : <http://example.com/>
SOURCE films_xml_file <https://rawgit.com/herminiogg/ShExML/master/src/test/resources/films.xml>
SOURCE films_json_file <https://rawgit.com/herminiogg/ShExML/master/src/test/resources/films.json>
ITERATOR film_xml <xpath: //film> {
FIELD id <@id>
FIELD name <name>
FIELD year <year>
FIELD country <country>
FIELD directors <directors/director>
}
ITERATOR film_json <jsonpath: $.films[*]> {
FIELD id <id>
FIELD name <name>
FIELD year <year>
FIELD country <country>
FIELD directors <director>
}
EXPRESSION films <films_xml_file.film_xml UNION films_json_file.film_json>
:Films :[films.id] {
:name [films.name] ;
:year [films.year] ;
:country [films.country] ;
:director [films.directors] ;
}
This example shows how to map and merge two files (in JSON and XML) with different films. In the first part, the
declarations, we can define some 'variables' that can be used inside the shapes. Prefixes used in the resulting RDF,
sources to the files, iterators and fields (queries) to be applied over the files and expressions to merge and transform the queries results.
Then, the shapes are defined as in ShEx but using the previously defined expressions or composing them inside the
square brackets. More complex example can be seen under the films.shexml file.
Features
- XML support (using XPath queries)
- JSON support (using JSONPath queries)
- CSV and TSV support
- Relational databases, with following included drivers
- MySQL
- SQLite
- PostgreSQL
- MariaDB
- SQLServer
- Matchers
- Joins
- Named graphs
- Autoincrement ids
The full specification with all the supported features and examples can be consulted here.
Usage
CLI
A command line interface is offered under the jar library with the following options available:
Usage: ShExML [-h] [-id] [-nu] [--parallel] [-V] [-d=<drivers>] [-f=<format>]
-m=<file> [--nThreads=<numberOfThreads>] [-o=<output>]
[-p=<password>] [--parallelAspects=<parallelAspects>]
[-u=<username>] [-pc | -r | -rp | -s | -sm | -sh | -shc]
Map and merge heterogeneous data sources with a Shape Expressions based syntax
-m, --mapping=<file> Path to the file with the mappings. If '-' is
provided as the path the engine will read from
the standard input.
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
Options for the transformation to RDF
-id, --inferenceDatatypes Use the inference system for choosing the best
suited datatype for the generated literal.
Without this option, and not declaring a
datatype in the mapping rules, all the
literals will be outputted as strings
-nu, --normaliseURIs Activate the URI normalisation system which
allows to avoid malformed URIs when using
strings for URI creation
-f, --format=<format> Output format for RDF graph. Turtle, RDF/XML,
N-Triples, ...
Other transformation options
-pc, --precompile Create a single version including all the
imported files, useful for debugging purposes.
Additionally it checks the input for syntactic
and grammatical errors
-r, --rml Generate RML output
-rp, --rmlPrettified Generate RML output using Blank nodes for better
readability
-s, --shex Generate ShEx validation
-sm, --shapeMap Generate Shape Map for ShEx validation
-sh, --shacl Generate SHACL validation
-shc, --shaclClosed Generate SHACL validation with closed shapes as
default
General configuration options applying to all the available transformations
-o, --output=<output> Path where the output file should be created
-u, --username=<username> Username in case of using a database
-p, --password=<password> Password in case of using a database
-d, --drivers=<drivers> Add more JDBC database drivers in the form of
<startJDBCURL>%<driver> and separating them
with ";". Example: jdbc:postgresql%org.
postgresql.Driver;jdbc:oracle%oracle.jdbc.
OracleDriver
--parallel EXPERIMENTAL: Enables the execution of the
engine in concurrent mode
--parallelAspects=<parallelAspects>
EXPERIMENTAL: Allows to select the aspects that
will be parallelised. The possible options
are: "queries", "shapes", or "all".
--nThreads=<numberOfThreads> EXPERIMENTAL: The number of threads to use in
the parallelisation. Default to the number of
virtual threads of the processor.
Therefore, to execute the films example: java -jar shexml.jar -m films.shexml
JVM compatible API
ShExML is coded in Scala and, because of that, it can be used with JVM compatible languages. See the example below on how to use the programmatic API.
val file = scala.io.Source.fromFile(pathToFile).mkString
val mappingLauncher = new MappingLauncher()
val output = mappingLauncher.launchMapping(file, "TURTLE")
Parallelisation
From v0.6.0, the ShExML engine includes an experimental parallel implementation which allows to run the RDF generation algorithm in parallel
over two main concerns: shapes and queries. When running shapes in parallel, the first shape will always be run synchronously
while the rest of the shapes will be executed in parallel. This is intended to avoid unnecessary identical computations that would
be executed by all the shapes, creating as a result a performance downgrade over the non-parallel counterpart execution. When queries are run in parallel,
only the execution of the final queries against the designated files will be parallelised. The execution in parallel of both shapes and queries
can be combined, creating a way of nesting parallel executions, for which both the CLI and the JVM compatible API provide configuration options.
The latter through the ParallelExecutionConfigurator object. By default, the ShExML engine runs all the transformations in a synchronous manner.
Warning: Running an algorithm in parallel requires a careful selection of the most optimal parts to be executed in parallel. Given that ShExML relies on an external set of mapping rules, this decision is relegated to the final user who must provide the configuration that suits best the targeted transformation. Be aware that a bad configuration may impose a longer execution time due to the associated overheads of running a multi-threaded application which in the case of nesting parallel aspects will increase very rapidly.
Requirements
The minimal versions for this software to work are:
- JDK 17, or the Open JDK 17. (Versions matching earlier JDK version can be generated following the Build instructions or provided upon request.)
- Scala 2.12.20
- SBT 1.11.2
Webpage
A live playground is also offered online (http://shexml.herminiogarcia.com). However, due to hardware limitations it is not intended for intensive use.
Citation
This tool is part of a scientific project which has led to different publications. The main and preferred publication for citation is:
García-González, H., Boneva, I., Staworko, S., Labra-Gayo, J. E., & Lovelle, J. M. C. (2020).
ShExML: improving the usability of heterogeneous data mapping languages for first-time users.
PeerJ Computer Science, 6, e318. https://doi.org/10.7717/peerj-cs.318
Other possible publications per topic are:
- Optimisation of the ShExML engine
García-González, H. (2025). Optimising the ShExML engine through code profiling: From turtle’s pace
to state-of-the-art performance. Semantic Web, (Preprint), 1-30. https://doi.org/10.3233/SW-243736
- Translation from ShExML to RML
García-González, H., & Dimou, A. (2022, September). Why to tie to a single data mapping language?
enabling a transformation from shexml to rml. In Proceedings of Poster and Demo Track and Workshop
Track of the 18th International Conference on Semantic Systems co-located with 18th International
Conference on Semantic Systems (SEMANTiCS 2022) (Vol. 3235, pp. paper-11).
https://ceur-ws.org/Vol-3235/paper11.pdf
- Addressing mapping challenges with ShExML
García-González, H. (2021, June). A ShExML perspective on mapping challenges: alr
