Convert with RML
Use the RDF Mapping Language (RML) to map your structured data (CSV, TSV, XLSX, SPSS, SQL, XML, JSON, YAML) to RDF using a declarative mapping language.
#
Create mapping for a datasetYou can run this command at the root of your repository to generate the dataset mappings files in the datasets
folder, you will be prompted to enter some metadata about the dataset to create.
The dataset readme, mappings, metadata, and download files are created in the datasets/$dataset_id
folder. Check the download script generated in datasets/$dataset_id/scripts/download.sh
and edit it if needed.
We use bash for it's performance and reliability with large file download. But you are free to use a python script or other documented methods.
#
Define mappingsWe recommend to use YARRRML, a mapping language to make the definition of RML mappings easier using a simplified YAML, which is then converted to proper RML.
The Matey web UI ๐ฆ is available to easily write and test RML mappings in YAML files using the YARRRML simplified mapping language. The mappings can be conveniently tested in the browser on a sample of the file to transform.
Recommended workflow to easily create and test RML mappings:
- Use the Matey web UI ๐ฆ to write YARRRML mappings, and test them against a sample of your data
- Copy the YARRRML mappings to a file with the extension
.yarrr.yml
- Copy the RML mappings to a file with same name, and the extension
.rml.ttl
- Optionally you can automate the execution in a GitHub Actions workflow.
Specifications
- RML Specifications can be found as a W3C unofficial draft.
- See the rml.io website for more documentation about RML and the various tools built and deployed by Ghent University.
YARRRML package
YARRRML can also be parsed locally or automatically using the yarrrml-parser npm
package:
Example of a YARRRML mapping file using the split function on the |
character:
grel:p_string_sep
separators needs to be escaped with\
- See the full list of available default functions.
- Additional function can be added by integrating them in a
.jar
file, see the documentation.
Generate nquads
You can also generate nquads by adding the graph infos in the rr:subjectMap
in RML mappings (or just g:
in YARRRML):
โ ๏ธ Most RML engines does not support YARRRML by default, so you will need to convert it to RML and use the RML mappings for the conversion.
#
Tools for RML conversionThere are multiple tools available to generate RDF from RML mappings, with various efficiency, stability, and features.
Reference implementation, written in java
Not suited for large files
Supports custom functions (in java, compiled as separate
.jar
files)
Streaming implementation for large files, written in Scala
Works well for really large CSV files
Can be parallelized on Apache Flink clusters
Does not support custom functions yet
Written in Python
Can use a separate tool, Dragoman, for executing custom functions
Written in Python
Does not support custom functions
Written in JavaScript
Provide an easy way to define custom functions
We currently only implemented the rmlmapper-java and the RMLStreamer in d2s
, but you are encouraged to use the tool that fits your needs.
#
Convert with the RML MapperThe rmlmapper-java execute RML mappings to generate RDF Knowledge Graphs.
Not for large files
The RML Mapper loads all data in memory, so be aware when working with big datasets.
- Download the rmlmapper
.jar
file at https://github.com/RMLio/rmlmapper-java/releases - Run the RML mapper:
Run automatically in workflow
The RMLMapper can be easily run in GitHub Actions workflows, checkout the Run workflows page for more details.
#
Convert with the RML StreamerThe RMLStreamer is a scalable implementation of the RDF Mapping Language Specifications to generate RDF out of structured input data streams.
Work in progress
The RMLStreamer is still in development, some features such as functions are yet to be implemented.
To run the RMLStreamer you have 2 options:
- Start a single node Apache Flink cluster using docker on your machine.
- Use the DSRI Apache Flink cluster (especially for really large files).
Documentation
Checkout the documentation to convert COHD using the RMLStreamer on the DSRI.
#
Prepare filesCopy the RMLStreamer.jar
file, your mapping files and data files to the Flink jobmanager
pod before running it.
For example:
#
Run the RMLStreamerExample of command to run the RMLStreamer from the Flink cluster master:
Check the progress
The progress of the job can be checked in the Apache Flink web UI.
#
Merge and compress outputThe ntriples files produced by RMLStreamer in parallel:
#
Copy to your serverSSH connect to your server, http_proxy var might need to be changed temporarily to access the DSRI
Reactivate the proxy (EXPORT http_proxy
)
#
Preload in GraphDBCheck the generated COHD file on the server at:
Replace wrong triples:
Start preload:
The COHD repository will be created in /data/graphdb-preload/data
, copy it to the main GraphDB: