Use the RDF Mapping Language (RML) to map your structured data (CSV, TSV, SQL, XML, JSON, YAML) to RDF using a declarative mapping language.
By default we use YARRRML, a YAML mapping language, to make the definition of RML mappings easier. RML mappings defined using Turtle can also be executed.
The following documentation will use the COHD Clinical CSV data and a Geonames TSV dataset as example. Download the dataset, if not already done:
Downloaded files goes to
The rmlmapper-java execute RML mappings to generate RDF Knowledge Graphs. It loads all data in memory, so be aware when working with big datasets.
d2s rml will execute RML files defined in the RDF Turtle format, in files with the extension
Output goes to
You can execute YARRRML mappings defined in files with the extension
.yarrr.yml, by providing the argument
If you face memory issues, you might need to change the maximum memory allocated to Java (
-xmx), or try using the RMLStreamer documented below.
⚠️ The RMLStreamer is still in development, some features such as functions are yet to be implemented.
The RML mappings needs to be defined as in a file with the extension
.rml.ttl, in the mapping folder of the dataset to transform, e.g.
Starting Apache Flink is required to stream the files:
Access at http://localhost:8078 to see running jobs.
Output goes to
workspace/import/rmlstreamer-associations-mapping_rml_ttl-cohd.ntand can then be loaded to a triplestore.
You can also provide YARRRML files with the extensions
.yarrr.yml to be processed to
.rml.ttl files before running RML:
The command run detached by default, you can keep the terminal attached and watch the execution:
Generate NQuads by adding the graph infos in the
rr:subjectMap in RML mappings:
Still experimental, the RMLStreamer can be run on the Data Science Research Infrastructure OpenShift cluster.
See the DSRI documentation to deploy Apache Flink.
Copy the RMLStreamer.jar file, your mapping files and data files to the pod. It will be proposed when running
d2s rmlbut they could be loaded manually before.
Transferring the files to the Apache Flink storage easily is still a work in progress. Use
oc cpif rsync does not work.
- Run the RMLStreamer job on the GeoNames example:
The progress of the job can be checked in the Apache Flink web UI.
Output file in
/mnt/rdf_output-associations-mapping.ntin the pod
/apache-flinkin the persistent storage.
The Matey Web UI editor 🦜 is available to easily write RML mappings in YAML files using the YARRRML simplified mapping language. The mappings can be conveniently tested in the browser on a sample of the file to transform.
RML Specifications can be found as a W3C unofficial draft.
See the rml.io website for more documentation about RML and the various tools built and deployed by Ghent University.
YARRRML can also be parsed locally using a npm package:
Similar to Matey, RMHell web UI editor allow to write YARRRML or RML and test it against a sample input file in a web browser.
Mapeathor converts Excel mappings into R2RML, RML or YARRRML mappings. Functions not supported.
Run Mapeathor locally:
Make sure the xlsx file is in the mapping folder of the datasets and execute the conversion to YARRRML:
Output format can be
The RML file will be generated in
RML functions are still not implemented in the RMLStreamer, use the RML Mapper if you want to make use of them. See the full list of available default functions.
Example using the split function:
grel:p_string_sepseparators needs to be escaped with
Additional function can be added by integrating them in a
.jar file, see the documentation.
After the RDF Knowledge Graph has been generated and loaded in a triplestore, HCLS descriptive metadata and statistics can be easily computed and inserted for the different datasets (graphs in the triplestore), by running a CWL workflow: