Run RML transformations
Use the RDF Mapping Language (RML) to map your structured data (CSV, TSV, SQL, XML, JSON, YAML) to RDF using a declarative mapping language.
By default we use YARRRML, a YAML mapping language, to make the definition of RML mappings easier. RML mappings defined using Turtle can also be executed.
Download files to convert#
The following documentation will use the COHD Clinical CSV data and a Geonames TSV dataset as example. Download the dataset, if not already done:
Downloaded files goes to
workspace/input/cohd
Run RML Mapper#
The rmlmapper-java execute RML mappings to generate RDF Knowledge Graphs. It loads all data in memory, so be aware when working with big datasets.
By default d2s rml will execute RML files defined in the RDF Turtle format, in files with the extension .rml.ttl (e.g. datasets/<dataset_id>/mapping/associations-mapping.rml.ttl)
Output goes to
workspace/import/rmlmapper-associations-mapping_rml_ttl-cohd.nt
You can execute YARRRML mappings defined in files with the extension .yarrr.yml, by providing the argument --yarrrml:
If you face memory issues, you might need to change the maximum memory allocated to Java (
-xmx), or try using the RMLStreamer documented below.
Run RML Streamer#
The RMLStreamer is a scalable implementation of the RDF Mapping Language Specifications to generate RDF out of structured input data streams.
⚠️ The RMLStreamer is still in development, some features such as functions are yet to be implemented.
The RML mappings needs to be defined as in a file with the extension .rml.ttl, in the mapping folder of the dataset to transform, e.g. datasets/dataset_id/mapping/associations-mapping.rml.ttl
Start Apache Flink#
Starting Apache Flink is required to stream the files:
Access at http://localhost:8078 to see running jobs.
Run job#
We provide an example converting a sample of COHD (clinical concepts co-occurences from FDA reports) to the BioLink model:
Output goes to
workspace/import/rmlstreamer-associations-mapping_rml_ttl-cohd.ntand can then be loaded to a triplestore.
You can also provide YARRRML files with the extensions .yarrr.yml to be processed to .rml.ttl files before running RML:
The command run detached by default, you can keep the terminal attached and watch the execution:
Generate NQuads by adding the graph infos in the rr:subjectMap in RML mappings:
Run on OpenShift#
Still experimental, the RMLStreamer can be run on the Data Science Research Infrastructure OpenShift cluster.
See the DSRI documentation to deploy Apache Flink.
Copy the RMLStreamer.jar file, your mapping files and data files to the pod. It will be proposed when running
d2s rmlbut they could be loaded manually before.
Transferring the files to the Apache Flink storage easily is still a work in progress. Use
oc cpif rsync does not work.
- Run the RMLStreamer job on the GeoNames example:
 
The progress of the job can be checked in the Apache Flink web UI.
Output file in
/mnt/rdf_output-associations-mapping.ntin the podOr in
/apache-flinkin the persistent storage.
Edit RML mappings#
YARRRML Matey web editor#
The Matey Web UI editor 🦜 is available to easily write RML mappings in YAML files using the YARRRML simplified mapping language. The mappings can be conveniently tested in the browser on a sample of the file to transform.
RML Specifications can be found as a W3C unofficial draft.
See the rml.io website for more documentation about RML and the various tools built and deployed by Ghent University.
YARRRML can also be parsed locally using a npm package:
RMHell web editor#
Similar to Matey, RMHell web UI editor allow to write YARRRML or RML and test it against a sample input file in a web browser.
Mapeathor#
Mapeathor converts Excel mappings into R2RML, RML or YARRRML mappings. Functions not supported.
Run Mapeathor locally:
Make sure the xlsx file is in the mapping folder of the datasets and execute the conversion to YARRRML:
Output format can be
RML,R2RMLandYARRRML.
The RML file will be generated in
workspace/mapeathor
Using functions#
RML functions are still not implemented in the RMLStreamer, use the RML Mapper if you want to make use of them. See the full list of available default functions.
Example using the split function:
grel:p_string_sepseparators needs to be escaped with\
Additional function can be added by integrating them in a .jar file, see the documentation.
Compute HCLS metadata#
After the RDF Knowledge Graph has been generated and loaded in a triplestore, HCLS descriptive metadata and statistics can be easily computed and inserted for the different datasets (graphs in the triplestore), by running a CWL workflow:
- Insert dataset metadata defined in the datasets/cohd/metadata folder.
 - Compute and insert HCLS descriptive statistics using SPARQL queries.