Run RML transformations

Use the RDF Mapping Language (RML) to map your structured data (CSV, TSV, SQL, XML, JSON, YAML) to RDF using a declarative mapping language.

By default we use YARRRML, a YAML mapping language, to make the definition of RML mappings easier. RML mappings defined using Turtle can also be executed.

Download files to convert#

The following documentation will use the COHD Clinical CSV data and a Geonames TSV dataset as example. Download the dataset, if not already done:

d2s download cohd geonames

See the download Bash scripts for COHD and GeoNames.

Downloaded files goes to workspace/input/cohd

Run RML Mapper#

The rmlmapper-java execute RML mappings to generate RDF Knowledge Graphs. It loads all data in memory, so be aware when working with big datasets.

By default d2s rml will execute RML files defined in the RDF Turtle format, in files with the extension .rml.ttl (e.g. datasets/<dataset_id>/mapping/associations-mapping.rml.ttl)

d2s rml cohd --mapper

Output goes to workspace/import/rmlmapper-associations-mapping_rml_ttl-cohd.nt

You can execute YARRRML mappings defined in files with the extension .yarrr.yml, by providing the argument --yarrrml:

d2s rml geonames --yarrrml --mapper

If you face memory issues, you might need to change the maximum memory allocated to Java (-xmx), or try using the RMLStreamer documented below.

Run RML Streamer#

The RMLStreamer is a scalable implementation of the RDF Mapping Language Specifications to generate RDF out of structured input data streams.

⚠️ The RMLStreamer is still in development, some features such as functions are yet to be implemented.

The RML mappings needs to be defined as in a file with the extension .rml.ttl, in the mapping folder of the dataset to transform, e.g. datasets/dataset_id/mapping/associations-mapping.rml.ttl

Start Apache Flink#

Starting Apache Flink is required to stream the files:

d2s start rmlstreamer rmltask

Access at http://localhost:8078 to see running jobs.

Run job#

We provide an example converting a sample of COHD (clinical concepts co-occurences from FDA reports) to the BioLink model:

d2s rml cohd

Output goes to workspace/import/rmlstreamer-associations-mapping_rml_ttl-cohd.nt and can then be loaded to a triplestore.

You can also provide YARRRML files with the extensions .yarrr.yml to be processed to .rml.ttl files before running RML:

d2s rml geonames --yarrrml

The command run detached by default, you can keep the terminal attached and watch the execution:

d2s rml cohd --watch

Generate NQuads by adding the graph infos in the rr:subjectMap in RML mappings:

rr:graphMap [ rr:constant <https://w3id.org/trek/graph/drugbank> ];

Run on OpenShift#

Still experimental, the RMLStreamer can be run on the Data Science Research Infrastructure OpenShift cluster.

See the DSRI documentation to deploy Apache Flink.
Copy the RMLStreamer.jar file, your mapping files and data files to the pod. It will be proposed when running d2s rml but they could be loaded manually before.

# get flink pod id
oc get pod --selector app=flink --selector component=jobmanager --no-headers -o=custom-columns=NAME:.metadata.name

oc exec <flink-jobmanager-id> -- mkdir -p /mnt/workspace/import
oc rsync workspace/input <flink-jobmanager-id>:/mnt/workspace/
oc rsync datasets <flink-jobmanager-id>:/mnt/

Transferring the files to the Apache Flink storage easily is still a work in progress. Use oc cp if rsync does not work.

Run the RMLStreamer job on the GeoNames example:

d2s rml geonames --openshift

The progress of the job can be checked in the Apache Flink web UI.

Output file in /mnt/rdf_output-associations-mapping.nt in the pod
Or in /apache-flink in the persistent storage.

Edit RML mappings#

YARRRML Matey web editor#

The Matey Web UI editor 🦜 is available to easily write RML mappings in YAML files using the YARRRML simplified mapping language. The mappings can be conveniently tested in the browser on a sample of the file to transform.

RML Specifications can be found as a W3C unofficial draft.

See the rml.io website for more documentation about RML and the various tools built and deployed by Ghent University.

YARRRML can also be parsed locally using a npm package:

npm i @rmlio/yarrrml-parser -g

RMHell web editor#

Similar to Matey, RMHell web UI editor allow to write YARRRML or RML and test it against a sample input file in a web browser.

Mapeathor#

Mapeathor converts Excel mappings into R2RML, RML or YARRRML mappings. Functions not supported.

Run Mapeathor locally:

d2s start mapeathor

Make sure the xlsx file is in the mapping folder of the datasets and execute the conversion to YARRRML:

docker exec -it mapeathor ./run.sh /Mapeathor/data/<dataset_id>/mapping/<mapping_spreadsheet>.xlsx YARRRML

Output format can be RML, R2RML and YARRRML.

The RML file will be generated in workspace/mapeathor

Using functions#

RML functions are still not implemented in the RMLStreamer, use the RML Mapper if you want to make use of them. See the full list of available default functions.

Example using the split function:

prefixes:
  grel: "http://users.ugent.be/~bjdmeest/function/grel.ttl#"
  rdfs: "http://www.w3.org/2000/01/rdf-schema#"
  gn: "http://www.geonames.org/ontology#"
mappings:
  neighbours:
    sources:
      - ['/mnt/workspace/input/geonames/dataset-geonames-countryInfo.csv~csv']
    s: http://www.geonames.org/ontology#$(ISO)
    po:
      - [a, gn:Country]
      - p: gn:neighbours
        o:
            function: grel:string_split
            parameters:
                - [grel:valueParameter, $(neighbours)]
                - [grel:p_string_sep, "\|"]
            language: en

grel:p_string_sep separators needs to be escaped with \

Additional function can be added by integrating them in a .jar file, see the documentation.

Compute HCLS metadata#

After the RDF Knowledge Graph has been generated and loaded in a triplestore, HCLS descriptive metadata and statistics can be easily computed and inserted for the different datasets (graphs in the triplestore), by running a CWL workflow:

d2s run compute-hcls-metadata.cwl cohd

Insert dataset metadata defined in the datasets/cohd/metadata folder.
Compute and insert HCLS descriptive statistics using SPARQL queries.