Define RML mappings

Use the RDF Mapping Language (RML) to map your structured data (CSV, TSV, XLSX, SPSS, SQL, XML, JSON, YAML) to RDF using a declarative mapping language.

Define mappings#

We recommend to use YARRRML, a mapping language to replace the RDF by YAML, to make the definition of RML mappings easier.

The Matey web UI ๐Ÿฆœ is available to easily write and test RML mappings in YAML files using the YARRRML simplified mapping language. The mappings can be conveniently tested in the browser on a sample of the file to transform.

Recommended workflow to easily create and test RML mappings:

  1. Use the Matey web UI ๐Ÿฆœ to write YARRRML mappings, and test them against a sample of your data
  2. Copy the YARRRML mappings to a file with the extension .yarrr.yml
  3. Copy the RML mappings to a file with same name, and the extension .rml.ttl
  4. Optionally you can automate the execution in a GitHub Actions workflow.
Specifications
  • RML Specifications can be found as a W3C unofficial draft.
  • See the rml.io website for more documentation about RML and the various tools built and deployed by Ghent University.
YARRRML package

YARRRML can also be parsed locally or automatically using the yarrrml-parser npm package:

npm i @rmlio/yarrrml-parser -g
yarrrml-parser -i mappings.yarrr.yml

Example of a YARRRML mapping file using the split function on the | character:

prefixes:
grel: "http://users.ugent.be/~bjdmeest/function/grel.ttl#"
rdfs: "http://www.w3.org/2000/01/rdf-schema#"
gn: "http://www.geonames.org/ontology#"
mappings:
neighbours:
sources:
- ['countries.csv~csv']
s: http://www.geonames.org/ontology#$(ISO)
po:
- [a, gn:Country]
- p: gn:neighbours
o:
function: grel:string_split
parameters:
- [grel:valueParameter, $(neighbours)]
- [grel:p_string_sep, "\|"]
language: en
Generate nquads

You can also generate nquads by adding the graph infos in the rr:subjectMap in RML mappings (or just g: in YARRRML):

rr:graphMap [ rr:constant <https://w3id.org/d2s/graph> ];

Convert with the RML Mapper#

The rmlmapper-java execute RML mappings to generate RDF Knowledge Graphs.

Not for large files

The RML Mapper loads all data in memory, so be aware when working with big datasets.

  1. Download the rmlmapper .jar file at https://github.com/RMLio/rmlmapper-java/releases
  2. Run the RML mapper:
java -jar rmlmapper.jar -m mapping.ttl -o rdf-output.nt
Run automatically in workflow

The RMLMapper can be easily run in GitHub Actions workflows, checkout the Run workflows page for more details.

- name: Run RML mapper
uses: vemonet/rmlmapper-java@v4.9.0
with:
mapping: mappings.rml.ttl
output: rdf-output.nt
env:
JAVA_OPTS: "-Xmx6g"

Convert with the RML Streamer#

The RMLStreamer is a scalable implementation of the RDF Mapping Language Specifications to generate RDF out of structured input data streams.

Work in progress

The RMLStreamer is still in development, some features such as functions are yet to be implemented.

To run the RMLStreamer you have 2 options:

Prepare files#

Copy the RMLStreamer.jar file, your mapping files and data files to the Flink jobmanager pod before running it.

For example:

# get flink pod id
POD_ID=$(oc get pod --selector app=flink --selector component=jobmanager --no-headers -o=custom-columns=NAME:.metadata.name)
DATASET=my-dataset
oc rsh flink-jobmanager-7459cc58f7-5hqjb
oc exec $POD_ID -- mkdir -p /mnt/project
# If script run from datasets/dataset1/scripts/ :
oc cp ../../mappings $POD_ID:/mnt/project/
chmod +x /mnt/project/datasets/$DATASET/scripts/download.sh
oc exec $POD_ID -- /mnt/project/datasets/$DATASET/scripts/download.sh
oc exec $POD_ID -- wget -O /mnt/RMLStreamer.jar https://github.com/RMLio/RMLStreamer/releases/download/v2.0.0/RMLStreamer-2.0.0.jar

Run the RMLStreamer#

Example of command to run the RMLStreamer from the Flink cluster master:

nohup /opt/flink/bin/flink run -p 128 -c io.rml.framework.Main /mnt/RMLStreamer.jar toFile -m /mnt/mappings.rml.ttl -o /mnt/rmlstreamer-mappings-output.nt --job-name "RMLStreamer mappings.rml.ttl" &
Check the progress

The progress of the job can be checked in the Apache Flink web UI.

Merge and compress output#

The ntriples files produced by RMLStreamer in parallel:

cd /mnt/cohd/openshift-rmlstreamer-cohd-associations.nt
nohup cat * >> openshift-rmlstreamer-cohd-associations.nt &
ls -alh /mnt/cohd/openshift-rmlstreamer-cohd-associations.nt/openshift-rmlstreamer-cohd-associations.nt
# Zip the merged output file:
nohup gzip openshift-rmlstreamer-cohd-associations.nt &

Copy to Node2#

SSH connect to node2, http_proxy var need to be changed temporary to access DSRI

export http_proxy=""
export https_proxy=""
# Copy with oc tool:
oc login
oc cp flink-jobmanager-7459cc58f7-cjcqf:/mnt/cohd/openshift-rmlstreamer-cohd-associations.nt/openshift-rmlstreamer-cohd-associations.nt.gz /data/graphdb/import/umids-download &!
# Check (19G total):
ls -alh /data/graphdb/import/umids-download
cp /data/graphdb/import/umids-download/openshift-rmlstreamer-cohd-associations.nt.gz /data/d2s-project-trek/workspace/dumps/rdf/cohd/
gzip -d openshift-rmlstreamer-cohd-associations.nt.gz

Reactivate the proxy (EXPORT http_proxy)

Preload in GraphDB#

Check the generated COHD file on node2 at:

cd /data/d2s-project-trek/workspace/dumps/rdf/cohd

Replace wrong triples:

sed -i 's/"-inf"^^<http:\/\/www.w3.org\/2001\/XMLSchema#double>/"-inf"/g' openshift-rmlstreamer-cohd-associations.nt

Start preload:

cd /data/deploy-ids-services/graphdb/preload-cohd
docker-compose up -d

The COHD repository will be created in /data/graphdb-preload/data, copy it to the main GraphDB:

mv /data/graphdb-preload/data/repositories/cohd /data/graphdb/data/repositories
Last updated on by Vincent Emonet