Run RML transformations
Use the RDF Mapping Language (RML) to map your structured data (CSV, TSV, SQL, XML, JSON, YAML) to RDF using a declarative mapping language.
By default we use YARRRML, a YAML mapping language, to make the definition of RML mappings easier. RML mappings defined using Turtle can also be executed.
#
Download files to convertThe following documentation will use the COHD Clinical CSV data and a Geonames TSV dataset as example. Download the dataset, if not already done:
Downloaded files goes to
workspace/input/cohd
#
Run RML MapperThe rmlmapper-java execute RML mappings to generate RDF Knowledge Graphs. It loads all data in memory, so be aware when working with big datasets.
By default d2s rml
will execute RML files defined in the RDF Turtle format, in files with the extension .rml.ttl
(e.g. datasets/<dataset_id>/mapping/associations-mapping.rml.ttl
)
Output goes to
workspace/import/rmlmapper-associations-mapping_rml_ttl-cohd.nt
You can execute YARRRML mappings defined in files with the extension .yarrr.yml
, by providing the argument --yarrrml
:
If you face memory issues, you might need to change the maximum memory allocated to Java (
-xmx
), or try using the RMLStreamer documented below.
#
Run RML StreamerThe RMLStreamer is a scalable implementation of the RDF Mapping Language Specifications to generate RDF out of structured input data streams.
⚠️ The RMLStreamer is still in development, some features such as functions are yet to be implemented.
The RML mappings needs to be defined as in a file with the extension .rml.ttl
, in the mapping folder of the dataset to transform, e.g. datasets/dataset_id/mapping/associations-mapping.rml.ttl
#
Start Apache FlinkStarting Apache Flink is required to stream the files:
Access at http://localhost:8078 to see running jobs.
#
Run jobWe provide an example converting a sample of COHD (clinical concepts co-occurences from FDA reports) to the BioLink model:
Output goes to
workspace/import/rmlstreamer-associations-mapping_rml_ttl-cohd.nt
and can then be loaded to a triplestore.
You can also provide YARRRML files with the extensions .yarrr.yml
to be processed to .rml.ttl
files before running RML:
The command run detached by default, you can keep the terminal attached and watch the execution:
Generate NQuads by adding the graph infos in the rr:subjectMap
in RML mappings:
#
Run on OpenShiftStill experimental, the RMLStreamer can be run on the Data Science Research Infrastructure OpenShift cluster.
See the DSRI documentation to deploy Apache Flink.
Copy the RMLStreamer.jar file, your mapping files and data files to the pod. It will be proposed when running
d2s rml
but they could be loaded manually before.
Transferring the files to the Apache Flink storage easily is still a work in progress. Use
oc cp
if rsync does not work.
- Run the RMLStreamer job on the GeoNames example:
The progress of the job can be checked in the Apache Flink web UI.
Output file in
/mnt/rdf_output-associations-mapping.nt
in the podOr in
/apache-flink
in the persistent storage.
#
Edit RML mappings#
YARRRML Matey web editorThe Matey Web UI editor 🦜 is available to easily write RML mappings in YAML files using the YARRRML simplified mapping language. The mappings can be conveniently tested in the browser on a sample of the file to transform.
RML Specifications can be found as a W3C unofficial draft.
See the rml.io website for more documentation about RML and the various tools built and deployed by Ghent University.
YARRRML can also be parsed locally using a npm package:
#
RMHell web editorSimilar to Matey, RMHell web UI editor allow to write YARRRML or RML and test it against a sample input file in a web browser.
#
MapeathorMapeathor converts Excel mappings into R2RML, RML or YARRRML mappings. Functions not supported.
Run Mapeathor locally:
Make sure the xlsx file is in the mapping folder of the datasets and execute the conversion to YARRRML:
Output format can be
RML
,R2RML
andYARRRML
.
The RML file will be generated in
workspace/mapeathor
#
Using functionsRML functions are still not implemented in the RMLStreamer, use the RML Mapper if you want to make use of them. See the full list of available default functions.
Example using the split function:
grel:p_string_sep
separators needs to be escaped with\
Additional function can be added by integrating them in a .jar
file, see the documentation.
#
Compute HCLS metadataAfter the RDF Knowledge Graph has been generated and loaded in a triplestore, HCLS descriptive metadata and statistics can be easily computed and inserted for the different datasets (graphs in the triplestore), by running a CWL workflow:
- Insert dataset metadata defined in the datasets/cohd/metadata folder.
- Compute and insert HCLS descriptive statistics using SPARQL queries.