Run CWL workflows

CWL workflows can be run to perform various task such as executing transformation pipeline to build a RDF Knowledge Graph.

Download files to convert#

Files to process (e.g. CSV, XML) needs to be downloaded before running a workflow 📥

d2s download <dataset_id>

Download script defined in datasets/dataset_id/download/download.sh.

Downloaded files goes to workspace/input/dataset_id.

Run CWL workflows#

Run a CWL workflow defined in d2s-core/cwl/workflows on a specific dataset:

d2s run <workflow_filename>.cwl <dataset_id>

Output goes to workspace/output

Convert CSV/TSV to RDF#

Use AutoR2RML and Apache Drill to generate R2RML mapping based on input data structure.

We provide an example converting a sample of COHD (clinical concepts co-occurences from FDA reports) to the BioLink model:

d2s download cohd
d2s run csv-virtuoso.cwl cohd

By default the workflow runs detached from your terminal, so you can close the Windows or leave the SSH sessions.

You might face issues when processing large CSV or TSV file, see this documentation to deal with big files.

A workflow allows to split a property object: convert CSV/TSV and split statements (e.g. ?s ?p "value1,value2,value3" would be splitted in 3 statements).

We provide a example converting a sample of the EggNOG dataset to the BioLink model:

d2s download eggnog
d2s run split-csv-virtuoso.cwl eggnog

Not tested at the moment. Might need to be fixed.

Convert XML to RDF#

Use xml2rdf to generate RDF based on the XML structure.

We provide a example converting a sample of DrugBank 💊️ (drug associations) to the BioLink model.

d2s download drugbank
d2s run xml-virtuoso.cwl drugbank

Output goes to workspace/output

Compute HCLS metadata#

HCLS descriptive metadata and statistics for datasets can easily be computed and inserted for the generated graph by running a CWL workflow:

d2s run compute-hcls-metadata.cwl cohd

Insert dataset metadata defined in the datasets/cohd/metadata folder.
Compute and insert HCLS descriptive statistics using SPARQL queries.

Generate mappings#

When you start converting a new dataset d2s can help you generating mapping files based on the input data structure. You can then edit the generated SPARQL queries to adapt them to your target model.

d2s run csv-virtuoso.cwl cohd --get-mappings

--get-mappings causes the mapping queries based on the input data structure generated by d2s to be copied to /datasets/$dataset_id/mappings

You can use those mappings as starting point to map the input data to your target model.

Note: nested XML files can generate a lot of mapping files.

Access workflows logs#

The workflow logs are stored in workspace/logs.

Watch a running workflow#

You can watch the logs of a running workflow 👀

d2s watch csv-virtuoso.cwl-cohd-20200215-100352.txt

Display workflow logs#

Display the complete logs of any workflow previously run 📋

d2s log csv-virtuoso.cwl-cohd-20200215-091342.txt

Run attached to the terminal#

d2s run csv-virtuoso.cwl cohd --watch

⚠️ The logs will not be stored in workspace/logs.

Further details on SPARQL mappings#

Converting data with Data2Services relies on 3 steps:

A generic RDF is automatically generated from the input data structure.
SPARQL queries are designed by the user to map the generic RDF to a target model.
Extra modules can be added to the workflow to perform operations SPARQL doesn't natively support
- E.g. splitting statements, resolving the preferred URI for an entity.

You can find example of SPARQL mapping queries for:

XML files
- DrugBank
CSV/TSV files

Defining the mappings is the hardest and most cumbersome part of data integration. We are working actively on making it easier, by working on mapping automation and graphical user interfaces.

The mapping definition is straightforward for flat data format such as CSV, TSV or relational databases. But nested data representation such as XML or JSON require more complex mappings.

If you are mapping a dataset for the first time we advice you to run AutoR2RML or xml2rdf on the data to generate bootstrap SPARQL queries

AutoR2RML automatically generates a SPARQL query extracting all columns value for each row.
- You just need to generate proper URIs using BIND
- And write the statements corresponding to the target representation

PharmGKB is a good example of complex TSV file.

xml2rdf generates a SPARQL mapping file for each array it detects
- Mapping generation for XML is still experimental as it is complex to detect which fields should be mapped.
- Be careful when iterating on multiple different child arrays for a parent node in your SPARQL query. It can blow up the processing time.
  - Always split your queries to never iterate over more than one array for a parent node.
  - E.g. if drug:001 from a XML file has multiple publications and multiple synonyms nodes in its child, then it is preferable to get them in 2 different queries. Retrieving the 2 arrays in a single query would results in the returned row count be a cartesian product of the 2 arrays, which grows exponentially with the size of each array.
  - Final semantic results are the same, but the performance of the transformation is highly impacted.

DrugBank is a good example of multiple mappings files to handle arrays.