Introduction to d2s

This documentation explains how to use the d2s Command Line Interface to deploy services to integrate and access data in a Knowledge Graph:

  • Integrate any structured data using various solutions.
  • Deploy various interfaces to consume the Knowledge Graph data.
  • Deploy user-friendly web UI to access the integrated data.
Not maintained anymore

We are currently migrating from the CLI to using more GitHub Actions workflow, the d2s CLI is not maintained anymore. Prefer using the documentation at Publish data and Deploy services

d2s is a Command Line Interface, written in Python, that helps you define download scripts, metadata, mappings and Jupyter notebooks to integrate multiple datasets into a Knowledge Graph complying with a defined data model. d2s has been developed using mainly RDF Knowledge Graphs, but property graphs, such as Neo4j, can also be used.

This documentation is intended for people who are familiar with Docker and running simple commands in a Terminal. It is also preferable to be familiar with the SPARQL query language.

The d2s tool is following the Data2Services framework detailed in this figure:

Data2Services infrastructure

We provide d2s-project-template, a GitHub template repository with example mappings and workflows to start building your RDF Knowledge Graph from structured data.

Feel free to use directly the d2s-project-template repository or create a new GitHub repository from the template to start your own project.

See this presentation for more details about the Data2Services project.

Build a RDF Knowledge Graph#

  1. Initialize a d2s project
  2. Add a new dataset
  3. Choose a solution to integrate your data

Multiple solutions available to integrate data in a standard Knowledge Graph:

  • RML mappings (RDF Mapping Language)

  • CWL workflows defined to convert structured files to RDF using SPARQL queries

  • BioThings Studio to build BioThings APIs (exposed to the Translator using the ReasonerStd API)

  • DOCKET to integrate omics data

  • Python scripts and notebooks, such as Dipper ETL, rdflib or pandas

  • Define new CWL workflows to build and share your data transformation pipelines

    • It is strongly recommended to use a POSIX system (Linux, MacOS) if you consider running workflows on your laptop, since most workflow orchestration tools do not support Windows.

Deploy services#

Once your data has been integrated in a RDF Knowledge Graph you can deploy interfaces and services to access your data.

Data2Services aims to provide an exhaustive documentation to run and deploy RDF and Knowledge Graph related services using Docker. The d2s CLI uses docker-compose to start and link locally the different services.

Project folder structure#

The d2s client uses the following directory structure, which can be found in the example project d2s-project-template (here with the cohd dataset):

root-directory
โ”œโ”€โ”€ .d2sconfig # The project config file
โ”œโ”€โ”€ .github/workflows
โ”‚ โ”œโ”€โ”€ process-dataset1.yml
โ”‚ โ””โ”€โ”€ process-dataset2.yml
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ d2s-core (submodule) # docker-compose, CWL, Argo workflows files
โ”œโ”€โ”€ datasets # Folders of the different dataset integrated in the KG
โ”‚ โ””โ”€โ”€ cohd # Folder for files to convert COHD
โ”‚ โ”œโ”€โ”€ process-cohd.ipynb # Notebook for COHD data processing
โ”‚ โ”œโ”€โ”€ config.yml # The workflow config file
โ”‚ โ”œโ”€โ”€ download
โ”‚ โ”‚ โ””โ”€โ”€ download.sh # Script to download input files
โ”‚ โ”œโ”€โ”€ mapping # SPARQL mapping queries to build the KG
โ”‚ โ”‚ โ”œโ”€โ”€ concepts.rq # SPARQL query mappings
โ”‚ โ”‚ โ”œโ”€โ”€ associations-mapping.yarrr.yml # YARRRML mappings
โ”‚ โ”‚ โ””โ”€โ”€ associations-mapping.rml.ttl # RML mappings
โ”‚ โ””โ”€โ”€ metadata # SPARQL queries to insert metadata about the dataset
โ”‚ โ”œโ”€โ”€ metadata-cohd-summary.rq
โ”‚ โ”œโ”€โ”€ metadata-cohd-1.rq
โ”‚ โ””โ”€โ”€ metadata-cohd-2.rq
โ””โ”€โ”€ workspace # Contains all files required to run the KG and services
โ”œโ”€โ”€ input # Downloaded file to process
โ”œโ”€โ”€ output # Every file generated by the workflow
โ”œโ”€โ”€ logs # Logs of workflows run
โ”œโ”€โ”€ dumps # HDT and RDF file dumps of the different graphs (datasets)
โ”œโ”€โ”€ import # Import folder for GraphDB (or other triplestores)
โ”œโ”€โ”€ graphdb # Folders used by the triplestores
โ”œโ”€โ”€ virtuoso
โ”œโ”€โ”€ tmp-virtuoso
โ””โ”€โ”€ blazegraph

Source code repositories#

The Data2Services project uses multiples Git repositories:

  • d2s-cli: A Command Line Interface to orchestrate the integration of heterogenous data and the deployment of services consuming the integrated data (Python).
    • It will clone and use a d2s-project-template repository to store your project services and workflows settings.
  • d2s-project-template: template to create a Data2Services project folder, with example mappings to a few datasets, it includes d2s-core as submodule.
  • d2s-docs: source code of this documentation.
Last updated on by Vincent Emonet