Add a new dataset

In this documentation I will use d2s-project-template as example, but you are encouraged to create a new Git repository using the template.

Generate the new dataset#

The files required to transform the dataset will be generated in datasets/$dataset_id

d2s generate dataset

You will be prompted to enter some metadata about the dataset to create.

The dataset mappings, metadata, notebook and download files are created in the dataset/$dataset_id folder.

The dataset folder is generated based on this template folder. Example mapping files are provided for DrugBank XML data and Columbia Open Health clinical Data TSV data.

Let us know if those examples are helpful, or if they would need to be more explicit.

Describe the dataset metadata#

You are encouraged to improve the metadata description of your dataset by editing the 2 metadata files generated in datasets/$dataset_id/metadata.

A dozen of metadata are defined using a SPARQL query for the summary of the dataset, and for each distribution.

Change the URIs between <> and strings between "".

We recommend using Stardog RDF Grammars extension in Visual Studio Code to edit SPARQL queries (.rq files).

Add files to download#

You can define the files to download using:

  • a Bash file
    • In datasets/$dataset_id/download/download.sh
    • Download with d2s download $dataset_id
  • a Jupyter Notebook
    • In datasets/$dataset_id/process-dataset_id.ipynb

The files will be downloaded in workspace/input/$dataset_id.

A template is provided with examples to download, unzip or add column labels provided.

d2s extract data from csv/tsv files based on their column label. If your tabular doesn't have column you can add them at the end of the download.sh file by using the sed command.


Integrate data#

Multiple solutions are available to integrate data in a standard Knowledge Graph:

Last updated on by Vincent Emonet