Add a new dataset
In this documentation I will use d2s-project-template as example, but you are encouraged to create a new Git repository using the template.
#
Generate the new datasetThe files required to transform the dataset will be generated in datasets/$dataset_id
You will be prompted to enter some metadata about the dataset to create.
The dataset mappings, metadata, notebook and download files are created in the dataset/$dataset_id
folder.
The dataset folder is generated based on this template folder. Example mapping files are provided for DrugBank XML data and Columbia Open Health clinical Data TSV data.
Let us know if those examples are helpful, or if they would need to be more explicit.
#
Describe the dataset metadataYou are encouraged to improve the metadata description of your dataset by editing the 2 metadata files generated in datasets/$dataset_id/metadata
.
A dozen of metadata are defined using a SPARQL query for the summary of the dataset, and for each distribution.
- SPARQL insert dataset summary metadata (once by dataset).
- SPARQL insert dataset distribution metadata (for each new version).
Change the URIs between
<>
and strings between""
.
We recommend using
Stardog RDF Grammars
extension in Visual Studio Code to edit SPARQL queries (.rq
files).
#
Add files to downloadYou can define the files to download using:
- a Bash file
- In
datasets/$dataset_id/download/download.sh
- Download with
d2s download $dataset_id
- In
- a Jupyter Notebook
- In
datasets/$dataset_id/process-dataset_id.ipynb
- In
The files will be downloaded in workspace/input/$dataset_id
.
A template is provided with examples to download, unzip or add column labels provided.
d2s
extract data from csv/tsv files based on their column label. If your tabular doesn't have column you can add them at the end of the download.sh file by using thesed
command.
#
Integrate dataMultiple solutions are available to integrate data in a standard Knowledge Graph:
- RML mappings (RDF Mapping Language)
- CWL workflows defined to convert structured files to RDF using SPARQL queries
- BioThings Studio to build BioThings APIs (exposed to the Translator using the ReasonerStd API)
- DOCKET to integrate omics data
- Python scripts and notebooks
- Define new CWL workflows to build and share your data transformation pipelines
- See the CWL workflows defined for d2s.