Deploy workflows
GitHub Actions in a workflow engine proposed by GitHub, allowing users to define workflows in YAML files in the .github/workflows
folder of their git repository.
Worklows are a succession of steps, each step being either a bash command, or a GitHub Action. GitHub Actions are steps that have been defined to be easily run in workflows, you can find all available Actions on the GitHub Marketplace.
The users can also define when the workflow should be run: at each push to the main branch, every week, manually triggered...
Once the workflow is pushed to the GitHub repository, then GitHub will schedule the workflow automatically.
GitHub Actions workflow can run on 2 types of environment:
- GitHub runner: run in an isolated environment (Ubuntu) on GitHub servers (with resources limitations)
- Self hosted runner: run the workflow directly on the machine where you deployed the self hosted runner (and have access to all its resurces)
Secure credentials π
Password, and other sensible informations, can be securely stored as GitHub secrets and used in the workflows.
We use GitHub Actions to automatically run the different part of the workflow in a reproducible way:
- Download the input data files
- Run Python script (to directly generate RDF, or perform preprocessing of the data for RML)
- Run the RML mapper to generate the RDF data, if applicable
- Upload the generated RDF file to the SPARQL endpoint
- Generate and publish descriptive statistics for the published data
#
GitHub Actions for RDFA few GitHub Actions are available on the GitHub marketplace to easily work with RDF data.
#
πΊοΈ RML MapperA GitHub Action for the rmlmapper-java
#
π¬ SPARQL operationsA GitHub Action for d2s-sparql-operations, it allows to perform operations on SPARQL endpoints using RDF4J (SPARQL select, construct, insert, delete queries, upload RDF files, split statements...)
Execute insert queries using local folder:
Upload RDF from local folder:
#
βοΈ Validate RDFA GitHub Action to validate RDF with Jena
#
π Convert YARRRML to RMLA GitHub Action for the yarrrml-parser, to convert YARRRML YAML files to RML turtle files.
#
π½ Compress RDF to HDTConvert ntriples to HDT using the hdt-cpp docker image:
#
π Get metadata from SPARQLWork in progress
Computing HCLS descriptive metadata for a SPARQL endpoint is a work in development in the d2s
CLI
d2s metadata
will generate descriptive statistics for knowledge graphs, defined by the Health Care and Life Science Community Profile, for each graph in the SPARQL endpoint. The computed metadata provide an overview of the SPARQL endpoint content in RDF, with quantitative insights on entities classes, and the relations between them.
Requires Python 3.6+ setup. Metadata are generated as turtle RDF in the metadata.ttl
file.
#
Automate data processing and loadingRDF data can be automatically generated and loaded using GitHub Actions workflows.
See this workflow to generate data using a simple convert_to_rdf.py
file and load it in the triplestore
- Checkout the
git
repository file in your current folder:
- Download input file from Google Docs
- Install Python dependencies
- Run the python script to generate RDF
- Optional: clear an existing graph in the triplestore
- Upload the output as artifact to be able to download them from the GitHub website, or pass them between jobs:
- Optional: download the artifact (
rdf-output
here) back in another job:
The files in the artifact can be accessed directly, e.g. here rdf-output/rdf-file.nq
Secrets
You will need to define those 2 secrets in your GitHub repository workflows secrets: GRAPHDB_USER
and GRAPHDB_PASSWORD
#
Download from specific sources#
Google docsDownload input file from Google Docs
#
KaggleDownload input file from a Kaggle competition requires to define 2 secrets:
#
LimitationsThe following limitations apply to GitHub-hosted runner (they do not apply for self-hosted runners):
- Each job in a workflow can run for up to 6 hours of execution time.
- Each workflow run is limited to 72 hours.
- Total concurrent jobs for the free plan: 20 jobs (5 macOS)
GitHub-hosted runners run on machines with the following specifications:
- 2-core CPU
- 7 GB of RAM memory
- 14 GB of SSD disk space
- Environments:
ubuntu-latest
(a.k.a.ubuntu-18.04
),ubuntu-20.04
,windows-latest
,macos-latest
GitHub free plan allows to run Actions in private repositories, but impose execution time and storage limitations. By default GitHub set your spending limit to 0 β¬, so you will not be billed by surprise. The free plan provides the following credits per months, they are attached to the user or organization owning the repository running the workflows:
- 2,000 minutes of execution time (~33h)
- 500 MB of storage (for private artifacts and GitHub Packages)
#
Test Actions locallyAct allows to run GitHub Actions workflows directly on your local machine to test.
Provide a specific image and run a specific job: