GitHub Actions in a workflow engine proposed by GitHub, allowing users to define workflows in YAML files in the
.github/workflows folder of their git repository.
Worklows are a succession of steps, each step being either a bash command, or a GitHub Action. GitHub Actions are steps that have been defined to be easily run in workflows, you can find all available Actions on the GitHub Marketplace.
The users can also define when the workflow should be run: at each push to the main branch, every week, manually triggered...
Once the workflow is pushed to the GitHub repository, then GitHub will schedule the workflow automatically.
GitHub Actions workflow can run on 2 types of environment:
- GitHub runner: run in an isolated environment (Ubuntu) on GitHub servers (with resources limitations)
- Self hosted runner: run the workflow directly on the machine where you deployed the self hosted runner (and have access to all its resurces)
Secure credentials 🔒
Password, and other sensible informations, can be securely stored as GitHub secrets and used in the workflows.
We use GitHub Actions to automatically run the different part of the workflow in a reproducible way:
- Download the input data files
- Run Python script (to directly generate RDF, or perform preprocessing of the data for RML)
- Run the RML mapper to generate the RDF data, if applicable
- Upload the generated RDF file to the SPARQL endpoint
- Generate and publish descriptive statistics for the published data
A few GitHub Actions are available on the GitHub marketplace to easily work with RDF data.
Execute insert queries using local folder:
Upload RDF from local folder:
A GitHub Action for the yarrrml-parser, to convert YARRRML YAML files to RML turtle files.
Convert ntriples to HDT using the hdt-cpp docker image:
Work in progress
d2s metadata will generate descriptive statistics for knowledge graphs, defined by the Health Care and Life Science Community Profile, for each graph in the SPARQL endpoint. The computed metadata provide an overview of the SPARQL endpoint content in RDF, with quantitative insights on entities classes, and the relations between them.
Requires Python 3.6+ setup. Metadata are generated as turtle RDF in the
RDF data can be automatically generated and loaded using GitHub Actions workflows.
See this workflow to generate data using a simple
convert_to_rdf.py file and load it in the triplestore
- Checkout the
gitrepository file in your current folder:
- Download input file from Google Docs
- Install Python dependencies
- Run the python script to generate RDF
- Optional: clear an existing graph in the triplestore
- Upload the output as artifact to be able to download them from the GitHub website, or pass them between jobs:
- Optional: download the artifact (
rdf-outputhere) back in another job:
The files in the artifact can be accessed directly, e.g. here
You will need to define those 2 secrets in your GitHub repository workflows secrets:
Download input file from Google Docs
Download input file from a Kaggle competition requires to define 2 secrets:
The following limitations apply to GitHub-hosted runner (they do not apply for self-hosted runners):
- Each job in a workflow can run for up to 6 hours of execution time.
- Each workflow run is limited to 72 hours.
- Total concurrent jobs for the free plan: 20 jobs (5 macOS)
GitHub-hosted runners run on machines with the following specifications:
- 2-core CPU
- 7 GB of RAM memory
- 14 GB of SSD disk space
GitHub free plan allows to run Actions in private repositories, but impose execution time and storage limitations. By default GitHub set your spending limit to 0 €, so you will not be billed by surprise. The free plan provides the following credits per months, they are attached to the user or organization owning the repository running the workflows:
- 2,000 minutes of execution time (~33h)
- 500 MB of storage (for private artifacts and GitHub Packages)
Act allows to run GitHub Actions workflows directly on your local machine to test.
Provide a specific image and run a specific job: