Deploy workflows

GitHub Actions in a workflow engine proposed by GitHub, allowing users to define workflows in YAML files in the .github/workflows folder of their git repository.

Worklows are a succession of steps, each step being either a bash command, or a GitHub Action. GitHub Actions are steps that have been defined to be easily run in workflows, you can find all available Actions on the GitHub Marketplace.

The users can also define when the workflow should be run: at each push to the main branch, every week, manually triggered...

Once the workflow is pushed to the GitHub repository, then GitHub will schedule the workflow automatically.

GitHub Actions workflow can run on 2 types of environment:

  • GitHub runner: run in an isolated environment (Ubuntu) on GitHub servers (with resources limitations)
  • Self hosted runner: run the workflow directly on the machine where you deployed the self hosted runner (and have access to all its resurces)
Secure credentials πŸ”’

Password, and other sensible informations, can be securely stored as GitHub secrets and used in the workflows.

We use GitHub Actions to automatically run the different part of the workflow in a reproducible way:

  • Download the input data files
  • Run Python script (to directly generate RDF, or perform preprocessing of the data for RML)
  • Run the RML mapper to generate the RDF data, if applicable
  • Upload the generated RDF file to the SPARQL endpoint
  • Generate and publish descriptive statistics for the published data

GitHub Actions for RDF#

A few GitHub Actions are available on the GitHub marketplace to easily work with RDF data.

πŸ—ΊοΈ RML Mapper#

A GitHub Action for the rmlmapper-java

- name: Run RML mapper
uses: vemonet/rmlmapper-java@v4.9.0
with:
mapping: mappings.rml.ttl
output: rdf-output.nt
env:
JAVA_OPTS: "-Xmx6g"

πŸ“¬ SPARQL operations#

A GitHub Action for d2s-sparql-operations, it allows to perform operations on SPARQL endpoints using RDF4J (SPARQL select, construct, insert, delete queries, upload RDF files, split statements...)

Execute insert queries using local folder:

- uses: MaastrichtU-IDS/sparql-operations-action@v1
with:
operation: upload
file: my-folder/*.ttl
endpoint: https://graphdb.ontotext.com/repositories/test/statements
user: ${{ secrets.SPARQL_USER }}
password: ${{ secrets.SPARQL_PASSWORD }}
inputvar: https://w3id.org/d2s/graph/geonames
outputvar: https://w3id.org/d2s/metadata
servicevar: http://localhost:7200/repositories/test-vincent

Upload RDF from local folder:

- uses: MaastrichtU-IDS/sparql-operations-action@v1
with:
file: folder-with-rq-files/
endpoint: https://graphdb.ontotext.com/repositories/test/statements
user: ${{ secrets.SPARQL_USER }}
password: ${{ secrets.SPARQL_PASSWORD }}

βœ”οΈ Validate RDF#

A GitHub Action to validate RDF with Jena

- uses: vemonet/jena-riot-action@v3.14
with:
input: my_file.ttl

πŸ“ Convert YARRRML to RML#

A GitHub Action for the yarrrml-parser, to convert YARRRML YAML files to RML turtle files.

- uses: vemonet/yarrrml-parser@v1.1
with:
input: mappings.yarrr.yml
output: mappings.rml.ttl

πŸ’½ Compress RDF to HDT#

Convert ntriples to HDT using the hdt-cpp docker image:

- name: Compress RDF to HDT
uses: vemonet/rdfhdt-action@master
with:
input: rdf-output.nt
output: hdt-output.hdt

πŸ“ˆ Get metadata from SPARQL#

Work in progress

Computing HCLS descriptive metadata for a SPARQL endpoint is a work in development in the d2s CLI

d2s metadata will generate descriptive statistics for knowledge graphs, defined by the Health Care and Life Science Community Profile, for each graph in the SPARQL endpoint. The computed metadata provide an overview of the SPARQL endpoint content in RDF, with quantitative insights on entities classes, and the relations between them.

Requires Python 3.6+ setup. Metadata are generated as turtle RDF in the metadata.ttl file.

- name: Generate HCLS metadata for a SPARQL endpoint
run: |
pip install d2s
d2s metadata analyze $SPARQL_ENDPOINT -o metadata.ttl

Automate data processing and loading#

RDF data can be automatically generated and loaded using GitHub Actions workflows.

See this workflow to generate data using a simple convert_to_rdf.py file and load it in the triplestore

  1. Checkout the git repository file in your current folder:
- uses: actions/checkout@v2
  1. Download input file from Google Docs
- name: Download CSV files from Google docs
run: |
mkdir -p data/output
wget -O data/food-claims-kg.xlsx "https://docs.google.com/spreadsheets/d/1RWZ6AlGB8m7PO5kjsbbbeI4ETLwvKLOvkrzOpl8zAM8/export?format=xlsx&id=1RWZ6AlGB8m7PO5kjsbbbeI4ETLwvKLOvkrzOpl8zAM8"
  1. Install Python dependencies
- name: Install Python dependencies
run: |
python -m pip install -r requirements.txt
  1. Run the python script to generate RDF
- name: Run Python script to generate RDF
run: |
python src/convert_to_rdf.py
  1. Optional: clear an existing graph in the triplestore
- name: Clear existing graph
uses: vemonet/sparql-operations-action@v1
with:
query: "CLEAR GRAPH <https://w3id.org/foodkg/graph>"
endpoint: https://graphdb.dumontierlab.com/repositories/FoodHealthClaimsKG/statements
user: ${{ secrets.GRAPHDB_USER }}
password: ${{ secrets.GRAPHDB_PASSWORD }}
  1. Upload the output as artifact to be able to download them from the GitHub website, or pass them between jobs:
- name: Upload RDF output artifact
id: stepupload
uses: actions/upload-artifact@v1
with:
name: rdf-output
path: rdf-file.nq
  1. Optional: download the artifact (rdf-output here) back in another job:
- name: Get RDF output artifact
uses: actions/download-artifact@v1
with:
name: rdf-output

The files in the artifact can be accessed directly, e.g. here rdf-output/rdf-file.nq

Secrets

You will need to define those 2 secrets in your GitHub repository workflows secrets: GRAPHDB_USER and GRAPHDB_PASSWORD

Download from specific sources#

Google docs#

Download input file from Google Docs

- name: Download CSV files from Google docs
run: |
mkdir -p data/output
wget -O data/food-claims-kg.xlsx "https://docs.google.com/spreadsheets/d/1RWZ6AlGB8m7PO5kjsbbbeI4ETLwvKLOvkrzOpl8zAM8/export?format=xlsx&id=1RWZ6AlGB8m7PO5kjsbbbeI4ETLwvKLOvkrzOpl8zAM8"

Kaggle#

Download input file from a Kaggle competition requires to define 2 secrets:

- name: Download data from Kaggle
env:
KAGGLE_USERNAME: ${{ secrets.KAGGLE_USERNAME }}
KAGGLE_KEY: ${{ secrets.KAGGLE_KEY }}
run: |
pip install kaggle
kaggle competitions download -c whats-cooking-kernels-only
unzip -o \*.zip

Limitations#

The following limitations apply to GitHub-hosted runner (they do not apply for self-hosted runners):

  • Each job in a workflow can run for up to 6 hours of execution time.
  • Each workflow run is limited to 72 hours.
  • Total concurrent jobs for the free plan: 20 jobs (5 macOS)

GitHub-hosted runners run on machines with the following specifications:

  • 2-core CPU
  • 7 GB of RAM memory
  • 14 GB of SSD disk space
  • Environments: ubuntu-latest (a.k.a. ubuntu-18.04), ubuntu-20.04, windows-latest, macos-latest

GitHub free plan allows to run Actions in private repositories, but impose execution time and storage limitations. By default GitHub set your spending limit to 0 €, so you will not be billed by surprise. The free plan provides the following credits per months, they are attached to the user or organization owning the repository running the workflows:

  • 2,000 minutes of execution time (~33h)
  • 500 MB of storage (for private artifacts and GitHub Packages)

Test Actions locally#

Act allows to run GitHub Actions workflows directly on your local machine to test.

Provide a specific image and run a specific job:

act -P ubuntu-latest=nektos/act-environments-ubuntu:18.04 -j generate-rdf
Last updated on by Vincent Emonet