Deploy workflows

GitHub Actions in a workflow engine proposed by GitHub, allowing users to define workflows in YAML files in the .github/workflows folder of their git repository.

Worklows are a succession of steps, each step being either a bash command, or a GitHub Action. GitHub Actions are steps that have been defined to be easily run in workflows, you can find all available Actions on the GitHub Marketplace.

The users can also define when the workflow should be run: at each push to the main branch, every week, manually triggered...

Once the workflow is pushed to the GitHub repository, then GitHub will schedule the workflow automatically.

GitHub Actions workflow can run on 2 types of environment:

GitHub runner: run in an isolated environment (Ubuntu) on GitHub servers (with resources limitations)
Self hosted runner: run the workflow directly on the machine where you deployed the self hosted runner (and have access to all its resurces)

Secure credentials 🔒

Password, and other sensible informations, can be securely stored as GitHub secrets and used in the workflows.

We use GitHub Actions to automatically run the different part of the workflow in a reproducible way:

Download the input data files
Run Python script (to directly generate RDF, or perform preprocessing of the data for RML)
Run the RML mapper to generate the RDF data, if applicable
Upload the generated RDF file to the SPARQL endpoint
Generate and publish descriptive statistics for the published data

GitHub Actions for RDF#

A few GitHub Actions are available on the GitHub marketplace to easily work with RDF data.

🗺️ RML Mapper#

A GitHub Action for the rmlmapper-java

- name: Run RML mapper
  uses: vemonet/rmlmapper-java@v4.9.0
  with:
    mapping: mappings.rml.ttl
    output: rdf-output.nt
  env:
    JAVA_OPTS: "-Xmx6g"

📬 SPARQL operations#

A GitHub Action for d2s-sparql-operations, it allows to perform operations on SPARQL endpoints using RDF4J (SPARQL select, construct, insert, delete queries, upload RDF files, split statements...)

Execute insert queries using local folder:

- uses: MaastrichtU-IDS/sparql-operations-action@v1
  with:
    operation: upload
    file: my-folder/*.ttl
    endpoint: https://graphdb.ontotext.com/repositories/test/statements
    user: ${{ secrets.SPARQL_USER }}
    password: ${{ secrets.SPARQL_PASSWORD }}
    inputvar: https://w3id.org/d2s/graph/geonames
    outputvar: https://w3id.org/d2s/metadata
    servicevar: http://localhost:7200/repositories/test-vincent

Upload RDF from local folder:

- uses: MaastrichtU-IDS/sparql-operations-action@v1
  with:
    file: folder-with-rq-files/
    endpoint: https://graphdb.ontotext.com/repositories/test/statements
    user: ${{ secrets.SPARQL_USER }}
    password: ${{ secrets.SPARQL_PASSWORD }}

✔️ Validate RDF#

A GitHub Action to validate RDF with Jena

- uses: vemonet/jena-riot-action@v3.14
  with:
    input: my_file.ttl

📝 Convert YARRRML to RML#

A GitHub Action for the yarrrml-parser, to convert YARRRML YAML files to RML turtle files.

- uses: vemonet/yarrrml-parser@v1.1
  with:
    input: mappings.yarrr.yml
    output: mappings.rml.ttl

💽 Compress RDF to HDT#

Convert ntriples to HDT using the hdt-cpp docker image:

- name: Compress RDF to HDT
  uses: vemonet/rdfhdt-action@master
  with:
    input: rdf-output.nt
    output: hdt-output.hdt

📈 Get metadata from SPARQL#

Work in progress

Computing HCLS descriptive metadata for a SPARQL endpoint is a work in development in the d2s CLI

d2s metadata will generate descriptive statistics for knowledge graphs, defined by the Health Care and Life Science Community Profile, for each graph in the SPARQL endpoint. The computed metadata provide an overview of the SPARQL endpoint content in RDF, with quantitative insights on entities classes, and the relations between them.

Requires Python 3.6+ setup. Metadata are generated as turtle RDF in the metadata.ttl file.

- name: Generate HCLS metadata for a SPARQL endpoint
  run: |
    pip install d2s
    d2s metadata analyze $SPARQL_ENDPOINT -o metadata.ttl

Automate data processing and loading#

RDF data can be automatically generated and loaded using GitHub Actions workflows.

See this workflow to generate data using a simple convert_to_rdf.py file and load it in the triplestore

Checkout the git repository file in your current folder:

- uses: actions/checkout@v2

Download input file from Google Docs

    - name: Download CSV files from Google docs
      run: |
        mkdir -p data/output
        wget -O data/food-claims-kg.xlsx "https://docs.google.com/spreadsheets/d/1RWZ6AlGB8m7PO5kjsbbbeI4ETLwvKLOvkrzOpl8zAM8/export?format=xlsx&id=1RWZ6AlGB8m7PO5kjsbbbeI4ETLwvKLOvkrzOpl8zAM8"

Install Python dependencies

    - name: Install Python dependencies
      run: |
        python -m pip install -r requirements.txt

Run the python script to generate RDF

    - name: Run Python script to generate RDF
      run: |
        python src/convert_to_rdf.py

Optional: clear an existing graph in the triplestore

    - name: Clear existing graph
      uses: vemonet/sparql-operations-action@v1
      with:
        query: "CLEAR GRAPH <https://w3id.org/foodkg/graph>"
        endpoint: https://graphdb.dumontierlab.com/repositories/FoodHealthClaimsKG/statements
        user: ${{ secrets.GRAPHDB_USER }}
        password: ${{ secrets.GRAPHDB_PASSWORD }}

Upload the output as artifact to be able to download them from the GitHub website, or pass them between jobs:

- name: Upload RDF output artifact
  id: stepupload
  uses: actions/upload-artifact@v1
  with:
    name: rdf-output
    path: rdf-file.nq

Optional: download the artifact (rdf-output here) back in another job:

- name: Get RDF output artifact
  uses: actions/download-artifact@v1
  with:
    name: rdf-output

The files in the artifact can be accessed directly, e.g. here rdf-output/rdf-file.nq

Secrets

You will need to define those 2 secrets in your GitHub repository workflows secrets: GRAPHDB_USER and GRAPHDB_PASSWORD

Download from specific sources#

Google docs#

Download input file from Google Docs

- name: Download CSV files from Google docs
  run: |
  mkdir -p data/output
  wget -O data/food-claims-kg.xlsx "https://docs.google.com/spreadsheets/d/1RWZ6AlGB8m7PO5kjsbbbeI4ETLwvKLOvkrzOpl8zAM8/export?format=xlsx&id=1RWZ6AlGB8m7PO5kjsbbbeI4ETLwvKLOvkrzOpl8zAM8"

Kaggle#

Download input file from a Kaggle competition requires to define 2 secrets:

- name: Download data from Kaggle
  env: 
    KAGGLE_USERNAME: ${{ secrets.KAGGLE_USERNAME }}
    KAGGLE_KEY: ${{ secrets.KAGGLE_KEY }}
  run: |
    pip install kaggle
    kaggle competitions download -c whats-cooking-kernels-only
    unzip -o \*.zip

Limitations#

The following limitations apply to GitHub-hosted runner (they do not apply for self-hosted runners):

Each job in a workflow can run for up to 6 hours of execution time.
Each workflow run is limited to 72 hours.
Total concurrent jobs for the free plan: 20 jobs (5 macOS)

GitHub-hosted runners run on machines with the following specifications:

2-core CPU
7 GB of RAM memory
14 GB of SSD disk space
Environments: ubuntu-latest (a.k.a. ubuntu-18.04), ubuntu-20.04, windows-latest, macos-latest

GitHub free plan allows to run Actions in private repositories, but impose execution time and storage limitations. By default GitHub set your spending limit to 0 €, so you will not be billed by surprise. The free plan provides the following credits per months, they are attached to the user or organization owning the repository running the workflows:

2,000 minutes of execution time (~33h)
500 MB of storage (for private artifacts and GitHub Packages)

Test Actions locally#

Act allows to run GitHub Actions workflows directly on your local machine to test.

Provide a specific image and run a specific job:

act -P ubuntu-latest=nektos/act-environments-ubuntu:18.04 -j generate-rdf