Skip to content

WATonomous/run-gha-on-slurm

Repository files navigation

run-gha-on-slurm

This is still a work in progress. The goal is to run GitHub Actions on the Slurm cluster.

Overview

  1. The Allocator polls the GitHub API for queued jobs
  2. Whenever a job is queued, it allocates an ephemeral action runner on the Slurm cluster
  3. Once the job is complete, the runner and Slurm resources are de-allocated

Basic diagram of the system

flowchart LR
    GitHubAPI[("GitHub API")]
    ActionsRunners[("Allocator")]
    Slurm[("Slurm Compute Resources")]

    ActionsRunners --> | Poll Queued Jobs | GitHubAPI 
    ActionsRunners -->| Allocate Actions Runner| Slurm 
Loading

Enabling Docker Within CI Jobs

graph TD
    A[Docker Rootless Daemon] -->| Creates | B[Docker Rootless Socket]
    B -->| Creates | C[Custom Actions Runner Image]
    C -.->| Calls | B
    C --->| Mounts | B
    C -->| Creates | E[CI Helper Containers]
    E -.->| Calls | B
Loading

Since CI Docker commands will use the same filesystem, as they have the same Docker socket, you need to configure the working directory of your runners accordingly.

Speeding up our Actions Runner Image

After we were able to run the actions runner image in as Slurm job using sbatch and custom script we ran into the issue of having to pull the docker image for every job. From the time the script allocated resources to the time the job began was ~ 2 minutes. When you are running 70+ jobs in a workflow, with some jobs depending on others, this time adds up fast.

Unfortunately, caching the image is not an elegant solution because this would require mounting the filesystem directory to the Slurm job. This means we would need to have multiple directories if we wanted to support multiple concurrent runners. This would require creating a system to manage these directories and would introduce the potenital for starvation and dead locks.

This led us to investegate a Docker pull through cache.

Docker References

  1. Docker Rootless
  2. Custom Actions Runner Image

Issues

  • If script needs to be restart and runners are being built, the script will allocate new runners once its back up

Potential issue:

  • job1 requires label1, label2
  • job2 requires label1
  • runner1 is allocated with label1, label2
  • runner1 runs job2
  • runner2 is allocated with label1
  • runner2 CANT RUN job1 Won't be an issue if we use one label (small, medium, large) per job

About

Run Github Actions on Slurm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published