Skip to content

Gitlab CI Autoscaling Setup

Adam Novak edited this page Apr 18, 2019 · 20 revisions

I have set up an autoscaling Gitlab runner, like vg uses, to run multiple tests in parallel.

Setup

I am basically following the tutorial at https://docs.gitlab.com/runner/configuration/runner_autoscale_aws/

The tutorial has you create a "bastion" instance, on which you install the Gitlab Runner, using the "docker+machine" runner type. Then the bastion instance uses Docker Machine to create and destroy other instances to do the actual testing, as needed, but from the Gitlab side it looks like a single "runner" executing multiple tests.

I created a t2.micro instance named gitlab-ci-bastion, in the gitlab-ci-runner security group, with the gitlab-ci-runner IAM role, using the Ubuntu 18.04 image. I gave it a 20 GB root volume. I protected it from termination. It got IP address 54.218.250.217.

ssh ubuntu@54.218.250.217

I made sure to authorize the "ci" SSH key to access it, in ~/.ssh/authorized_keys.

Then I installed Gitlab Runner and Docker. I had to run each command separately; copy-pasting the whole block did not work.

curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | sudo bash

sudo apt-get -y -q install gitlab-runner

sudo apt-get -y -q install docker.io

sudo usermod -a -G docker gitlab-runner 

sudo usermod -a -G docker ubuntu 

Then I installed Docker Machine. Version 0.16.1 was current:

curl -L https://github.com/docker/machine/releases/download/v0.16.1/docker-machine-`uname -s`-`uname -m` >/tmp/docker-machine &&
chmod +x /tmp/docker-machine &&
sudo mv /tmp/docker-machine /usr/local/bin/docker-machine

Then I disconnected and ssh-d back in. At that point I could successfully run docker ps.

Then I went and got the Gitlab registration token from the Gitlab web UI. I decided to register the runner to the DataBiosphere group, instead of just the Toil project.

Then I registered the Gitlab Runner with the main Gitlab server, using the token instead of ##CENSORED##.

sudo gitlab-ci-multi-runner register -n \
  --url https://ucsc-ci.com/ \
  --registration-token ##CENSORED## \
  --executor docker+machine \
  --description "docker-machine-runner" \
  --docker-image "quay.io/vgteam/dind" \
  --docker-privileged

As soon as the runner registered with the Gitlab server, I found it in the web UI and paused it, so it wouldn't start trying to run jobs until I had it configured properly.

I also at some point updated the packages on the bastion machine:

sudo apt update && sudo apt upgrade -y

I edited the /etc/gitlab-runner/config.toml file to actually configure the runner. After a bit of debugging, I got it looking like this.

# Let the runner run 10 jobs in parallel
concurrent = 10
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "docker-machine-runner"
  url = "https://ucsc-ci.com/"
  # Leave the pre-filled value here from your config.toml, or replace
  # with the registration token you are using if copy-pasting this one.
  token = "##CENSORED##"
  executor = "docker+machine"
  # Run no more than 10 machines at a time.
  limit = 10
  [runners.docker]
    tls_verify = false
    # We reuse this image because it is Ubuntu with Docker 
    # available and virtualenv installed.
    image = "quay.io/vgteam/vg_ci_prebake"
    # t2.xlarge has 16 GB
    memory = "15g"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
  [runners.machine]
    IdleCount = 0
    IdleTime = 60
    # Max builds per machine before recreating
    MaxBuilds = 10
    MachineDriver = "amazonec2"
    MachineName = "gitlab-ci-machine-%s"
    MachineOptions = [
      "amazonec2-iam-instance-profile=gitlab-ci-runner",
      "amazonec2-region=us-west-2",
      "amazonec2-zone=a",
      "amazonec2-use-private-address=true",
      # Make sure to fill in your own owner details here!
      "amazonec2-tags=Owner,anovak@soe.ucsc.edu,Name,gitlab-ci-runner-machine",
      "amazonec2-security-group=gitlab-ci-runner",
      "amazonec2-instance-type=t2.xlarge",
      "amazonec2-root-size=80"
    ]

To enable this to work, I had to add some IAM policies to the gitlab-ci-runner role. It already had the AWS built-in AmazonS3ReadOnlyAccess, to let the tests read test data from S3. I gave it the AWS built-in AmazonEC2FullAccess to allow the bastion to create the machines. I also gave it gitlab-ci-runner-passrole, which I had to talk cluster-admin into creating for me, which allows the bastion to pass on the gitlab-ci-runner role to the machines it creates. That policy had the following contents:

{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Sid": "VisualEditor0",
           "Effect": "Allow",
           "Action": "iam:PassRole",
           "Resource": "arn:aws:iam::719818754276:role/gitlab-ci-runner"
       }
   ]
}

After getting all the policies attached to the role, I rebooted the bastion machine to get it to actually start up the Gitlab Runner daemon:

sudo shutdown -r now

Then when it came back up I unpaused it in the Gitlab web interface, and it started running jobs. A few jobs failed, and to debug them I set the docker image to the vg_ci_prebake that vg uses (to provide packages like python-virtualenv) and added python3-dev to the packages that that image carries.

Docker Maintenance

To make more changes to the image, commit to https://github.com/vgteam/vg_ci_prebake and Quay will automatically rebuild it. If you don't have rights to do that and don't want to wait around for a PR, clone the repo, edit it, and make a new Quay project to build your own version.

Future Work

One change I have not yet made might be to set a high output_limit as described in https://stackoverflow.com/a/53541010 in case the CI logs get too long.

I also have not yet destroyed the old shell runner. I want to leave it in place until we are confident in the new system.