This is a service for managing Firefox release operations (RelOps) hardware. It is a rewrite of the build-slaveapi based on tecken to help migrate from buildbot to taskcluster.
The service consists of a Django Rest Framework web API, Redis-backed Celery queue, and one or more Celery workers. It should be run behind a VPN.
+-----------------------------------------------------------------------------+
| VPN |
| |
+------------+ | +--------------+ +----------------+ +-----------+ +--------+ |
| | | | Roller | | Roller | | Roller +-----> | |
| TC Dash. +-------> API +-----> Queue +-----> Workers | | HW 1 | |
| | | | | | | | <-----+ | |
| <-------+ <-----+ <-----+ | +--------+ |
| | | | | | | | | |
+------------+ | +----+---+-----+ +----------------+ | | +--------+ |
| | +-----> | |
| | | | HW 2 | |
| | <-----+ | |
| | | +--------+ |
| | | |
| | | +--------+ |
| | +-----> | |
| | | | HW 3 | |
| | <-----+ | |
| +-----------+ +--------+ |
| |
| |
+-----------------------------------------------------------------------------+
After a Roller admin registers an action with taskcluster, a sheriff or RelOps operator on a worker page of the taskcluster dashboard can use the actions dropdown to trigger an action (ping, reboot, reimage, etc.) on a RelOps managed machine.
Under the hood, the taskcluster dashboard makes a CORS request to Roller API, which checks the Taskcluster authorization header and scopes then queues a Celery task for the Roller worker to run. (There is an open issue for sending notifications back to the user).
URL for worker-context Taskcluster actions that needs to be registered.
URL params:
-
$worker_id
the Taskcluster Worker ID e.g.ms1-10
. 1 to 128 characters in long. -
$worker_group
the Taskcluster Worker Group e.g.mdc1
usually a datacenter for RelOps hardware. 1 to 128 characters in long.
Query param:
$task_name
the celery task to run. Must be inTASK_NAMES
insettings.py
Taskcluster does not POST data/body params.
Example request from Taskcluster:
POST http://localhost:8000/api/v1/workers/dummy-worker-id/group/dummy-worker-group/jobs?task_name=ping
Authorization: Hawk ...
Example response:
{"task_name":"ping","worker_id":"dummy-worker-id","worker_group":"dummy-worker-group","task_id":"e62c4d06-8101-4074-b3c2-c639005a4430"}
Where task_name
, worker_id
, and worker_group
are as defined in the request and task_id
is the task's Celery AsyncResult UUID.
To run the service fetch the roller image and redis:
docker pull mozilla/relops-hardware-controller
docker pull redis:3.2
The roller web API and worker images run from one docker container.
Copy the example settings file (if you don't have the repo checked out: wget https://raw.githubusercontent.com/mozilla-services/relops-hardware-controller/master/.env-dist
):
cp .env-dist .env
In production, use --env ENV_FOO=bar instead of an env var file.
Then docker run the containers:
docker run --name roller-redis --expose 6379 -d redis:3.2
docker run --name roller-web -p 8000:8000 --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller -d web
docker run --name roller-worker --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller -d worker
Check that it's running:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f45d4bcc5c3a mozilla/relops-hardware-controller "/bin/bash /app/bi..." 3 minutes ago Up 3 minutes 8000/tcp roller-worker
c48a68ad887c mozilla/relops-hardware-controller "/bin/bash /app/bi..." 3 minutes ago Up 3 minutes 0.0.0.0:8000->8000/tcp roller-web
d1750321c4df redis:3.2 "docker-entrypoint..." 9 minutes ago Up 8 minutes 6379/tcp roller-redis
curl -w '\n' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' http://localhost:8000/api/v1/workers/tc-worker-1/group/ndc2/jobs\?task_name\=ping
<h1>Bad Request (400)</h1>
docker logs roller-web
[2018-01-10 08:27:23 +0000] [5] [INFO] Starting gunicorn 19.7.1
[2018-01-10 08:27:23 +0000] [5] [INFO] Listening at: http://0.0.0.0:8000 (5)
[2018-01-10 08:27:23 +0000] [5] [INFO] Using worker: egg:meinheld#gunicorn_worker
[2018-01-10 08:27:23 +0000] [8] [INFO] Booting worker with pid: 8
[2018-01-10 08:27:23 +0000] [10] [INFO] Booting worker with pid: 10
[2018-01-10 08:27:23 +0000] [12] [INFO] Booting worker with pid: 12
[2018-01-10 08:27:23 +0000] [13] [INFO] Booting worker with pid: 13
172.17.0.1 - - [10/Jan/2018:08:31:46 +0000] "POST /api/v1/workers/tc-worker-1/group/ndc2/jobs HTTP/1.1" 400 26 "-" "curl/7.43.0"
172.17.0.1 - - [10/Jan/2018:08:31:46 +0000] "- - HTTP/1.0" 0 0 "-" "-"
Roller uses an environment variable called DJANGO_CONFIGURATION
that
defaults to Prod
to pick which composable
configuration
to use.
In addition to the usual Django, Django Rest Framework and Celery settings we have:
-
TASKCLUSTER_CLIENT_ID
The Taskcluster CLIENT_ID to authenticate with -
TASKCLUSTER_ACCESS_TOKEN
The Taskcluster access token to use
-
CORS_ORIGIN
Which origin to allow CORS requests from (returning CORS access-control-allow-origin header) Defaults tolocalhost
in Dev andtools.taskcluster.net
in Prod -
TASK_NAMES
List of management commands can be run from the API. Defaults toping
in Dev andreboot
in prod.
-
BUGZILLA_URL
URL for the Bugzilla REST API e.g. https://landfill.bugzilla.org/bugzilla-5.0-branch/rest/ -
BUGZILLA_API_KEY
API for using the Bugzilla REST API -
XEN_URL
URL for the Xen RPC API http://xapi-project.github.io/xen-api/usage.html -
XEN_USERNAME
Username to authenticate with the Xen management server -
XEN_PASSWORD
Password to authenticate with the Xen management server -
ILO_USERNAME
Username to authenticate with the HP iLO management interface -
ILO_PASSWORD
Password to authenticate with the HP iLO management interface -
FQDN_TO_SSH_FILE
Path to the JSON file mapping FQDNs to SSH username and key file paths example in settings.py. The ssh keys need to be mounted when docker is run. For example withdocker run -v host-ssh-keys:.ssh --name roller-worker
. The ssh user on the target machine should use ForceCommand to only allow the commandreboot
orshutdown
defaultssh.json
-
FQDN_TO_IPMI_FILE
Path to the JSON file mapping FQDNs to IPMI username and passwords example in settings.py defaultipmi.json
-
FQDN_TO_PDU_FILE
Path to the JSON file mapping FQDNs to pdu SNMP sockets example in settings.py defaultpdus.json
-
FQDN_TO_XEN_FILE
Path to the JSON file mapping FQDNs to Xen VM UUIDs example in settings.py defaultxen.json
Note: there is a bug for simplifying the FQDN_TO_* settings
To list available actions/management commands:
docker run --name roller-runner --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller manage.py
Type 'manage.py help <subcommand>' for help on a specific subcommand.
Available subcommands:
[api]
file_bugzilla_bug
ilo_reboot
ipmi_reboot
ipmitool
ping
reboot
register_tc_actions
snmp_reboot
ssh_reboot
xenapi_reboot
To show help for one:
docker run --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller manage.py ping --help
usage: manage.py ping [-h] [--version] [-v {0,1,2,3}] [--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback] [--no-color]
[-c COUNT] [-w TIMEOUT] [--configuration CONFIGURATION]
host
Tries to ICMP ping the host. Raises for exceptions for a lost packet or
timeout.
positional arguments:
host A host
optional arguments:
-h, --help show this help message and exit
...
-c COUNT stop after sending NUMBER packets
-w TIMEOUT stop after N seconds
...
And test it:
docker run --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller manage.py ping -c 4 -w 5 localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.042 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.074 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.086 ms
64 bytes from localhost (127.0.0.1): icmp_seq=4 ttl=64 time=0.074 ms
--- localhost ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3141ms
rtt min/avg/max/mdev = 0.042/0.069/0.086/0.016 ms
In general, we should be able to run tasks as a manage.py commands and tasks should do the same thing when run as commands as via the API.
Note: bug for not requiring redis to run management commands
- Create an ssh key and user limited to
shutdown
orreboot
with ForceCommand on the target hardware - Add the ssh key and user to the mounted worker ssh keys directory
- Add the machine's FQDN to any relevant
FQDN_TO_*
config files
- Check that the
TASK_NAMES
settings only includes tasks we want to register with Taskcluster - Check
TASKCLUSTER_CLIENT_ID
andTASKCLUSTER_ACCESS_TOKEN
are present as env vars or in settings (via taskcluster-cli login) The client will need the Taskcluster scopequeue:declare-provisioner:$provisioner_id#actions
- Run:
docker run --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller manage.py register_tc_actions https://roller-dev1.srv.releng.mdc1.mozilla.com my-provisioner-id
Note: An arg like --settings relops_hardware_controller.settings
or --configuration Dev
may be necessary to use the right Taskcluster credentials
Note: This does not need to be run from the roller server since the first argument is the URL to Taskcluster to send the action.
- Check the action shows up in the Taskcluster dashboard for a worker on the provisioner e.g. https://tools.taskcluster.net/provisioners/my-provisioner-id/worker-types/dummy-worker-type/workers/test-dummy-worker-group/dummy-worker-id (this might require creating a worker)
- Run the action from the worker's Taskcluster dashboard
This is similar to prod deployment, but uses make, docker-compose, and env files to simplify starting and running things.
To build and run the web server development mode and have the worker reload and purge the queue on file changes run:
make start-web start-worker
To run tests and watch for changes:
make current-shell # requires the start-web / the web server to be running
docker-compose exec web bash
app@ca6a901df6b4:~$ ptw .
Running: py.test .
=========================================================== test session starts ============================================================
platform linux -- Python 3.6.3, pytest-3.3.2, py-1.5.2, pluggy-0.6.0
Django settings: relops_hardware_controller.settings (from environment variable)
rootdir: /app, inifile: pytest.ini
plugins: flake8-0.9.1, django-3.1.2, celery-4.1.0
collected 74 items
...
- Create
relops_hardware_controller/api/management/commands/<command_name>.py
andtests/test_<command_name>_command.py
e.g. ping.py and test_ping_command.py - Run
make shell
then./manage.py
and check for the command in the api section of the output - Add the command name to
TASK_NAMES
inrelops_hardware_controller/settings.py
to make it accessible via API - Add any required shared secrets like ssh keys to the settings.py or .env-dist
- register the action with taskcluster