Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge azure-images to main #74

Merged
merged 88 commits into from
Feb 9, 2024

Conversation

henrirosten
Copy link
Collaborator

@henrirosten henrirosten commented Feb 7, 2024

Introduce changes from the azure-images branch to the main branch:

After this change, it's possible to spin-up a ghaf-infra instance with terraform following the instructions from README.md. We verified ghaf x86 targets (both native and cross-compiled) can be build with an example dev-instance when manually triggered over ssh on the jenkins-controller VM.

This is used by the nix_build.sh script used to build images with
terraform.

Signed-off-by: Florian Klink <flokli@flokli.de>
This introduces a terraform module that can be used to nix-build and
upload VM images to Azure.

nix-build.sh originates from https://cs.tvl.fyi/depot/-/blob/ops/terraform/deploy-nixos/nixos-eval.sh,
which is why it inherits its copyright from there.

Signed-off-by: Florian Klink <flokli@flokli.de>
This groups some common together some resources to create a VM. We might
introduce more flexibility at a later point.

Signed-off-by: Florian Klink <flokli@flokli.de>
We can just include azure-config.nix from nixpkgs. It pulls in azure-
common.nix, which contains all necessary kernel config / udev rules.

It also defines a `config.system.azureImage` attribute, which builds
a vhd that we can import into azure, using the `azurerm-nix-vm-image`
terraform module

These can be referred to from source_image_id in Terraform
(using azurerm-linux-vm for example), allowing to boot the desired
machine config out of the box, without having to do a two-staged-deploy.

Signed-off-by: Florian Klink <flokli@flokli.de>
This allows injecting custom userdata to the VM at instance creation
time, which we can use to provision some config (like SSH pubkey config)
that's not part of the NixOS image.

Signed-off-by: Florian Klink <flokli@flokli.de>
Signed-off-by: Florian Klink <flokli@flokli.de>
Signed-off-by: Florian Klink <flokli@flokli.de>
Signed-off-by: Florian Klink <flokli@flokli.de>
azure-common.nix already sets
services.openssh.settings.{PermitRootLogin,ClientAliveInterval},
so we need to decide what wins.

To keep the intended behaviour, we want to mkForce PermitRootLogin to
"no" (azure-common.nix sets "prohibit-password"), and set the
ClientAliveInterval with mkDefault - bumping that timeout probably makes
sense for azure, and we don't want the setting in this file to take
priority.

Signed-off-by: Florian Klink <flokli@flokli.de>
This file contains all ssh public keys used by real humans.
It's parsed from Terraform to inject into instance metadata.

Signed-off-by: Florian Klink <flokli@flokli.de>
This builds the jenkins-master Nix image, turns it into a bootable Azure
image, and then boots an instance with the image.

Signed-off-by: Florian Klink <flokli@flokli.de>
Signed-off-by: Florian Klink <flokli@flokli.de>
That way, the VM survives reboots - the non-networkd configuration seems
to be quite brittle.

Signed-off-by: Florian Klink <flokli@flokli.de>
Ideally, we'd keep systemd-resolved disabled too, but the way nixpkgs
configures cloud-init prevents it from picking up DNS settings from
elsewhere.

Signed-off-by: Florian Klink <flokli@flokli.de>
Move the azure-specific config snipped into its own file, so we can
import it from multiple configuration.nix.

azure-common.nix is already used for the existing machine
configurations, and as we don't want to break these, it's using this
transient name.

Signed-off-by: Florian Klink <flokli@flokli.de>
This gives each VM a system-assigned identity, and exposes the principal
ID as a module output, allowing to grant access to certain resources.

Signed-off-by: Florian Klink <flokli@flokli.de>
This exposes a read-only HTTP webserver for the contents in the storage
container.
`rclone serve http` takes care of exposing the storage container over
HTTP.

We disallow listing (by only allowing access to certain paths), and
expose it over HTTP(S) with auto-ssl via caddy.

This will work with whatever domain we route to it, so it's not part of
the configuration.

Signed-off-by: Florian Klink <flokli@flokli.de>
This works around NixOS/nixpkgs#272532, we
can revert this once NixOS/nixpkgs#272617 has
landed here.

Signed-off-by: Florian Klink <flokli@flokli.de>
We don't want to blindly issue certs for all domains, but make this configurable.

This should be config coming from the environment, via cloud-init.

Signed-off-by: Florian Klink <flokli@flokli.de>
Define this for each machine outside the VM, and describe everything in
a single security group.
Attaching multiple security groups caused confusing duplicate errors,
this might be a Terraform Azure Provider Bug.

Signed-off-by: Florian Klink <flokli@flokli.de>
This adds filesystem-related tools to the $PATH of cloud-init, so it
can format disks with its disk_setup module (and fs_setup) config key.

This will be used to format data volumes attached to VMs.

Signed-off-by: Florian Klink <flokli@flokli.de>
We need to use cloud-init to format and mount data volumes in azure, we
can't use systemd for it.

Due to
hashicorp/terraform-provider-azurerm#6117,
disks in Azure gets attached late at boot, so any dev-disk-by-….device
units created via systemd-fstab-generator might not exist yet at the
time the graph for multi-user.target is created, causing systemd to fail
starting downstream services due to a missing dependency.

Once the volume is attached, the .device unit pops up via udev, and then
a manual restart of services depending on data disks would work, but
it's messy.

Letting cloud-init take care of data disk mounting (and formatting) is
the right choice, that way systemd doesn't need to do any dependency
tracking of it.

Signed-off-by: Florian Klink <flokli@flokli.de>
This adds the ghafbinarycache storage account, and a binary-cache-v1
storage container inside of it.

It's used to serve artifacts from (via the binary-cache) VM, and Nix
build artifacts are also uploaded to it.

Signed-off-by: Florian Klink <flokli@flokli.de>
This deploys the VM defined at binary-cache.

Attaching the data disks is still a bit messy (requires one reboot, or
manual reverse proxy restart).
Fixing this requires some more debugging.

Signed-off-by: Florian Klink <flokli@flokli.de>
Signed-off-by: Florian Klink <flokli@flokli.de>
The service-binary-cache module is all the specific hosts need.

Signed-off-by: Florian Klink <flokli@flokli.de>
Otherwise, cloud-init.service might still be running while we start up
services expecting the mount to happen.

Signed-off-by: Florian Klink <flokli@flokli.de>
Configure the domain and storage account name with cloud-init.

This allows keeping the same NixOS image across multiple deployments of
this image, serving another bucket at another domain.

Also, switch to listening on port 443 only, caddy can use the
TLS-ALPN-01 challenge just fine.

Signed-off-by: Florian Klink <flokli@flokli.de>
This should use tls-alpn-01 on port 443 just fine.

Signed-off-by: Florian Klink <flokli@flokli.de>
Apparently canonical/cloud-init#4673 and more
hacks are not needed, we can simply ramp up the timeout that systemd is
willing to wait for the .device unit to appear.

Signed-off-by: Florian Klink <flokli@flokli.de>
Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Fix cloud-config startup by adding a dependency to mnt-resource.mount

Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Move binary cache signing key to its own resource group, this makes it
possible to share the signing key between the private development
environments.

Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Move ssh private key to azure-secrets resource group, similarly to how
binary cache signing key was moved in the previous commit.

Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Remove functions `delete_keyvault` and `import_sigkey` which are no
longer needed after the two previous commits, that move the builder ssh
private key and the binary cache signing key to their own resource
group. These secrets now persist even after workspace destruction, so
there's no need to generate or delete them separately outside terraform.

Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Do not automatically switch to default workspace after `destroy`
command.

Improve workspace name matching by not allowing partial matches.

Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Associate each builder VM to correct network security group. Before this
change, builders were bound to binary cache's security group.

Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Run jenkins service only after cloud-config. This is an attempt to fix
the occasional jenkins service startup failures.

Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
@henrirosten henrirosten changed the title Merge azure images to main Merge azure-images to main Feb 7, 2024
@henrirosten henrirosten marked this pull request as ready for review February 7, 2024 06:00
@henrirosten henrirosten requested a review from a team February 7, 2024 06:00
Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
This reverts 3ba044e moving the nixpkgs
revision back to b0b2c5445c64191fd8d0b31f2b1a34e45a64547d from 23.11
which is the same nixpkgs version as what was used in main-branch
already.

Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Move azure nix host configurations to their own subdirectory to avoid
confusion with the ficolo (e.g. 'binarycache') and azure
('binary-cache') nix configurations.

Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Move caddy state disk to persistent. Binary-cache vm stores let's
encrypt certificates and data on the caddy state disk. This state disk
needs to be stored in 'persistent' data, otherwise there will be issues
with certificate authority rate limits when development environments are
deployed and consequently destroyed.

Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
Copy link

@mnokka mnokka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azure_image branch worked on in playground tests, go ahead

@henrirosten henrirosten merged commit 211229f into tiiuae:main Feb 9, 2024
1 check passed
@henrirosten henrirosten deleted the merge-azure-images-to-main branch February 13, 2024 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants