Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(experimental-ec2-pattern): Pattern to deploy ASG updates w/CFN #2417

Merged
merged 23 commits into from
Sep 17, 2024

Conversation

akash1810
Copy link
Member

@akash1810 akash1810 commented Aug 20, 2024

What does this change?

Reduce the “time spent broken” after failed deploys, making these manual steps redundant, by adding a new experimental pattern GuEc2AppExperimental to provision an EC2 based service with an AutoScalingRollingUpdate update policy. The configuration is based on a number of tests we ran; we've documented some various observations.

This is an alternative to #2395. It improves on the solution proposed in #2395 by providing continuity of ASG metrics and ASG activity history.

AutoScalingRollingUpdate offers an improved DX over the autoscaling Riff-Raff deployment type when a deployment fails. On deployment failure, the autoscaling Riff-Raff deployment type leaves the ASG over-capacity, requiring manual intervention to resolve, whereas AutoScalingRollingUpdate will perform a rollback automatically. This means deployments are more atomic.

Note

Unfortunately testing has shown unpredictable behaviour if a scale-in event occurs during deployment. Therefore we suspend the AlarmNotification process during a rolling update.

This means scaling events cannot occur during deployment and guardian/riff-raff#1342 is not fixed.

With AutoScalingRollingUpdate CloudFormation will deploy application updates to an ASG in a rolling fashion. The rate at which the update rolls out is dependant on how scaled-out an ASG is:

  • If an ASG does not have any scaling policies, deployment will behave similar to Riff-Raff; the DesiredCapacity will be doubled, once new instances are healthy the capacity will be halved removing the old instances.

  • For an ASG that has a scaling policy, the rate will be dynamic:

    • If it is running normally, DesiredCapacity will be doubled, once new instances are healthy the capacity will be halved removing the old instances (testing observations).
    • If the service is partially scaled, deployment will happen in batches relative to the remaining capacity (MinCapacity / (MaxCapacity - DesiredCapacity)) (testing observations).
    • If the service is fully scaled, deployment will happen in a "one out, one in" fashion (testing observations).

    This will be achieved by Riff-Raff setting MinInstancesInService at deploy time via a CloudFormation Parameter.

    This functionality is yet to be added to Riff-Raff. It doesn't block shipping this change, however services that use scaling policies cannot use this experimental pattern until then.

As application updates are performed entirely by CloudFormation, it is not necessary to use the autoscaling RIff-Raff deployment type1. If it is used, it shouldn't break anything. However it does mean deployments will take a long time, as the instances will be rotated twice; once by CloudFormation and once by Riff-Raff's custom logic.

Changes to UserData

With AutoScalingRollingUpdate CloudFormation will remain in the UPDATE_IN_PROGRESS state whilst waiting for a SUCCESS signal to be received from the instance. This signal is sent manually.

The GuEc2AppExperimental pattern will add this signal emission to the UserData. A SUCCESS signal will be sent once the instance has successfully registered with the Target Group.

Why experimental?

There are a few requirements for this approach:

  • The AWS CLI, and cfn-signal binaries need to be available and on the PATH (this is added via the AMIgo role aws-tools).
  • The service's artifact should include a build number, as this is the most reliable way to create a difference with the running CloudFormation template, and thus trigger an update.

We're not (yet) validating these are met, as its tricky.

Proposed rollout

The rollout plan looks something like this:

  1. Dogfood on some DevX services.
  2. Test on a service within the department.
  3. Move the pattern to stable, as a breaking change, with communication on the migration path, etc.

How to test

These changes have been informed by "real-world" testing that has been documented in https://github.com/guardian/testing-asg-rolling-update.

This branch is being successfully used in guardian/amiable, guardian/cdk-playground and guardian/security-hq.

How can we measure success?

A DX improvement when a deployment fails.

Once fully adopted, we will be able to remove the autoscaling deployment type in Riff-Raff, reducing it's complexity.

Have we considered potential risks?

As this new pattern is experimental there risk levels are fairly small. We should monitor usage and gather feedback before promoting to stable.

Checklist

  • [ - ] I have listed any breaking changes, along with a migration path 2
  • I have updated the documentation as required for the described changes 3

Footnotes

  1. We still need the uploadArtifacts action of the autoscaling deployment type. The cloud-formation deployment type should depend on this step too.

  2. Consider whether this is something that will mean changes to projects that have already been migrated, or to the CDK CLI tool. If changes are required, consider adding a checklist here and/or linking to related PRs.

  3. If you are adding a new construct or pattern, has new documentation been added? If you are amending defaults or changing behaviour, are the existing docs still valid?

Copy link

changeset-bot bot commented Aug 20, 2024

🦋 Changeset detected

Latest commit: 7a12f60

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@guardian/cdk Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

akash1810 added a commit to guardian/testing-asg-rolling-update that referenced this pull request Aug 22, 2024
@akash1810 akash1810 force-pushed the aa/ec2-AutoScalingReplacingUpdate branch 3 times, most recently from 2812ba0 to 11e2fb6 Compare September 13, 2024 13:40
@akash1810 akash1810 marked this pull request as ready for review September 13, 2024 15:31
@akash1810 akash1810 requested a review from a team as a code owner September 13, 2024 15:31
@akash1810 akash1810 force-pushed the aa/ec2-AutoScalingReplacingUpdate branch 2 times, most recently from 7a1f2e7 to d6e6f9e Compare September 13, 2024 16:30
akash1810 added a commit to guardian/cdk-playground that referenced this pull request Sep 13, 2024
Comment on lines +94 to +159
new CfnParameter(this.stack, `MinInstancesInServiceFor${autoScalingGroup.app}`, {
type: "Number",
default: parseInt(cfnAutoScalingGroup.minSize),
maxValue: parseInt(cfnAutoScalingGroup.maxSize) - 1,
}),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the app contains hyphens (-), or other illegal characters for a CFN Parameter, they'll get stripped out. This means Riff-Raff cannot reliably just find CFN Parameters that start with MinInstancesInServiceFor. Therefore Riff-Raff's logic would be something like this:

  1. Find all AWS::AutoScaling::ScalingPolicys in the CloudFormation template that has MinInstancesInService set via a CloudFormation Parameter
  2. Find current DesiredCapacity of ASG that relates to the Scaling Policy
  3. Set the CFN Parameter identified in step 1 to the value identified in step 2

Copy link
Contributor

@jacobwinch jacobwinch Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if our Aspect could add Outputs for Riff-Raff to read.

This would mean that step 1 can be simplified, as Riff-Raff would not need to parse the full template or lookup the associated ASG name (which will not be included in the template unless this property is set).

For example:

Outputs:
  MinInstancesInServiceForAppA:
    Description: The AutoScaling group name associated with the MinInstancesInServiceForAppA parameter.    
    Value: !Ref AsgForAppA 
  MinInstancesInServiceForAppB:
    Description: The AutoScaling group name associated with the MinInstancesInServiceForAppB parameter.    
    Value: !Ref AsgForAppB

I think we can obtain these outputs pretty easily via the existing describeStack function and IIUC using !Ref here will mean that we can avoid looking up the ASG name ourselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting idea! Tempted to make these changes in a separate PR once we've know more about what the changes in Riff-Raff would look like. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tempted to make these changes in a separate PR once we've know more about what the changes in Riff-Raff would look like. WDYT?

Yes, that sounds good to me 👍

@akash1810 akash1810 changed the title feat(experimental-ec2-pattern): Pattern to deploy ASGs updates via CloudFormation (AutoScalingRollingUpdate) feat(experimental-ec2-pattern): Pattern to deploy ASG updates via CloudFormation (AutoScalingRollingUpdate) Sep 13, 2024
@akash1810 akash1810 changed the title feat(experimental-ec2-pattern): Pattern to deploy ASG updates via CloudFormation (AutoScalingRollingUpdate) feat(experimental-ec2-pattern): Pattern to deploy ASG updates w/CFN Sep 13, 2024
akash1810 added a commit to guardian/amiable that referenced this pull request Sep 14, 2024
akash1810 added a commit to guardian/amiable that referenced this pull request Sep 14, 2024
akash1810 added a commit to guardian/amiable that referenced this pull request Sep 14, 2024
akash1810 added a commit to guardian/amiable that referenced this pull request Sep 14, 2024
@akash1810 akash1810 force-pushed the aa/ec2-AutoScalingReplacingUpdate branch from f6dc526 to b4801a0 Compare September 14, 2024 13:49
akash1810 added a commit to guardian/amiable that referenced this pull request Sep 14, 2024
Copy link
Contributor

@jacobwinch jacobwinch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work 💯

…a CloudFormation (`AutoScalingRollingUpdate`)
…arkers

This should make it easier to parse a user data string if ever one is debugging.
During a deployment, CloudFormation updates the min and desired.

During a rollback (e.g. if the healthcheck failed), CloudFormation only resets the min.
The desired is still elevated, meaning the service is over provisioned.

Explicitly setting the desired property of the ASG ensures CloudFormation rollback puts the ASG back to the initial state,
e.g. correctly provisioned.
A scale-in event fires during a rolling update can cause service disruption.
Suspend scaling events during a rolling update for safety.
…where scaling policy present

Some practical testing of `AutoScalingRollingUpdate` has demonstrated that
when an ASG has a scaling policy, it is safest to dynamically set the `MinInstancesInService` property.
Add an aspect to do that.

See also https://github.com/guardian/testing-asg-rolling-update.
The `ec2metadata` command was failing with a 401 with AMIable CODE in deployTools account:

```console
root@ip-10-248-51-213:/var/lib/cloud/instance# ec2metadata --instance-id
Traceback (most recent call last):
  File "/usr/bin/ec2metadata", line 249, in <module>
    main()
  File "/usr/bin/ec2metadata", line 245, in main
    display(metaopts, burl, prefix)
  File "/usr/bin/ec2metadata", line 192, in display
    value = m.get(metaopt)
  File "/usr/bin/ec2metadata", line 177, in get
    return self._get('meta-data/' + metaopt)
  File "/usr/bin/ec2metadata", line 137, in _get
    resp = urllib_request.urlopen(urllib_request.Request(url))
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 401: Unautho
```

This service uses IMDSv2. A 401 response usually happens when a request is made without a token.
However `ec2metadata` does exchange a token.

Switch to a more reliable mechanism.

See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html.
The behaviour being tested is already covered by `should only adjust properties of a horizontally scaling service`.
… period

Matching these properties allows rollbacks to happen as quickly as possible.
@akash1810 akash1810 force-pushed the aa/ec2-AutoScalingReplacingUpdate branch from 6276b0e to 7a12f60 Compare September 17, 2024 13:32
@akash1810 akash1810 merged commit 3588b24 into main Sep 17, 2024
4 checks passed
@akash1810 akash1810 deleted the aa/ec2-AutoScalingReplacingUpdate branch September 17, 2024 13:41
@akash1810
Copy link
Member Author

akash1810 commented Sep 17, 2024

Usage of this experimental pattern can be found via Service Catalogue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants