Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apps can not auto-scale until an autoscaling deploy has successfully completed #1342

Open
rtyley opened this issue May 23, 2024 · 4 comments

Comments

@rtyley
Copy link
Member

rtyley commented May 23, 2024

Since #83 back in April 2013, Riff Raff autoscaling deploys have always disabled ASG scaling alarms at the start of a deploy (SuspendAlarmNotifications), and only re-enabled them at the end of the deploy, once deployment has successfully completed:

SuspendAlarmNotifications(autoScalingGroup, target.region),
TagCurrentInstancesWithTerminationTag(autoScalingGroup, target.region),
ProtectCurrentInstances(autoScalingGroup, target.region),
DoubleSize(autoScalingGroup, target.region),
HealthcheckGrace(
autoScalingGroup,
target.region,
healthcheckGrace(pkg, target, reporter) * 1000
),
WaitForStabilization(
autoScalingGroup,
secondsToWait(pkg, target, reporter) * 1000,
target.region
),
WarmupGrace(
autoScalingGroup,
target.region,
warmupGrace(pkg, target, reporter) * 1000
),
WaitForStabilization(
autoScalingGroup,
secondsToWait(pkg, target, reporter) * 1000,
target.region
),
CullInstancesWithTerminationTag(autoScalingGroup, target.region),
TerminationGrace(
autoScalingGroup,
target.region,
terminationGrace(pkg, target, reporter) * 1000
),
WaitForStabilization(
autoScalingGroup,
secondsToWait(pkg, target, reporter) * 1000,
target.region
),
ResumeAlarmNotifications(autoScalingGroup, target.region)

There are good reasons for this, but it leads to two problems:

  • Even during a successful deploy, there's a window of ~3 minutes when the app cannot scale
  • More severely, if a deploy fails, the app will be left with ASG scaling alarms disabled, requiring developers to manually re-enable them (or run another deploy) - the response time to this can be much longer, even hours.

For apps where sudden unpredictable bursts of traffic can occur, where many deploys can take place every day, this adds up to significant windows of time where the odds are eventually that a deploy will coincide with a spike in traffic that they are unable to respond to.

Ophan Tracker outage - 22nd May 2024

full incident summary

  • 16:04 - Ophan PR #6109, a minor change to the Ophan Dashboard, is merged. This will trigger a deploy of all Ophan apps, including the Ophan Tracker.

  • 16:11 - App Notification for major news story Rishi Sunak will call general election for July this afternoon in surprise move, senior sources tell the Guardian is sent out: image

  • 16:12:02 - Riff Raff deploy disables auto-scaling alarms, with the size of the ASG set to 3 instances

  • 16:13:32 - Ophan Tracker's scale-up alarm enters ALARM status. The Tracker ASG would normally scale up on 2 consecutive ALARM states 1 minute apart, but ASG scale-up has been disabled by the deploy.
    image

  • 16:14:26 - Riff Raff deploy culls the 3 old instances, taking the ASG size back to 3 instances - the cluster is now very under-scaled for the spike in traffic

  • 16:14:37 - Riff Raff deploy starts the final WaitForStabilization, which is the last step before re-enabling alarms. Due to the servers being so overloaded, they never stabilise. The step has a 15 minute timeout.

  • 16:29:42 - The deploy finally fails as WaitForStabilization times out, and the alarms are left disabled.

  • 17:19:30 – Tracker ASG is manually scaled up to 6 instances by the Ophan team

  • 17:23:12 – Tracker ASG stops terminating unhealthy instances - the outage has lasted just over 1 hour

  • 17:30:41 - Alarms are finally re-enabled by the Ophan team performing a new deploy

In this case, had ResumeAlarmNotifications been enabled immediately before WaitForStabilization, the deploy would have failed, but the outage would probably have ended within a minute or 2 of 16:14, giving a 2 minute outage, rather than a 1 hour outage.

@jacobwinch
Copy link
Contributor

jacobwinch commented May 23, 2024

Thanks for raising this issue.

In this case, had ResumeAlarmNotifications been enabled immediately before WaitForStabilization

I think trying to run ResumeAlarmNotifications earlier in the process1 is a good idea. I agree that it would've helped to mitigate the impact of this particular incident. It would also shorten the window where apps cannot scale during successful deployments2, which might help with performance more generally.

In order for this to work I think it would be desirable to replace the final WaitForStabilization task with a new task (WaitForOldInstancesToTerminate, or similar).

The current task is checking for an expected number of instances. This is currently an acceptable way to check because Riff-Raff is essentially holding a lock on the desired capacity setting by blocking scaling operations (i.e. we know that the number won't change).

Once we re-enable scaling the desired number of instances becomes a moving target, so I think it would be better for Riff-Raff to check that there are 0 instances with the Magenta=Terminate tag still running in the ASG. This would allow us to be sure that all instances are now running the build currently being deployed3, which means it is safe/correct to mark the deployment as successful.

Footnotes

  1. I think we could probably re-enable scaling after CullInstancesWithTerminationTag if we implemented some of the other changes described in this comment.

  2. It's a shame that we can't deploy and scale up at the same time, but I think making that possible requires a major architecture change in Riff-Raff or in the way that we set up EC2 apps (e.g. we could do this if all apps had 2 separate ASGs and used a blue/green deployment strategy).

  3. Some of these might have been launched by an ASG scaling action, rather than Riff-Raff's actions, but it doesn't really matter at this point. We want all requests to go to instances running the new build, as we are already confident that the new build passes the healthcheck before we start terminating the old instances.

rtyley added a commit that referenced this issue May 23, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description
indicates this was a necessary choice. We can see the argument for it -
obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by
alarms - but if the ASG instances are already overloaded and recycling, the ASG
will _never_ stabilise, because it needs to scale up to handle the load it's
experiencing.

By simply putting the final `WaitForStabilization` step _after_
`ResumeAlarmNotifications`, the Ophan outage would have been shortened from
1 hour to ~2 minutes.

The `WaitForStabilization` step itself simply checks that the number of instances
in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not
scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or
twice, and the `WaitForStabilization` condition will still be easily satisfied,
and the deploy will be reported as success.
rtyley added a commit that referenced this issue May 23, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description
indicates this was a necessary choice. We can see the argument for it -
obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by
alarms - but if the ASG instances are already overloaded and recycling, the ASG
will _never_ stabilise, because it needs to scale up to handle the load it's
experiencing.

By simply putting the final `WaitForStabilization` step _after_
`ResumeAlarmNotifications`, the Ophan outage would have been shortened from
1 hour to ~2 minutes.

The `WaitForStabilization` step itself simply checks that the number of instances
in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not
scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or
twice, and the `WaitForStabilization` condition will still be easily satisfied,
and the deploy will be reported as success.
@rtyley
Copy link
Member Author

rtyley commented May 23, 2024

I think trying to run ResumeAlarmNotifications earlier in the process is a good idea.

Thanks! I've just opened draft PR #1345 to do this - your excellent feedback in your comment has given me some more stuff to think about!

@akash1810
Copy link
Member

akash1810 commented May 23, 2024

I think a lot of our manual orchestration of ASGs during a deployment can be swapped for instance refresh today.

Instance refresh is different from our current process:

  • It does gradual instance replacement, with optional checkpointing1. We currently replace all at once.
  • It can automatically rollback when an alarm state is entered. We require manual intervention.
  • It allows for horizontal scaling whilst deploying. We don't.
  • It enforces true versioning. We don't really have versioning in a meaningful way.

We'd likely want to keep our current pre-flight check requiring the number of healthy instances in the load balancer to match the ASG.

I think blue/green deployment, as @jacobwinch describes, is the end-goal to seek. However I wonder if adopting instance refresh solves the issues witnessed, whilst requiring fewer changes compared to those needed to support (and migrate to) multiple ASG/LB, and DNS swapping during a deployment.

Lastly, instance refresh is an AWS native capability, meaning there's less code for us to maintain.

Footnotes

  1. I'd be curious to see how checkpointing impacts deployment times.

@rtyley
Copy link
Member Author

rtyley commented May 24, 2024

our manual orchestration of ASGs during a deployment can be swapped for instance refresh today.

I'd not heard of Instance Refresh (tho' apparently it was introduced in 2020!), but having read about it, it does sound good!

It looks like Instance Refresh is a way of rolling out launch template configuration updates, which means it can roll out new EC2 instances to use whatever new AMI Id and User Data are in the new launch templates. This feels like maybe we would want to move away from the current deployment model of just downloading whatever artifact is on S3 at a particular path (eg s3://ophan-dist/ophan/PROD/tracker/tracker_0.1.0-SNAPSHOT_all.deb, where that version number 0.1.0 never changes), and maybe instead make the User Data specific to downloading a specific version of the artifact from S3 (perhaps directly from the 'riffraff-artifact' bucket?). This would make the rollback feature of Instance Refresh effective, as there's no way that Instance Refresh rollback can work unless the launch template fully dictates what software runs on the instance.

I think blue/green deployment, as @jacobwinch describes, is the end-goal to seek.

Fair enough- so it sounds like we're talking about something like this?

  1. Short term, do something like autoscaling deploy: re-enable ASG scaling before final stabilisation check #1345 to make current autoscaling deploys better
  2. Mid-term, adopt Instance Refresh, with launch templates that fully dictate what software runs on the instance
  3. Long-term, blue/green deployment

rtyley added a commit that referenced this issue May 30, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description
indicates this was a necessary choice. We can see the argument for it -
obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by
alarms - but if the ASG instances are already overloaded and recycling, the ASG
will _never_ stabilise, because it needs to scale up to handle the load it's
experiencing.

By simply putting the final `WaitForStabilization` step _after_
`ResumeAlarmNotifications`, the Ophan outage would have been shortened from
1 hour to ~2 minutes.

The `WaitForStabilization` step itself simply checks that the number of instances
in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not
scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or
twice, and the `WaitForStabilization` condition will still be easily satisfied,
and the deploy will be reported as success.
rtyley added a commit that referenced this issue Jun 4, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description
indicates this was a necessary choice. We can see the argument for it -
obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by
alarms - but if the ASG instances are already overloaded and recycling, the ASG
will _never_ stabilise, because it needs to scale up to handle the load it's
experiencing.

By simply putting the final `WaitForStabilization` step _after_
`ResumeAlarmNotifications`, the Ophan outage would have been shortened from
1 hour to ~2 minutes.

The `WaitForStabilization` step itself simply checks that the number of instances
in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not
scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or
twice, and the `WaitForStabilization` condition will still be easily satisfied,
and the deploy will be reported as success.
rtyley added a commit that referenced this issue Jun 4, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_
`ResumeAlarmNotifications`, rather than _after_, and if the ASG instances are
already overloaded and recycling, the ASG will _never_ stabilise, because it _needs
to scale up_ to handle the load it's experiencing.

In this change, we introduce a new task, `WaitForCullToComplete`, that can establish
whether the cull has completed or not, regardless of whether the ASG is scaling -
it simply checks that there are no remaining instances tagged for termination.
Consequently, once we've executed `CullInstancesWithTerminationTag` to _request_ old
instances terminate, we can immediately allow scaling with `ResumeAlarmNotifications`,
and then `WaitForCullToComplete` _afterwards_.

With this change in place, the Ophan outage would have been shortened from
1 hour to ~2 minutes, a much better outcome!

Common code between `CullInstancesWithTerminationTag` and `WaitForCullToComplete` has
been factored out into a new `CullSummary` class.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants