Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Set MinInstancesInService via CFN parameters #1383

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

akash1810
Copy link
Member

@akash1810 akash1810 commented Oct 11, 2024

Context

In guardian/cdk#2417 support was added to @guardian/cdk (GuCDK) to deploy application updates to an autoscaling group (ASG) via CloudFormation (CFN) with the AutoScalingRollingUpdate update policy.

During the testing, we observed that for a service that horizontally scales, the MinInstanceInService property should be set relative to the current state of the ASG. Consequently when an autoscaling group has scaling policies, GuCDK adds a parameter to the CFN template in the form MinInstancesInServiceFor<APP>:

To support this dynamic property, services that horizontally scale should expose a CloudFormation parameter for Riff-Raff to set during deployment.

https://github.com/guardian/testing-asg-rolling-update/blob/main/README.md

What does this change?

This change adds a new riff-raff.yaml property to the ami-cloudformation-parameter and cloud-formation deployment types called minInstancesInServiceParameters.

The CFN parameter added by GuCDK does not provide a direct connection the to ASG it relates to. The new minInstancesInServiceParameters property attempts to provide this connection via tag lookups. That is, it is a map of CFN parameter -> ASG tags.

In the below example, we're telling Riff-Raff to expect a CFN parameter named MinInstancesInServiceForcdkplayground in the CdkPlayground.template.json CFN template and to set it's value from the desired capacity of an ASG tagged App=cdk-playground, Stage=PROD, Stack=playground.

allowedStages:
  - PROD
stacks:
  - playground
regions:
  - eu-west-1
deployments:
  cfn-deployment:
    type: cloud-formation
    parameters:
      templateStagePaths:
        PROD: CdkPlayground.template.json
      minInstancesInServiceParameters:
        MinInstancesInServiceForcdkplayground:
          App: cdk-playground

Once the ASG has been found, the MinInstancesInService property to whichever is lower between desired or 75% of max, meaning:

  • When running normally, then we double capacity and half it again once healthy
  • When partially or fully scaled, we rotate instances in batches, which is slightly slower

This contrasts with the recommendation from https://github.com/guardian/testing-asg-rolling-update:

  • If the service is running at "normal" capacity (e.g. desired = min), then MinInstancesInService can be set to match min.
  • If the service is partially scaled, then MinInstancesInService should be set to match the current desired capacity.
  • If the service is fully scaled, then MinInstancesInService should be set to (at least) max -1.

https://github.com/guardian/testing-asg-rolling-update/blob/main/README.md

This is because being percentage based offers faster deployment.

How to test

See added unit tests.

We can also perform some practical tests. Each of the following have been performed on https://github.com/guardian/cdk-playground.

"Normal" capacity (i.e. desired = min)

guardian/cdk-playground#548 tests this scenario. It demonstrates the CFN parameter was correctly set to desired.

Deployment log showing value of CFN parameter

image

Partial scaled (i.e. min < desired < max)

guardian/cdk-playground#548 tests this scenario. It demonstrates the CFN parameter was correctly set to desired.

Deployment log showing value of CFN parameter

image

Fully scaled (i.e. desired = max)

guardian/cdk-playground#548 tests this scenario. It demonstrates the CFN parameter was set to 7 (75% of 10, rounded down).

Deployment log showing value of CFN parameter

image

No scaling policy (i.e. our typical ASG)

This test amounts to omitting the minInstancesInServiceParameters property in the riff-raff.yaml file.

This has been done (guardian/cdk-playground#547), and the deployment behaved as expected: CFN doubled the ASG capacity, waited for SUCCESS signals, then removed old instances by halving the ASG capacity.

This can be seen on the dashboard (note: this link might not work in future).

minInstancesInServiceParameters describes a CFN parameter that does not exist

guardian/cdk-playground#549 tests this scenario.

The deployment should fail when referencing a CFN parameter that does not exist in the CFN template.

Deployment log showing deployment error

image

minInstancesInServiceParameters describes an ASG that does not exist

guardian/cdk-playground#550 tests this scenario.

The deployment should fail when referencing an ASG that does not exist.

Deployment log showing deployment error

image

Initial deployment (i.e. CFN stack does not yet exist)

This code path is the same as the "minInstancesInServiceParameters describes an ASG that does not exist" test.

How can we measure success?

This change means we're able to use the AutoScalingRollingUpdate deployment mechanism on all ASG architectures: those that horizontally scale and those that do not horizontally scale.

akash1810 added a commit to guardian/cdk-playground that referenced this pull request Oct 11, 2024
akash1810 added a commit to guardian/cdk-playground that referenced this pull request Oct 11, 2024
akash1810 added a commit to guardian/cdk-playground that referenced this pull request Oct 11, 2024
akash1810 added a commit to guardian/cdk-playground that referenced this pull request Oct 11, 2024
akash1810 added a commit to guardian/cdk-playground that referenced this pull request Oct 11, 2024
This to allow testing of guardian/riff-raff#1383
whilst the `riff-raff.yaml` generator from GuCDK hasn't been updated.
akash1810 added a commit to guardian/cdk-playground that referenced this pull request Oct 11, 2024
This is to allow testing of guardian/riff-raff#1383
whilst the `riff-raff.yaml` generator from GuCDK hasn't been updated.
@akash1810 akash1810 marked this pull request as ready for review October 11, 2024 17:04
@akash1810 akash1810 requested a review from a team as a code owner October 11, 2024 17:04
@akash1810 akash1810 force-pushed the aa/minInstancesInService branch 5 times, most recently from 1f7d109 to 0f62f2a Compare October 17, 2024 18:23
Copy link
Contributor

@adamnfish adamnfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of comments, sorry! Some tiny Scala style things, but a few other observations in there as well.

We have nice tests for the fiddly min / 75% logic, which is great. The other tricky bit is the way tags get added from multiple places now, it'd be great if we could make sure this is covered?

magenta-lib/src/main/scala/magenta/tasks/AWS.scala Outdated Show resolved Hide resolved
magenta-lib/src/main/scala/magenta/tasks/AWS.scala Outdated Show resolved Hide resolved
magenta-lib/src/main/scala/magenta/tasks/AWS.scala Outdated Show resolved Hide resolved
magenta-lib/src/main/scala/magenta/tasks/AWS.scala Outdated Show resolved Hide resolved
@@ -602,7 +604,8 @@ class CloudFormationTest
Map.empty,
Map.empty,
(_: CfnParam) =>
(_: String) => (_: String) => (_: Map[String, String]) => None
(_: String) => (_: String) => (_: Map[String, String]) => None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise it isn't from this PR, but this is a bit strange!

The type on the case class is the following:

latestImage: CfnParam => String => String => Map[String, String] => Option[String]

This says latestImage is a function that takes CfnParam, and returns a function that takes a String which in turn returns a function that takes a String, which in turn returns a function that takes a Map and returns Option[String].

This is called currying, and would be normal in an ML-style functional language but is a bit out of place in Scala. Simpler to use the following, which would make these tests much clearer.

latestImage: (CfnParam, String, String, Map[String, String]) => Option[String]

Now latest image is a function that takes 4 parameters and returns Option[String].

Up to you if you think now is a good time to change this for the extra test clarity, do it later, or leave it as-is 🤷 .

(This applies in a few place, whatever you choose)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we leave it as-is, the type annotations aren't required here. Since we're discarding all these arguments anyway I don't think there's any benefit in making the types explicit, we could just write the following:

_ => _ => _ => _ => None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose we make these changes in a separate PR, for the benefit of a single purpose PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep and as I said no dramas if you want to leave it anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

I'll leave this comment "unresolved" to make it easier to locate/reference in any future PR(s).

magenta-lib/src/test/scala/magenta/tasks/ASGTest.scala Outdated Show resolved Hide resolved
@@ -398,6 +398,150 @@ class ASGTest extends AnyFlatSpec with Matchers with MockitoSugar {
) shouldBe Right(())
}

it should "calculate MinInstancesInService as desired when there is capacity to double" in {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These feel like great candidates for property-based testing! This would let us assert that the described properties hold tue with a range of ASG configurations, rather than just providing a specific example of each. As well as giving us wider coverage, this would help us to cover off the fiddly edge cases without having to manually add lots of extra cases.

With the addition of the following in the build dpendency...

"org.scalatestplus" %% "scalacheck-1-18" % "3.2.19.0" % Test

... we'd have scalacheck available (we're actually pulling in Scalatest's scalachek integration, which in turn pulls in scalacheck). We can then mix the ScalaCheckDrivenPropertyChecks trait into our test to make this functionality available:

import org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks

class ASGTest extends AnyFlatSpec with Matchers with MockitoSugar with ScalaCheckDrivenPropertyChecks

We can then use forAll in our tests to have the library generate random test data for us. Here's a simple hello world example to demonstrate the principle in general:

"the length of the concatenation of two strings should be the sum of the strings' lengths" in {
  forAll { (str1: String, str2: String) =>
    val expected = str1.length + str2.length
    (str1 ++ str2).length shouldBe expected
  }
}

This will provide random strings to our test and check the property holds true for all of them, automatically. If it finds a problem it will "shrink" the example to try and give a nice clean minimal example the demonstrates the problem. Try this out yourself! Break the implementation and see what happens!


However, in our case we don't want purely random data (Ints in our case). We want ASG configurations that make sense for the test case in question. We're looking at the "when there is capacity to double" case here, so let's take that one. Our test data needs to have a max that is at least double the desired.

Instead of asking for completely random test data, let's create our own generator and tell forAll to use that. In general terms:

val generateMyType = ???
forAll(generateMyType) { myType =>
  // your test
} 

Our generator wants to make two ints. A desired, and a max. The latter depends on the value of desired, which is neatly expressed using flatMap (aka a for comprehension). Gen.choose is provided by the library to address the common use case of needing an number from a range.

import org.scalacheck.Gen

val generateAsgSettings = for {
  desired <- Gen.choose(1, 100)
  max <- Gen.choose(desired * 2, desired * 4) // we can think about what values we want here!
} yield (desired, max)

Now we can write the test as before, but instead of hard coding the max and desired to 6 and 3 respectively, we can use the generated values.

import org.scalacheck.Gen

it should "calculate MinInstancesInService as desired when there is capacity to double" in {
  val genASGAllowingDouble = for {
    desired <- Gen.choose(1, 100)
    max <- Gen.choose(desired * 2, desired * 4) // we can think about what values we want here!
  } yield (desired, max)

  forAll(genASGAllowingDouble) { case (desired, max) =>
    // ...
    when(asgDescribeIterableMock.autoScalingGroups()) thenReturn toSdkIterable(
      List(
        AutoScalingGroupWithTags(
          min = 3,
          max = max,
          desired = desired,
          "Stack" -> "playground",
          "Stage" -> "PROD",
          "App" -> "api",
          "aws:cloudformation:stack-name" -> "playground-PROD-api"
        )
      ).asJava
    )
  }
}

Of course, now we start asking ourself other questions, like what should the "min" be for these tests (maybe the same as desired, maybe not)? This is one benefit of property based testing, it helps you to think about the data you're testing!

Sorry for the long comment, be happy to talk about this more with you elsewhere 😄

Copy link
Member Author

@akash1810 akash1810 Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if the, albeit slight, indirection reduces accessibility/maintainability. Additionally, given this is a new deployment mechanism, might having explicit capacity values in the tests makes it easier to explain teams as they migrate?

Maybe we can implement these changes in a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep just food for thought, we can do it later or not at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

I'll leave this comment "unresolved" to make it easier to locate/reference in any future PR(s).

@akash1810
Copy link
Member Author

The other tricky bit is the way tags get added from multiple places now, it'd be great if we could make sure this is covered?

I can't think of the best way to do this:

  • getMinInstancesInService take explicit tags, rather than a general list of tags, but that's quite a complex interface
  • Another option could be to cross reference the discovered ASG with a CFN list resources call and cross reference the two. However, this felt overly complex.
  • Tags prefixed aws: can only be created by AWS. Maybe getMinInstancesInService could assert that the list of tag requirements has at least one aws:cloudformation:* tag? This doesn't directly address your concern though.

Tricky!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants