-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add the ability to remove nodes older than a certain age even when the node group has reached the min #240
Conversation
} | ||
|
||
// We don't want to attempt to rotate nodes that have reached the max age if we haven't reached the minimum node | ||
// count, as the scaling down of the node group will remove the oldest first anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still an edge case here which is a node group that somehow has very consistent load such that it always maintains X number of nodes (e.g. always 2) and thus never gets cycled. I think this is very unlikely with the kinds of workloads that run on Escalator, but worth calling out in case others think it's worth considering here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thought of this edge case whilst writing this feature - I guess it's a been an edge case in Escalator since it's existence, and won't be fixed or made worse by this feature being added.
Ideally if there is a consistent base load, the mininum nodes in the ASG should really be configured to a value that meets the base load, then this feature can be used to ensure the nodes are cycled out.
I can check if this edge case is documented in Escalator's documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can check if this edge case is documented in Escalator's documentation
This isn't documented anywhere - I've raised #241
pkg/controller/controller.go
Outdated
|
||
// We don't want to attempt to rotate nodes that have reached the max age if we haven't reached the minimum node | ||
// count, as the scaling down of the node group will remove the oldest first anyway. | ||
if len(untaintedNodes) != nodeGroup.Opts.MinNodes || len(untaintedNodes) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be <= to catch weird edge cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I think we want to keep len(untaintedNodes) != nodeGroup.Opts.MinNodes
, as we're only wanting this feature to come into play when we're at the minimum node level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was using <=
is more defensive in case we (or the cloud provider) have a bug elsewhere in the code that would allow the number of nodes present to be less than the number of MinNodes. In that case, we would actually start checking their age and potentially mess with the scale delta here, which complicates getting back to the MinNodes regardless.
All of this said, it's an unlikely edge case and I'm fine with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was using <= is more defensive in case we (or the cloud provider) have a bug elsewhere in the code that would allow the number of nodes present to be less than the number of MinNodes.
Exceeding the max and being under the min is already caught by Escalator here -
escalator/pkg/controller/controller.go
Lines 237 to 254 in 97246b1
if len(allNodes) < nodeGroup.Opts.MinNodes { | |
err = errors.New("node count less than the minimum") | |
log.WithField("nodegroup", nodegroup).Warningf( | |
"Node count of %v less than minimum of %v", | |
len(allNodes), | |
nodeGroup.Opts.MinNodes, | |
) | |
return 0, err | |
} | |
if len(allNodes) > nodeGroup.Opts.MaxNodes { | |
err = errors.New("node count larger than the maximum") | |
log.WithField("nodegroup", nodegroup).Warningf( | |
"Node count of %v larger than maximum of %v", | |
len(allNodes), | |
nodeGroup.Opts.MaxNodes, | |
) | |
return 0, err | |
} |
}, | ||
listerOptions: ListerOptions{}, | ||
}, | ||
1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this case, the behaviour is that we scale up one node, it would be replaced and then we would do another scale up to replace the other older node once we made it back to our minimum node count again. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - that's the desired behaviour
nil, | ||
}, | ||
{ | ||
"max_node_age enabled, some nodes are tainted", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a test case where we're at the MinNodes but some are tainted? I don't think that should occur in regular operation though, just throwing this out there in case it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a look into adding this test case - it doesn't make sense to add to this feature's tests, as Escalator should get its self out of this scenario prior to the maxNodeAge code running (it will untaint the tainted nodes in an attempt to get the untainted nodes equal to the min nodes).
Again, this should be tested elsewhere already.
@@ -471,6 +476,31 @@ func TestValidateNodeGroup(t *testing.T) { | |||
HardDeleteGracePeriod: "1h10m", | |||
ScaleUpCoolDownPeriod: "55m", | |||
TaintEffect: "", | |||
MaxNodeAge: "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Piggy backing this particular edge case onto the case where there's an empty TaintEffect means that us also checking that we're parsing an empty MaxNodeAge is kinda lost. Either make a new table test case here or add that we're checking this field in the name of the table test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added a test case for this
args args | ||
expectedNodeDelta int | ||
err error | ||
}{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add a test for num nodes == maxNodes, and all expired, with scale up enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you worried about here? Are we expecting delta to be 0 because we're at max, or to be positive but unable to go further due to max? Either way I think that particular case isn't related to this feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah just the very unlikely case of nodes at max for a long time, and this min of +1 node delta to cycle them trying to push above maxASG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually perfectly fine for node delta to be +1 when at the max ASG size - we have logic later on when trying to increase the ASG in the cloud provider above the max that will reject the scale up.
I don't think this test case is really relevant for this specific feature and the case of node delta exceeding the ASG max should be tested elsewhere.
Ideally we're really testing if nodeDelta is either +1 or 0 given the criteria for the feature and assuming whatever Escalator does with that nodeDelta is appropriately tested later on
if c.scaleOnMaxNodeAge(nodeGroup, untaintedNodes) { | ||
log.WithField("nodegroup", nodegroup). | ||
Info("Setting scale to minimum of 1 to rotate out a node older than the max node age") | ||
nodesDelta = int(math.Max(float64(nodesDelta), 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this also protect against nodesDelta exceeding ASGMax? Or is it protected against in normal scale functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat related to the other conversation, but I don't think this is relevant for the tests of this feature - this should already be covered by the general scaling tests that check that if nodeDelta would breach the ASG max, that Escalator prevents it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
…e node group has reached the min - adds max_node_age field to node group configuration - max_node_age by default is disabled, and needs to be a positive, greater than zero duration to enable - when the node group has reached the minimum and there are nodes older than the max_node_age, the scale delta will be set to +1 - this will then force escalator to replace the oldest node fixes #239 Signed-off-by: Alex Price <aprice@atlassian.com>
…o rotate nodes when at the ASG min, but there is a tainted node Signed-off-by: Alex Price <aprice@atlassian.com>
fixes #239