Skip to content

Commit

Permalink
Automatically evict replicas while draining LEP
Browse files Browse the repository at this point in the history
Longhorn 2238

Signed-off-by: Eric Weber <eric.weber@suse.com>
  • Loading branch information
ejweber committed Nov 13, 2023
1 parent e4e5772 commit 08d53d4
Showing 1 changed file with 280 additions and 0 deletions.
280 changes: 280 additions & 0 deletions enhancements/20230905-automatically-evict-replicas-while-draining.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
# Automatically Evict Replicas While Draining

## Summary

Currently, Longhorn allows the choice between a number of behaviors (node drain policies) when a node is cordoned or
drained:

- `Block If Contains Last Replica` ensures the `instance-manager` pod cannot be drained from a node as long as it is the
last node with a healthy replica for some volume.

Benefits:

- Protects data by preventing the drain operation from completing until there is a healthy replica available for each
volume available on another node.

Drawbacks:

- If there is only one replica for the volume, or if its other replicas are unhealthy, the user may need to manually
(through the UI) request the eviction of replicas from the disk or node.
- Volumes may be degraded after the drain is complete. If the node is rebooted, redundancy is reduced until it is
running again. If the node is removed, redundancy is reduced until another replica rebuilds.

- `Allow If Last Replica Is Stopped` is similar to the above, but only prevents an `instance-manager` pod from
draining if it has the last RUNNING replica.

Benefits:

- Allows the drain operation to proceed in situations where the node being drained is expected to come back online
(data will not be lost) and the replicas stored on the node's disks are not actively being used.

Drawbacks:

- Similar drawbacks to `Block If Contains Last Replica`.
- If, for some reason, the node never comes back, data is lost.

- `Always Allow` never prevents an `instance-manager` pod from draining.

Benefits:

- The drain operation completes quickly without Longhorn getting in the way.

Drawbacks:

- There is no opportunity for Longhorn to protect data.

This proposal seeks to add a fourth and fifth behavior (node drain policy) with the following properties:

- `Block For Eviction` ensures the `instance-manager` pod cannot be drained from a node as long as it contains any
replicas for any volumes. Replicas are automatically evicted from the node as soon as it is cordoned.

Benefits:

- Protects data by preventing the drain operation from completing until all replicas have been relocated.
- Automatically evicts replicas, so the user does not need to do it manually (through the UI).
- Maintains replica redundancy at all times.

Drawbacks:

- The drain operation is significantly slower than for other behaviors. Every replica must be rebuilt on another node
before it can complete.
- The drain operation is data-intensive, especially when replica auto balance is enabled, as evicted replicas may be
moved back to the drained node when/if it comes back online.
- Like all of these policies, it triggers on cordon, not on drain (it is not possible for Longhorn to distinguish
between a node that is actively being drained and one that is cordoned for some other reason). If a user
regularly cordons nodes without draining them, replicas will be rebuilt pointlessly.

- `Block For Eviction If Contains Last Replica` ensures the `instance-manager` pod cannot be drained from a node as long
as it is the last node with a healthy replica for some volume. Replicas that meet this condition are automatically
evicted from the node as soon as it is cordoned.

Benefits:

- Protects data by preventing the drain operation from completing until there is a healthy replica available for each
volume available on another node.
- Automatically evicts replicas, so the user does not need to do it manually (through the UI).
- The drain operation is only as slow and data-intensive as is necessary to protect data.

Drawbacks:

- Volumes may be degraded after the drain is complete. If the node is rebooted, redundancy is reduced until it is
running again. If the node is removed, redundancy is reduced until another replica rebuilds.
- Like all of these policies, it triggers on cordon, not on drain (it is not possible for Longhorn to distinguish
between a node that is actively being drained and one that is cordoned for some other reason). If a user
regularly cordons nodes without draining them, replicas will be rebuilt pointlessly.

Given the drawbacks, `Block For Eviction` should likely not be the default node drain policy moving forward. However,
some users may find it helpful to switch to `Block For Eviction`, especially during cluster upgrade operations. See
[user stories](#user-stories) for additional insight.

`Block For Eviction If Contains Last Replica` is a more efficient behavior that may make sense as the long-term setting
in some clusters. It has largely the same benefits and drawbacks as `Block If Contains Last Replica`, except that it
doesn't require users to perform any manual drains. Still, when data is properly backed up and the user is planning to
bring a node back online after maintenance, it is costlier than other options.

### Related Issues

https://github.com/longhorn/longhorn/issues/2238

## Motivation

### Goals

- Add a new `Block For Eviction` node drain policy as described in the summary.
- Ensure that replicas automatically evict from a cordoned node when `Block For Eviction` is set.
- Ensure a drain operation can not complete until all replicas are evicted when `Block For Eviction` is set
- Document recommendations for when to use `Block For Eviction`.

### Non-goals

- Only trigger automatic eviction when a node is actively being drained. It is not possible to distinguish between a
node that is only cordoned and one that is actively being drained.

## Proposal

### User Stories

#### Story 1

I use Rancher to manage RKE2 and K3s Kubernetes clusters. When I upgrade these clusters, the system upgrade controller
attempts to drain each node before rebooting it. If a node contains the last healthy replica for a volume, the drain
never completes. I know I can manually evict replicas from a node to allow it to continue, but this eliminates the
benefit of the automation.

After this enhancement, I can choose to set the node drain policy to `Block For Eviction` before kicking off a cluster
upgrade. The upgrade may take a long time, but it eventually completes with no additional intervention.

#### Story 2

I am not comfortable with the reduced redundancy `Block If Contains Last Replica` provides while my drained node is
being rebooted. Or, I commonly drain nodes to remove them from the cluster and I am not comfortable with the reduced
redundancy `Block If Contains Last Replica` provides while a new replica is rebuilt. It would be nice if I could drain
nodes without this discomfort.

After this enhancement, I can choose to set the node drain policy to `Block For Eviction` before draining a node or
nodes. It may take a long time, but I know my data is safe when the drain completes.

### User Experience In Detail

### API changes

Add `block-for-eviction` and `block-for-eviction-if-last-replica` options to the `node-drain-policy` setting. The user
chooses these options to opt in to the new behavior.

Add a `status.autoEvicting` to the `node.longhorn.io/v1beta2` custom resource. This is not a field users can/should
interact with, but they can view it via kubectl.

NOTE: We originally experimented with a new `status.conditions` entry in the `node.longhorn.io/v1beta2` custom resource
with the type `Evicting`. However, this was a bit less natural, because:

- Longhorn node conditions generally describe the state a node is in, not what the node is doing.
- During normal operation, `Evicting` should be `False`. The Longhorn UI displays a condition in this state with a red
symbol, indicating an error state that should be investigated.

Deprecate the `replica.status.evictionRequested` field in favor of a new `replica.spec.evictionRequested` field so that
replica eviction can be requested by the node controller instead of the replica controller.

## Design

### Implementation Overview

The existing eviction logic is well-tested, so there is no reason to significantly refactor it. It works as follows:

- The user can set `spec.evictionRequested = true` on a node or disk.
- When the replica controller sees `spec.evictionRequested == true` on the node or disk hosting a replica, it sets
`status.evictionRequested = true` on that replica.
- The volume controller uses `replica.status.evictionRequested == true` to influence replica scheduling/deletion
behavior (e.g. rebuild an extra replica to replace the evicting one or delete the evicting one once rebuilding is
complete).
- The user can set `spec.evictionRequested = false` on a node or disk.
- When the replica controller sees `spec.evictionRequested == false` on the node or disk hosting a replica, it sets
`replica.status.evictionRequested = false` on that replica.
- The volume controller uses `replica.status.evictionRequested == false` to influence replica scheduling/deletion
behavior (e.g. don't start a rebuild for a previously evicting replica if one hasn't been started already).
- NOTE: If a new replica already started rebuilding as part of an eviction, it continues to rebuild and remains in the
cluster even after eviction is canceled. It can be cleaned up manually if desired.

Make changes so that:

- The node controller (not the replica controller) sets `replica.spec.evictionRequested = true` when:
- `spec.evictionRequested == true` on the replica's node (similar to existing behavior moved from the replica
controller), OR
- `spec.evictionRequested == true` on the replica's disk. (similar to existing behavior moved from the replica
controller), OR
- `status.Unschedulable == true` on the associated Kubernetes node object and the node drain policy is
`block-for-eviction`, OR
- `status.Unschedulable == true` on the associated Kubernetes node object, the node drain policy is
`block-for-eviction-if-contains-last-replica`, and there are no other PDB-protected replicas for a volume.
- Much of the logic currently used by the instance manager controller to recognize PDB-protected replicas is moved to
utility functions so both the node and instance manager controllers can use it.
- The volume controller uses `replica.spec.evictionRequested == true` in exactly the same way it previously used
`replica.status.evictionRequested` to influence replica scheduling/deletion behavior (e.g. rebuild an extra replica
to replace the evicting one or delete the evicting one once rebuilding is complete).
- The node controller sets `replica.spec.evictionRequested = false` when:
- The user is not requesting eviction with `node.spec.evictionRequested == true` or
`disk.spec.evictionRequested == true`, AND
- The conditions aren't right for auto-eviction based on the node status and drain policy.
- The node controller sets `status.autoEvicting = true` when a node has evicting replicas because of the new drain
policies and `status.autoEvicting == false` when it does not. This provides a clue to the user (and the UI) while auto
eviction is ongoing.

### Test plan

Test normal behavior with `block-for-eviction`:

- Set `node-drain-policy` to `block-for-eviction`.
- Create a volume.
- Ensure (through soft anti-affinity, low replica count, and/or enough disks) that an evicted replica of the volume can
be scheduled elsewhere.
- Write data to the volume.
- Drain a node one of the volume's replicas is scheduled to.
- While the drain is ongoing:
- Verify that the volume never becomes degraded.
- Verify that `node.status.autoEvicting == true`.
- Verify that `replica.spec.evictionRequested == true`.
- Verify the drain completes.
- Uncordon the node.
- Verify the replica on the drained node has moved to a different one.
- Verify that `node.status.autoEvicting == false`.
- Verify that `replica.spec.evictionRequested == false`.
- Verify the volume's data.
- Verify the output of an appropriate event during the test (with reason `EvictionAutomatic`).

Test normal behavior with `block-for-eviction-if-contains-last-replica`:

- Set `node-drain-policy` to `block-for-eviction-if-contains-last-replica`.
- Create one volume with a single replica and another volume with three replicas.
- Ensure (through soft anti-affinity, low replica count, and/or enough disks) that evicted replicas of both volumes can
be scheduled elsewhere.
- Write data to the volumes.
- Drain a node both volumes have a replica scheduled to.
- While the drain is ongoing:
- Verify that the volume with one replica never becomes degraded.
- Verify that the volume with three replicas becomes degraded.
- Verify that `node.status.autoEvicting == true`.
- Verify that `replica.spec.evictionRequested == true` on the replica for the volume that only has one.
- Verify that `replica.spec.evictionRequested == false` on the replica for the volume that has three.
- Verify the drain completes.
- Uncordon the node.
- Verify the replica for the volume with one replica has moved to a different node.
- Verify the replica for the volume with three replicas has not moved.
- Verify that `node.status.autoEvicting == false`.
- Verify that `replica.spec.evictionRequested == false` on all replicas.
- Verify the the data in both volumes.
- Verify the output of two appropriate events during the test (with reason `EvictionAutomatic`).
- Verify the output of an appropriate event for the replica that ultimately wasn't evicted during the test (with reason
`EvictionCanceled`).

Test unschedulable behavior with `block-for-eviction`:

- Set `node-drain-policy` to `block-for-eviction`.
- Create a volume.
- Ensure (through soft anti-affinity, high replica count, and/or not enough disks) that an evicted replica of the volume
can not be scheduled elsewhere.
- Write data to the volume.
- Drain a node one of the volume's replicas is scheduled to.
- While the drain is ongoing:
- Verify that `node.status.autoEvicting == true`.
- Verify that `replica.spec.evictionRequested == true`.
- Verify the drain never completes.
- Uncordon the node.
- Verify that the volume is healthy.
- Verify that `node.status.autoEvicting == false`.
- Verify the volume's data.
- Verify the output of an appropriate event during the test (with reason `EvictionAutomatic`).
- Verify the output of an appropriate event during the test (with reason `EvictionCanceled`).

### Upgrade strategy

- Add `status.autoEvicting = false` to all `node.longhorn.io` objects during the upgrade.
- Add `spec.evictionRequested = status.evictionRequested` to all replica objects during the upgrade.
- The default node drain policy remains `Block If Contains Last Replica`, so do not make setting changes.

## Note

I have given some though to if/how this behavior should be reflected in the UI. In this draft, I have [chosen not to
represent auto-eviction as a node condition](#api-changes), which would have automatically shown it in the UI, but
awkwardly. I considered representing it in the `Status` column on the `Node` tab. Currently, the only status are
`Schedulable` (green), `Unschedulable` (yellow), `Down` (grey), and `Disabled` (red). We could add `AutoEvicting`
(yellow), but it would overlap with `Unschedulable`. This might be acceptable, as it could be read as, "This node is
auto-evicting in addition to being unschedulable."

0 comments on commit 08d53d4

Please sign in to comment.