Automatically evict replicas while draining LEP

Longhorn 2238 Signed-off-by: Eric Weber <eric.weber@suse.com>
ejweber · Sep 6, 2023 · 03ca307 · 03ca307
1 parent 8615cfc
commit 03ca307
Showing 1 changed file with 208 additions and 0 deletions.
diff --git a/enhancements/2023-09-05-automatically-evict-replicas-while-draining.md b/enhancements/2023-09-05-automatically-evict-replicas-while-draining.md
@@ -0,0 +1,208 @@
+# Automatically Evict Replicas While Draining
+
+## Summary
+
+Currently, Longhorn allows the choice between a number of behaviors (node drain policies) when a node is cordoned or
+drained:
+
+- `Block If Contains Last Replica` ensures the `instance-manager` pod cannot be drained from a node as long as it is the
+  last node with a healthy replica for some volume.
+
+  Benefits:
+
+  - Protects data by preventing the drain operation from completing until there is a healthy replica available for each
+    volume available on another node.
+
+  Drawbacks:
+
+  - If there is only one replica for the volume, or if its other replicas are unhealthy, the user may need to manually
+    (through the UI) request the eviction of replicas from the disk or node.
+  - Volumes may be degraded after the drain is complete. If the node is rebooted, redundancy is reduced until it is
+    running again. If the node is removed, redundancy is reduced until another replica rebuilds.
+
+- `Allow If Last Replica Is Stopped` is similar to the above, but only prevents an `instance-manager` pod from
+  draining if it has the last RUNNING replica.
+
+  Benefits:
+
+  - Allows the drain operation to proceed in situations where the node being drained is expected to come back online
+    (data will not be lost) and the replicas stored on the node's disks are not actively being used.
+
+  Drawbacks:
+
+  - Similar drawbacks to `Block If Contains Last Replica`.
+  - If, for some reason, the node never comes back, data is lost.
+
+- `Always Allow` never prevents an `instance-manager` pod from draining.
+
+  Benefits:
+
+  - The drain operation completes quickly without Longhorn getting in the way.
+
+  Drawbacks:
+
+  - There is no opportunity for Longhorn to protect data.
+
+This proposal seeks to add a fourth behavior (node drain policy) with the following properties:
+
+- `Block For Eviction` ensures the `instance-manager` pod cannot be drained from a node as long as it contains any
+  replicas for any volumes. Replicas are automatically evicted from the node as soon as it is cordoned.
+
+  Benefits:
+
+  - Protects data by preventing the drain operation from completing until all replicas have been relocated.
+  - Automatically evicts replicas, so the user does not need to do it manually (through the UI).
+  - Maintains replica redundancy at all times.
+
+  Drawbacks:
+
+  - The drain operation is significantly slower than for other behaviors. Every replica must be rebuilt on another node
+    before it can complete.
+  - Like all of these policities, it triggers on cordon, not on drain (it is not possible for Longhorn to distinguish
+    between a node that is actively being drained and one that is cordoned for some other reason). If a user
+    regularly cordons nodes without draining them, replicas will be rebuilt pointlessly.
+
+Given the drawbacks, `Block For Eviction` should likely not be the default node drain policy moving forward. However,
+some users may find it helpful to switch to `Block For Eviction`, especially during cluster upgrade operations. See
+[user stories](#user-stories) for additional insight.
+
+### Related Issues
+
+https://github.com/longhorn/longhorn/issues/2238
+
+## Motivation
+
+### Goals
+
+- Add a new `Block For Eviction` node drain policy as described in the summary.
+- Ensure that replicas automatically evict from a cordoned node when `Block For Eviction` is set.
+- Ensure a drain operation can not complete until all replicas are evicted when `Block For Eviction` is set
+- Document recommendations for when to use `Block For Eviction`.
+
+### Non-goals
+
+- Only trigger automatic eviction when a node is actively being drained. It is not possible to distinguish between a 
+  node that is only cordoned and one that is actively being drained.
+
+## Proposal
+
+### User Stories
+
+#### Story 1
+
+I use Rancher to manage RKE2 and K3s Kubernetes clusters. When I upgrade these clusters, the system upgrade controller
+attempts to drain each node before rebooting it. If a node contains the last healthy replica for a volume, the drain
+never completes. I know I can manually evict replicas from a node to allow it to continue, but this eliminates the
+benefit of the automation.
+
+After this enhancement, I can choose to set the node drain policy to `Block For Eviction` before kicking off a cluster
+upgrade. The upgrade may take a long time, but it eventually completes with no additional intervention.
+
+#### Story 2
+
+I am not comfortable with the reduced redundancy `Block If Contains Last Replica` provides while my drained node is
+being rebooted. Or, I commonly drain nodes to remove them from the cluster and I am not comfortable with the reduced
+redundancy `Block If Contains Last Replica` provides while a new replica is rebuilt. It would be nice if I could drain
+nodes without this discomfort.
+
+After this enhancement, I can choose to set the node drain policy to `Block For Eviction` before draining a node or
+nodes. It may take a long time, but I know my data is safe when the drain completes.
+
+### User Experience In Detail
+
+### API changes
+
+Add a `block-for-eviction` option to the `node-drain-policy` setting. The user chooses this option to opt in to the new
+behavior.
+
+Add a `status.autoEvicting` to the `node.longhorn.io/v1beta2` custom resource. This is not a field users can/should
+interact with, but they can view it via kubectl.
+
+NOTE: We originally experimented with a new `status.conditions` entry in the `node.longhorn.io/v1beta2` custom resource
+with the type `Evicting`. However, this was a bit less natural, because:
+
+- Longhorn node conditions generally describe the state a node is in, not what the node is doing.
+- During normal operation, `Evicting` should be `False`. The Longhorn UI displays a condition in this state with a red
+  symbol, indicating an error state that should be investigated.
+
+## Design
+
+### Implementation Overview
+
+The existing eviction logic is well-tested, so there is no reason to refactor it. It works as follows:
+
+- The user can set `spec.evictionRequested = true` on a node or disk.
+- When the replica controller sees `spec.evictionRequested == true` on the node or disk hosting a replica, it sets
+  `status.evictionRequested = true` on that replica.
+- The volume controller uses `replica.status.evictionRequested == true` to influence replica scheduling/deletion
+  behavior (e.g. rebuild an extra replica to replace the evicting one or delete the evicting one once rebuilding is
+  complete).
+- The user can set `spec.evictionRequested = false` on a node or disk.
+- When the replica controller sees `spec.evictionRequested == false` on the node or disk hosting a replica, it sets
+  `replica.status.evictionRequested = false` on that replica.
+- The volume controller uses `replica.status.evictionRequested == true` to influence replica scheduling/deletion
+  behavior (e.g. rebuild an extra replica to replace the evicting one or delete the evicting one once rebuilding is
+  complete).
+
+Make slight changes so that:
+
+- The node controller checks `status.Unschedulable` on the appropriate Kubernetes node object. If
+  `status.Unschedulable == true` and the node drain policy is `block-for-eviction`, it sets `status.autoEvicting = true`
+   on the appropriate `node.longhorn.io` object.
+- In addition to its pre-existing checks, if the replica controller sees `status.autoEvicting == true` on the node
+  hosting a replica, it sets `status.evictionRequested = true` on that replica.
+- The volume controller still  uses `replica.status.evictionRequested == true` to influence replica
+  scheduling/deletionbehavior (e.g. rebuild an extra replica to replace the evicting one or delete the evicting one once
+  rebuilding is complete).
+- The node controller checks `status.Unschedulable` on the appropriate Kubernetes node object. If
+  `status.Unschedulable == false` or the node drain policy is no longer `block-for-eviction`, it sets
+  `status.autoEvicting = false` on the appropriate `node.longhorn.io` object.
+- In addition to its pre-existing checks, if the replica controller sees `status.autoEvicting == false` on the node
+  hosting a replica, it may set `status.evictionRequested = false` on that replica.
+
+### Test plan
+
+Test normal behavior:
+
+- Create a volume.
+- Ensure (through soft anti-affinity, low replica count, and/or enough disks) that an evicted replica of the volume can
+  be scheduled elsewhere.
+- Write data to the volume.
+- Drain a node one of the volume's replicas is scheduled to.
+- While the drain is ongoing:
+  - Verify that the volume never becomes degraded.
+  - Verify that `node.status.autoEvicting == true`.
+- Verify the drain completes.
+- Uncordon the node.
+- Verify that `node.status.autoEvicting == false`.
+- Verify the volume's data.
+
+Test unschedulable behavior:
+
+- Create a volume.
+- Ensure (through soft anti-affinity, high replica count, and/or not enough disks) that an evicted replica of the volume
+  can not be scheduled elsewhere.
+- Write data to the volume.
+- Drain a node one of the volume's replicas is scheduled to.
+- While the drain is ongoing:
+  - Verify that the volume becomes degraded (one of its replicas is unschedulable).
+  - Verify that `node.status.autoEvicting == true`.
+- Verify the drain never completes.
+- Uncordon the node.
+- Verify that the volume is no longer degraded (it no longer needs the unschedulable replica).
+- Verify that `node.status.autoEvicting == false`.
+- Verify the volume's data.
+
+### Upgrade strategy
+
+Add `status.autoEvicting == false` to all `node.longhorn.io` objects during the upgrade. The default node drain policy
+remains `Block If Contains Last Replica`, so do not make setting changes.
+
+## Note
+
+I have given some though to if/how this behavior should be reflected in the UI. In this draft, I have [chosen not to
+represent auto-eviction as a node condition](#api-changes), which would have automatically shown it in the UI, but
+awkwardly. I considered representing it in the `Status` column on the `Node` tab. Currently, the only status are
+`Schedulable` (green), `Unschedulable` (yellow), `Down` (grey), and `Disabled` (red). We could add `AutoEvicting`
+(yellow), but it would overlap with `Unschedulable`. This might be acceptable, as it could be read as, "This node is
+auto-evicting in addition to being unschedulable."