Skip to content

Commit

Permalink
Automatically evict replicas while draining LEP
Browse files Browse the repository at this point in the history
Longhorn 2238

Signed-off-by: Eric Weber <eric.weber@suse.com>
  • Loading branch information
ejweber committed Sep 6, 2023
1 parent 8615cfc commit 03ca307
Showing 1 changed file with 208 additions and 0 deletions.
208 changes: 208 additions & 0 deletions enhancements/2023-09-05-automatically-evict-replicas-while-draining.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
# Automatically Evict Replicas While Draining

## Summary

Currently, Longhorn allows the choice between a number of behaviors (node drain policies) when a node is cordoned or
drained:

- `Block If Contains Last Replica` ensures the `instance-manager` pod cannot be drained from a node as long as it is the
last node with a healthy replica for some volume.

Benefits:

- Protects data by preventing the drain operation from completing until there is a healthy replica available for each
volume available on another node.

Drawbacks:

- If there is only one replica for the volume, or if its other replicas are unhealthy, the user may need to manually
(through the UI) request the eviction of replicas from the disk or node.
- Volumes may be degraded after the drain is complete. If the node is rebooted, redundancy is reduced until it is
running again. If the node is removed, redundancy is reduced until another replica rebuilds.

- `Allow If Last Replica Is Stopped` is similar to the above, but only prevents an `instance-manager` pod from
draining if it has the last RUNNING replica.

Benefits:

- Allows the drain operation to proceed in situations where the node being drained is expected to come back online
(data will not be lost) and the replicas stored on the node's disks are not actively being used.

Drawbacks:

- Similar drawbacks to `Block If Contains Last Replica`.
- If, for some reason, the node never comes back, data is lost.

- `Always Allow` never prevents an `instance-manager` pod from draining.

Benefits:

- The drain operation completes quickly without Longhorn getting in the way.

Drawbacks:

- There is no opportunity for Longhorn to protect data.

This proposal seeks to add a fourth behavior (node drain policy) with the following properties:

- `Block For Eviction` ensures the `instance-manager` pod cannot be drained from a node as long as it contains any
replicas for any volumes. Replicas are automatically evicted from the node as soon as it is cordoned.

Benefits:

- Protects data by preventing the drain operation from completing until all replicas have been relocated.
- Automatically evicts replicas, so the user does not need to do it manually (through the UI).
- Maintains replica redundancy at all times.

Drawbacks:

- The drain operation is significantly slower than for other behaviors. Every replica must be rebuilt on another node
before it can complete.
- Like all of these policities, it triggers on cordon, not on drain (it is not possible for Longhorn to distinguish
between a node that is actively being drained and one that is cordoned for some other reason). If a user
regularly cordons nodes without draining them, replicas will be rebuilt pointlessly.

Given the drawbacks, `Block For Eviction` should likely not be the default node drain policy moving forward. However,
some users may find it helpful to switch to `Block For Eviction`, especially during cluster upgrade operations. See
[user stories](#user-stories) for additional insight.

### Related Issues

https://github.com/longhorn/longhorn/issues/2238

## Motivation

### Goals

- Add a new `Block For Eviction` node drain policy as described in the summary.
- Ensure that replicas automatically evict from a cordoned node when `Block For Eviction` is set.
- Ensure a drain operation can not complete until all replicas are evicted when `Block For Eviction` is set
- Document recommendations for when to use `Block For Eviction`.

### Non-goals

- Only trigger automatic eviction when a node is actively being drained. It is not possible to distinguish between a
node that is only cordoned and one that is actively being drained.

## Proposal

### User Stories

#### Story 1

I use Rancher to manage RKE2 and K3s Kubernetes clusters. When I upgrade these clusters, the system upgrade controller
attempts to drain each node before rebooting it. If a node contains the last healthy replica for a volume, the drain
never completes. I know I can manually evict replicas from a node to allow it to continue, but this eliminates the
benefit of the automation.

After this enhancement, I can choose to set the node drain policy to `Block For Eviction` before kicking off a cluster
upgrade. The upgrade may take a long time, but it eventually completes with no additional intervention.

#### Story 2

I am not comfortable with the reduced redundancy `Block If Contains Last Replica` provides while my drained node is
being rebooted. Or, I commonly drain nodes to remove them from the cluster and I am not comfortable with the reduced
redundancy `Block If Contains Last Replica` provides while a new replica is rebuilt. It would be nice if I could drain
nodes without this discomfort.

After this enhancement, I can choose to set the node drain policy to `Block For Eviction` before draining a node or
nodes. It may take a long time, but I know my data is safe when the drain completes.

### User Experience In Detail

### API changes

Add a `block-for-eviction` option to the `node-drain-policy` setting. The user chooses this option to opt in to the new
behavior.

Add a `status.autoEvicting` to the `node.longhorn.io/v1beta2` custom resource. This is not a field users can/should
interact with, but they can view it via kubectl.

NOTE: We originally experimented with a new `status.conditions` entry in the `node.longhorn.io/v1beta2` custom resource
with the type `Evicting`. However, this was a bit less natural, because:

- Longhorn node conditions generally describe the state a node is in, not what the node is doing.
- During normal operation, `Evicting` should be `False`. The Longhorn UI displays a condition in this state with a red
symbol, indicating an error state that should be investigated.

## Design

### Implementation Overview

The existing eviction logic is well-tested, so there is no reason to refactor it. It works as follows:

- The user can set `spec.evictionRequested = true` on a node or disk.
- When the replica controller sees `spec.evictionRequested == true` on the node or disk hosting a replica, it sets
`status.evictionRequested = true` on that replica.
- The volume controller uses `replica.status.evictionRequested == true` to influence replica scheduling/deletion
behavior (e.g. rebuild an extra replica to replace the evicting one or delete the evicting one once rebuilding is
complete).
- The user can set `spec.evictionRequested = false` on a node or disk.
- When the replica controller sees `spec.evictionRequested == false` on the node or disk hosting a replica, it sets
`replica.status.evictionRequested = false` on that replica.
- The volume controller uses `replica.status.evictionRequested == true` to influence replica scheduling/deletion
behavior (e.g. rebuild an extra replica to replace the evicting one or delete the evicting one once rebuilding is
complete).

Make slight changes so that:

- The node controller checks `status.Unschedulable` on the appropriate Kubernetes node object. If
`status.Unschedulable == true` and the node drain policy is `block-for-eviction`, it sets `status.autoEvicting = true`
on the appropriate `node.longhorn.io` object.
- In addition to its pre-existing checks, if the replica controller sees `status.autoEvicting == true` on the node
hosting a replica, it sets `status.evictionRequested = true` on that replica.
- The volume controller still uses `replica.status.evictionRequested == true` to influence replica
scheduling/deletionbehavior (e.g. rebuild an extra replica to replace the evicting one or delete the evicting one once
rebuilding is complete).
- The node controller checks `status.Unschedulable` on the appropriate Kubernetes node object. If
`status.Unschedulable == false` or the node drain policy is no longer `block-for-eviction`, it sets
`status.autoEvicting = false` on the appropriate `node.longhorn.io` object.
- In addition to its pre-existing checks, if the replica controller sees `status.autoEvicting == false` on the node
hosting a replica, it may set `status.evictionRequested = false` on that replica.

### Test plan

Test normal behavior:

- Create a volume.
- Ensure (through soft anti-affinity, low replica count, and/or enough disks) that an evicted replica of the volume can
be scheduled elsewhere.
- Write data to the volume.
- Drain a node one of the volume's replicas is scheduled to.
- While the drain is ongoing:
- Verify that the volume never becomes degraded.
- Verify that `node.status.autoEvicting == true`.
- Verify the drain completes.
- Uncordon the node.
- Verify that `node.status.autoEvicting == false`.
- Verify the volume's data.

Test unschedulable behavior:

- Create a volume.
- Ensure (through soft anti-affinity, high replica count, and/or not enough disks) that an evicted replica of the volume
can not be scheduled elsewhere.
- Write data to the volume.
- Drain a node one of the volume's replicas is scheduled to.
- While the drain is ongoing:
- Verify that the volume becomes degraded (one of its replicas is unschedulable).
- Verify that `node.status.autoEvicting == true`.
- Verify the drain never completes.
- Uncordon the node.
- Verify that the volume is no longer degraded (it no longer needs the unschedulable replica).
- Verify that `node.status.autoEvicting == false`.
- Verify the volume's data.

### Upgrade strategy

Add `status.autoEvicting == false` to all `node.longhorn.io` objects during the upgrade. The default node drain policy
remains `Block If Contains Last Replica`, so do not make setting changes.

## Note

I have given some though to if/how this behavior should be reflected in the UI. In this draft, I have [chosen not to
represent auto-eviction as a node condition](#api-changes), which would have automatically shown it in the UI, but
awkwardly. I considered representing it in the `Status` column on the `Node` tab. Currently, the only status are
`Schedulable` (green), `Unschedulable` (yellow), `Down` (grey), and `Disabled` (red). We could add `AutoEvicting`
(yellow), but it would overlap with `Unschedulable`. This might be acceptable, as it could be read as, "This node is
auto-evicting in addition to being unschedulable."

0 comments on commit 03ca307

Please sign in to comment.