diff --git a/keps/sig-multicluster/3335-statefulset-slice/README.md b/keps/sig-multicluster/3335-statefulset-slice/README.md index e47b2d89c9f8..5eb503ab5a7b 100644 --- a/keps/sig-multicluster/3335-statefulset-slice/README.md +++ b/keps/sig-multicluster/3335-statefulset-slice/README.md @@ -128,14 +128,14 @@ checklist items _must_ be updated for the enhancement to be released. Items marked with (R) are required *prior to targeting to a milestone / release*. -- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) - [ ] (R) KEP approvers have approved the KEP status as `implementable` -- [ ] (R) Design details are appropriately documented -- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) +- [X] (R) Design details are appropriately documented +- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [ ] e2e Tests for all Beta API Operations (endpoints) - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free -- [ ] (R) Graduation criteria is in place +- [X] (R) Graduation criteria is in place - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Production readiness review completed - [ ] (R) Production readiness review approved @@ -236,11 +236,12 @@ What is out of scope for this KEP? Listing non-goals helps to focus discussion and make progress. --> -* Updating a PDB to safeguard more than one StatefulSet slice +* Updating a PDB to safeguard more than one StatefulSet slice * As StatefulSet slices are scaled up or down, corresponding PDBs can also be adjusted. For example, a PDB corresponding to a slice of `k` replicas could be adjusted to `MinAvailable: k-1` on scale up or down events. Providing guidance and functionality to adjust these PDBs is outside the scope of this KEP. -* Orchestrating pod movement from one StatefulSet slice to another -* Managing network connectivity between pods in different StatefulSet slices -* Orchestrating storage lifecycle of PVCs and PVs across different StatefulSet slices +* Orchestrating pod movement from one StatefulSet slice to another +* Managing network connectivity between pods in different StatefulSet slices +* Orchestrating storage lifecycle of PVCs and PVs across different StatefulSet slices + * Referenced PV/PVCs will need to be migrated in order for a new StatefulSet to reference data that was used by an existing StatefulSet. Orchestration complexity will depend on how volumes are used (RWO with `.spec.volumeClaimTemplates` on a StatefulSet, RWX with pod `.spec.volumes`). ## Proposal @@ -940,9 +941,52 @@ not need to be as detailed as the proposal, but should include enough information to express the idea and why it was not acceptable. --> -Users can orphan pods from a StatefulSet, migrate pods across a namespace or cluster, and create a new StatefulSet to manage pods upon migration. In the case of pod eviction or failure, pods will need to be manually restarted, requiring manual intervention and constant monitoring. - -Users can backup and restore a StatefulSet (and underlying storage) in a new namespace or cluster. Doing so requires the existing StatefulSet to be deleted, for underlying storage to be backed up and restored, resulting in downtime for the stateful application. +### Alternative API changes + +**ReverseOrderedReady**: A new PodManagementPolicy policy called +`ReverseOrderedReady` could be added. This would allow a StatefulSet to be +started and actuated from the highest ordinal (current default is from the +lowest ordinal). For the cross-cluster migration use case, this would allow for +a source StatefulSet to be scaled down and a target StatefulSet to be scaled in. +The downside with this API is that pod management policy is not a mutable field. +So if an orchestrator uses this behavior to scale in a StatefulSet, in a +destination cluster, and then wants to revert the PodManagementPolicy back to +default, the StatefulSet would need to be deleted, and re-created. + +**KEP-3521**: [KEP-3521](https://github.com/kubernetes/enhancements/issues/3521) +proposes a Pod `.spec` level API that enables a pod to be paused at the initial +scheduling phase of pod lifecycle. This provides granular control of which pods +should be started and running (active) and which pods shouldn't be scheduled +(standby). An orchestrator can leverage control over specific pod scheduling, +without making changes to the StatefulSet controller, as the StatefulSet +controller is in control of creating pods. + +If the StatefulSet controller is using OrderedReady Pod Management, pausing +scheduling can result in a pod being marked as not Ready. This will prevent +the StatefulSet controller from actuating updates to higher ordinal pods (eg: +pod `m` will not be created if pod `n` is unhealthy, where `m` > `n`). This +may increase orchestrator complexity, by requiring an orchestrator of a +migration to leverage Parallel Pod Management during a migration, and then +re-create a StatefulSet (using `--cascade=orphan`) to revert back to +`OrderedReady` if desired. + +Additionally, if modifying a StatefulSet template is undesired, a webhook must +be introduced to mark Pods as paused when they are created. This adds a layer +of complexity to an orchestrator operator, since it needs both an operator +component that is capable of making changes to ApiServer, and a webhook that is +reading from a consistent migration state. + +### Alternatives without any API changes + +**Orphan Pods**: Users can orphan pods from a StatefulSet, migrate pods across a +namespace or cluster, and create a new StatefulSet to manage pods upon +migration. In the case of pod eviction or failure, pods will need to be manually +recreated, requiring manual intervention and constant monitoring. + +**Backup/Restore**: Users can backup and restore a StatefulSet (and underlying +storage) in a new namespace or cluster. Doing so requires the existing +StatefulSet to be deleted, for underlying storage to be backed up and restored, +resulting in downtime for the stateful application. ## Infrastructure Needed (Optional)