diff --git a/keps/prod-readiness/sig-apps/4650.yaml b/keps/prod-readiness/sig-apps/4650.yaml new file mode 100644 index 00000000000..31adc0d5d14 --- /dev/null +++ b/keps/prod-readiness/sig-apps/4650.yaml @@ -0,0 +1,3 @@ +kep-number: 4650 +alpha: + approver: "@wojtek-t" diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md new file mode 100644 index 00000000000..3a12a95962c --- /dev/null +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -0,0 +1,1087 @@ + +# KEP-4650: StatefulSet Support for Updating Volume Claim Template + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Kubernetes API Changes](#kubernetes-api-changes) + - [Updated Reconciliation Logic](#updated-reconciliation-logic) + - [What PVC is compatible](#what-pvc-is-compatible) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1: Batch Expand Volumes](#story-1-batch-expand-volumes) + - [Story 2: Shinking the PV by Re-creating PVC](#story-2-shinking-the-pv-by-re-creating-pvc) + - [Story 3: Asymmetric Replicas](#story-3-asymmetric-replicas) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Extensively validate the updated volumeClaimTemplates](#extensively-validate-the-updated-volumeclaimtemplates) + - [Support for updating arbitrary fields in volumeClaimTemplates](#support-for-updating-arbitrary-fields-in-volumeclaimtemplates) + - [Patch PVC size regardless of the immutable fields](#patch-pvc-size-regardless-of-the-immutable-fields) + - [Support for automatically skip not managed PVCs](#support-for-automatically-skip-not-managed-pvcs) + - [Reconcile all PVCs regardless of Pod revision labels](#reconcile-all-pvcs-regardless-of-pod-revision-labels) + - [Treat all incompatible PVCs as unavailable replicas](#treat-all-incompatible-pvcs-as-unavailable-replicas) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +Kubernetes does not support the modification of the `volumeClaimTemplates` of a StatefulSet currently. +This enhancement proposes to support modifications to the `volumeClaimTemplates`, +automatically patching the associated PersistentVolumeClaim objects if applicable. +Currently, PVC `spec.resources.requests.storage`, `spec.volumeAttributesClassName`, `metadata.labels`, and `metadata.annotations` +can be patched. +All the updates to PersistentVolumeClaim can be coordinated with `Pod` updates +to honor any dependencies between them. + +## Motivation + + + +Currently there are very few things that users can do to update the volumes of +their existing StatefulSet deployments. +They can only expand the volumes, or modify them with VolumeAttributesClass +by updating individual PersistentVolumeClaim objects as an ad-hoc operation. +When the StatefulSet scales up, the new PVC(s) will be created with the old +config and this again needs manual intervention. +This brings many headaches in a continuously evolving environment. + +### Goals + + +* Allow users to update some fields of `volumeClaimTemplates` of a `StatefulSet`. +* Automatically patch the associated PersistentVolumeClaim objects, without interrupting the running Pods. +* Support updating PersistentVolumeClaim objects with `OnDelete` strategy. +* Coordinate updates to `Pod` and PersistentVolumeClaim objects. +* Provide accurate status and error messages to users when the update fails. + +### Non-Goals + + +* Support automatic re-creating of PersistentVolumeClaim. We will never delete a PVC automatically. +* Validate the updated `volumeClaimTemplates` as how PVC patch does. +* Update ephemeral volumes. +* Patch PVCs that are different from the template, e.g. StatefulSet adopts the pre-existing PVCs. + + +## Proposal + + +1. Change API server to allow specific updates to `volumeClaimTemplates` of a StatefulSet: + * `labels` + * `annotations` + * `resources.requests.storage` + * `volumeAttributesClassName` + +2. Modify StatefulSet controller to add PVC reconciliation logic. + +3. Collect the status of managed PVCs, and show them in the StatefulSet status. + +### Kubernetes API Changes + +Changes to StatefulSet `spec`: + +Introduce a new field in StatefulSet `spec`: `volumeClaimUpdatePolicy` to +specify how to coordinate the update of PVCs and Pods. Possible values are: +- `OnDelete`: the default value, only update the PVC when the the old PVC is deleted. +- `InPlace`: patch the PVC in-place if possible. Also includes the `OnDelete` behavior. + +Changes to StatefultSet `status`: + +Additionally collect the status of managed PVCs, and show them in the StatefulSet status. + +For each PVC in the template: +- compatible: the number of PVCs that are compatible with the template. + These replicas will not be blocked on Pod recreation. +- updating: the number of PVCs that are being updated in-place (e.g. expansion in progress). +- overSized: the number of PVCs that are larger than the template. +- totalCapacity: the sum of `status.capacity` of all the PVCs. + +Some fields in the `status` are also updated to reflect the staus of the PVCs: +- readyReplicas: in addition to pods, also consider the PVCs status. A PVC is not ready if: + - `volumeClaimUpdatePolicy` is `InPlace` and the PVC is updating; +- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` +- currentRevision, updateRevision, currentReplicas, updatedReplicas + are updated to reflect the status of PVCs. + +With these changes, user can still use `kubectl rollout status` to monitor the update process, +both for automated patching and for the PVCs that need manual intervention. + +### Updated Reconciliation Logic + +How to update PVCs: +1. If `volumeClaimUpdatePolicy` is `InPlace`, + and if `volumeClaimTemplates` and actual PVC only differ in mutable fields + (`spec.resources.requests.storage`, `spec.volumeAttributesClassName`, `metadata.labels`, and `metadata.annotations` currently), + patch the PVC to the extent possible. + - `spec.resources.requests.storage` is patched to max(template spec, PVC status) + - Do not decreasing the storage size below its current status. + Note that decrease the size in PVC spec can help recover from a failed expansion if + `RecoverVolumeExpansionFailure` feature gate is enabled. + - `spec.volumeAttributesClassName` is patched to the template value. + - `metadata.labels` and `metadata.annotations` are patched with server side apply. + +2. If it is not possible to make the PVC [compatible](#what-pvc-is-compatible), + do nothing. But when recreating a Pod and the corresponding PVC is deleting, + wait for the deletion then create a new PVC together with the new Pod (already implemented). + + +3. Use either current or updated revision of the `volumeClaimTemplates` to create/update the PVC, + just like Pod template. + +When to update PVCs: +1. before advancing `status.updatedReplicas` to the next replica, + check that the PVCs of the next replica are + [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplates`. + If not, and if we are not going to patch it automatically, + wait for the user to delete/update the old PVC manually. + +2. When doing rolling update, A replica is considered ready if the Pod is ready + and all its volumes are not being updated in-place. + Wait for a replica to be ready for at least `minReadySeconds` before proceeding to the next replica. + +3. Whenever we check for Pod update, also check for PVCs update. + e.g.: + - If `spec.updateStrategy.type` is `RollingUpdate`, + update the PVCs in the order from the largest ordinal to the smallest. + - If `spec.updateStrategy.type` is `OnDelete`, + Only update the PVC when the Pod is deleted. + +4. When patching the PVC, if we also re-create the Pod, + update the PVC after old Pod deleted, together with creating new pod. + Otherwise, if pod is not changed, update the PVC only. + +Failure cases: don't left too many PVCs being updated in-place. We expect to update the PVCs in order. + +- If the PVC update fails, we should block the update process. + If the Pod is also deleted (by controller or manually), don't block the creation of new Pod. + We should retry and report events for this. + The events and status should look like those when the Pod creation fails. + +- While waiting for the PVC to reach the compatible state, + We should update status, just like what we do when waiting for Pod to be ready. + We should block the update process if the PVC is never compatible. + +- If the `volumeClaimTemplates` is updated again when the previous rollout is blocked, + similar to [Pods](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback), + user may need to manually deal with the blocking PVCs (update or delete them). + + +### What PVC is compatible + +A PVC is compatible with the template if: +- All the immutable fields match exactly; and +- `metadata.labels` and `metadata.annotations` of PVC is a superset of the template; and +- `status.capacity.storage` of PVC is greater than or equal to + the `spec.resources.requests.storage` of the template; and +- `status.currentVolumeAttributesClassName` of PVC is equal to + the `spec.volumeAttributesClassName` of the template. + +### User Stories (Optional) + + + +#### Story 1: Batch Expand Volumes + +We're running a CI/CD system and the end-to-end automation is desired. +To expand the volumes managed by a StatefulSet, +we can just use the same pipeline that we are already using to update the Pod. +All the test, review, approval, and rollback process can be reused. + +#### Story 2: Shinking the PV by Re-creating PVC + +After running our app for a while, we optimize the data layout and reduce the required storage size. +Now we want to shrink the PVs to save cost. +We can not afford any downtime, so we don't want to delete and recreate the StatefulSet. +We also don't have the infrastructure to migrate between two StatefulSets. +Our app can automatically rebuild the data in the new storage from other replicas. +So we update the `volumeClaimTemplates` of the StatefulSet, +delete the PVC and Pod of one replica, let the controller re-create them, +then monitor the rebuild process. +Once the rebuild completes successfully, we proceed to the next replica. + +#### Story 3: Asymmetric Replicas + +The storage requirement of different replicas are not identical, +so we still want to update each PVC manually and separately. +Possibly we also update the `volumeClaimTemplates` for new replicas, +but we don't want the controller to interfere with the existing replicas. + +### Notes/Constraints/Caveats (Optional) + + + +When designing the `InPlace` update strategy, we update the PVC like how we re-create the Pod. +i.e. we update the PVC whenever we would re-create the Pod; +we wait for the PVC to be compatible whenever we would wait for the Pod to be available. + +The StatefulSet controller should also keeps the current and updated revision of the `volumeClaimTemplates`, +so that a StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated. + +### Risks and Mitigations + + +TODO: Recover from failed in-place update (insufficient storage, etc.) +What else is needed in addition to revert the StatefulSet spec? + +## Design Details + + + +We can use Server Side Apply to patch the PVCs, +so that we will not interfere with the user's manual changes, +e.g. to `metadata.labels` and `metadata.annotations`. + +New invariants established about PVCs: +If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A. + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: StatefulSetUpdateVolumeClaimTemplate + - Components depending on the feature gate: + - kube-apiserver + - kube-controller-manager + +###### Does enabling the feature change any default behavior? + + +The update to StatefulSet `volumeClaimTemplates` will be accepted by the API server while it is previously rejected. + +Otherwise No. +If `volumeClaimUpdatePolicy` is `OnDelete` (the default values), +the behavior of StatefulSet controller is almost the same as before. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + +Yes. Since the `volumeClaimTemplates` can already differ from the actual PVCs now, +disable this feature gate should not leave any inconsistent state. + +If the `volumeClaimTemplates` is updated then the feature is disabled and the StatefulSet is rolled back, +The `volumeClaimTemplates` will be kept as the latest version, and the history of them will be lost. + +###### What happens if we reenable the feature if it was previously rolled back? + +If the `volumeClaimUpdatePolicy` is already set to `InPlace` reenable the feature +will kick off the update process immediately. + +###### Are there any tests for feature enablement/disablement? + + +Will add unit tests for the StatefulSet controller with and without the feature gate, +`volumeClaimUpdatePolicy` set to `InPlace` and `OnDelete` respectively. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + +- PATCH StatefulSet + - kubectl or other user agents +- PATCH PersistentVolumeClaim + - 1 per updated PVC in the StatefulSet (number of updated claim template * replica) + - StatefulSet controller (in KCM) + - triggered by the StatefulSet spec update +- PATCH StatefulSet status + - 1-2 per updated PVC in the StatefulSet (number of updated claim template * replica) + - StatefulSet controller (in KCM) + - triggered by the StatefulSet spec update and PVC status update + +###### Will enabling / using this feature result in introducing new API types? + + +No + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + +Not directly. The cloud provider may be called when the PVCs are updated. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + +StatefulSet: +- `spec`: 2 new enum fields, ~10B +- `status`: 4 new integer fields, ~10B + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + +No. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + +The logic of StatefulSet controller is more complex, more CPU will be used. +TODO: measure the actual increase. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + +No. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + +### Extensively validate the updated `volumeClaimTemplates` + +[KEP-0661] proposes that we should do extensive validation on the updated `volumeClaimTemplates`. +e.g., prevent decreasing the storage size, preventing expand if the storage class does not support it. +However, this have saveral drawbacks: +* If we disallow decreasing, we make the editing a one-way road. + If a user edited it then found it was a mistake, there is no way back. + The StatefulSet will be broken forever. If this happens, the updates to pods will also be blocked. This is not acceptable. +* To mitigate the above issue, we will want to prevent the user from going down this one-way road by mistake. + We are forced to do way more validations on APIServer, which is very complex, and fragile (please see KEP-0661). + For example: check storage class allowVolumeExpansion, check each PVC's storage class and size, + basically duplicate all the validations we have done to PVC. + And even if we do all the validations, there are still race conditions and async failures that we are impossible to catch. + I see this as a major drawback of KEP-0661 that I want to avoid in this KEP. +* Validation means we should disable rollback of storage size. If we enable it later, it can surprise users, if it is not called a breaking change. +* The validation is conflict to RecoverVolumeExpansionFailure feature, although it is still alpha. +* `volumeClaimTemplates` is also used when creating new PVCs, so even if the existing PVCs cannot be updated, + a user may still want to affect new PVCs. +* It violates the high-level design. + The template describes a desired final state, rather than an immediate instruction. + A lot of things can happen externally after we update the template. + For example, I have an IaaS platform, which tries to kubectl apply one updated StatefulSet + one new StorageClass to the cluster to trigger the expansion of PVs. + We don't want to reject it just because the StorageClass is applied after the StatefulSet. + +### Support for updating arbitrary fields in `volumeClaimTemplates` + +No technical limitations. Just that we want to be careful and keep the changes small, so that we can move faster. +This is just an extra validation in APIServer. We may remove it later if we find it is not needed. + +### Patch PVC size regardless of the immutable fields + +We propose to patch the PVC as a whole, so it can only succeed if the immutable fields matches. + +If only expansion is supported, patching regardless of the immutable fields can be a logical choice. +But this KEP also integrates with VAC. VAC is closely coupled with storage class. +Only patching VAC if storage class matches is a very logical choice. +And we'd better follow the same operation model for all mutable fields. + + +### Support for automatically skip not managed PVCs + +Introduce a new field in StatefulSet `spec.updateStrategy.rollingUpdate`: `volumeClaimSyncStrategy`. +If it is set to `Async`, then we skip patching the PVCs that are not managed by the StatefulSet (e.g. StorageClass does not match). + +The rules to determine what PVCs are managed are a little bit tricky. +We have to check each field, and determine what to do for each field. +This makes us deeply coupled with the PVC implementation. + +And still, we want to keep the changes small. + +### Reconcile all PVCs regardless of Pod revision labels + +Like Pods, we only update the PVCs if the Pod revision labels is not the update revision. + +We need to unmarshal all revisions used by Pods to determine the desired PVC spec. +Even if we do so, we don't want to send a apply request for each PVC at each reconcile iteration. +We also don't want to replicate the SSA merging/extraction and validation logic, which can be complex and CPU-intensive. + + +### Treat all incompatible PVCs as unavailable replicas + +Currently, incompatible PVCs only blocks the rolling update, not scaling up or down. +Only the update revision is used for checking. + +We need to unmarshal all revisions used by Pods to determine the compatibility. +Even if we do so, old StatefulSets do not have claim info in its history. +If we just use the latest version, then all replicas may suddenly become unavailable, +and all operations are blocked. + +[KEP-0661]: https://github.com/kubernetes/enhancements/pull/3412 + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml new file mode 100644 index 00000000000..89587d8f26f --- /dev/null +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml @@ -0,0 +1,50 @@ +title: StatefulSet Support for Updating Volume Claim Template +kep-number: 4650 +authors: + - "@huww98" + - "@vie-serendipity" +owning-sig: sig-apps +participating-sigs: + - sig-storage +status: provisional +creation-date: 2024-05-17 +reviewers: + - "@kow3ns" + - "@gnufied" + - "@msau42" + - "@xing-yang" + - "@soltysh" +approvers: + - "@kow3ns" + - "@xing-yang" + +see-also: + - "/keps/sig-storage/1790-recover-resize-failure" + - "/keps/sig-storage/3751-volume-attributes-class" +replaces: + - "https://github.com/kubernetes/enhancements/pull/2842" # Previous attempt on 0611 + - "https://github.com/kubernetes/enhancements/pull/3412" # Previous attempt on 0611 + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.31" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.32" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: StatefulSetUpdateVolumeClaimTemplate + components: + - kube-apiserver + - kube-controller-manager +disable-supported: true + +# The following PRR answers are required at beta release +metrics: []