Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4650: StatefulSet Support for Updating Volume Claim Template #4651

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

huww98
Copy link

@huww98 huww98 commented May 22, 2024

  • One-line PR description: initial proposal of KEP-4650: StatefulSet Support for Updating Volume Claim Template

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 22, 2024
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 22, 2024
@huww98
Copy link
Author

huww98 commented May 22, 2024

This KEP is inspired by the previous proposal in #3412 . However, there are several major differences, so I decided to make a new KEP for this.

Differences:

  • No validation is performed at StatefulSet level. We accept everything, do what we can, and report what we have done in status. This should resolve the main concern of the previous attempt that it is hard to recover from expansion failure. Just kubectl rollout undo should work in this KEP.
  • We take extra care about backward compatibility. No immediate changes are expected when the feature is enabled.
  • We accept changes to all fields, not just storage size. And explicitly consider what to do if the PVC does not match the template.

Please read more in the "Alternatives" section in the KEP.

This KEP also contains more details about how to coordinate the update of Pod and PVC, which is another main concern of the previous attempt.

Comment on lines 5 to 7
owning-sig: sig-storage
participating-sigs:
- sig-app
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main issues with #3412 were around StatefulSet controller and its behavior in error cases. IMO it should be owned by sig-apps.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, given that the changes to code should mainly happen on StatefulSet controller, and the previous attempts all put sig-apps as the owning-sig (it is weird that they are placed in the sig-storage folder)

@xing-yang
Copy link
Contributor

/sig apps

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label May 23, 2024
@huww98
Copy link
Author

huww98 commented May 27, 2024

/cc @smarterclayton @gnufied @soltysh
Also cc @areller @maxin93 @jimethn
You have joined the conversations in #3412, So you maybe also interested in this KEP.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing production readiness questionnaire filled in and a PRR file in https://github.com/kubernetes/enhancements/tree/master/keps/prod-readiness/sig-apps matching the template

know that this has succeeded?
-->
* Allow users to update the `volumeClaimTemplates` of a `StatefulSet` in place.
* Automatically update the associated PersistentVolumeClaim objects in-place if applicable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by in-place here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the resizedisk feature can automatically adjust the disk and filesystem sizes while the pod is running. In addition to this there will be new features like VolumeAttributesClass that will also support in-place changes to the storage being used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the word in-place to explicitly distinguish from Delete/Create style update for Pods. We should never delete PVC automatically.

Production Readiness review, etc.
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: huww98
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@huww98
Copy link
Author

huww98 commented Jun 17, 2024

Updated "Production Readiness Review Questionnaire" as requested.
/assign @wojtek-t
For PRR.

During the last sig-apps meeting,
Q1: @kow3ns asked when we do want to modify PVC (e.g. migrate between storage providers) why we don't use the application level features, and the sts ordinal feature to migrate between 2 StatefulSet.

First, this KEP is still mainly targeting the use-cases where we can update the PVC in-place, without interrupting the running pods. The migration and other use-cases are just by-product of enabling editing the volumeClaimTemplates. But still, migration by editing sts is simpler, just requiring a rolling restart, which should already be familiar by most k8s operators. Editing the sts does not require switching traffic between 2 sts, for example. And the name of sts won't change after migration.

Yes, we cannot rollback easily with this procedure. The user should delete the PVC again to rollback. And the data in the PVC may not be recovered if retention policy is Delete. But rolling back the stateful application is inherently complex. Once a replica leaves the cluster, its state in the PV becomes stall. There is no guarantee that the old data can be used to rollback anyway. Whichever the procedure used, this complexity should be addressed at higher level. Custom migrators built on the ordinal feature should also face the same issue.

Q2: @soltysh suggested we may still go along the way of KEP-0661, then do more after we got more experience.

Here is why I don't want to proceed that way:

  1. We have VAC now, which is expected to go to beta soon. VAC is closely related to storage class, and can be patched to existing PVC. This KEP also integrates with VAC. Only updating VAC if storage class matches is a very logical choice.
  2. By allowing editing the whole template, we are forced to fully consider what to do if PVC and template are inconsistent. For example, the storage class differs. In KEP-0661, it is logical to just do the expansion anyway. But in this KEP, I think we should not expand it. Because the person who write the StatefulSet spec and the person who apply the change may not the same. We should not surprise the operator. And this is consistent with the VAC operation model. This is the divergency, and we cannot go through KEP-0661 and then to this KEP, or there will be breaking changes.
  3. KEP-0661 proposes not reverting the volumeClaimTemplates when rollback the StatefulSet, which is very confusing. Another potential breaking change if we go that way first.

Please read more in the Alternatives section in this KEP.

@soltysh said we don't want the STS to be stuck in a permanently broken state. Of course. With this KEP, as we are not validating the templates. It is actually very hard to get stuck. Larger PVC is compatible with smaller template, so just rolling back the template should unblock us from future rollout, leaving PVCs in the expanding state, or try to cancel the expansion if RecoverVolumeExpansionFailure feature gate is enabled.

I think the only way we may get stuck is patching VAC, if one replica is successfully updated to the new VAC, but another replica failed. and rolling back the first replica to the old VAC also failed. Even in this case, the user can just set volumeClaimUpdateStrategy to OnDelete to unblock pod rollouts.

Q3: @kow3ns thinks it is not appropriate to delete and recreate the PVC to alter the performance characteristics of volumes.

VAC is the KEP that actually parameterizing the storage class and allow us to specify and update the performance characteristics of volumes without interrupting the running pod, by patching the existing PVC. So this KEP should also integrate with VAC. The update to VAC in the volumeClaimTemplates should not require re-creating the PVC, and is fully automated if everything goes well.

Q4: @kow3ns asked how we should handle each field of volumeClaimTemplates.

This is described in the KEP in "How to update PVCs" section in "Updated Reconciliation Logic". Basically, patch what we can, skip the rest.

It seems the wording "update PVC in-place" causes many mis-understandings. I will replace it with "patch PVC".

We didn’t actually decide anything during the last meeting. I think these core questions should be decided to push this KEP forward:

  • Allow editing all fields vs only allow storage size and VAC?
  • What to do if storage class does not match between template and PVC? Skip, block, or proceed anyway.
  • Should kubectl rollout undo affect volumeClaimTemplates or not?

@soltysh
Copy link
Contributor

soltysh commented Jun 18, 2024

Here is why I don't want to proceed that way:

The general sentiment from that sig-apps call (see https://docs.google.com/document/d/1LZLBGW2wRDwAfdBNHJjFfk9CFoyZPcIYGWU7R1PQ3ng/edit#heading=h.2utc2e8dj14) was that the smaller changes have a greater chances to move forward. Also, it's worth noting that the smaller changes do not stand in opposition to the changes proposed in here, they are only taking the gradual approach by focusing on a minimal subset of changes.

@mowangdk
Copy link

Here is why I don't want to proceed that way:

The general sentiment from that sig-apps call (see https://docs.google.com/document/d/1LZLBGW2wRDwAfdBNHJjFfk9CFoyZPcIYGWU7R1PQ3ng/edit#heading=h.2utc2e8dj14) was that the smaller changes have a greater chances to move forward. Also, it's worth noting that the smaller changes do not stand in opposition to the changes proposed in here, they are only taking the gradual approach by focusing on a minimal subset of changes.

Okay, we agreed that focusing on the minimal subset of changes. @huww98 and @vie-serendipity will proceed with only VolumeClaimTemplates.spec.resources.requests.storage and VolumeClaimTemplates.spec. volumeAttributesClassName fields. and there are still two questions remaining to solve, maybe we can talk about it in the next sig-app meetings.

  • What to do if storage class does not match between template and PVC? Skip, block, or proceed anyway.
  • Should kubectl rollout undo affect volumeClaimTemplates or not?

@huww98
Copy link
Author

huww98 commented Jul 11, 2024

At the last sig-apps meeting, we have decided that we should:

  • Only allow editing of storage size in the template;
  • If storage class does not match between template and PVC, block the update process and wait for user interaction.
    These changes should greatly shrink the scope of this KEP. And I think this is beneficial to get the KEP forward.

But for the validation of the template, I think we still need more discussion. It can be a major blocking point of this KEP. @soltysh think that we should not allow decreasing the size of template. He thinks we can remove the validation later if desired. But I think validation has many drawbacks which may block normal usage of this feature and should be resolved in the initial version:

  1. If we disallow decreasing, we make the editing a one-way road. If a user edited it then found it was a mistake, there is no way back. The StatefulSet will be broken forever. If this happens, the updates to pods will also be blocked. This is not acceptable IMO.
  2. To mitigate the above issue, we will want to prevent the user from going down this one-way road by mistake. We are forced to do way more validations on APIServer, which is very complex, and fragile (please see KEP-0661). For example: check storage class allowVolumeExpansion, check each PVC's storage class and size, basically duplicate all the validations we have done to PVC. And even if we do all the validations, there are still race conditions and async failures that we are impossible to catch. I see this as a major drawback of KEP-0661 that I want to avoid in this KEP.
  3. Validation means we should disable rollback of storage size. If we enable it later, it can surprise users, if it is not called a breaking change.
  4. The validation is conflict to RecoverVolumeExpansionFailure feature, although it is still alpha.

On the contrast, if we just don't add the validation, we can avoid all these issues, and lose nothing: The user can expand PVC independently today. So, the state that the template is smaller than PVC is already very common and stable. The strategy in this state is not even trying to shrink the PVC. I think this is well defined and easy to follow. If Kubernetes ever supports shrinking in the future, we will still need to support drivers that can't shrink. So, even then we can only support shrinking with a new volumeClaimUpdateStrategy (maybe InPlaceShrinkable).

To take one step back, I think validating the template across resources violates the high-level design. The template describes a desired final state, rather than an immediate instruction. A lot of things can happen externally after we update the template. For example, I have an IaaS platform, which tries to kubectl apply one updated StatefulSet + one new StorageClass to the cluster to trigger the expansion of PVs. We don't want to reject it just because the StorageClass is applied after the StatefulSet, right?

To conclude, I don't want to add the validation, we don't add it just to be removed in the future.

@liubog2008
Copy link

On the contrast, if we just don't add the validation, we can avoid all these issues, and lose nothing: The user can expand PVC independently today. So, the state that the template is smaller than PVC is already very common and stable. The strategy in this state is not even trying to shrink the PVC. I think this is well defined and easy to follow. If Kubernetes ever supports shrinking in the future, we will still need to support drivers that can't shrink. So, even then we can only support shrinking with a new volumeClaimUpdateStrategy (maybe InPlaceShrinkable).

Agree. By the way, request means the minimal resource requirement, so the actual allocated which is larger than the request is actually reasonable. What we need is just show it to users.

@soltysh
Copy link
Contributor

soltysh commented Oct 1, 2024

  1. If we disallow decreasing, we make the editing a one-way road. If a user edited it then found it was a mistake, there is no way back. The StatefulSet will be broken forever. If this happens, the updates to pods will also be blocked. This is not acceptable IMO.

That's one way looking at it, also in those cases where a mistake happens (I consider that a rare occurrence), you can always use #3335 and migrate to a new, smaller StatefulSet.
Additionally, it's always easier to limit the scope of any enhancement and expand it once we confirm there will be sufficient use cases supporting it. The other way around, ie. removing functionality is either hard or more often not possible at all. We've done that in the past, and given the available solutions around "fixing a mistake", I'll stand strong with only allowing increasing the size.

@soltysh
Copy link
Contributor

soltysh commented Oct 1, 2024

@huww98 and @liubog2008 are you planning to push this through for 1.32?

@kfox1111
Copy link

kfox1111 commented Oct 1, 2024

Users have been waiting for many years to be able to scale up statefulset volumes. I agree we shouldn't over complicate that use case trying to handle solving other issues at the same time. Lets focus on the very common use case, and then can reevaluate other features after that is completed.

@mowangdk
Copy link

mowangdk commented Oct 3, 2024

@huww98 and @liubog2008 are you planning to push this through for 1.32?

@soltysh Enhancement Freeze date is at Oct 11th right? We are on vacation right now and may not make this date. cc @huww98 @vie-serendipity

@soltysh
Copy link
Contributor

soltysh commented Oct 4, 2024

@soltysh Enhancement Freeze date is at Oct 11th right? We are on vacation right now and may not make this date. cc @huww98 @vie-serendipity

Roger that, let's try to work on the enhancement still within 1.32 time frame, so that it's ready in time for 1.33, in that case.

@mowangdk
Copy link

mowangdk commented Oct 5, 2024

@soltysh Enhancement Freeze date is at Oct 11th right? We are on vacation right now and may not make this date. cc @huww98 @vie-serendipity

Roger that, let's try to work on the enhancement still within 1.32 time frame, so that it's ready in time for 1.33, in that case.

okay, that sounds great~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

9 participants