-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new GroupControllerService and VolumeGroupSnapshot CSI RPCs #519
Add new GroupControllerService and VolumeGroupSnapshot CSI RPCs #519
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for proposing these changes! Except for missing secrets in some of the procedures, we should be able to implement VolumeGroups with Ceph-CSI this way.
a5808fb
to
cf06518
Compare
This model first-classes the "group" concept as a thing w/ its own metadata. Is that really needed? Does CSI really care about group metadata? If so, why? What if the model was inverted, e.g. the Volume concept was extended w/ a list of "group names", which would just be opaque identifiers from a CSI perspective. We didn't add "Zone" or "Topology" as first-class objects managed by CSI RPCs - why would we do the same for groups? If we're aiming for the simplest API that still provides value, this current proposal seems a bit heavy. Consistency (for snapshot) groups, replication groups, isolation groups, etc definitely have value for different use cases. I wonder why CSI cares to distinguish among them at this point? If a volume object could self-identify as belonging to groups "a" and "b" isn't that enough to begin implementing support for some the desired, advanced use cases from the perspective of both the CO and SP? If not, why? If the answer is "because the KEP needs these things" then my next question is "why does the KEP propose separate CRDs to manage groups vs extending the Volume k8s resource w/ group attributes" - but that's arguably a question that belongs on the KEP thread :) |
@jdef That’s an interesting idea. If we only need to support Group Snapshot, then I think just adding a group name attribute to a volume object would be sufficient. If we want to support more group types, then just relying on attributes in volume objects could be confusing. A group could be for group snapshot, group replication, placement, etc. So other than the group name, we also need to specify what type of group that is. Although this PR has not talked about replication group, the plan is to make this design generic so that it can be extended to support replication group in the future. For a replication group, there are additional group specific attributes, i.e., whether it is primary or secondary, etc. We would also want to introduce RPCs to support operations such as failover, failback, etc. at the group level. Regarding why we didn’t add “Zone” or “Topology” as first class objects in CSI, my understanding is that those are not storage system’s internal attributes. Support for zone and topology does not consider the internal topology of the storage system. That’s why we are also exploring how to use the group concept to handle placement which is associated with storage system’s internal topology. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! With the secrets added it should be possible for Ceph-CSI to implement the new procedures.
A few people have reviewed the design and confirmed it works for their storage systems. See details here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall these changes look very good to me. I did leave a couple of comments but they are only advisory. I believe we can implement this specification for our arrays.
f8764d1
to
9e3910d
Compare
Following this idea a bit more, is it required that the group needs to distinguish its purpose/type? At least in its current proposal, the Should we consider this alternative? Or at least see where we run into difficulties? |
When we support group snapshot, replication group, etc., it is required to distinguish its type as CSI driver may need to call different APIs for different group types. When we take a GroupSnapshot, CSI driver needs to call a storage API to do so. In order to do that, it needs to have a list of volumes as the input. If we don't have a VolumeGroup, then we still need to add new RPCs which take an input of a list of volumes. |
I'm trying to understand the leap, from volume lifecycle mgmt to group
lifecycle mgmt. Why does a CO need to explictly create or modify or destroy
groups? Can't a CO use a group, perhaps created out of band, by name ref?
Dropping VolumeGroup does not preclude you from passing a groupName
parameter to a snapshot rpc. Asking CSI API to enumerate all vendor group
features seems pitted against the desire to maintain a lowest common
denominator, minimal API.
…On Wed, Oct 5, 2022, 10:28 PM Xing Yang ***@***.***> wrote:
@jdef <https://github.com/jdef> That’s an interesting idea. If we only
need to support Group Snapshot, then I think just adding a group name
attribute to a volume object would be sufficient. If we want to support
more group types, then just relying on attributes in volume objects could
be confusing. A group could be for group snapshot, group replication,
placement, etc. So other than the group name, we also need to specify what
type of group that is. Although this PR has not talked about replication
group, the plan is to make this design generic so that it can be extended
to support replication group in the future. For a replication group, there
are additional group specific attributes, i.e., whether it is primary or
secondary, etc. We would also want to introduce RPCs to support operations
such as failover, failback, etc. at the group level.
Following this idea a bit more, is it required that the group needs to
distinguish its purpose/type? At least in its current proposal, the Group
does not indicate if its for snapshots vs other purposes, and there's a
separate GroupSnapshot. I imagine in the future too, you could use the same
group for multiple purposes like replication as well.
Should we consider this alternative? Or at least see where we run into
difficulties?
When we support group snapshot, replication group, etc., we do need to
distinguish its type as CSI driver may need to call different APIs for
different group types.
When we take a GroupSnapshot, CSI driver needs to call a storage API to do
so. In order to do that, it needs to have a list of volumes as the input.
If we don't have a VolumeGroup, then we still need to add new RPCs which
takes an input of a list of volumes. I think that is more ugly than having
a VolumeGroup.
—
Reply to this email directly, view it on GitHub
<#519 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR5KLADJREJ2G3HSMA5LBLWBY2MFANCNFSM57UQWJNA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@jdef, you replied before I edited my previous response that removed the last sentence from my previous response. Actually even if we only support group snapshot, we still need to have a new CSI CreateGroupSnapshot RPC that allows us to pass a list of volume ids to the CSI driver. CSI aims to support multiple storage systems. For some, simply passing a volume group name is enough, but for others, it is required to pass a list of volume ids as well. |
When you create a volume group, do you need to state what its purpose is during group creation? For my reference, can you point to a storage system that would need this? If this is common, then maybe we should not have a generic group concept. |
@msau42 Not every driver needs to differentiate that but some do. Here's documentation of Generic Volume Group in OpenStack Cinder: https://docs.openstack.org/cinder/latest/contributor/groups.html. It was extended to support both group snapshots and replication group. It was implemented by multiple storage vendors. There was at least a Cinder driver from Dell EMC that calls different APIs to create a group that supports consistent group snapshot vs a group that supports replication group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a spin through the PR and left some notes. Why are secrets everywhere - seems like overkill?
This adds a TON of surface area. Has anyone considered breaking volume group things into a separate service? VolumeGroupController
or something - because, really there's a lot of RPCs here that a plugin has to implement (at least stubs) when they don't support vol groups at all. We've tended to lean away from defining new services, but maybe a new alpha service in this case makes sense?
Volume groups, as a concept, feel pretty opinionated here. I still wonder if there's a simpler representation that we can aim for.
@jdef Regarding the secrets, there are CSI drivers that require a secret to be passed in to every volume operation. For that reason, we have added secrets to most existing RPCs. Now since we are introducing new RPCs in this PR, we are just adding them at the same time. This way we don't have to submit followup PRs to add secrets to these RPCs in the future. I'll get back to your other comments later. Thanks! |
70a7fb1
to
d1c0a08
Compare
Hi @jdef, I addressed all your comments. Please take a look. Thanks. |
I made the following changes:
|
spec.md
Outdated
|
||
In VolumeGroupSnapshot message, both snapshots and group_snapshot_id are required fields. | ||
|
||
If an error occurs before all the individual snapshots are created when creating a group snapshot of multiple volumes and a group_snapshot_id is not yet available for CO to do clean up, SP SHOULD do clean up and make sure no snapshots are leaked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If create-volume-group-snapshot (CVGS) fails after some subset of snapshots was created w/ some subset failing (because some post-processing op failed), then a subsequent CVGS RPC is expected to report the error... as long as the SP hasn't already cleaned up after the failed CVGS op, right?
Because if the SP is doing cleanup, and the cleanup happens very quickly then subsequent CVGS could kick off a secondary operation (that may or may not fail) like the first. This could happen in such a manner that the CO never actually observes the failure gRPC because the SP is so quick to clean up between CO invocations of (an async-failing) CVGS. It's entirely possible that in such cases, the CO could reissue CVGS calls indefinitely w/ the same parameter set, not knowing that it will never be completed by the SP because the SP is mopping things up so quickly - all the CO might see is that ready_to_use never becomes true.
Are there holes in the spec like this elsewhere, where the SP can perform GC in a manner that leaves the CO blind to underlying errors? In general, we've shied away from async operations in CSI. Designing w/ post-processing steps that complete out-of-band effectively makes the API async. Can we avoid that?
IMO this needs more thought: the CO should be able to observe a failure and report it back to either some controller, or admin. Which likely means that magical auto-cleanup for failed CVGS (and maybe even for single create-snapshot) ops, performed by the SP, should go away.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If create-volume-group-snapshot (CVGS) fails after some subset of snapshots was created w/ some subset failing (because some post-processing op failed), then a subsequent CVGS RPC is expected to report the error... as long as the SP hasn't already cleaned up after the failed CVGS op, right?
create-volume-group-snapshot (CVGS) will not return success with a group_snapshot_id before all the snapshots are cut. Post-processing happens after this point. This sentence "SP SHOULD do clean up and make sure no snapshots are leaked" refers to the situation before all the snapshots are cut successfully. So "If create-volume-group-snapshot (CVGS) fails after some subset of snapshots was created w/ some subset failing (because some post-processing op failed)", a gRPC error code will be returned back to CO at this point. It does not rely on a subsequent call to report error. The create operation is synchronous until those IDs are returned. Yes, post-processing is async, but CO should clean up those resources when post-processing starts because it has got those IDs and CO does not need to rely on SP for the cleanups. I'll clarify this in the spec that "SP clean up" refers to before post-processing starts.
IMO this needs more thought: the CO should be able to observe a failure and report it back to either some controller, or admin. Which likely means that magical auto-cleanup for failed CVGS (and maybe even for single create-snapshot) ops, performed by the SP, should go away.
Cleanup by SP is needed if an error occurs before a CO can get group_snapshot_id and snapshot_ids. Without those IDs, CO does not have a way to clean up even if it observes a failure. @bswartz did suggest to have a new delete API that takes in all the parameters from a create operation during our discussions. This applies to not just CreateSnapshot and CreateVolumeGroupSnapshot, but also CreateVolume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put an agenda item on tomorrow's monthly meeting to discuss this issue. It's a wider class of problems that affects failed creates for all CSI objects: volumes and snapshots included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When Xing and I discussed this, the idea was that the SP would only clean up when reporting a terminal error. Non-terminal errors would not trigger a clean up, and cleanup would never occur until after an error has been reported to the CO.
At the time I was able to convince myself that this guarantee was "good enough" to stay out of trouble, and follows the existing pattern of CSI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the comments to this:
If an error occurs before all the individual snapshots are cut when creating a group snapshot of multiple volumes and a group_snapshot_id is not yet available for CO to do clean up, SP MUST return an error, and SP SHOULD also do clean up and make sure no snapshots are leaked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdef We discussed this in today's CSI Community Sync and decided not to introduce a name parameter in the Delete operation. Can you please review the updated PR and see if it looks good?
Comments are addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COs may want to be able to require a minimum level of consistency for group snapshots. Users may want to be able to discover the consistency of a group snapshot that was previously created. To address these use cases, I propose adding a new optional field to CreateVolumeGroupSnapshotRequest
and VolumeGroupSnapshot
:
enum VolumeGroupSnapshotConsistency {
CONSISTENCY_UNSPECIFIED = 0;
CONSISTENCY_CRASH_CONSISTENT = 1;
CONSISTENCY_FILESYSTEM_CONSISTENT = 2;
CONSISTENCY_APPLICATION_CONSISTENT = 3;
}
message CreateVolumeGroupSnapshotRequest {
...
// This field is OPTIONAL. Indicates that the request should fail
// if the Plugin cannot guarantee snapshot consistency at the
// the requested level or higher.
VolumeGroupSnapshotConsistency consistency = 5;
}
message VolumeGroupSnapshot {
...
// This field is OPTIONAL. Indicates the consistency of the snapshot
// as ascertained by the Plugin and the underlying storage provider
// at the time that the group snapshot was created. A value of
// CONSISTENCY_UNSPECIFIED indicates that the consistency level is
// not known.
VolumeGroupSnapshotConsistency consistency = 5;
}
It's true that better degrees of consistency exist and that end users may want to know about them. Unfortunately CSI is too low level to provide any guarantees other than "crash consistent". The better forms of consistency must be implemented at a higher level and thus should be tracked at a higher level. |
Note that there is a WIP KEP in Kubernetes that tries to address application consistency: kubernetes/enhancements#1995 |
9bd6f4e
to
7445a3a
Compare
@saad-ali I have addressed your comment. PTAL. Thanks. |
@saad-ali Your comments are addressed. Please take another look. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for all the hard work on this @xing-yang !!
Indeed, thanks @xing-yang for adopting all the review thoughts and getting into the consensus! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty close to the finish line. mostly nits, except for concerns around the newly added delete RPCs (and dissonance w/ the newly added get RPC) - PTAL
if DeleteSnapshot fails because snapshot is part of a group snapshot.
@jdef Your comments are addressed. PTAL. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
KEP in Kubernetes: kubernetes/enhancements#1551