-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[main] Add flag to control CRD update features #114
Conversation
bc8034c
to
fdeee11
Compare
Images built for PR #114:
|
Images built for PR #114:
|
So with this fix in place, if you create a cluster (using k3s/rke2) and then later install the chart you should see CRDs with 2 sets of timestamps. One for the cluster creation and one for the helm chart install time - the
Actually, while this seems like an improvement and the correct behavior...I'm not currently able to replicate this at all on 2.9.2 with I guess I'll try to setup 2.8 cluster and see if I can even replicate the original issue. Oh right - the creation date is what we're seeing and that doesn't change. So I should have compared status fields to check update at times. Currently rebuilding lab to test on 2.8 anyways but I'll remember that. |
Summary of Helm-Controller Situation:
Note that I've been testing with 1.28.14 which matches up with More info on CRD differences: https://gist.github.com/mallardduck/120b1d5d833126ae2cb8ff2df20c53c5 |
ed9b007
to
2077d30
Compare
2077d30
to
f6babbd
Compare
Images built for PR #114:
|
d0e648b
to
29374cc
Compare
Images built for PR #114:
|
1 similar comment
Images built for PR #114:
|
dd1bfcc
to
cb3f41e
Compare
Images built for PR #114:
|
cb3f41e
to
d5b5c32
Compare
Images built for PR #114:
|
1 similar comment
Images built for PR #114:
|
ee50dc1
to
39a577d
Compare
Images built for PR #114:
|
39a577d
to
67babd8
Compare
Images built for PR #114:
Helm Debug Values: image:
pullPolicy: Always
registry: ghcr.io
repository: rancher/prometheus-federator
tag: pr-114-67babd8 |
e4c5df2
to
05245f8
Compare
For now we indirectly rely on checking nodes for k8s runtime type, then based on if we find k3s/rke2 we skip them. In the future, we'd like to directly be able to query the CRD on the cluster and see a clear annotation/label citing the CRDs owner.
05245f8
to
54fef57
Compare
Images built for PR #114:
Helm Debug Values: image:
pullPolicy: Always
registry: ghcr.io
repository: rancher/prometheus-federator
tag: pr-114-94d7ddf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other thing we need to start doing is adding behavioral tests in Go for every code change, establishing clear expected behavior in something other than just documentation and having CI catch it.
This PR should include those tests as well and I'm happy to sync up about those.
func shouldManageHelmControllerCRDs(cfg *rest.Config) bool { | ||
if os.Getenv("DETECT_K3S_RKE2") != "true" { | ||
logrus.Debug("k3s/rke2 detection feature is disabled; `helm-controller` CRDs will be managed") | ||
return true | ||
} | ||
|
||
k8sRuntimeType, err := identifyKubernetesRuntimeType(cfg) | ||
if err != nil { | ||
logrus.Error(err) | ||
} | ||
|
||
onK3sRke2 := k8sRuntimeType == "k3s" || k8sRuntimeType == "rke2" | ||
if onK3sRke2 { | ||
logrus.Debug("the cluster is running on k3s (or rke2), `helm-controller` CRDs will not be managed") | ||
} | ||
|
||
return !onK3sRke2 | ||
} | ||
|
||
func identifyKubernetesRuntimeType(clientConfig *rest.Config) (string, error) { | ||
client, err := clients.NewFromConfig(clientConfig, nil) | ||
if err != nil { | ||
return "", err | ||
} | ||
|
||
nodes, err := client.Core.Node().List(metav1.ListOptions{}) | ||
if err != nil { | ||
logrus.Fatalf("Failed to list nodes: %v", err) | ||
} | ||
instanceTypes := make(map[string]int) | ||
for _, node := range nodes.Items { | ||
instanceType, exists := node.Labels["node.kubernetes.io/instance-type"] | ||
if exists { | ||
instanceTypes[instanceType]++ | ||
} else { | ||
logrus.Debugf("Cannot find `node.kubernetes.io/instance-type` label on node `%s`", node.Name) | ||
} | ||
} | ||
|
||
if len(instanceTypes) == 0 { | ||
return "", errors.New("cannot identify k8s runtime type; no nodes in cluster have expected label") | ||
} | ||
|
||
var k8sRuntimeType string | ||
for instanceType := range instanceTypes { | ||
k8sRuntimeType = instanceType | ||
break | ||
} | ||
|
||
return k8sRuntimeType, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going back up to the previous discussion, if we can't detect owners of crds from other methods, then we shouldn't have auto detect logic at all.
It can introduce many incompatibilities across many versions - does it work on older rke2 and k3s versions? will it work going forward? How will we use a test matrix to ensure it always works? It's what we call an out of band dependency and those should generally be avoided if possible, because they introduce unnecessary complexity both in code and in this instance especially in testing.
I'm almost certain that's also why the original controller has a toggle to use an embedded controller or not, rather than detect specific instances. In this case, I'd heavily consider a toggle to just filter CRD updates and omit detection logic altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were to introduce distribution specific behavior, then the existing place where rancher and rancher observability sort of handle that is in the charts themselves, with distribution specific resource definitions - which in my mind is also overkill to make work in this specific instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed to leave auto detection as an experimental feature and to omit tests to move it to prod asap
7ed8290
to
2cff460
Compare
) | ||
|
||
// IdentifyKubernetesRuntimeType provides the k8s runtime used on nodes in the cluster. | ||
// Deprecated: This feature is a stop gap not expected to be maintained long-term. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured we just go for it and deprecate from day 1 even tho it feels weird.
Big picture it feels better since it's a clear signal that this method isn't intended to be used long term. Nor something that needs to be maintained once a better mechanism to get the same information is in place.
# When false, only missing CRDs will be installed, and existing ones will not be updated. | ||
update: true | ||
|
||
# !! EXPERIMENTAL OPTION !! - Use this feature with caution and careful consideration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly warn end-users that this value is experimental and to use with care. Ultimately we added it in a way that won't cause issues if we eventually remove it, or even if customers use unexpected combinations of PromFed chart and PromFed images.
} | ||
} | ||
return | ||
} | ||
|
||
// List returns the list of CRDs and dependent CRDs for this operator | ||
func List() ([]crd.CRD, []crd.CRD) { | ||
func List() ([]crd.CRD, []crd.CRD, []crd.CRD) { | ||
// TODO: The underlying crd.CRD is deprectated and will eventually be removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave future TODO to remind us that this crd.CRD
is deprecated.
return true | ||
} | ||
|
||
// TODO: In the future, this should not rely on detecting k8s runtime type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally very clear todo here, to make it even more clear this will be removed in the future.
Images built for PR #114:
Helm Debug Values: image:
pullPolicy: Always
registry: ghcr.io
repository: rancher/prometheus-federator
tag: pr-114-2cff460 |
Related Issue: rancher/rancher#46764 / SURE-8872
Per the title, this PR adds features that allow users to opt-out of CRD updates and disable helm-controller CRDs on k3s/rke2.
In general, if a user opts out of the CRD updates (via
.crdManagement.update: false
) then they will be responsible for either:a) manually applying the expected CRD updates, or
b) they must enable the CRD updates value for a short period of time then re-disable it.
For the CRDs that we inherit from
helm-controller
users can opt-out of PromFed managing them at all. In that case, the CRDs would be managed byhelm-controller
. However due to changes inhelm-controller
there can be multiple operators interacting with them without causing issues. They will only actually act on the CRD resources with annotations citing them (the controller in question) as the manager.The description of the original issue isn't very accurate today as it used to be (i.e. using recent k3s/rke2 versions things work differently). However this PR seeks to address the root concern of the issues even if the conditions of modern distros are different.
New values added:
These defaults will ship to provide the existing functionality out of the box. This preserves the functionality many users would expect today so as to not create issues for them due to un-expected configuration changes.
For users on k3s/RKE2 they may want to change
detectK3sRke2
totrue
for the most simple experience out of the box. However, they can go as far as also disabling CRD updates, and/or the PromFed shippedhelm-controller
. When using these options, they'll need to ensure proper configurations so that the cluster built-inhelm-controller
will manage the Charts for PromFed.Future plans
Once we have a more complete frameworks level solution for tracking CRD ownership, we can immediately replace the k3s/rke2 specific logic to use that instead. (And remove added helpers that check the nodes) Then we can simultaneously deprecate and remove the chart values as they will have no affect nor will break anything if left with the methods used to implement the feature here.