Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[main] Add flag to control CRD update features #114

Merged
merged 20 commits into from
Dec 13, 2024

Conversation

mallardduck
Copy link
Member

@mallardduck mallardduck commented Oct 24, 2024

Related Issue: rancher/rancher#46764 / SURE-8872

Per the title, this PR adds features that allow users to opt-out of CRD updates and disable helm-controller CRDs on k3s/rke2.

In general, if a user opts out of the CRD updates (via .crdManagement.update: false) then they will be responsible for either:
a) manually applying the expected CRD updates, or
b) they must enable the CRD updates value for a short period of time then re-disable it.

For the CRDs that we inherit from helm-controller users can opt-out of PromFed managing them at all. In that case, the CRDs would be managed by helm-controller. However due to changes in helm-controller there can be multiple operators interacting with them without causing issues. They will only actually act on the CRD resources with annotations citing them (the controller in question) as the manager.


The description of the original issue isn't very accurate today as it used to be (i.e. using recent k3s/rke2 versions things work differently). However this PR seeks to address the root concern of the issues even if the conditions of modern distros are different.


New values added:

# Configure how the operator will manage the CustomResourceDefinitions (CRDs) it needs to function.
crdManagement:
  # Enable or disable automatic updates of CRDs during startup.
  # When true, all CRDs will be udpated to the version the operator provides.
  # When false, only missing CRDs will be installed, and existing ones will not be updated.
  update: true
  # Specify whether the operator should detect K3s and RKE2 clusters and exclude `helm-controller` CRDs from management.
  # When true, `helm-controller` CRDs will not be managed by the operator in these environments, as K3s/RKE2 handle them internally.
  # When false, the operator will manage `helm-controller` CRDs regardless of the runtime environment.
  detectK3sRke2: false

These defaults will ship to provide the existing functionality out of the box. This preserves the functionality many users would expect today so as to not create issues for them due to un-expected configuration changes.

For users on k3s/RKE2 they may want to change detectK3sRke2 to true for the most simple experience out of the box. However, they can go as far as also disabling CRD updates, and/or the PromFed shipped helm-controller. When using these options, they'll need to ensure proper configurations so that the cluster built-in helm-controller will manage the Charts for PromFed.


Future plans

Once we have a more complete frameworks level solution for tracking CRD ownership, we can immediately replace the k3s/rke2 specific logic to use that instead. (And remove added helpers that check the nodes) Then we can simultaneously deprecate and remove the chart values as they will have no affect nor will break anything if left with the methods used to implement the feature here.

Copy link

Images built for PR #114:

  • Helm Locker: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-locker:pr-114 ghcr.io/rancher/prometheus-federator/helm-locker:pr-114-4bc6a44

  • Helm Project Operator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114 ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114-4bc6a44

  • Prometheus Federator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator:pr-114 ghcr.io/rancher/prometheus-federator:pr-114-4bc6a44

@mallardduck mallardduck changed the title Crd install bugfix [main] Fix CRDs being installed even when they already exist and don't need updating Oct 25, 2024
@mallardduck mallardduck marked this pull request as ready for review October 25, 2024 21:26
@mallardduck mallardduck requested a review from a team as a code owner October 25, 2024 21:26
Copy link

Images built for PR #114:

  • Helm Locker: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-locker:pr-114 ghcr.io/rancher/prometheus-federator/helm-locker:pr-114-4bc6a44

  • Helm Project Operator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114 ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114-4bc6a44

  • Prometheus Federator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator:pr-114 ghcr.io/rancher/prometheus-federator:pr-114-4bc6a44

@mallardduck
Copy link
Member Author

mallardduck commented Oct 28, 2024

So with this fix in place, if you create a cluster (using k3s/rke2) and then later install the chart you should see CRDs with 2 sets of timestamps. One for the cluster creation and one for the helm chart install time - the helmcharts and helmchartconfigs will be from creation time. Like:

helmchartconfigs.helm.cattle.io                                   2024-10-28T00:53:16Z
helmcharts.helm.cattle.io                                         2024-10-28T00:53:16Z
helmreleases.helm.cattle.io                                       2024-10-28T15:23:18Z
projecthelmcharts.helm.cattle.io                                  2024-10-28T15:23:18Z

Actually, while this seems like an improvement and the correct behavior...I'm not currently able to replicate this at all on 2.9.2 with 104.0.0+up0.4.0 which should be the same as the reported prometheus-federator-103.0.2+up0.4.0 which has the issue.

I guess I'll try to setup 2.8 cluster and see if I can even replicate the original issue.


Oh right - the creation date is what we're seeing and that doesn't change. So I should have compared status fields to check update at times. Currently rebuilding lab to test on 2.8 anyways but I'll remember that.

@mallardduck
Copy link
Member Author

Summary of Helm-Controller Situation:

Helm Controller Release Date CRD Changes (Compared to Current) RKE2 K3S PromFed PromFed Update Safety
v0.13.1 Nov 23, 2022 N/A v1.26.1+rke2r1, v1.25.6+rke2r1 v1.26.3+k3s1, v1.25.8+k3s1 Current N/A
v0.13.3 Apr 4, 2023 False N/A v1.27.1+k3s1, v1.26.4+k3s1, v1.25.9+k3s1 N/A V Low Risk
v0.14.0 May 10, 2023 True (skipped, not helpful info) (skipped, not helpful info) N/A Higher Risk
v0.15.4 Aug 15, 2023 True v1.29.4+rke2r1, v1.28.9+rke2r1, v1.27.13+rke2r1, v1.26.12+rke2r1, v1.25.16+rke2r2 v1.29.0+k3s1, v1.28.5+k3s1, v1.27.9+k3s1, v1.26.12+k3s1, v1.25.16+k3s4 N/A Higher Risk
v0.16.4 Sep 5, 2024 True; even more v1.30.5+rke2r1 v1.30.5+k3s1 N/A V High Risk

Note that I've been testing with 1.28.14 which matches up with v0.15.13 of helm-controller. And as noted in the table, v0.13.1 of helm-controller currently lines up with the one PromFed ships.

More info on CRD differences: https://gist.github.com/mallardduck/120b1d5d833126ae2cb8ff2df20c53c5

Copy link

github-actions bot commented Dec 9, 2024

Images built for PR #114:

  • Helm Locker: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-locker:pr-114 ghcr.io/rancher/prometheus-federator/helm-locker:pr-114-0ecd192

  • Helm Project Operator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114 ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114-0ecd192

  • Prometheus Federator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator:pr-114 ghcr.io/rancher/prometheus-federator:pr-114-0ecd192

@mallardduck mallardduck force-pushed the crd-install-bugfix branch 2 times, most recently from d0e648b to 29374cc Compare December 9, 2024 21:08
@mallardduck mallardduck changed the title [main] Fix CRDs being installed even when they already exist and don't need updating [main] Add flag to control CRD update features Dec 9, 2024
@mallardduck mallardduck requested review from alexandreLamarre and removed request for joshmeranda December 9, 2024 21:15
Copy link

github-actions bot commented Dec 9, 2024

Images built for PR #114:

  • Helm Locker: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-locker:pr-114 ghcr.io/rancher/prometheus-federator/helm-locker:pr-114-0ecd192

  • Helm Project Operator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114 ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114-0ecd192

  • Prometheus Federator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator:pr-114 ghcr.io/rancher/prometheus-federator:pr-114-0ecd192

1 similar comment
Copy link

Images built for PR #114:

  • Helm Locker: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-locker:pr-114 ghcr.io/rancher/prometheus-federator/helm-locker:pr-114-0ecd192

  • Helm Project Operator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114 ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114-0ecd192

  • Prometheus Federator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator:pr-114 ghcr.io/rancher/prometheus-federator:pr-114-0ecd192

Copy link

Images built for PR #114:

  • Helm Locker: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-locker:pr-114-cb3f41e ghcr.io/rancher/prometheus-federator/helm-locker:pr-114

  • Helm Project Operator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114-cb3f41e ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114

  • Prometheus Federator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator:pr-114-cb3f41e ghcr.io/rancher/prometheus-federator:pr-114

Copy link

Images built for PR #114:

  • Helm Locker: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-locker:pr-114-d5b5c32 ghcr.io/rancher/prometheus-federator/helm-locker:pr-114

  • Helm Project Operator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114-d5b5c32 ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114

  • Prometheus Federator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator:pr-114-d5b5c32 ghcr.io/rancher/prometheus-federator:pr-114

1 similar comment
Copy link

Images built for PR #114:

  • Helm Locker: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-locker:pr-114-d5b5c32 ghcr.io/rancher/prometheus-federator/helm-locker:pr-114

  • Helm Project Operator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114-d5b5c32 ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114

  • Prometheus Federator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator:pr-114-d5b5c32 ghcr.io/rancher/prometheus-federator:pr-114

@mallardduck mallardduck force-pushed the crd-install-bugfix branch 2 times, most recently from ee50dc1 to 39a577d Compare December 10, 2024 21:13
Copy link

Images built for PR #114:

  • Helm Locker: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-locker:pr-114-39a577d ghcr.io/rancher/prometheus-federator/helm-locker:pr-114

  • Helm Project Operator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114-39a577d ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114

  • Prometheus Federator: Link to image:
    Tags: ghcr.io/rancher/prometheus-federator:pr-114-39a577d ghcr.io/rancher/prometheus-federator:pr-114

Copy link

Images built for PR #114:

  • Helm Locker - Tags:
    • ghcr.io/rancher/prometheus-federator/helm-locker:pr-114-67babd8 ghcr.io/rancher/prometheus-federator/helm-locker:pr-114
  • Helm Project Operator - Tags:
    • ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114-67babd8 ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114
  • Prometheus Federator - Tags:
    • ghcr.io/rancher/prometheus-federator:pr-114-67babd8 ghcr.io/rancher/prometheus-federator:pr-114

Helm Debug Values:

  image:
    pullPolicy: Always
    registry: ghcr.io
    repository: rancher/prometheus-federator
    tag: pr-114-67babd8

For now we indirectly rely on checking nodes for k8s runtime type, then based on if we find k3s/rke2 we skip them.
In the future, we'd like to directly be able to query the CRD on the cluster and see a clear annotation/label citing the CRDs owner.
Copy link

Images built for PR #114:

  • Helm Locker - Tags:
    • ghcr.io/rancher/prometheus-federator/helm-locker:pr-114-94d7ddf ghcr.io/rancher/prometheus-federator/helm-locker:pr-114
  • Helm Project Operator - Tags:
    • ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114-94d7ddf ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114
  • Prometheus Federator - Tags:
    • ghcr.io/rancher/prometheus-federator:pr-114-94d7ddf ghcr.io/rancher/prometheus-federator:pr-114

Helm Debug Values:

  image:
    pullPolicy: Always
    registry: ghcr.io
    repository: rancher/prometheus-federator
    tag: pr-114-94d7ddf

Copy link
Contributor

@alexandreLamarre alexandreLamarre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing we need to start doing is adding behavioral tests in Go for every code change, establishing clear expected behavior in something other than just documentation and having CI catch it.

This PR should include those tests as well and I'm happy to sync up about those.

Comment on lines 253 to 302
func shouldManageHelmControllerCRDs(cfg *rest.Config) bool {
if os.Getenv("DETECT_K3S_RKE2") != "true" {
logrus.Debug("k3s/rke2 detection feature is disabled; `helm-controller` CRDs will be managed")
return true
}

k8sRuntimeType, err := identifyKubernetesRuntimeType(cfg)
if err != nil {
logrus.Error(err)
}

onK3sRke2 := k8sRuntimeType == "k3s" || k8sRuntimeType == "rke2"
if onK3sRke2 {
logrus.Debug("the cluster is running on k3s (or rke2), `helm-controller` CRDs will not be managed")
}

return !onK3sRke2
}

func identifyKubernetesRuntimeType(clientConfig *rest.Config) (string, error) {
client, err := clients.NewFromConfig(clientConfig, nil)
if err != nil {
return "", err
}

nodes, err := client.Core.Node().List(metav1.ListOptions{})
if err != nil {
logrus.Fatalf("Failed to list nodes: %v", err)
}
instanceTypes := make(map[string]int)
for _, node := range nodes.Items {
instanceType, exists := node.Labels["node.kubernetes.io/instance-type"]
if exists {
instanceTypes[instanceType]++
} else {
logrus.Debugf("Cannot find `node.kubernetes.io/instance-type` label on node `%s`", node.Name)
}
}

if len(instanceTypes) == 0 {
return "", errors.New("cannot identify k8s runtime type; no nodes in cluster have expected label")
}

var k8sRuntimeType string
for instanceType := range instanceTypes {
k8sRuntimeType = instanceType
break
}

return k8sRuntimeType, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back up to the previous discussion, if we can't detect owners of crds from other methods, then we shouldn't have auto detect logic at all.

It can introduce many incompatibilities across many versions - does it work on older rke2 and k3s versions? will it work going forward? How will we use a test matrix to ensure it always works? It's what we call an out of band dependency and those should generally be avoided if possible, because they introduce unnecessary complexity both in code and in this instance especially in testing.

I'm almost certain that's also why the original controller has a toggle to use an embedded controller or not, rather than detect specific instances. In this case, I'd heavily consider a toggle to just filter CRD updates and omit detection logic altogether.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to introduce distribution specific behavior, then the existing place where rancher and rancher observability sort of handle that is in the charts themselves, with distribution specific resource definitions - which in my mind is also overkill to make work in this specific instance

Copy link
Contributor

@alexandreLamarre alexandreLamarre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed to leave auto detection as an experimental feature and to omit tests to move it to prod asap

)

// IdentifyKubernetesRuntimeType provides the k8s runtime used on nodes in the cluster.
// Deprecated: This feature is a stop gap not expected to be maintained long-term.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured we just go for it and deprecate from day 1 even tho it feels weird.

Big picture it feels better since it's a clear signal that this method isn't intended to be used long term. Nor something that needs to be maintained once a better mechanism to get the same information is in place.

# When false, only missing CRDs will be installed, and existing ones will not be updated.
update: true

# !! EXPERIMENTAL OPTION !! - Use this feature with caution and careful consideration.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly warn end-users that this value is experimental and to use with care. Ultimately we added it in a way that won't cause issues if we eventually remove it, or even if customers use unexpected combinations of PromFed chart and PromFed images.

}
}
return
}

// List returns the list of CRDs and dependent CRDs for this operator
func List() ([]crd.CRD, []crd.CRD) {
func List() ([]crd.CRD, []crd.CRD, []crd.CRD) {
// TODO: The underlying crd.CRD is deprectated and will eventually be removed
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave future TODO to remind us that this crd.CRD is deprecated.

return true
}

// TODO: In the future, this should not rely on detecting k8s runtime type
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally very clear todo here, to make it even more clear this will be removed in the future.

Copy link

Images built for PR #114:

  • Helm Locker - Tags:
    • ghcr.io/rancher/prometheus-federator/helm-locker:pr-114-2cff460 ghcr.io/rancher/prometheus-federator/helm-locker:pr-114
  • Helm Project Operator - Tags:
    • ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114-2cff460 ghcr.io/rancher/prometheus-federator/helm-project-operator:pr-114
  • Prometheus Federator - Tags:
    • ghcr.io/rancher/prometheus-federator:pr-114-2cff460 ghcr.io/rancher/prometheus-federator:pr-114

Helm Debug Values:

  image:
    pullPolicy: Always
    registry: ghcr.io
    repository: rancher/prometheus-federator
    tag: pr-114-2cff460

@mallardduck mallardduck merged commit 9f57328 into rancher:main Dec 13, 2024
17 checks passed
@mallardduck mallardduck deleted the crd-install-bugfix branch December 13, 2024 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants