🐛 don't set installed status unless it is successful #1368

everettraven · 2024-10-11T15:43:53Z

Description

It was pointed out in an offline discussion that we still have some areas in our code where we may end up overwriting the Installed status condition fields from True to False. Namely:

operator-controller/internal/controllers/clusterextension_controller.go

Lines 208 to 210 in 78b586a

    
           setInstallStatus(ext, nil) 
        
           // TODO: use Installed=Unknown 
        
           setInstalledStatusConditionFailed(ext, err.Error())

operator-controller/internal/controllers/clusterextension_controller.go

Line 220 in 78b586a

setInstallStatus(ext, nil)

In order to fix this, this PR removes setting of the Installed status condition and associated status fields from anywhere in the existing code base except when installation has been successful.

This means that the Installed status condition will:

Be missing until the initial install is successful (this reduces duplication of information from the Installed status condition and the Progressing status condition)
Be set to True and populated with the most up-to-date information on successful installation and upgrades.

Reviewer Checklist

API Go Documentation
Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

Signed-off-by: everettraven <everettraven@gmail.com>

netlify · 2024-10-11T15:44:09Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`21b2777`
🔍 Latest deploy log	https://app.netlify.com/sites/olmv1/deploys/67094fd4df38de00080f3236
😎 Deploy Preview	https://deploy-preview-1368--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Signed-off-by: everettraven <everettraven@gmail.com>

codecov · 2024-10-11T15:54:45Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.90%. Comparing base (78b586a) to head (21b2777).
Report is 16 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1368      +/-   ##
==========================================
+ Coverage   74.83%   74.90%   +0.06%     
==========================================
  Files          42       42              
  Lines        2515     2502      -13     
==========================================
- Hits         1882     1874       -8     
+ Misses        449      445       -4     
+ Partials      184      183       -1

Flag	Coverage Δ
e2e	`56.59% <ø> (-0.15%)`	⬇️
unit	`52.63% <ø> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…subsequent reconcile Signed-off-by: everettraven <everettraven@gmail.com>

everettraven · 2024-10-11T16:23:08Z

internal/controllers/clusterextension_controller_test.go

Not sure why the diff on this file is all messed up. Essentially what I did here was:

Remove the Installed condition check from TestClusterExtensionResolutionAndUnpackSuccessfulApplierFails since we no longer expect it to have been set as part of applier failure (I can change this to an explicit nil check if desired)

Remove the TestClusterExtensionApplierFailsWithBundleInstalled function in favor of a new TestClusterExtensionUpgradeFails test function that asserts that the Installed condition is not modified if there is an error returned from each interface (individually) on a subsequent reconciliation

Do you need to rebase?

It does seem excessive?

Maybe? I'm thinking it is more so related to the removal of a function in the middle of the file and appending one towards the end than a need to rebase.

I don't see any merge conflicts so probably not related to a need to rebase. I think this strictly has to do with positioning of where code was removed and added. I can fiddle with it a bit if it seems valuable to do so

Yeah, it looks as though removing a function in the middle of a bunch of similar functions is causing chaos... If I pull the PR down locally, I might be able to do a proper incantation of diff options to make it easier to review.

The code diff looks fine to me. If you change the first line of code block and also last few lines , I think git considers that the whole block has changed.

The --patience option to diff make it (more) clear that you deleted and added a function.

tmshort · 2024-10-11T18:06:40Z

test/e2e/cluster_extension_install_test.go

-		cond := apimeta.FindStatusCondition(clusterExtension.Status.Conditions, ocv1alpha1.TypeInstalled)
-		if assert.NotNil(ct, cond) {
-			assert.Equal(ct, metav1.ConditionFalse, cond.Status)
-			assert.Equal(ct, ocv1alpha1.ReasonFailed, cond.Reason)
 			assert.Contains(ct, cond.Message, "forbidden")


Seems odd that this line was kept? It was part of the EventuallyWithT() that was deleted; not sure it should be added the the previous one?

I think this is correct here. The logic behind moving it to the previous one is that we no longer set the Installed condition failure here, but rather reflect the failure in the Progressing status condition (which is the condition we are checking immediately before this. In this case we expect the Progressing condition to eventually be True, with a reason Retrying, and a message that signals that there were insufficient permissions attached to the ServiceAccount (asserted by looking for the "forbidden" string in the message).

I could be wrong here, but I believe we want to ensure we aren't progressing further in the test until we've identified that the failure is related to permissions and this should do that.

LalatenduMohanty

The commit message is very brief and does not contain the amount of information available in the PR description. I will suggest that we add more information to the commit message so that it would be easier for anyone who is trying to understand the commit in future (i.e. commit)

everettraven · 2024-10-11T19:30:12Z

@LalatenduMohanty I could be misremembering, but I was under the impression that merges in this repo default to using the PR title+description for the commit message.

If this is the incorrect understanding, I am happy to squash the commits here and write a better commit message.

LalatenduMohanty · 2024-10-11T19:48:50Z

@everettraven Here is an an example fba8473 , the PR had two commits , while merging it squashed it in to one commit but it also kept the commit messages from all commits. So if you add a commit message for a commit it wont be lost after the commits are squashed.

LalatenduMohanty · 2024-10-11T19:53:02Z

Overall the PR looks good to me. Once the we add a improved commit message (as mentioned in my previous comments) I can approve the PR.

tmshort

/lgtm
Looks much better with the --patience diff option.

joelanford · 2024-10-14T12:36:23Z

internal/controllers/clusterextension_controller.go

-		setInstallStatus(ext, nil)
-		// TODO: use Installed=Unknown
-		setInstalledStatusConditionFailed(ext, err.Error())


I'm not sure I agree with this set of removals.

If we fail to get the installed bundle, that seems like a pretty direct signal for Installed=Unknown. In this case, something is majorly wrong with our ability to contact the api server, or we got an ephemeral error, and we'll hopefully set this back to Installed=True when the ephemeral issue goes away. For this, I can see an argument for holding off on setting Installed=Unknown until we see the same error N times in a row so that we avoid flapping, but it seems like we should eventually set Installed=Unknown if we truly can't get the installed bundle.

If we get installedBundle==nil, that's a clear signal for Installed=False. In this case, it seems like we could set the Installed condition to False if we get any other subsequent errors during reconcile.

joelanford

I'm not sure I agree with all of these changes. It seems like these move us away from level-based and into edge-based logic.

IMO, we should assess the current state of the world and update our status accordingly on every reconcile. We had an issue in catalogd recently that we had to fix to move from edge-based back to level-based logic.

I think there are some improvements we could make to prevent unnecessary flapping, but it seems counter to the eventually consistent assumptions of Kubernetes for us to make assumptions about what happened in the past and disregard what we're actually seeing in the present.

everettraven · 2024-10-14T14:15:45Z

I'm not sure I agree with all of these changes. It seems like these move us away from level-based and into edge-based logic.

IMO, we should assess the current state of the world and update our status accordingly on every reconcile. We had an issue in catalogd recently that we had to fix to move from edge-based back to level-based logic.

I think there are some improvements we could make to prevent unnecessary flapping, but it seems counter to the eventually consistent assumptions of Kubernetes for us to make assumptions about what happened in the past and disregard what we're actually seeing in the present.

@joelanford while I understand what you are saying, and don't entirely disagree, where I am coming from with this PR is that we want to achieve the following with the Installed status condition:

Before an initial installation is successful, we want a user to interpret Installed as False OR Unknown.
After an initial installation is successful, we want a user to interpret Installed as True regardless of subsequent failures to upgrade - even if it results in a degredation of the initial installation

When I was looking into this, I was originally looking at this in the same lens of updating our status based on our knowledge of the current state of the world, but this seemed to introduce quite a bit of complexity that I wasn't confident would prevent false-negative flapping (i.e saying something isn't installed when it is). More specifically, these are the challenges I recall running into:

If we fail to fetch an installed bundle, receiving an error from the InstalledBundleGetter:
- How can we be sure that we don't misinform end users when updating the Installed status?
- Are we okay with duplicating information on the Installed and Progressing status conditions?

If we receive a nil BundleMetadata from the InstalledBundleGetter, which based on

operator-controller/internal/controllers/clusterextension_controller.go

Lines 483 to 491 in e5ecec7

    
           switch rel.Info.Status { 
        
           case release.StatusUnknown: 
        
           	return nil, fmt.Errorf("installation status is unknown") 
        
           case release.StatusDeployed, release.StatusUninstalled, release.StatusSuperseded, release.StatusFailed: 
        
           case release.StatusUninstalling, release.StatusPendingInstall, release.StatusPendingRollback, release.StatusPendingUpgrade: 
        
           	return nil, fmt.Errorf("installation is still pending: %s", rel.Info.Status) 
        
           default: 
        
           	return nil, fmt.Errorf("unknown installation status: %s", rel.Info.Status) 
        
           }

, is possible even if we found a release but it has some unknown status conditions:

Are we sure that we haven't actually attempted an install before? This problem in particular could probably be remedied by making it such that the InstalledBundleGetter expectation is that if it finds a release, regardless of state, the bundle metadata is returned.

If we "fix" the above point and receive a populated BundleMetadata from the InstalledBundleGetter alongside an error due to "invalid" status:
- How do we ensure that the BundleMetadata we received is for an initial installation attempt vs a failed upgrade attempt?
- If we add release revision to the BundleMetadata that gives a bit more insight, but my understanding is that revision is increased even on "upgrade" operations to a failed initial install. How can we be confident that a release revision range corresponds to "initial installation attempt"?
To potentially resolve the complexities I've called out above, we could add some kind of caching layer as an additional mechanism for cross-checking the information we've received from our InstalledBundleGetter (or the InstalledBundleGetter itself uses it), but then:
- You add yet another caching layer
- Should the cache be memory based? What happens if the container crashes and we lose that information? Would potentially overwriting the Installed status condition in this case be acceptable?
- Should the cache be written to disk to preserve it between potential container crashes/restarts (probably)?

Hopefully my response here shows the complexity involved with managing the Installed condition in the way we've outlined in the RFC when taking a level-based approach. I'm by no means saying one way is right or wrong, but simply calling out the complexities involved.

Going the route of this PR, you resolve the two requirements I shared above by:

Not setting the Installed status condition until the installation has actually been successful (observed by reaching a specific part of our code). This means that failures up until that point will be reflected in the Progressing condition. Pairing a non-existent Installed condition and a present Progressing condition should give and end user enough insight to say "This ClusterExtension has not yet successfully installed, but is progressing towards the installed state". If the Installed condition is present and the Progressing condition is signaling errors an end user should be able to conclude that "This ClusterExtension has been successfully installed, but is currently failing to progress through an upgrade".
Only setting the Installed status condition when we know for sure that an installation or an upgrade has been successful ensures that we don't run the risk of providing a false-negative on the Installed condition based on our observed state being incomplete.

I'll re-iterate again, I'm not saying one way is right or wrong. I'm simply calling out that there is significant complexity involved in going the route of setting the Installed condition based on what our controller observes (which could be incomplete) and the approach of this PR has significantly less complexity. There is obviously a trade-off either way, but given the "time crunch" we've been under for shipping v1.0.0 I went with this approach because it:

Achieves the agreed upon requirements for the Installed condition
Is faster to implement and thus allows us to reach our v1.0.0 milestone faster
Doesn't prohibit us from spending more time on this to get back to a more level-based approach post-v1.0.0

LalatenduMohanty · 2024-10-14T18:42:35Z

/hold adding a hold as we want to discuss the changes first.

everettraven · 2024-10-24T13:22:49Z

Superseded by #1379

(bugfix): don't set installed status unless it is successful

91d2b14

Signed-off-by: everettraven <everettraven@gmail.com>

everettraven requested a review from a team as a code owner October 11, 2024 15:43

remove unused code

3963157

Signed-off-by: everettraven <everettraven@gmail.com>

add unit test for testing all components failing individually during …

21b2777

…subsequent reconcile Signed-off-by: everettraven <everettraven@gmail.com>

everettraven commented Oct 11, 2024

View reviewed changes

tmshort reviewed Oct 11, 2024

View reviewed changes

LalatenduMohanty reviewed Oct 11, 2024

View reviewed changes

tmshort approved these changes Oct 11, 2024

View reviewed changes

openshift-ci bot assigned tmshort Oct 11, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2024

joelanford reviewed Oct 14, 2024

View reviewed changes

joelanford requested changes Oct 14, 2024

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2024

everettraven closed this Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 don't set installed status unless it is successful #1368

🐛 don't set installed status unless it is successful #1368

everettraven commented Oct 11, 2024

netlify bot commented Oct 11, 2024 •

edited

Loading

codecov bot commented Oct 11, 2024 •

edited

Loading

everettraven Oct 11, 2024

tmshort Oct 11, 2024

tmshort Oct 11, 2024

everettraven Oct 11, 2024

everettraven Oct 11, 2024

tmshort Oct 11, 2024

LalatenduMohanty Oct 11, 2024

tmshort Oct 11, 2024

tmshort Oct 11, 2024

everettraven Oct 11, 2024

LalatenduMohanty left a comment

everettraven commented Oct 11, 2024

LalatenduMohanty commented Oct 11, 2024 •

edited

Loading

LalatenduMohanty commented Oct 11, 2024

tmshort left a comment

joelanford Oct 14, 2024

joelanford left a comment

everettraven commented Oct 14, 2024

LalatenduMohanty commented Oct 14, 2024

everettraven commented Oct 24, 2024

	setInstallStatus(ext, nil)
	// TODO: use Installed=Unknown
	setInstalledStatusConditionFailed(ext, err.Error())

🐛 don't set installed status unless it is successful #1368

🐛 don't set installed status unless it is successful #1368

Conversation

everettraven commented Oct 11, 2024

Description

Reviewer Checklist

netlify bot commented Oct 11, 2024 • edited Loading

✅ Deploy Preview for olmv1 ready!

codecov bot commented Oct 11, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LalatenduMohanty left a comment

Choose a reason for hiding this comment

everettraven commented Oct 11, 2024

LalatenduMohanty commented Oct 11, 2024 • edited Loading

LalatenduMohanty commented Oct 11, 2024

tmshort left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joelanford left a comment

Choose a reason for hiding this comment

everettraven commented Oct 14, 2024

LalatenduMohanty commented Oct 14, 2024

everettraven commented Oct 24, 2024

netlify bot commented Oct 11, 2024 •

edited

Loading

codecov bot commented Oct 11, 2024 •

edited

Loading

LalatenduMohanty commented Oct 11, 2024 •

edited

Loading