Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 don't set installed status unless it is successful #1368

Conversation

everettraven
Copy link
Contributor

Description

It was pointed out in an offline discussion that we still have some areas in our code where we may end up overwriting the Installed status condition fields from True to False. Namely:

In order to fix this, this PR removes setting of the Installed status condition and associated status fields from anywhere in the existing code base except when installation has been successful.

This means that the Installed status condition will:

  • Be missing until the initial install is successful (this reduces duplication of information from the Installed status condition and the Progressing status condition)
  • Be set to True and populated with the most up-to-date information on successful installation and upgrades.

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

Signed-off-by: everettraven <everettraven@gmail.com>
@everettraven everettraven requested a review from a team as a code owner October 11, 2024 15:43
Copy link

netlify bot commented Oct 11, 2024

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 21b2777
🔍 Latest deploy log https://app.netlify.com/sites/olmv1/deploys/67094fd4df38de00080f3236
😎 Deploy Preview https://deploy-preview-1368--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Signed-off-by: everettraven <everettraven@gmail.com>
Copy link

codecov bot commented Oct 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.90%. Comparing base (78b586a) to head (21b2777).
Report is 16 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1368      +/-   ##
==========================================
+ Coverage   74.83%   74.90%   +0.06%     
==========================================
  Files          42       42              
  Lines        2515     2502      -13     
==========================================
- Hits         1882     1874       -8     
+ Misses        449      445       -4     
+ Partials      184      183       -1     
Flag Coverage Δ
e2e 56.59% <ø> (-0.15%) ⬇️
unit 52.63% <ø> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…subsequent reconcile

Signed-off-by: everettraven <everettraven@gmail.com>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why the diff on this file is all messed up. Essentially what I did here was:

  • Remove the Installed condition check from TestClusterExtensionResolutionAndUnpackSuccessfulApplierFails since we no longer expect it to have been set as part of applier failure (I can change this to an explicit nil check if desired)
  • Remove the TestClusterExtensionApplierFailsWithBundleInstalled function in favor of a new TestClusterExtensionUpgradeFails test function that asserts that the Installed condition is not modified if there is an error returned from each interface (individually) on a subsequent reconciliation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to rebase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem excessive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I'm thinking it is more so related to the removal of a function in the middle of the file and appending one towards the end than a need to rebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any merge conflicts so probably not related to a need to rebase. I think this strictly has to do with positioning of where code was removed and added. I can fiddle with it a bit if it seems valuable to do so

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it looks as though removing a function in the middle of a bunch of similar functions is causing chaos... If I pull the PR down locally, I might be able to do a proper incantation of diff options to make it easier to review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code diff looks fine to me. If you change the first line of code block and also last few lines , I think git considers that the whole block has changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --patience option to diff make it (more) clear that you deleted and added a function.

cond := apimeta.FindStatusCondition(clusterExtension.Status.Conditions, ocv1alpha1.TypeInstalled)
if assert.NotNil(ct, cond) {
assert.Equal(ct, metav1.ConditionFalse, cond.Status)
assert.Equal(ct, ocv1alpha1.ReasonFailed, cond.Reason)
assert.Contains(ct, cond.Message, "forbidden")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems odd that this line was kept? It was part of the EventuallyWithT() that was deleted; not sure it should be added the the previous one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct here. The logic behind moving it to the previous one is that we no longer set the Installed condition failure here, but rather reflect the failure in the Progressing status condition (which is the condition we are checking immediately before this. In this case we expect the Progressing condition to eventually be True, with a reason Retrying, and a message that signals that there were insufficient permissions attached to the ServiceAccount (asserted by looking for the "forbidden" string in the message).

I could be wrong here, but I believe we want to ensure we aren't progressing further in the test until we've identified that the failure is related to permissions and this should do that.

Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message is very brief and does not contain the amount of information available in the PR description. I will suggest that we add more information to the commit message so that it would be easier for anyone who is trying to understand the commit in future (i.e. commit)

@everettraven
Copy link
Contributor Author

@LalatenduMohanty I could be misremembering, but I was under the impression that merges in this repo default to using the PR title+description for the commit message.

If this is the incorrect understanding, I am happy to squash the commits here and write a better commit message.

@LalatenduMohanty
Copy link
Member

LalatenduMohanty commented Oct 11, 2024

@everettraven Here is an an example fba8473 , the PR had two commits , while merging it squashed it in to one commit but it also kept the commit messages from all commits. So if you add a commit message for a commit it wont be lost after the commits are squashed.

@LalatenduMohanty
Copy link
Member

Overall the PR looks good to me. Once the we add a improved commit message (as mentioned in my previous comments) I can approve the PR.

Copy link
Contributor

@tmshort tmshort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Looks much better with the --patience diff option.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2024
Comment on lines -208 to -210
setInstallStatus(ext, nil)
// TODO: use Installed=Unknown
setInstalledStatusConditionFailed(ext, err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree with this set of removals.

If we fail to get the installed bundle, that seems like a pretty direct signal for Installed=Unknown. In this case, something is majorly wrong with our ability to contact the api server, or we got an ephemeral error, and we'll hopefully set this back to Installed=True when the ephemeral issue goes away. For this, I can see an argument for holding off on setting Installed=Unknown until we see the same error N times in a row so that we avoid flapping, but it seems like we should eventually set Installed=Unknown if we truly can't get the installed bundle.

If we get installedBundle==nil, that's a clear signal for Installed=False. In this case, it seems like we could set the Installed condition to False if we get any other subsequent errors during reconcile.

Copy link
Member

@joelanford joelanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree with all of these changes. It seems like these move us away from level-based and into edge-based logic.

IMO, we should assess the current state of the world and update our status accordingly on every reconcile. We had an issue in catalogd recently that we had to fix to move from edge-based back to level-based logic.

I think there are some improvements we could make to prevent unnecessary flapping, but it seems counter to the eventually consistent assumptions of Kubernetes for us to make assumptions about what happened in the past and disregard what we're actually seeing in the present.

@everettraven
Copy link
Contributor Author

I'm not sure I agree with all of these changes. It seems like these move us away from level-based and into edge-based logic.

IMO, we should assess the current state of the world and update our status accordingly on every reconcile. We had an issue in catalogd recently that we had to fix to move from edge-based back to level-based logic.

I think there are some improvements we could make to prevent unnecessary flapping, but it seems counter to the eventually consistent assumptions of Kubernetes for us to make assumptions about what happened in the past and disregard what we're actually seeing in the present.

@joelanford while I understand what you are saying, and don't entirely disagree, where I am coming from with this PR is that we want to achieve the following with the Installed status condition:

  • Before an initial installation is successful, we want a user to interpret Installed as False OR Unknown.
  • After an initial installation is successful, we want a user to interpret Installed as True regardless of subsequent failures to upgrade - even if it results in a degredation of the initial installation

When I was looking into this, I was originally looking at this in the same lens of updating our status based on our knowledge of the current state of the world, but this seemed to introduce quite a bit of complexity that I wasn't confident would prevent false-negative flapping (i.e saying something isn't installed when it is). More specifically, these are the challenges I recall running into:

  • If we fail to fetch an installed bundle, receiving an error from the InstalledBundleGetter:
    • How can we be sure that we don't misinform end users when updating the Installed status?
    • Are we okay with duplicating information on the Installed and Progressing status conditions?
  • If we receive a nil BundleMetadata from the InstalledBundleGetter, which based on
    switch rel.Info.Status {
    case release.StatusUnknown:
    return nil, fmt.Errorf("installation status is unknown")
    case release.StatusDeployed, release.StatusUninstalled, release.StatusSuperseded, release.StatusFailed:
    case release.StatusUninstalling, release.StatusPendingInstall, release.StatusPendingRollback, release.StatusPendingUpgrade:
    return nil, fmt.Errorf("installation is still pending: %s", rel.Info.Status)
    default:
    return nil, fmt.Errorf("unknown installation status: %s", rel.Info.Status)
    }
    , is possible even if we found a release but it has some unknown status conditions:
    • Are we sure that we haven't actually attempted an install before? This problem in particular could probably be remedied by making it such that the InstalledBundleGetter expectation is that if it finds a release, regardless of state, the bundle metadata is returned.
  • If we "fix" the above point and receive a populated BundleMetadata from the InstalledBundleGetter alongside an error due to "invalid" status:
    • How do we ensure that the BundleMetadata we received is for an initial installation attempt vs a failed upgrade attempt?
    • If we add release revision to the BundleMetadata that gives a bit more insight, but my understanding is that revision is increased even on "upgrade" operations to a failed initial install. How can we be confident that a release revision range corresponds to "initial installation attempt"?
  • To potentially resolve the complexities I've called out above, we could add some kind of caching layer as an additional mechanism for cross-checking the information we've received from our InstalledBundleGetter (or the InstalledBundleGetter itself uses it), but then:
    • You add yet another caching layer
    • Should the cache be memory based? What happens if the container crashes and we lose that information? Would potentially overwriting the Installed status condition in this case be acceptable?
    • Should the cache be written to disk to preserve it between potential container crashes/restarts (probably)?

Hopefully my response here shows the complexity involved with managing the Installed condition in the way we've outlined in the RFC when taking a level-based approach. I'm by no means saying one way is right or wrong, but simply calling out the complexities involved.

Going the route of this PR, you resolve the two requirements I shared above by:

  • Not setting the Installed status condition until the installation has actually been successful (observed by reaching a specific part of our code). This means that failures up until that point will be reflected in the Progressing condition. Pairing a non-existent Installed condition and a present Progressing condition should give and end user enough insight to say "This ClusterExtension has not yet successfully installed, but is progressing towards the installed state". If the Installed condition is present and the Progressing condition is signaling errors an end user should be able to conclude that "This ClusterExtension has been successfully installed, but is currently failing to progress through an upgrade".
  • Only setting the Installed status condition when we know for sure that an installation or an upgrade has been successful ensures that we don't run the risk of providing a false-negative on the Installed condition based on our observed state being incomplete.

I'll re-iterate again, I'm not saying one way is right or wrong. I'm simply calling out that there is significant complexity involved in going the route of setting the Installed condition based on what our controller observes (which could be incomplete) and the approach of this PR has significantly less complexity. There is obviously a trade-off either way, but given the "time crunch" we've been under for shipping v1.0.0 I went with this approach because it:

  • Achieves the agreed upon requirements for the Installed condition
  • Is faster to implement and thus allows us to reach our v1.0.0 milestone faster
  • Doesn't prohibit us from spending more time on this to get back to a more level-based approach post-v1.0.0

@LalatenduMohanty
Copy link
Member

/hold adding a hold as we want to discuss the changes first.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2024
@everettraven
Copy link
Contributor Author

Superseded by #1379

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants