-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 don't set installed status unless it is successful #1368
🐛 don't set installed status unless it is successful #1368
Conversation
Signed-off-by: everettraven <everettraven@gmail.com>
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Signed-off-by: everettraven <everettraven@gmail.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1368 +/- ##
==========================================
+ Coverage 74.83% 74.90% +0.06%
==========================================
Files 42 42
Lines 2515 2502 -13
==========================================
- Hits 1882 1874 -8
+ Misses 449 445 -4
+ Partials 184 183 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
…subsequent reconcile Signed-off-by: everettraven <everettraven@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why the diff on this file is all messed up. Essentially what I did here was:
- Remove the
Installed
condition check fromTestClusterExtensionResolutionAndUnpackSuccessfulApplierFails
since we no longer expect it to have been set as part of applier failure (I can change this to an explicitnil
check if desired) - Remove the
TestClusterExtensionApplierFailsWithBundleInstalled
function in favor of a newTestClusterExtensionUpgradeFails
test function that asserts that theInstalled
condition is not modified if there is an error returned from each interface (individually) on a subsequent reconciliation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to rebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem excessive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? I'm thinking it is more so related to the removal of a function in the middle of the file and appending one towards the end than a need to rebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any merge conflicts so probably not related to a need to rebase. I think this strictly has to do with positioning of where code was removed and added. I can fiddle with it a bit if it seems valuable to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it looks as though removing a function in the middle of a bunch of similar functions is causing chaos... If I pull the PR down locally, I might be able to do a proper incantation of diff options to make it easier to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code diff looks fine to me. If you change the first line of code block and also last few lines , I think git considers that the whole block has changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --patience
option to diff make it (more) clear that you deleted and added a function.
cond := apimeta.FindStatusCondition(clusterExtension.Status.Conditions, ocv1alpha1.TypeInstalled) | ||
if assert.NotNil(ct, cond) { | ||
assert.Equal(ct, metav1.ConditionFalse, cond.Status) | ||
assert.Equal(ct, ocv1alpha1.ReasonFailed, cond.Reason) | ||
assert.Contains(ct, cond.Message, "forbidden") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems odd that this line was kept? It was part of the EventuallyWithT()
that was deleted; not sure it should be added the the previous one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is correct here. The logic behind moving it to the previous one is that we no longer set the Installed
condition failure here, but rather reflect the failure in the Progressing
status condition (which is the condition we are checking immediately before this. In this case we expect the Progressing
condition to eventually be True
, with a reason Retrying
, and a message that signals that there were insufficient permissions attached to the ServiceAccount
(asserted by looking for the "forbidden" string in the message).
I could be wrong here, but I believe we want to ensure we aren't progressing further in the test until we've identified that the failure is related to permissions and this should do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message is very brief and does not contain the amount of information available in the PR description. I will suggest that we add more information to the commit message so that it would be easier for anyone who is trying to understand the commit in future (i.e. commit)
@LalatenduMohanty I could be misremembering, but I was under the impression that merges in this repo default to using the PR title+description for the commit message. If this is the incorrect understanding, I am happy to squash the commits here and write a better commit message. |
@everettraven Here is an an example fba8473 , the PR had two commits , while merging it squashed it in to one commit but it also kept the commit messages from all commits. So if you add a commit message for a commit it wont be lost after the commits are squashed. |
Overall the PR looks good to me. Once the we add a improved commit message (as mentioned in my previous comments) I can approve the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Looks much better with the --patience
diff option.
setInstallStatus(ext, nil) | ||
// TODO: use Installed=Unknown | ||
setInstalledStatusConditionFailed(ext, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I agree with this set of removals.
If we fail to get the installed bundle, that seems like a pretty direct signal for Installed=Unknown
. In this case, something is majorly wrong with our ability to contact the api server, or we got an ephemeral error, and we'll hopefully set this back to Installed=True
when the ephemeral issue goes away. For this, I can see an argument for holding off on setting Installed=Unknown
until we see the same error N times in a row so that we avoid flapping, but it seems like we should eventually set Installed=Unknown
if we truly can't get the installed bundle.
If we get installedBundle==nil
, that's a clear signal for Installed=False
. In this case, it seems like we could set the Installed condition to False if we get any other subsequent errors during reconcile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I agree with all of these changes. It seems like these move us away from level-based and into edge-based logic.
IMO, we should assess the current state of the world and update our status accordingly on every reconcile. We had an issue in catalogd recently that we had to fix to move from edge-based back to level-based logic.
I think there are some improvements we could make to prevent unnecessary flapping, but it seems counter to the eventually consistent assumptions of Kubernetes for us to make assumptions about what happened in the past and disregard what we're actually seeing in the present.
@joelanford while I understand what you are saying, and don't entirely disagree, where I am coming from with this PR is that we want to achieve the following with the
When I was looking into this, I was originally looking at this in the same lens of updating our status based on our knowledge of the current state of the world, but this seemed to introduce quite a bit of complexity that I wasn't confident would prevent false-negative flapping (i.e saying something isn't installed when it is). More specifically, these are the challenges I recall running into:
Hopefully my response here shows the complexity involved with managing the Going the route of this PR, you resolve the two requirements I shared above by:
I'll re-iterate again, I'm not saying one way is right or wrong. I'm simply calling out that there is significant complexity involved in going the route of setting the
|
/hold adding a hold as we want to discuss the changes first. |
Superseded by #1379 |
Description
It was pointed out in an offline discussion that we still have some areas in our code where we may end up overwriting the
Installed
status condition fields fromTrue
toFalse
. Namely:operator-controller/internal/controllers/clusterextension_controller.go
Lines 208 to 210 in 78b586a
operator-controller/internal/controllers/clusterextension_controller.go
Line 220 in 78b586a
In order to fix this, this PR removes setting of the
Installed
status condition and associated status fields from anywhere in the existing code base except when installation has been successful.This means that the
Installed
status condition will:Installed
status condition and theProgressing
status condition)True
and populated with the most up-to-date information on successful installation and upgrades.Reviewer Checklist