Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup and streamline status computation #1032

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

ffromani
Copy link
Member

@ffromani ffromani commented Oct 3, 2024

The way we computed the NUMAResourcesOperator status was messy, relying on inner functions and helpers to reported somehow if something changed in the object, abusing the name of the conditions (leading to awkward function signatures), doing and undoing checks and so forth.

Besides messy and unecessarily hard to read code, the outcome was that we both sometimes missed to update the status, leading to not major bugs (yet) but to less than ideal experience.

To untangle this mess, the new approach is to just mutate the status freely during the reconciliation loop in the reconciliation sub-step. Every step is free to mutate the status just reporting error or not, and the top-level reconciliation code will detect semantic differences (e.g. if only timestamps changed but say nothing else, this is not a semantically relevant difference so no actual update should be sent).

Detecting changes this way needs nested comparations of objects but it's a very major simplifications; if we implement carefully the comparison code (coming soon), benchmarks show good numbers (= still some theoretical slowdowns, but the overall state is not terrible and better than expected) wrt the current implementation, so win-win.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2024
Copy link
Contributor

openshift-ci bot commented Oct 3, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 3, 2024
@ffromani
Copy link
Member Author

ffromani commented Oct 3, 2024

/hold

need to wait for all the HCP work to go in anyway

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 3, 2024
@ffromani ffromani force-pushed the cleanup-status-computation branch 5 times, most recently from 4cb9131 to 75c5334 Compare October 9, 2024 07:39
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2024
@ffromani
Copy link
Member Author

/cc @shajmakh

@openshift-ci openshift-ci bot requested a review from shajmakh October 28, 2024 15:25
@ffromani ffromani changed the title WIP: cleanup status computation cleanup and streamline status computation Oct 29, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 29, 2024
@ffromani
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 29, 2024
@ffromani
Copy link
Member Author

@shajmakh hey! let's discuss the cleanups in the last 2 commits to see if they can help you

Copy link
Member

@shajmakh shajmakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!
reviews are related to the last 2 commits

Comment on lines 47 to 100
return conditionInfo{
Type: status.ConditionDegraded,
Message: messageFromError(err),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add reason here?
Reason: reasonFromError(err)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably should, yes. It should be helpful and harmless.

if ok {
instance.Status.Conditions = conditions
}
}

func (r *NUMAResourcesOperatorReconciler) degradeStatus(ctx context.Context, instance *nropv1.NUMAResourcesOperator, reason string, stErr error) (ctrl.Result, error) {
message := messageFromError(stErr)
info := degradedConditionInfoFromError(stErr)
info.Reason = reason
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if reason is empty it will override info.Reason (keep it as "InternalError"?) if we set it above Reason: reasonFromError(err)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, good catch. I think I fixed in the later commits by rearranging the flow.

@@ -207,25 +208,27 @@ func (r *NUMAResourcesOperatorReconciler) degradeStatus(ctx context.Context, ins
return ctrl.Result{}, nil
}

func (r *NUMAResourcesOperatorReconciler) reconcileResourceAPI(ctx context.Context, instance *nropv1.NUMAResourcesOperator, trees []nodegroupv1.Tree) (bool, ctrl.Result, string, error) {
func (r *NUMAResourcesOperatorReconciler) reconcileResourceAPI(ctx context.Context, instance *nropv1.NUMAResourcesOperator, trees []nodegroupv1.Tree) (bool, ctrl.Result, conditionInfo, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason to keep the returned error, it is already part of the returned conditionInfo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same below in the other reconciliation substeps

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point. What it should go is not the top level error though, which is used to report to the upper layers, but the inner error in ConditionInfo. Let's see what I can do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s4cratch that, fixed in the following commit

controllers/numaresourcesoperator_controller.go Outdated Show resolved Hide resolved
controllers/controllers.go Outdated Show resolved Hide resolved
Inline status update in the happy path, if the reconciliation
loop completed all the expected steps.

This is a intermediate step towards the final cleanup,
and should cause no changes in behavior.

Signed-off-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
Instead of relying in the chain of helpers to report
if something changed in status, thus warrants a update,
run a full diff of our status explictely.

If something worthy (whose definition depends on
the helper implemented in `pkg/status`) changed,
then we will push a status update.

This is probably slower than relying on the reports from
subfunctions, but simplifies and streamlines the code
significantly.

Signed-off-by: Francesco Romani <fromani@redhat.com>
The only reason why we update the status conditions
outside reconcileResources, while we update everything
else related to status inside, is historical.

We are now enabled to close this gap and streamline
the code further.

Signed-off-by: Francesco Romani <fromani@redhat.com>
we never use the return value, good riddance.

Signed-off-by: Francesco Romani <fromani@redhat.com>
instead of passing condition types, possibly message, maybe error,
then derive the full condition data in many place,
factor all the data in a condition info struct, to be used
as basis for creating the real metav1.Condition.

This clean up things and unlocks further cleanups.

Signed-off-by: Francesco Romani <fromani@redhat.com>
@ffromani
Copy link
Member Author

/hold
need to test

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 30, 2024
the reconciliation steps are returning a common
(and growing) set of values, let's pack them
in a struct, since we always want to return
the same tuple anyway for consistency.

Signed-off-by: Francesco Romani <fromani@redhat.com>
@ffromani
Copy link
Member Author

know controller test failure. It's legit. I'll have a look and fix ASAP

Copy link
Contributor

openshift-ci bot commented Oct 31, 2024

@ffromani: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ci-unit b4f8b44 link true /test ci-unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants