Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure rollouts are completed before starting a new upgrade. #1296

Closed
Tracked by #950
thetechnick opened this issue Sep 23, 2024 · 7 comments · Fixed by #1349
Closed
Tracked by #950

Ensure rollouts are completed before starting a new upgrade. #1296

thetechnick opened this issue Sep 23, 2024 · 7 comments · Fixed by #1349
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. v1.0 Issues related to the initial stable release of OLMv1

Comments

@thetechnick
Copy link

Interruptions during bundle reconciliation could lead to operator-controller not completing the current running rollout and instead pick up the next available release from the catalog.

Not rolling out a release completely, may lead to inconsistent results and unexpected behavior.

;tldr: Binary restarts and transient errors should not trigger re-resolve bundle images when a rollout is still in flight.

@thetechnick thetechnick added the kind/bug Categorizes issue or PR as related to a bug. label Sep 23, 2024
@joelanford
Copy link
Member

joelanford commented Sep 24, 2024

Scenario:

  • v1 was previously successfully installed
  • v2 failed to upgrade
  • v3 available in the catalog, upgrades from v1

Discussion in WG meeting:

  • If the upgrade to v2 fails because of an ephemeral error AND the catalog contains no upgrade edges from v1, we should try to complete the rollout of the failed release.
  • If the upgrade fails because of a "blocking" error OR the catalog contains a "better" upgrade edge from v1, we should discard the failed release, re-resolve and proceed with an upgrade using the newly resolved bundle.
  • It may be necessary for extension authors to produce a patch version on a non-latest minor in order to help upgrades progress.

Need to be careful returning nil errors while we are in this retrying state (in either of the above cases).

@LalatenduMohanty LalatenduMohanty added the v1.0 Issues related to the initial stable release of OLMv1 label Oct 7, 2024
@tmshort
Copy link
Contributor

tmshort commented Oct 7, 2024

@thetechnick to confirm, you're referring to pushing out the deployment objects, not necessarily that Deplpyments/SS/RS/DS have all their replicas up and running?

@LalatenduMohanty
Copy link
Member

@thetechnick The PR has merged now, so please test it and reopen the issue if you still see the issue.

@everettraven
Copy link
Contributor

Reopening this issue as we identified the fix that was merged in #1349 builds on the Helm library Get call and that call is insufficient for making an appropriate decision as to what the latest successfully deployed release is.

We identified in a sync call that we should instead:

  • Add a List method to the helm-operator-plugins ActionInterface that allows us to use the Helm library List functionality. This allows filtering releases by specific criteria (like status == Deployed), and sorting
  • Use the new List method in operator-controllers InstalledBundleGetter implementation to only return the most recent release that has status == Deployed
    • NOTE: Using List would make it such that OLM v1 won't upgrade from a "partial upgrade" UNLESS there is another preferred upgrade path on a future resolution attempt.

@LalatenduMohanty
Copy link
Member

Here is PR in helm-operator-plugins repo to get more information about the state of release operator-framework/helm-operator-plugins#397

@LalatenduMohanty
Copy link
Member

#1379 is the PR which uses operator-framework/helm-operator-plugins#397.

@LalatenduMohanty
Copy link
Member

Marking it done as #1379 has merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. v1.0 Issues related to the initial stable release of OLMv1
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants