-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ACM-14639: Dashboard loader: retry adding folder on errors #1646
ACM-14639: Dashboard loader: retry adding folder on errors #1646
Conversation
Skipping CI for Draft Pull Request. |
/test all |
ba7bfa2
to
5e8347e
Compare
/test test-e2e |
5e8347e
to
fd3f6bd
Compare
/test test-e2e |
fd3f6bd
to
79b6f9b
Compare
/test all |
Grafana sometimes fails to correctly set (default) permissions for folders added via the API. This seems to mostly affect the "Custom" folder. When this happens, while the folder is added, it's not possible for ACM users to view the folder, and its dashboards. This commit tries to detect that case, and retry adding the folder if it fails. Additional minor changes: - Configure grafana to retry queries - Log messages from the component all starts in lower-case. Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>
79b6f9b
to
036178d
Compare
/test test-e2e |
1 similar comment
/test test-e2e |
Previously, the grafana-dev test would create a new custom dashboard, and test the exporting to configmap using that. However since adding custom dashboards are not so reliable after Grafana 11 update, we change the test to use of the default dashboards. This should improve test reliability, and there are no difference whether we export a custom dashboard or a default one. Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>
Since these are unreliable, and requires a few retries increase the test timeout, to give a better chance of success. Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>
.. since we no longer add it. Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>
1 is slightly better than one, since 1 is usually used as a generic "the program had an error" exit code. Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>
/retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@moadz you want to contest the forceful exit? I guess it's a matter of whether we want this to silently fail or not. With this approach there's a chance Grafana availability will be impacted until at least one pod has succeeded (Grafana replicas=2 by default), but at least it's clear something is wrong. Alternatively, we fail more silently and dashboards will just be missing (in this case maybe we want a bit more retries, but it's already pretty high as is). |
So my main question with forcing a crash is there is likely no intervention from a user that will fix this issue. And restarting the pod actually increases the likelihood this will fail again upon creation. So materially it's the same effect (we keep retrying whether through crashloop or otherwise) and if it doesn't work they will have to create a support ticket to review. Restarting at least gives them a culprit when the dashboard doesn't show I suppose. But dashboard loader restarting grafana feels sketchy to me, so i would er on the side of just constantly retrying. If it doesn't work after N number of retries it's a big problem we will need to investigate regardless (i'm hoping the likelihood is remote). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % the restart discussion
@@ -40,24 +41,25 @@ const ( | |||
var ( | |||
grafanaURI = "http://127.0.0.1:3001" | |||
// Retry on errors. | |||
retry = 10 | |||
httpRetry = 10 | |||
dashboardRetry = 25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
dashboardRetry = 25 | |
maxDashboardRetry = 25 |
@@ -40,24 +41,25 @@ const ( | |||
var ( | |||
grafanaURI = "http://127.0.0.1:3001" | |||
// Retry on errors. | |||
retry = 10 | |||
httpRetry = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
httpRetry = 10 | |
maxHttpRetry = 10 |
It doesn't actually restart the Grafana container (just the loader sidecar) but it will stop the service from directing requests to the failing pod if it's in crashloop backoff. The two Grafana replicas we have by default, don't use the same sqllite database. Not crashing the pod, could lead to some weird behavior where sometimes the dashboard show, sometimes it doesn't depending on which pod your request is routed to - Not a big fan of this this undetermism/hidden failures that this could cause. However, I do agree that the impact of hitting this error, is potentially much larger with failing the pod, since if by bad luck we don't manage to ever actually create the dashboard, Grafana will not be available at all, whereas if we just silently fail, the worst that can happen is a few dashboards are missing. So maybe it is a little "safer" to not crash the pod, so the impact is minimal (albeit potentially more confusing for customers imo). |
- Don't forcefully exit - bump retries - naming nits Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>
Quality Gate failedFailed conditions |
@jacobbaungard: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jacobbaungard, moadz, philipgough The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
99d6bb0
into
stolostron:main
Grafana sometimes fails to correctly set (default) permissions for folders added via the API. This seems to mostly affect the "Custom" folder. When this happens, while the folder is added, it's not possible for ACM users to view the folder, and its dashboards. This commit tries to detect that case, and retry adding the folder if it fails.
Additional minor changes: