ACM-14639: Dashboard loader: retry adding folder on errors #1646

jacobbaungard · 2024-10-07T13:05:39Z

Grafana sometimes fails to correctly set (default) permissions for folders added via the API. This seems to mostly affect the "Custom" folder. When this happens, while the folder is added, it's not possible for ACM users to view the folder, and its dashboards. This commit tries to detect that case, and retry adding the folder if it fails.

Additional minor changes:

Configure grafana to retry queries
Log messages from the component all starts in lower-case.

openshift-ci · 2024-10-07T13:05:45Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

jacobbaungard · 2024-10-07T13:05:54Z

/test all

jacobbaungard · 2024-10-07T13:11:18Z

/test test-e2e

jacobbaungard · 2024-10-07T14:17:40Z

/test test-e2e

jacobbaungard · 2024-10-07T15:13:56Z

/test all

Grafana sometimes fails to correctly set (default) permissions for folders added via the API. This seems to mostly affect the "Custom" folder. When this happens, while the folder is added, it's not possible for ACM users to view the folder, and its dashboards. This commit tries to detect that case, and retry adding the folder if it fails. Additional minor changes: - Configure grafana to retry queries - Log messages from the component all starts in lower-case. Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>

jacobbaungard · 2024-10-07T17:15:31Z

/test test-e2e

jacobbaungard · 2024-10-08T07:52:42Z

/test test-e2e

Previously, the grafana-dev test would create a new custom dashboard, and test the exporting to configmap using that. However since adding custom dashboards are not so reliable after Grafana 11 update, we change the test to use of the default dashboards. This should improve test reliability, and there are no difference whether we export a custom dashboard or a default one. Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>

Since these are unreliable, and requires a few retries increase the test timeout, to give a better chance of success. Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>

.. since we no longer add it. Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>

1 is slightly better than one, since 1 is usually used as a generic "the program had an error" exit code. Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>

jacobbaungard · 2024-10-08T12:27:00Z

/retest-required

philipgough

/lgtm

jacobbaungard · 2024-10-08T14:13:10Z

@moadz you want to contest the forceful exit?

I guess it's a matter of whether we want this to silently fail or not. With this approach there's a chance Grafana availability will be impacted until at least one pod has succeeded (Grafana replicas=2 by default), but at least it's clear something is wrong. Alternatively, we fail more silently and dashboards will just be missing (in this case maybe we want a bit more retries, but it's already pretty high as is).

moadz · 2024-10-09T08:59:37Z

So my main question with forcing a crash is there is likely no intervention from a user that will fix this issue. And restarting the pod actually increases the likelihood this will fail again upon creation. So materially it's the same effect (we keep retrying whether through crashloop or otherwise) and if it doesn't work they will have to create a support ticket to review.

Restarting at least gives them a culprit when the dashboard doesn't show I suppose. But dashboard loader restarting grafana feels sketchy to me, so i would er on the side of just constantly retrying. If it doesn't work after N number of retries it's a big problem we will need to investigate regardless (i'm hoping the likelihood is remote).

moadz

LGTM % the restart discussion

moadz · 2024-10-09T09:00:40Z

loaders/dashboards/pkg/controller/dashboard_controller.go

@@ -40,24 +41,25 @@ const (
 var (
 	grafanaURI = "http://127.0.0.1:3001"
 	// Retry on errors.
-	retry = 10
+	httpRetry      = 10
+	dashboardRetry = 25


nit

Suggested change

dashboardRetry = 25

maxDashboardRetry = 25

moadz · 2024-10-09T09:00:54Z

loaders/dashboards/pkg/controller/dashboard_controller.go

@@ -40,24 +41,25 @@ const (
 var (
 	grafanaURI = "http://127.0.0.1:3001"
 	// Retry on errors.
-	retry = 10
+	httpRetry      = 10


nit:

Suggested change

httpRetry = 10

maxHttpRetry = 10

jacobbaungard · 2024-10-09T09:27:05Z

But dashboard loader restarting grafana

It doesn't actually restart the Grafana container (just the loader sidecar) but it will stop the service from directing requests to the failing pod if it's in crashloop backoff. The two Grafana replicas we have by default, don't use the same sqllite database. Not crashing the pod, could lead to some weird behavior where sometimes the dashboard show, sometimes it doesn't depending on which pod your request is routed to - Not a big fan of this this undetermism/hidden failures that this could cause.

However, I do agree that the impact of hitting this error, is potentially much larger with failing the pod, since if by bad luck we don't manage to ever actually create the dashboard, Grafana will not be available at all, whereas if we just silently fail, the worst that can happen is a few dashboards are missing. So maybe it is a little "safer" to not crash the pod, so the impact is minimal (albeit potentially more confusing for customers imo).

- Don't forcefully exit - bump retries - naming nits Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>

sonarcloud · 2024-10-09T10:52:37Z

Quality Gate failed

Failed conditions
17.2% Coverage on New Code (required ≥ 70%)

See analysis details on SonarCloud

openshift-ci · 2024-10-09T10:52:43Z

@jacobbaungard: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/sonarcloud	`045461c`	link	false	`/test sonarcloud`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

moadz

/lgtm

openshift-ci · 2024-10-09T11:17:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jacobbaungard, moadz, philipgough

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jacobbaungard,moadz,philipgough]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added the do-not-merge/work-in-progress label Oct 7, 2024

openshift-ci bot added dco-signoff: yes approved labels Oct 7, 2024

jacobbaungard force-pushed the dashboard-reloader-folder-retry branch from ba7bfa2 to 5e8347e Compare October 7, 2024 13:10

jacobbaungard force-pushed the dashboard-reloader-folder-retry branch from 5e8347e to fd3f6bd Compare October 7, 2024 14:16

jacobbaungard force-pushed the dashboard-reloader-folder-retry branch from fd3f6bd to 79b6f9b Compare October 7, 2024 15:13

jacobbaungard changed the title ~~Dashboard loader: retry adding folder on errors~~ ACM-14639: Dashboard loader: retry adding folder on errors Oct 7, 2024

jacobbaungard force-pushed the dashboard-reloader-folder-retry branch from 79b6f9b to 036178d Compare October 7, 2024 15:57

jacobbaungard marked this pull request as ready for review October 7, 2024 15:57

openshift-ci bot removed the do-not-merge/work-in-progress label Oct 7, 2024

jacobbaungard added 4 commits October 8, 2024 10:27

Tests: increase timeout for custom dashboard tests

3529488

Since these are unreliable, and requires a few retries increase the test timeout, to give a better chance of success. Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>

Grafana-dev test: don't delete custom dashboard

8307911

.. since we no longer add it. Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>

Dashboard loader: use exit code 1 instead of 3

a3e0b71

1 is slightly better than one, since 1 is usually used as a generic "the program had an error" exit code. Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>

philipgough approved these changes Oct 8, 2024

View reviewed changes

openshift-ci bot assigned philipgough Oct 8, 2024

openshift-ci bot added the lgtm label Oct 8, 2024

moadz approved these changes Oct 9, 2024

View reviewed changes

openshift-ci bot assigned moadz Oct 9, 2024

moadz reviewed Oct 9, 2024

View reviewed changes

Dashboard loader: no forceful exit, increase retry

045461c

- Don't forcefully exit - bump retries - naming nits Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>

openshift-ci bot removed the lgtm label Oct 9, 2024

moadz approved these changes Oct 9, 2024

View reviewed changes

openshift-ci bot added the lgtm label Oct 9, 2024

openshift-merge-bot bot merged commit 99d6bb0 into stolostron:main Oct 9, 2024
20 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACM-14639: Dashboard loader: retry adding folder on errors #1646

ACM-14639: Dashboard loader: retry adding folder on errors #1646

jacobbaungard commented Oct 7, 2024

openshift-ci bot commented Oct 7, 2024

jacobbaungard commented Oct 7, 2024

jacobbaungard commented Oct 7, 2024

jacobbaungard commented Oct 7, 2024

jacobbaungard commented Oct 7, 2024

jacobbaungard commented Oct 7, 2024

jacobbaungard commented Oct 8, 2024

jacobbaungard commented Oct 8, 2024

philipgough left a comment

jacobbaungard commented Oct 8, 2024

moadz commented Oct 9, 2024

moadz left a comment

moadz Oct 9, 2024

moadz Oct 9, 2024

jacobbaungard commented Oct 9, 2024

sonarcloud bot commented Oct 9, 2024

openshift-ci bot commented Oct 9, 2024

moadz left a comment

openshift-ci bot commented Oct 9, 2024

ACM-14639: Dashboard loader: retry adding folder on errors #1646

ACM-14639: Dashboard loader: retry adding folder on errors #1646

Conversation

jacobbaungard commented Oct 7, 2024

openshift-ci bot commented Oct 7, 2024

jacobbaungard commented Oct 7, 2024

jacobbaungard commented Oct 7, 2024

jacobbaungard commented Oct 7, 2024

jacobbaungard commented Oct 7, 2024

jacobbaungard commented Oct 7, 2024

jacobbaungard commented Oct 8, 2024

jacobbaungard commented Oct 8, 2024

philipgough left a comment

Choose a reason for hiding this comment

jacobbaungard commented Oct 8, 2024

moadz commented Oct 9, 2024

moadz left a comment

Choose a reason for hiding this comment

moadz Oct 9, 2024

Choose a reason for hiding this comment

moadz Oct 9, 2024

Choose a reason for hiding this comment

jacobbaungard commented Oct 9, 2024

sonarcloud bot commented Oct 9, 2024

Quality Gate failed

openshift-ci bot commented Oct 9, 2024

moadz left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Oct 9, 2024