Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore: Use uncached k8s client in the api shim #2960

Closed
danail-branekov opened this issue Oct 27, 2023 · 2 comments
Closed

Explore: Use uncached k8s client in the api shim #2960

danail-branekov opened this issue Oct 27, 2023 · 2 comments

Comments

@danail-branekov
Copy link
Member

We have seen the following flake in e2e periodic tests:
https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/11735

According to the test output, the test first creates an org dorifi, and after that succeeds, executes cf target -o dorifi. Targetting the org resulst into org not found error.

We have analysed what the cli does on targetting - it lists the orgs by name, and if the result is empty, returns the not found error

The API is awaiting for the org ready condition, therefore the theory that the API does not wait for it does not hold.

We believe that the problem might be that once the org namespace is created and user rolebingins are propagated in it, the cache of the API shim has not seen the rolebinding yet, therefore listing orgs yields unauthorised error, which is masked by the API by returning an empty list.

In order to address this, could we turn off the k8s client cache completely in the API shim? By doing that all API operations would talk to the k8s databse directly and caching issues would be probably eliminated. Furthermore, by not using the client cache, we could experiment removing the retrying client (although this might cause flakes if there are multiple etcd instances).

According to the flake hunter, this flake is not likely to occur:

❯ flake-hunter "Organization 'dorifi' not found."
+-------+----------------------------------+-----------------------------------------------------
| Ended | Job                              | Url
+-------+----------------------------------+-----------------------------------------------------
| 4h    | main/run-e2es-periodic           | https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/11735
| 68d   | main/run-e2es-periodic           | https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/10543
+-------+----------------------------------+-----------------------------------------------------

Therefore it might be hard to confirm whether we have fixed the issue.

@github-project-automation github-project-automation bot moved this to 🧊 Icebox in Korifi - Backlog Oct 27, 2023
@github-project-automation github-project-automation bot moved this to 🧊 Icebox in Korifi - Backlog Oct 27, 2023
@georgethebeatle georgethebeatle moved this from 🧊 Icebox to ⚙️ Chores in Korifi - Backlog Oct 27, 2023
@georgethebeatle georgethebeatle moved this from ⚙️ Chores to 🇪🇺 To do in Korifi - Backlog Dec 11, 2023
@georgethebeatle
Copy link
Member

It turns out that the api client does not have a cache, so this theory is invalidated. We have added the verbose flag to the cf cli hoping fro more details next time it flakes.

@georgethebeatle
Copy link
Member

It turned out this flake is not related to caching, but org deletion being slow when pods are in the Initializing state. We have decided to ignore this in the tests. For more info: #3061

@github-project-automation github-project-automation bot moved this from 🇪🇺 To do to ✅ Done in Korifi - Backlog Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

2 participants