Explore: Use uncached k8s client in the api shim #2960

danail-branekov · 2023-10-27T08:57:15Z

We have seen the following flake in e2e periodic tests:
https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/11735

According to the test output, the test first creates an org dorifi, and after that succeeds, executes cf target -o dorifi. Targetting the org resulst into org not found error.

We have analysed what the cli does on targetting - it lists the orgs by name, and if the result is empty, returns the not found error

The API is awaiting for the org ready condition, therefore the theory that the API does not wait for it does not hold.

We believe that the problem might be that once the org namespace is created and user rolebingins are propagated in it, the cache of the API shim has not seen the rolebinding yet, therefore listing orgs yields unauthorised error, which is masked by the API by returning an empty list.

In order to address this, could we turn off the k8s client cache completely in the API shim? By doing that all API operations would talk to the k8s databse directly and caching issues would be probably eliminated. Furthermore, by not using the client cache, we could experiment removing the retrying client (although this might cause flakes if there are multiple etcd instances).

According to the flake hunter, this flake is not likely to occur:

❯ flake-hunter "Organization 'dorifi' not found."
+-------+----------------------------------+-----------------------------------------------------
| Ended | Job                              | Url
+-------+----------------------------------+-----------------------------------------------------
| 4h    | main/run-e2es-periodic           | https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/11735
| 68d   | main/run-e2es-periodic           | https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/10543
+-------+----------------------------------+-----------------------------------------------------

Therefore it might be hard to confirm whether we have fixed the issue.

The text was updated successfully, but these errors were encountered:

georgethebeatle · 2023-12-11T16:41:02Z

It turns out that the api client does not have a cache, so this theory is invalidated. We have added the verbose flag to the cf cli hoping fro more details next time it flakes.

georgethebeatle · 2024-04-03T14:23:38Z

It turned out this flake is not related to caching, but org deletion being slow when pods are in the Initializing state. We have decided to ignore this in the tests. For more info: #3061

danail-branekov added the explore label Oct 27, 2023

korifi-bot added this to Korifi - Backlog Oct 27, 2023

georgethebeatle added the chore label Oct 27, 2023

github-project-automation bot moved this to 🧊 Icebox in Korifi - Backlog Oct 27, 2023

georgethebeatle removed this from Korifi - Backlog Oct 27, 2023

georgethebeatle added this to Korifi - Backlog Oct 27, 2023

github-project-automation bot moved this to 🧊 Icebox in Korifi - Backlog Oct 27, 2023

georgethebeatle moved this from 🧊 Icebox to ⚙️ Chores in Korifi - Backlog Oct 27, 2023

georgethebeatle moved this from ⚙️ Chores to 🇪🇺 To do in Korifi - Backlog Dec 11, 2023

georgethebeatle closed this as completed Apr 3, 2024

github-project-automation bot moved this from 🇪🇺 To do to ✅ Done in Korifi - Backlog Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore: Use uncached k8s client in the api shim #2960

Explore: Use uncached k8s client in the api shim #2960

danail-branekov commented Oct 27, 2023

georgethebeatle commented Dec 11, 2023

georgethebeatle commented Apr 3, 2024

Explore: Use uncached k8s client in the api shim #2960

Explore: Use uncached k8s client in the api shim #2960

Comments

danail-branekov commented Oct 27, 2023

georgethebeatle commented Dec 11, 2023

georgethebeatle commented Apr 3, 2024