[DPE-6338] Add timeout for client watch requests. #10

theoctober19th · 2025-01-21T11:36:40Z

There's a bug in the Spark Integration Hub rock that makes the K8s resource watcher created by the client to hang up waiting for the response. Details about the bug here: canonical/spark-k8s-bundle#72.

This PR adds an extra param for the Python process that runs the resource watcher to stop and rerun after timeout seconds have elapsed. (by default 30 seconds)

Batalex · 2025-01-21T13:20:39Z

rockcraft.yaml

+    on-success: restart
+    on-failure: restart


question: Have you considered wrapping the original piece of code in a try..except in a loop? Something like this, in pseudo code:

while True: try: iterate over client.watch(TIMEOUT) except httpx.TimeoutException: continue

With this solution, I wonder if we could have some additional benefits:

we would probably have less log dump from the rock compared to restarting the service every 4 minutes or so

we keep the capability of properly crashing the service if something goes wrong elsewhere, whereas in the original solution we are going to retry even if the issue is not related to the AKS timeout

The while True loop is already implemented inside the client.watch function, and we are not restarting the service every 4 minutes. Even if the timeout happens, the control is still inside the loop inside client.watch, which effectively means that the our process (monitor_sa.py) will not exit and restart every timeout seconds).

As far as the service being killed / stopped by any reason, we would still want it to be restarted because the integration-hub pebble service should be "always active" in order to watch the service accounts added to the cluster.

Also, regarding the log dumps, since the control is still inside client.watch during a timeout, the library is wise enough to continue with the same resource watcher, which means it will ignore the resources change events that were already watched, and thus the logs from the outer process monitor_sa.py will not be re-triggered.

The on_failure: restart is there as the ultimate backing mechanism if everything else fails, such that the pebble service would restart again.

Oh ok, I had a look at the lib implementation and I understand much better now. I thought that the two lines changed in rockcraft.yaml were the only "retry mechanism"
Let me approve then

Add timeout for client watch requests.

a6240da

theoctober19th requested review from welpaolo, deusebio and Batalex January 21, 2025 12:06

Batalex reviewed Jan 21, 2025

View reviewed changes

theoctober19th requested a review from Batalex January 21, 2025 13:37

Batalex approved these changes Jan 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPE-6338] Add timeout for client watch requests. #10

[DPE-6338] Add timeout for client watch requests. #10

theoctober19th commented Jan 21, 2025 •

edited

Loading

Batalex Jan 21, 2025 •

edited

Loading

theoctober19th Jan 21, 2025 •

edited

Loading

Batalex Jan 21, 2025

[DPE-6338] Add timeout for client watch requests. #10

Are you sure you want to change the base?

[DPE-6338] Add timeout for client watch requests. #10

Conversation

theoctober19th commented Jan 21, 2025 • edited Loading

Batalex Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

theoctober19th Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Batalex Jan 21, 2025

Choose a reason for hiding this comment

theoctober19th commented Jan 21, 2025 •

edited

Loading

Batalex Jan 21, 2025 •

edited

Loading

theoctober19th Jan 21, 2025 •

edited

Loading