-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-6338] Add timeout for client watch requests. #10
base: 3.4-22.04/edge
Are you sure you want to change the base?
Conversation
on-success: restart | ||
on-failure: restart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Have you considered wrapping the original piece of code in a try..except in a loop? Something like this, in pseudo code:
while True:
try:
iterate over client.watch(TIMEOUT)
except httpx.TimeoutException:
continue
With this solution, I wonder if we could have some additional benefits:
- we would probably have less log dump from the rock compared to restarting the service every 4 minutes or so
- we keep the capability of properly crashing the service if something goes wrong elsewhere, whereas in the original solution we are going to retry even if the issue is not related to the AKS timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The while True
loop is already implemented inside the client.watch
function, and we are not restarting the service every 4 minutes. Even if the timeout happens, the control is still inside the loop inside client.watch
, which effectively means that the our process (monitor_sa.py
) will not exit and restart every timeout seconds).
As far as the service being killed / stopped by any reason, we would still want it to be restarted because the integration-hub
pebble service should be "always active" in order to watch the service accounts added to the cluster.
Also, regarding the log dumps, since the control is still inside client.watch
during a timeout, the library is wise enough to continue with the same resource watcher, which means it will ignore the resources change events that were already watched, and thus the logs from the outer process monitor_sa.py
will not be re-triggered.
The on_failure: restart
is there as the ultimate backing mechanism if everything else fails, such that the pebble service would restart again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, I had a look at the lib implementation and I understand much better now. I thought that the two lines changed in rockcraft.yaml
were the only "retry mechanism"
Let me approve then
There's a bug in the Spark Integration Hub rock that makes the K8s resource watcher created by the client to hang up waiting for the response. Details about the bug here: canonical/spark-k8s-bundle#72.
This PR adds an extra param for the Python process that runs the resource watcher to stop and rerun after
timeout
seconds have elapsed. (by default 30 seconds)