Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-6338] Add timeout for client watch requests. #10

Open
wants to merge 1 commit into
base: 3.4-22.04/edge
Choose a base branch
from

Conversation

theoctober19th
Copy link
Member

@theoctober19th theoctober19th commented Jan 21, 2025

There's a bug in the Spark Integration Hub rock that makes the K8s resource watcher created by the client to hang up waiting for the response. Details about the bug here: canonical/spark-k8s-bundle#72.

This PR adds an extra param for the Python process that runs the resource watcher to stop and rerun after timeout seconds have elapsed. (by default 30 seconds)

Comment on lines +28 to +29
on-success: restart
on-failure: restart
Copy link

@Batalex Batalex Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Have you considered wrapping the original piece of code in a try..except in a loop? Something like this, in pseudo code:

while True:
    try:
        iterate over client.watch(TIMEOUT)
    except httpx.TimeoutException:
        continue

With this solution, I wonder if we could have some additional benefits:

  • we would probably have less log dump from the rock compared to restarting the service every 4 minutes or so
  • we keep the capability of properly crashing the service if something goes wrong elsewhere, whereas in the original solution we are going to retry even if the issue is not related to the AKS timeout

Copy link
Member Author

@theoctober19th theoctober19th Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The while True loop is already implemented inside the client.watch function, and we are not restarting the service every 4 minutes. Even if the timeout happens, the control is still inside the loop inside client.watch, which effectively means that the our process (monitor_sa.py) will not exit and restart every timeout seconds).

As far as the service being killed / stopped by any reason, we would still want it to be restarted because the integration-hub pebble service should be "always active" in order to watch the service accounts added to the cluster.

Also, regarding the log dumps, since the control is still inside client.watch during a timeout, the library is wise enough to continue with the same resource watcher, which means it will ignore the resources change events that were already watched, and thus the logs from the outer process monitor_sa.py will not be re-triggered.

The on_failure: restart is there as the ultimate backing mechanism if everything else fails, such that the pebble service would restart again.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok, I had a look at the lib implementation and I understand much better now. I thought that the two lines changed in rockcraft.yaml were the only "retry mechanism"
Let me approve then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants