fix(backend): handle client side HTTP timeouts to fix crashes of metadata-writer. Fixes #8200 #11361

OutSorcerer · 2024-11-06T18:42:57Z

Description of your changes:

Handles client side timeouts (urllib3.exceptions.ReadTimeoutError) of k8s_watch.stream to prevent crashes of metadata-writer pod. Without this metadata-writer pod fails in cases when a connection error causes a client timeout. This should fix [backend] Metadata writer pod always restarting #8200.
Adjusts the client side timeout according to the recommendations of Kubernetes Python client authors.

It is recommended to set this [server] timeout value to a higher number such as 3600 seconds (1 hour).
It is recommended to set this [client] timeout value to a lower number (for eg. ~ maybe 60 seconds).
- The benefit of this for the metadata-writer use case is that, currently, with a client side timeout of 2000 seconds (33.3 minutes) if a connection error happens during this period, metadata-writer stops doing its job for up to 33.3 minutes. After this change metadata-writer will only be disconneted for up to the new client timeout, 60 seconds.
- Since the client timeout is now smaller than the server timeout, if no events occur within the client timeout period, a ReadTimeoutError is thrown and caught even in the absence of errors.
Makes client side and server side timeout configurable via environment variables.
Updates Kubernetes Python client to the latest version.

Checklist:

You have signed off your commits
The title for your pull request (PR) should follow our title convention. Learn more about the pull request title convention used in this repository.

… adjust timeouts according to the recommendations of the Kubernetes client authors. Signed-off-by: Evgeniy Mamchenko <evgeniy@deevio.ai>

google-oss-prow · 2024-11-06T18:43:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ark-kun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

backend/metadata_writer/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow · 2024-11-06T18:43:09Z

Hi @OutSorcerer. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

thesuperzapper

@OutSorcerer just want to clarify a few things.

thesuperzapper · 2024-11-06T22:48:38Z

backend/metadata_writer/src/metadata_writer.py

-        except Exception as e:
-            import traceback
-            print(traceback.format_exc())
+        # Server side timeout, continue watching.


What is this line meant to be a comment for?

This comment marks the place in code which is only reached when the server side timeout occurs, it was not meant to comment a particular statement.

Should I remove it?

Maybe for clarity we can explain in the comment that the "continue[d] watching" occurs by getting a new stream on the next iteration of the while loop — that way it is clear the comment is referring to the logical flow rather than a particular statement.

That comment indeed seems misplaced.
@OutSorcerer isn't possible to handle the server-side exception the same way you did for the client-side one and keep the original exception handling (except Exception as e:)?
Something like:

try: ... except <Server-side error> as e: # Server side timeout, continue watching. pass except urllib3.exceptions.ReadTimeoutError as e: # Client side timeout, continue watching. pass except Exception as e: import traceback print(traceback.format_exc())

But the behaviour in case of a server-side timeout is that the loop for event in pod_stream just finishes normally (because the underlying HTTP request to Kubernetes API also finishes) so it cannot be handled by try-except.

I see. Thanks for clarifying @OutSorcerer.
Then how about the following?

Suggested change

# Server side timeout, continue watching.

# If the for loop ended, a server-side timeout occurred. Continue watching.

pass

OK, I see, now it will be clear to which line that comment belongs. I made a new commit that adds pass under the comment.

Updated the comment text to If the for loop ended, a server-side timeout occurred. Continue watching. as well.

thesuperzapper · 2024-11-06T22:50:02Z

backend/metadata_writer/src/metadata_writer.py

+    try:
+        for event in pod_stream:


I am trying to understand why we need to retry the entire iterator on every error and thus create a new one?

Or does the iterator returned by pod_stream become "poisoned" when it fails, so calling __next__ on it will never return a new item in the stream?

Or does the iterator returned by pod_stream become "poisoned" when it fails, so calling next on it will never return a new item in the stream?

As I understand, yes, this is what happens here.

In case of a network error causing a client timeout it was leading to an unhandled exception in metadata-writer, so Kubertenetes was restarting it and increasing the restart counter.

In the stack trace the error was happening on this line:

pipelines/backend/metadata_writer/src/metadata_writer.py

Line 163 in f8973d2

for event in pod_stream:

Traceback (most recent call last): File "/kfp/metadata_writer/metadata_writer.py", line 163, in <module> for event in pod_stream: File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 144, in stream for line in iter_resp_lines(resp): File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines for seg in resp.read_chunked(decode_content=False): File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 857, in read_chunked self._original_response.close() File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__ self.gen.throw(type, value, traceback) File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 449, in _error_catcher raise ReadTimeoutError(self._pool, None, "Read timed out.") urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='34.118.224.1', port=443): Read timed out.

thesuperzapper · 2024-11-06T22:52:45Z

backend/metadata_writer/src/metadata_writer.py

+        # Client side timeout, continue watching.
+        pass
+
+    except Exception as e:


I know we were already catching all these errors before, but I am struggling to see why catching Exception won't get us sometimes stuck where we never actually crash.

Especially because you proposed above that we have this try around the for loop, meaning any iteration errors will endlessly retry on a pod_stream that may never work?

I wanted to keep the amount of changes small originally.

But in my understanding removing this catch-all except clause improves the code. The unhalded exceptions will still be printed to the console and they will not be hidden from a user anymore. A user will see that restart counter increases and will be able to get the logs by a commnad like kubectl -n kubeflow logs --previous metadata-writer-6d5b8456-78265. Kubernetes will restart metadata-writer pod and it will continue handling events if it is possible.

So I made a new commit now that removes the catch-all except clause.

hbelmiro · 2024-11-07T11:01:17Z

/ok-to-test

… users. Signed-off-by: Evgeniy Mamchenko <evgeniy@deevio.ai>

Signed-off-by: Evgeniy Mamchenko <evgeniy@deevio.ai>

hbelmiro

/lgtm

Catch exceptions in case of client side timeouts of k8s_watch.stream,…

2d852fb

… adjust timeouts according to the recommendations of the Kubernetes client authors. Signed-off-by: Evgeniy Mamchenko <evgeniy@deevio.ai>

google-oss-prow bot requested review from Ark-kun and rimolive November 6, 2024 18:43

google-oss-prow bot added size/M needs-ok-to-test labels Nov 6, 2024

OutSorcerer mentioned this pull request Nov 6, 2024

[backend] Metadata writer pod always restarting #8200

Open

github-actions bot added the ci-passed All CI tests on a pull request have passed label Nov 6, 2024

thesuperzapper reviewed Nov 6, 2024

View reviewed changes

google-oss-prow bot added ok-to-test and removed needs-ok-to-test labels Nov 7, 2024

Remove catch-all except so that unexpected errors are not hidden from…

1092da3

… users. Signed-off-by: Evgeniy Mamchenko <evgeniy@deevio.ai>

google-oss-prow bot added size/S and removed size/M labels Nov 7, 2024

OutSorcerer added 2 commits November 11, 2024 13:58

Add pass statement below server-side timeout comment for clarity.

ed2dbe6

Signed-off-by: Evgeniy Mamchenko <evgeniy@deevio.ai>

Improve the comment about the server-side timeout.

553e97c

Signed-off-by: Evgeniy Mamchenko <evgeniy@deevio.ai>

OutSorcerer force-pushed the timeout-fix branch from 3f1d534 to 553e97c Compare November 11, 2024 13:32

hbelmiro reviewed Nov 11, 2024

View reviewed changes

google-oss-prow bot assigned hbelmiro Nov 11, 2024

google-oss-prow bot added the lgtm label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backend): handle client side HTTP timeouts to fix crashes of metadata-writer. Fixes #8200 #11361

fix(backend): handle client side HTTP timeouts to fix crashes of metadata-writer. Fixes #8200 #11361

OutSorcerer commented Nov 6, 2024 •

edited

Loading

google-oss-prow bot commented Nov 6, 2024

google-oss-prow bot commented Nov 6, 2024

thesuperzapper left a comment

thesuperzapper Nov 6, 2024

OutSorcerer Nov 7, 2024

ishaan-mehta Nov 8, 2024

hbelmiro Nov 11, 2024

OutSorcerer Nov 11, 2024

hbelmiro Nov 11, 2024

OutSorcerer Nov 11, 2024

OutSorcerer Nov 11, 2024

thesuperzapper Nov 6, 2024

OutSorcerer Nov 7, 2024

thesuperzapper Nov 6, 2024

OutSorcerer Nov 7, 2024 •

edited

Loading

hbelmiro commented Nov 7, 2024

hbelmiro left a comment

	# Server side timeout, continue watching.
	# If the for loop ended, a server-side timeout occurred. Continue watching.
	pass

fix(backend): handle client side HTTP timeouts to fix crashes of metadata-writer. Fixes #8200 #11361

Are you sure you want to change the base?

fix(backend): handle client side HTTP timeouts to fix crashes of metadata-writer. Fixes #8200 #11361

Conversation

OutSorcerer commented Nov 6, 2024 • edited Loading

google-oss-prow bot commented Nov 6, 2024

google-oss-prow bot commented Nov 6, 2024

thesuperzapper left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OutSorcerer Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

hbelmiro commented Nov 7, 2024

hbelmiro left a comment

Choose a reason for hiding this comment

OutSorcerer commented Nov 6, 2024 •

edited

Loading

OutSorcerer Nov 7, 2024 •

edited

Loading