Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(backend): handle client side HTTP timeouts to fix crashes of metadata-writer. Fixes #8200 #11361

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

OutSorcerer
Copy link

@OutSorcerer OutSorcerer commented Nov 6, 2024

Description of your changes:

  • Handles client side timeouts (urllib3.exceptions.ReadTimeoutError) of k8s_watch.stream to prevent crashes of metadata-writer pod. Without this metadata-writer pod fails in cases when a connection error causes a client timeout. This should fix [backend] Metadata writer pod always restarting  #8200.
  • Adjusts the client side timeout according to the recommendations of Kubernetes Python client authors.

    It is recommended to set this [server] timeout value to a higher number such as 3600 seconds (1 hour).
    It is recommended to set this [client] timeout value to a lower number (for eg. ~ maybe 60 seconds).

    • The benefit of this for the metadata-writer use case is that, currently, with a client side timeout of 2000 seconds (33.3 minutes) if a connection error happens during this period, metadata-writer stops doing its job for up to 33.3 minutes. After this change metadata-writer will only be disconneted for up to the new client timeout, 60 seconds.
    • Since the client timeout is now smaller than the server timeout, if no events occur within the client timeout period, a ReadTimeoutError is thrown and caught even in the absence of errors.
  • Makes client side and server side timeout configurable via environment variables.
  • Updates Kubernetes Python client to the latest version.

Checklist:

… adjust timeouts according to the recommendations of the Kubernetes client authors.

Signed-off-by: Evgeniy Mamchenko <evgeniy@deevio.ai>
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ark-kun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Hi @OutSorcerer. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions github-actions bot added the ci-passed All CI tests on a pull request have passed label Nov 6, 2024
Copy link
Member

@thesuperzapper thesuperzapper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OutSorcerer just want to clarify a few things.

except Exception as e:
import traceback
print(traceback.format_exc())
# Server side timeout, continue watching.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this line meant to be a comment for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment marks the place in code which is only reached when the server side timeout occurs, it was not meant to comment a particular statement.

Should I remove it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for clarity we can explain in the comment that the "continue[d] watching" occurs by getting a new stream on the next iteration of the while loop — that way it is clear the comment is referring to the logical flow rather than a particular statement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comment indeed seems misplaced.
@OutSorcerer isn't possible to handle the server-side exception the same way you did for the client-side one and keep the original exception handling (except Exception as e:)?
Something like:

    try:
        ...
    except <Server-side error> as e:
         # Server side timeout, continue watching.
         pass
    except urllib3.exceptions.ReadTimeoutError as e:
         # Client side timeout, continue watching.
         pass
    except Exception as e:
         import traceback
         print(traceback.format_exc())

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the behaviour in case of a server-side timeout is that the loop for event in pod_stream just finishes normally (because the underlying HTTP request to Kubernetes API also finishes) so it cannot be handled by try-except.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for clarifying @OutSorcerer.
Then how about the following?

Suggested change
# Server side timeout, continue watching.
# If the for loop ended, a server-side timeout occurred. Continue watching.
pass

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see, now it will be clear to which line that comment belongs. I made a new commit that adds pass under the comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment text to If the for loop ended, a server-side timeout occurred. Continue watching. as well.

Comment on lines +167 to +168
try:
for event in pod_stream:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand why we need to retry the entire iterator on every error and thus create a new one?

Or does the iterator returned by pod_stream become "poisoned" when it fails, so calling __next__ on it will never return a new item in the stream?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or does the iterator returned by pod_stream become "poisoned" when it fails, so calling next on it will never return a new item in the stream?

As I understand, yes, this is what happens here.

In case of a network error causing a client timeout it was leading to an unhandled exception in metadata-writer, so Kubertenetes was restarting it and increasing the restart counter.

In the stack trace the error was happening on this line:

Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 163, in <module>
    for event in pod_stream:
  File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 857, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 449, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='34.118.224.1', port=443): Read timed out.

# Client side timeout, continue watching.
pass

except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we were already catching all these errors before, but I am struggling to see why catching Exception won't get us sometimes stuck where we never actually crash.

Especially because you proposed above that we have this try around the for loop, meaning any iteration errors will endlessly retry on a pod_stream that may never work?

Copy link
Author

@OutSorcerer OutSorcerer Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep the amount of changes small originally.

But in my understanding removing this catch-all except clause improves the code. The unhalded exceptions will still be printed to the console and they will not be hidden from a user anymore. A user will see that restart counter increases and will be able to get the logs by a commnad like kubectl -n kubeflow logs --previous metadata-writer-6d5b8456-78265. Kubernetes will restart metadata-writer pod and it will continue handling events if it is possible.

So I made a new commit now that removes the catch-all except clause.

@hbelmiro
Copy link
Contributor

hbelmiro commented Nov 7, 2024

/ok-to-test

… users.

Signed-off-by: Evgeniy Mamchenko <evgeniy@deevio.ai>
@google-oss-prow google-oss-prow bot added size/S and removed size/M labels Nov 7, 2024
Signed-off-by: Evgeniy Mamchenko <evgeniy@deevio.ai>
Signed-off-by: Evgeniy Mamchenko <evgeniy@deevio.ai>
Copy link
Contributor

@hbelmiro hbelmiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-passed All CI tests on a pull request have passed lgtm ok-to-test size/S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[backend] Metadata writer pod always restarting
4 participants