Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resiliency for ArmadaOperator services calls #2591

Closed
wants to merge 14 commits into from
Closed

Resiliency for ArmadaOperator services calls #2591

wants to merge 14 commits into from

Conversation

pavan12395
Copy link
Contributor

@pavan12395 pavan12395 commented Jun 20, 2023

This Pull Request makes the following changes

Fixes issue #2483

  • Adds BackOff retry algorithm for ArmadaClient & JobServiceClient
  • Contains Changes for Deploying JobService as a Deployment rather than StatefulSet

┆Issue is synchronized with this Jira Task by Unito

@pavan12395 pavan12395 changed the title added backoff-retry-algorithm_for_clients Resiliency for ArmadaOperator services calls Jun 20, 2023
@codecov
Copy link

codecov bot commented Jun 20, 2023

Codecov Report

Patch coverage has no change and project coverage change: -24.41 ⚠️

Comparison is base (f2cdeab) 58.78% compared to head (0e46fe0) 34.38%.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2591       +/-   ##
===========================================
- Coverage   58.78%   34.38%   -24.41%     
===========================================
  Files         238      112      -126     
  Lines       30568     3467    -27101     
  Branches        0      470      +470     
===========================================
- Hits        17970     1192    -16778     
+ Misses      11223     2192     -9031     
+ Partials     1375       83     -1292     
Flag Coverage Δ
armada-server ?
unittests 34.38% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 350 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@@ -21,6 +21,7 @@
from armada_client.k8s.io.api.core.v1 import generated_pb2 as core_v1
from armada_client.permissions import Permissions
from armada_client.typings import JobState
import tenacity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you will need to add tenacity to our pyproject.toml file as this is a new dependency.

You will also need to run the python code formatter (Black via tox -e format-code) to get your code to pass CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove tenacity from the client? Not sure I like this change being enforced on all users of the client.

I’d leave the changed in jobservice. Can you also experiment with deploying jobservice with multiple replicas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didnt get you but in the comments of the issue , It was mentioned that we need to include backoff for all the clients(ArmadaClient & JobServiceClient).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing the client side. The main user of this is mostly for jobservice and I'm not sure we want to add retries to all users of the client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kannon92 , I added the test for tenacity.

kannon92
kannon92 previously approved these changes Jun 26, 2023
Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good. Maybe you can work on having replicas for jobservice in a separate PR?

Copy link
Contributor

@ClifHouck ClifHouck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to discuss and decide if we want the behavior of JobServiceClient and JobServiceAsyncIOClient to majorly diverge as that's the effect of this PR as it stands. I haven't looked at tenacity yet. Does it support asyncio retry as well? If so, can you add this to JobServiceAsyncIOClient as well?

@pavan12395
Copy link
Contributor Author

Hey @ClifHouck , I found out that tenacity works fine with async functions as well , So I added the functionality for JobServiceAsyncIOClient as well.

@@ -18,6 +20,11 @@ class JobServiceClient:
def __init__(self, channel):
self.job_stub = jobservice_pb2_grpc.JobServiceStub(channel)

@tenacity.retry(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love some kind of test that can verify that this library works with GRPC.

Can you look into adding a unit test? You could use our mock python grpc class for jobservice.

Otherwise this is looking really good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should check whether tenacity makes retries or not , right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

@kannon92
Copy link
Contributor

kannon92 commented Jul 7, 2023

Hey @pavan12395 We got a request form our user for this feature to go a different way. We are going to close this PR.

Nice work on it so far! We decided to go with GRPC retries rather than tenacity.

@kannon92 kannon92 closed this Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants