Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add async observability to QuartzJob #461

Merged
merged 17 commits into from
Feb 1, 2024
Merged

Conversation

ashanhol
Copy link
Contributor

@ashanhol ashanhol commented Jan 18, 2024

Added Observability to QuartzJob.execute denoting running, failure, and succeed states. The DefaultMeterObservationHandler gives us a built in Timer, Counter, and LongTaskTimer.

Here's how it reports in Prometheus.

# HELP wds_job_execute_job_failed_total
# TYPE wds_job_execute_job_failed_total counter
wds_job_execute_job_failed_total{jobType="TdrManifestQuartzJob",wds_version="0.2.113-SNAPSHOT",} 1.0
# HELP wds_job_execute_job_running_total
# TYPE wds_job_execute_job_running_total counter
wds_job_execute_job_running_total{jobType="TdrManifestQuartzJob",wds_version="0.2.113-SNAPSHOT",} 1.0
# HELP wds_job_execute_seconds
# TYPE wds_job_execute_seconds summary
wds_job_execute_seconds_count{error="JobExecutionException",jobType="TdrManifestQuartzJob",wds_version="0.2.113-SNAPSHOT",} 1.0
wds_job_execute_seconds_sum{error="JobExecutionException",jobType="TdrManifestQuartzJob",wds_version="0.2.113-SNAPSHOT",} 0.522719647
# HELP wds_job_execute_seconds_max
# TYPE wds_job_execute_seconds_max gauge
wds_job_execute_seconds_max{error="JobExecutionException",jobType="TdrManifestQuartzJob",wds_version="0.2.113-SNAPSHOT",} 0.522719647
# HELP wds_job_execute_active_seconds
# TYPE wds_job_execute_active_seconds summary
wds_job_execute_active_seconds_active_count{wds_version="0.2.113-SNAPSHOT",} 1.0
wds_job_execute_active_seconds_duration_sum{wds_version="0.2.113-SNAPSHOT",} 3.752484771
# HELP wds_job_execute_active_seconds_max
# TYPE wds_job_execute_active_seconds_max gauge
wds_job_execute_active_seconds_max{wds_version="0.2.113-SNAPSHOT",} 3.752493404

Reminder:

PRs merged into main will not automatically generate a PR in https://github.com/broadinstitute/terra-helmfile to update the WDS image deployed to kubernetes - this action will need to be triggered manually by running the following github action: https://github.com/DataBiosphere/terra-workspace-data-service/actions/workflows/tag.yml. Dont forget to provide a Jira Id when triggering the manual action, if no Jira ID is provided the action will not fully succeed.

After you manually trigger the github action (and it completes with no errors), you must go to the terra-helmfile repo and verify that this generated a PR that merged successfully.

The terra-helmfile PR merge will then generate a PR in leonardo. This will automerge if all tests pass, but if jenkins tests fail it will not; be sure to watch it to ensure it merges. To trigger jenkins retest simply comment on PR with "jenkins retest".

service/build.gradle Outdated Show resolved Hide resolved
Copy link
Contributor

@jladieu jladieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments/suggestions but nothing showstopper. LGTM with pre-emptive approval!

@jladieu jladieu mentioned this pull request Jan 19, 2024
@ashanhol ashanhol marked this pull request as ready for review January 20, 2024 00:58
Copy link
Collaborator

@davidangb davidangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question inline, otherwise looking good!

I see this is failing unit tests; it needs some updates for that.

Can you write a test to ensure that QuartzJob is properly recording metrics? If you pass a @MockBean ObservationRegistry observationRegistry (or a @SpyBean), can you verify in the mock/spy that it's doing the right thing?

This intends to make the default `ObservationRegistry` automatically
wired up to the same `MeterRegistry` provided by springboot.
@jladieu
Copy link
Contributor

jladieu commented Jan 22, 2024

@ashanhol I saw your question about whether we might need to customize DefaultMeterObservationHandler and I left some suggestions.

Regarding your question in Slack about whether/how to test this, I think if you merge the PR mentioned here you should be able to adjust QuartzJobTest to use an @Autowired MeterRegistry and then check meters in a test as described in this StackOverflow: https://stackoverflow.com/a/74182551

Copy link
Collaborator

@davidangb davidangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments inline, mostly about tests. Good stuff in this PR!

Observation.start("wds.job.execute", observationRegistry)
.contextualName("job-execution")
.lowCardinalityKeyValue("jobType", getClass().getSimpleName())
.highCardinalityKeyValue("jobId", jobId.toString());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How feasible is it to start an Observation in a separate class - and a separate Java thread - and update the Observation here?

We queue the job over in ImportService.java:

// schedule the job. after successfully scheduling, mark the job as queued

and if we wanted to track timing between QUEUED and RUNNING status, we'd need to start the Observation over there.

I am totally fine answering this question sometime after this PR merges. What's in this PR as-is is useful and adds value, so it's worth it on its own.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call @davidangb this is actually the spirit behind the measurement, which is to know how long the customer has to wait, so I'm in favor of this. Ok to defer to another PR though if you just want to land this one @ashanhol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had this exact thought when doing this, but couldn't quite figure out how to "pass" an observation from one class to another. I'm happy to make a ticket to think about it further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.hasBeenStarted()
.hasBeenStopped();
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a test for a failing job that looks for .hasLowCardinalityKeyValue("outcome", "failure")? You probably need to modify TestableQuartzJob or create something new that extends QuartzJob but throws an error in executeInternal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'm losing my mind here. In this current test implementation I have the execute function wrapped in a assertThrows, and it fails because it says it doesn't throw anything. But if I get rid of assertThrows it totally does. I feel like I'm missing something obvious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK @jladieu and I figured this out, and it's confusing. We have two JobExecutionExceptions, one from Quartz and one we define. The one we define extends RuntimeException and therefore doesn't need to be caught/handled, but the quartz one does. QuartzJob.execute throws a org.quartz.JobExecutionException (and therefore needs the throws org.quartz.JobExecutionException in the method signature), but all our custom jobs throw various RuntimeException ones. So I had the right idea having the TestQuartzJob throw our custom one but it made writing this test VERY confusing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this explanation helps to explain …

The typical data flow is:

  • Quartz looks for an execute() method and runs that method when it kicks off a job
  • in the WDS codebase, we define the execute() method in QuartzJob.
  • QuartzJob.execute() wraps and calls the executeInternal() method defined in subclasses such as TdrManifestQuartzJob or PfbQuartzJob
  • As part of its wrapping, QuartzJob.execute() !!catches any exceptions thrown by executeInternal!!. On catch, it marks the job as failed and swallows the exception so Quartz can mark its job as complete.

In this test:

  • the test defines a TestableQuartzJob subclass of QuartzJob
  • the test throws an exception !!from executeInternal()!!
  • the QuartzJob.execute() wrapper properly catches the error thrown from executeInternal(), marks the job as failed, registers metrics, and swallows the error.

So, it's expected that when this test calls TestableQuartzJob(...).execute, it will not throw an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I figured it out eventually. I personally didn't know that RuntimeException had different behavior than regular Exception and that required some compiler fighting and extra confusion with the namespaces.

Copy link
Collaborator

@davidangb davidangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #475 for my requested changes to prevent double-counting of metrics!

davidangb and others added 2 commits January 29, 2024 12:54
observability tweaks to remove observationRegistry bean and add testobservationRegistry
Copy link
Collaborator

@davidangb davidangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit/suggestion inline, giving this a proactive 👍

ashanhol and others added 2 commits February 1, 2024 09:44
Co-authored-by: David An <davidan@broadinstitute.org>
Copy link

sonarcloud bot commented Feb 1, 2024

@ashanhol ashanhol merged commit a2554ee into main Feb 1, 2024
14 checks passed
@ashanhol ashanhol deleted the adinas/AsyncObservability branch February 1, 2024 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants