Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest Terra metrics #478

Merged
merged 20 commits into from
Jan 19, 2024
Merged

Conversation

kathy-t
Copy link
Contributor

@kathy-t kathy-t commented Dec 12, 2023

Companion PRs:

PR Description

This PR ingests the metrics that Terra provided.

Approach:

  • It processes 100,000 rows at a time, processing the list of rows in parallel. This helped reduce the processing time significantly.
    • For the entire data set that Terra gave us, it took 6m37s to ingest! A big decrease from my initial draft PR and this is due to grouping executions by source_url and sending the executions with the same TRS ID and version in one request instead of multiple (one for each execution).
      Done processing 4979681 executions from Terra. Submitted 4323813 executions. Skipped 691263 executions.
      submit-terra-metrics took PT6M36.517129093S
      
    • It took 6m32s to aggregate all the Terra metrics ingested!
    • Unfortunately, from my testing, the webservice changes in Add update execution metrics endpoint dockstore#5778 that creates a file for each execution greatly increases the time for this PR. Processing 200,000 executions without the changes in the linked PR takes 1 minute, but with those changes, it takes 32 minutes. Optimization efforts will have to be in that PR. EDIT: I will be reverting the changes in that PR that resulted in the performance decrease in metric ingestion. See the comment in that PR for more details.
  • There is a flag that records executions that the program skipped and outputs it to a CSV file. This may be helpful to examine why it was skipped and possibly resubmit them in the future if it was an error on our part.
  • The metrics Terra provided do not have a TRS ID. Instead, they have a source_url pointing to the primary descriptor, like https://raw.githubusercontent.com/broadinstitute/gatk/ah_var_store/scripts/variantstore/wdl/GvsAoUReblockGvcf.wdl. We need to determine the TRS ID of the workflow that has this primary descriptor. The steps to determine this is:
    • Get a list of published workflows that begin with the same workflow path. Using the source_url from above, we would look for workflows with the path prefix github.com/broadinstitute/gatk.
    • For each published workflow, we get the primary descriptor for the version that is in the source_url. In the example, we're looking for primary descriptors for the version ah_var_store. We see if any of these workflows have a primary descriptor that match the descriptor in the source_url.
    • Important: there is an ambiguous case that forces us to ignore the execution. If there are more than one workflow in repository that have the same primary descriptor path, the program skips the execution because it can't accurately determine which workflow the source_url is referring to. I've encountered this case in testing the data that was given to us.

Note that builds will fail until I update the Dockstore webservice version to a tag containing the corresponding webservice changes.

Draft PR Description

This is a draft PR for ingesting the Terra metrics that the Broad team provided. I'm looking for feedback on the approach - I still need to add a few tests, clean it up, documentation, etc.

Approach:

  • It processes 100,000 rows at a time, processing the list of rows in parallel. This helped reduce the processing time significantly.
    • For about 40,000, processing sequentially took 17m48s. Processing in parallel in batches of 100,000 took 6m15s
    • Time to process 1,000,000 rows: 1h14m29s
      14:10:28.436 [main] INFO  i.d.m.c.cli.MetricsAggregatorClient - Done processing 1000000 executions from Terra. Submitted 392737 executions. Skipped 607263 executions. Cache hits: 391638. Cache misses: 1939
      14:10:28.436 [main] INFO  i.d.m.c.cli.MetricsAggregatorClient - View skipped executions in file terra-metrics-1000000.csv_skipped_executions_20231212T175400Z.csv
      14:10:28.436 [main] INFO  i.d.m.c.cli.MetricsAggregatorClient - submit-terra-metrics took PT1H16M28.653129692S
      
      • Not sure why there are more skipped than submitted, will take a look. EDIT: there are skipped executions where it says that it can't get a valid dateExecuted from the workflow_start but the workflow_start is valid 🤔. It works okay in a smaller dataset so something is up. EDIT 2: It was a bug on my end, the program was skipping executions that it shouldn't have. The run time ends up being longer due to this fix.
  • There is a flag that records executions that the program skipped and outputs it to a CSV file. This may be helpful to examine why it was skipped and possibly resubmit them in the future if it was an error on our part.
  • The metrics Terra provided do not have a TRS ID. Instead, they have a source_url pointing to the primary descriptor, like https://raw.githubusercontent.com/broadinstitute/gatk/ah_var_store/scripts/variantstore/wdl/GvsAoUReblockGvcf.wdl. We need to determine the TRS ID of the workflow that has this primary descriptor. The steps to determine this is:
    • Get a list of published workflows that begin with the same workflow path. Using the source_url from above, we would look for workflows with the path prefix github.com/broadinstitute/gatk.
    • For each published workflow, we get the primary descriptor for the version that is in the source_url. In the example, we're looking for primary descriptors for the version ah_var_store. We see if any of these workflows have a primary descriptor that match the descriptor in the source_url.
    • Important: there is an ambiguous case that forces us to ignore the execution. If there are more than one workflow in repository that have the same primary descriptor path, the program skips the execution because it can't accurately determine which workflow the source_url is referring to. I've encountered this case in testing the data that was given to us.

Review Instructions
Ingest Terra metrics in QA or staging then aggregate the metrics.

Issue
SEAB-5963

Security
If there are any concerns that require extra attention from the security team, highlight them here.

Please make sure that you've checked the following before submitting your pull request. Thanks!

  • Check that you pass the basic style checks and unit tests by running mvn clean install in the project that you have modified (until https://ucsc-cgl.atlassian.net/browse/SEAB-5300 adds multi-module support properly)
  • Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
  • If you are changing dependencies, check with dependabot to ensure you are not introducing new high/critical vulnerabilities
  • If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.

@kathy-t kathy-t self-assigned this Dec 12, 2023
Copy link

codecov bot commented Dec 12, 2023

Codecov Report

Attention: 88 lines in your changes are missing coverage. Please review.

Comparison is base (81eb362) 54.31% compared to head (125f9d7) 56.75%.
Report is 1 commits behind head on develop.

Files Patch % Lines
...csaggregator/client/cli/TerraMetricsSubmitter.java 72.33% 40 Missing and 17 partials ⚠️
...e/metricsaggregator/MetricsAggregatorS3Client.java 66.00% 15 Missing and 2 partials ⚠️
...aggregator/client/cli/MetricsAggregatorClient.java 63.63% 9 Missing and 3 partials ⚠️
.../metricsaggregator/client/cli/CommandLineArgs.java 80.00% 2 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             develop     #478      +/-   ##
=============================================
+ Coverage      54.31%   56.75%   +2.43%     
- Complexity       267      300      +33     
=============================================
  Files             31       31              
  Lines           1725     1947     +222     
  Branches         143      165      +22     
=============================================
+ Hits             937     1105     +168     
- Misses           725      762      +37     
- Partials          63       80      +17     
Flag Coverage Δ
metricsaggregator 48.74% <70.56%> (+3.46%) ⬆️
toolbackup 37.69% <9.36%> (-3.58%) ⬇️
tooltester 29.68% <9.36%> (-2.55%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

sonarcloud bot commented Dec 12, 2023

Quality Gate Failed Quality Gate failed

Failed conditions

0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarCloud

Copy link
Contributor

@coverbeck coverbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good... do you have a sense on whether it's slow with all the API calls fetching the primary descriptor paths, or posting? Or maybe it's just both, or neither, and the size of the data is the reason.

It's probably not worth the optimization effort, but depending on the answer above and how many different of a workflow are run, I wonder if it wouldn't be more efficient to fetch all versions for a workflow, and then calculating the primary descriptors? There are a lot of variables there, e.g., how many versions of the same workflow are run, so probably not worth further effort unless it really goes out of control.

There are some workflows that do not have absolute primary paths, see SEAB-5898. Will those work? If not, can we assume they're absolute by prepending a forward slash?

final SourceUrlTrsInfo sourceUrlTrsInfo = sourceUrlToSourceUrlTrsInfo.get(sourceUrl);
final ExecutionsRequestBody executionsRequestBody = new ExecutionsRequestBody().runExecutions(List.of(workflowExecution.get()));
try {
extendedGa4GhApi.executionMetricsPost(executionsRequestBody, Partner.TERRA.toString(), sourceUrlTrsInfo.trsId(), sourceUrlTrsInfo.version(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be worth having a dry-run type argument, where you can see what would get posted as a sort of sanity check before actually creating s3 data.

@kathy-t
Copy link
Contributor Author

kathy-t commented Dec 13, 2023

Looks good... do you have a sense on whether it's slow with all the API calls fetching the primary descriptor paths, or posting? Or maybe it's just both, or neither, and the size of the data is the reason.

Hmm, based on my simple evaluation of measuring the time elapsed before and after the call, the call to fetch descriptors takes anywhere from 0.01 - 0.09 seconds. Posting the metric takes around the same time, in the 0.0x range, but I'll see if grouping executions and posting helps

It's probably not worth the optimization effort, but depending on the answer above and how many different of a workflow are run, I wonder if it wouldn't be more efficient to fetch all versions for a workflow, and then calculating the primary descriptors? There are a lot of variables there, e.g., how many versions of the same workflow are run, so probably not worth further effort unless it really goes out of control.

Hmm, just tested this and it actually takes longer about 2 minutes longer for 40,000 rows, likely because we're getting versions that maybe were never executed.

I have an optimization that reduces the 1,000,000 processing time from 1h14m29s to 56m22s, which I'll push soon. So with 4,000,000 rows, we're looking at about 4 hours, hopefully less.

There are some workflows that do not have absolute primary paths, see SEAB-5898. Will those work? If not, can we assume they're absolute by prepending a forward slash?

Good point, I will double check

@coverbeck
Copy link
Contributor

I have an optimization that reduces the 1,000,000 processing time from 1h14m29s to 56m22s, which I'll push soon. So with 4,000,000 rows, we're looking at about 4 hours, hopefully less.

That's probably fine; we're going to run this very rarely.

@denis-yuen
Copy link
Member

There is a flag that records executions that the program skipped and outputs it to a CSV file. This may be helpful to examine why it was skipped and possibly resubmit them in the future if it was an error on our part.

👍

@denis-yuen
Copy link
Member

  • I've encountered this case in testing the data that was given to us.

Still thinking this is odd, can you output a few examples and track this in a new ticket for investigation?
I wonder if this is happening if someone renames a workflow (i.e. adds an entryname or changes it)

@denis-yuen
Copy link
Member

That's probably fine; we're going to run this very rarely.

Refresh my memory, what happens when we run this twice with the same data?
Will it skip the old aggregated data? Or does it just run again and overwrite the old data?
Either way though, nightly on weekends, for example, this doesn't seem so bad

Copy link
Member

@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment above

@kathy-t
Copy link
Contributor Author

kathy-t commented Dec 13, 2023

Refresh my memory, what happens when we run this twice with the same data?
Will it skip the old aggregated data? Or does it just run again and overwrite the old data?
Either way though, nightly on weekends, for example, this doesn't seem so bad

In https://ucsc-cgl.atlassian.net/browse/SEAB-5943, we're changing the S3 file key from the time submitted to the execution ID, so if this is run twice, it will error for the execution IDs that already exist in the same directory. I think we don't want users to accidentally overwrite the old data using the POST metrics endpoint, hence the erroring

@denis-yuen
Copy link
Member

ok, thinking we definitely don't have to worry then. Any idea what the aggregation runtimes will be like vs ingestion?

@kathy-t
Copy link
Contributor Author

kathy-t commented Dec 14, 2023

ok, thinking we definitely don't have to worry then. Any idea what the aggregation runtimes will be like vs ingestion?

For 200,000 rows, ingestion took about 30 mins which submitted 170480 executions. Aggregation of these executions took 1H19M54S, submitting 841 metrics to the database. The aggregation can definitely take advantage of parallel processing, so I'll add that

@kathy-t
Copy link
Contributor Author

kathy-t commented Dec 14, 2023

ok, thinking we definitely don't have to worry then. Any idea what the aggregation runtimes will be like vs ingestion?

For 200,000 rows, ingestion took about 30 mins which submitted 170480 executions. Aggregation of these executions took 1H19M54S, submitting 841 metrics to the database. The aggregation can definitely take advantage of parallel processing, so I'll add that

Adding parallel processing to the aggregator reduced the time down to 26M39S

@kathy-t
Copy link
Contributor Author

kathy-t commented Jan 3, 2024

Noting the following stats for a CSV file of 1,000,000 executions:

12:32:50.611 [main] INFO  i.d.m.c.cli.MetricsAggregatorClient - Done processing 1000000 executions from Terra. Submitted 844397 executions. Skipped 155603 executions.
12:32:50.611 [main] INFO  i.d.m.c.cli.MetricsAggregatorClient - submit-terra-metrics took PT2H23M20.072273889S

Submitted 844,397 executions and skipped 155,603, which makes alot more sense than the stats for 1,000,000 executions in the draft PR description. There was a bug that was causing it to skip executions it shouldn't (now fixed). This bug fixed caused the time to increase from 1h14m29s to 2h23m20s. So we're looking at a runtime of about 2h23m x 5 (hopefully faster if later executions have the same source_url as already processed executions) to process the approximately 5,000,000 rows of data that the Broad gave us.

@kathy-t kathy-t marked this pull request as ready for review January 8, 2024 16:23
@denis-yuen
Copy link
Member

A thought, no need for action now... The performance issues, and our workarounds, make me wonder if maybe it'd be better to warehouse the metrics in a database. That way, we could efficiently retrieve many results, easily find results by different criteria, do atomic bulk updates, gracefully change the schema, etc.

We could also leave it in BigQuery or AWS's equivalent (RedShift?)
Worth a thought

Copy link
Member

@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments/questions

@@ -23,7 +23,7 @@
</encoder>
</appender>

<root level="error">
<root level="info">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wanted to see the INFO logs otherwise its eerily quiet on the CLI 😁

}

RunExecution workflowExecution = new RunExecution();
// TODO: uncomment below when the update executions endpoint PR is merged
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

@kathy-t
Copy link
Contributor Author

kathy-t commented Jan 9, 2024

A thought, no need for action now... The performance issues, and our workarounds, make me wonder if maybe it'd be better to warehouse the metrics in a database. That way, we could efficiently retrieve many results, easily find results by different criteria, do atomic bulk updates, gracefully change the schema, etc.

We could also leave it in BigQuery or AWS's equivalent (RedShift?) Worth a thought

Another alternative is to use AWS' Athena to query the metrics in S3, which feels like a more future-proof solution 💭

Copy link
Member

@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that builds will fail until I update the Dockstore webservice version to a tag containing the corresponding webservice changes.

Ah ok. WIll swing back again after that tag

final ExecutionsRequestBody executionsRequestBody = new ExecutionsRequestBody().runExecutions(workflowExecutionsToSubmit);
try {
extendedGa4GhApi.executionMetricsPost(executionsRequestBody, Partner.TERRA.toString(), sourceUrlTrsInfo.trsId(), sourceUrlTrsInfo.version(),
"Terra metrics from Q4 2023");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coded and arguably not correct, unless it's the date they gave it to us (metrics are for 2022, but they were given to us in Q4 2023). If we expect them to switch to invoking the API in the future, then maybe we don't need to worry about it; just flagging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I think I'll make it an optional argument instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the description more generic and also added description as a command argument

@kathy-t
Copy link
Contributor Author

kathy-t commented Jan 16, 2024

FYI, need to merge #479 first before updating the webservice version in this PR to 1.15.0-rc.0 because the other PR has changes required for the metrics aggregator to build successfully

@denis-yuen
Copy link
Member

  • mportant: there is an ambiguous case that forces us to ignore the execution. If there are more than one workflow in repository that have the same primary descriptor path, the program skips the execution because it can't accurately determine which workflow the source_url is referring to. I've encountered this case in testing the data that was given to us.

I'm guessing, workflow that was renamed?

@kathy-t
Copy link
Contributor Author

kathy-t commented Jan 18, 2024

  • mportant: there is an ambiguous case that forces us to ignore the execution. If there are more than one workflow in repository that have the same primary descriptor path, the program skips the execution because it can't accurately determine which workflow the source_url is referring to. I've encountered this case in testing the data that was given to us.

I'm guessing, workflow that was renamed?

That's one possible scenario. Other ones are:

  • User has a .dockstore.yml with multiple workflows referencing the same descriptor
  • User registered multiple FULL workflows with the same descriptor

@kathy-t
Copy link
Contributor Author

kathy-t commented Jan 18, 2024

Updated the release checklist with instructions on how to ingest the Terra metrics. Per discussion in #479 (comment), we will be wiping the metrics bucket and re-ingesting the tooltester AGC executions and DNAstack miniwdl metrics.

Local stats of running through the checklist

15:38:13.708 [main] INFO  i.d.m.MetricsAggregatorS3Client - Completed aggregating metrics. Processed 5311 directories, submitted 10795 platform metrics, and skipped 67 platform metrics
15:38:13.708 [main] INFO  i.d.m.c.cli.MetricsAggregatorClient - aggregate-metrics took PT17M23.735754669S

EDIT: investigating why it skipped 67 metrics.

@kathy-t
Copy link
Contributor Author

kathy-t commented Jan 19, 2024

Updated the release checklist with instructions on how to ingest the Terra metrics. Per discussion in #479 (comment), we will be wiping the metrics bucket and re-ingesting the tooltester AGC executions and DNAstack miniwdl metrics.

Local stats of running through the checklist

15:38:13.708 [main] INFO  i.d.m.MetricsAggregatorS3Client - Completed aggregating metrics. Processed 5311 directories, submitted 10795 platform metrics, and skipped 67 platform metrics
15:38:13.708 [main] INFO  i.d.m.c.cli.MetricsAggregatorClient - aggregate-metrics took PT17M23.735754669S

EDIT: investigating why it skipped 67 metrics.

I created a webservice PR that fixes the aggregator so that it doesn't skip the 67 metrics. See details in dockstore/dockstore#5788.

I will still merge this PR, but will skip ingesting the DNAstack metrics in staging until the next RC release because the metrics that were skipped were some of the DNAstack metrics.

Copy link

sonarcloud bot commented Jan 19, 2024

Quality Gate Failed Quality Gate failed

Failed conditions

1 Security Hotspot
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarCloud

@kathy-t kathy-t merged commit 38b05b4 into develop Jan 19, 2024
9 of 10 checks passed
@kathy-t kathy-t deleted the feature/seab-5963/ingest-terra-metrics branch January 19, 2024 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants