Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix output_bam_basename calculation #5

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

multimeric
Copy link
Contributor

@multimeric multimeric commented Feb 21, 2019

The current implementation has a few issues:

  • Code duplication (output_bam_basename = sub(sub(unmapped_bam, sub_strip_path, ""), sub_strip_unmapped, "") + ".aligned.unsorted" is called 3 times)
  • Not portable: by stripping out only the gs:// prefix, this workflow struggles with every other cloud URL I've tried to use it with (AWS, and DNAnexus).
  • Involves casting a file to a string, with sub(unmapped_bam, "gs://.*/", ""). This means the file URL is converted directly to a string, and is not properly converted to the location at which this file will actually be downloaded.

I've solved these issues by using basename(), which is a much more portable and stable way of converting from a File to a String, and doesn't require prefix stripping. In addition I've removed the code duplication, which is no longer needed because of Cromwell fixes.

As a demonstration that this works, I've run the whole GATK3 Best Practice pipeline with my fixed version of the workflow on AWS (with Cromwell 37), and downloaded the Cromwell metadata (including the submitted workflow, and all results). Unfortunately I had to include some other AWS workarounds, but you can see from the logs that we start with this as an input:

    "flowcell_unmapped_bams": [
      "s3://cromwell-results/cromwell-execution/best_practise/4d3cfb17-1052-4c40-ae7c-6fa161eaf6ae/call-CPFTUB/unmap.CPFTUB/f0a79b5d-c936-40da-a427-bd657a79642e/call-PFTUB/shard
-0/NA12878_A.unmapped.bam",
      "s3://cromwell-results/cromwell-execution/best_practise/4d3cfb17-1052-4c40-ae7c-6fa161eaf6ae/call-CPFTUB/unmap.CPFTUB/f0a79b5d-c936-40da-a427-bd657a79642e/call-PFTUB/shard
-1/NA12878_B.unmapped.bam",
      "s3://cromwell-results/cromwell-execution/best_practise/4d3cfb17-1052-4c40-ae7c-6fa161eaf6ae/call-CPFTUB/unmap.CPFTUB/f0a79b5d-c936-40da-a427-bd657a79642e/call-PFTUB/shard
-2/NA12878_C.unmapped.bam"
    ]

And we then up with:

"output_bam_basename": "NA12878_A.unmapped.unmerged"

This is exactly what you want.

Full metadata logs are here. The workflow as a whole doesn't work (it fails for an unrelated reason), but the GenericPreProcessingWorkflow does complete successfully, which is what is of interest here.

GenericPreProcessingWorkflow.log
BestPractice.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant