Scripts to create downsampled files #662

VJalili · 2024-04-02T17:12:00Z

This PR adds utility scripts to create test data for the pipeline, which are created either by downsampling a larger file to contain information overlapping given test regions or by converting file types, whichever suits best. The resulting files are smaller than the input files, leading to faster and cheaper execution of workflows for testing purposes.

The following changes are implemented:

Create a tests folder containing all the tests, and move the Carrot-related files to this folder;
Implement methods to downsample different types of files; .cram. .vcf, .interval_list, and primary contigs file are the currently supported file types.
Implement a script that iterates through given workflow JSON files, downsamples the set files, pushes them to the cloud if needed, and creates an output JSON file containing the downsampled files in addition to untouched inputs. For instance:

input JSON:

{
    "GatherSampleEvidence.bam_or_cram_file": "gs://.../HG00096.final.cram",
    "GatherSampleEvidence.sample_id": "HG00096",
}

output json:

{
    "GatherSampleEvidence.bam_or_cram_file": "gs://my-test-bucket/downsampled_HG00096.final.cram",
    "GatherSampleEvidence.sample_id": "HG00096",
}

Add a BED file with default target regions such that resulting downsampled files run successfully through the GatherSampleEvidence workflow.

The UUID of a test run on GCP: dd8bc64f-7ac0-46b6-ae39-2af86f0fe573

# since otherwise coverage, needed for running MELT, # is calculated incorrectly as it will calculate # across the entire genome.

… both derives from.

…ting file given in the json.

…is localized.

…o linearly search for distant pairs.

VJalili added 2 commits January 8, 2024 10:59

Move carrot-based tests to the tests directory.

297a7ce

Scripts to create downsampled files.

896f69d

VJalili requested a review from mwalker174 April 2, 2024 17:12

VJalili added 19 commits April 4, 2024 12:31

Fix lint errors.

3eae196

Add melt_metrics_intervals

6f6efc1

# since otherwise coverage, needed for running MELT, # is calculated incorrectly as it will calculate # across the entire genome.

Add a downsampler for BED format.

4642826

Downsample wham_include_list_bed_file.

6138b44

Add default downsampling regions file.

94d7541

Implement both downsampler and converter & implement a base type that…

501681b

… both derives from.

refactor downsamplers to transformers.

36b64dd

Reimplement downsampling cram files to correctly fetch pairs.

0d3cefb

Rework the interface so that all the methods implement transform method.

da08f68

Simplify the interface by removing BaseDownsampler & BaseConverter.

6384cd6

Convert serialized regions to .intervallist instead of using the exis…

7a0dd80

…ting file given in the json.

Replace hardcoded config with cli args, & ensure reference dict file …

0ca6a91

…is localized.

fix lint

2ed9f83

Update downsampling cram files to perform a second pass on the cram t…

61149da

…o linearly search for distant pairs.

Add an option for excluding some keys.

26e0333

Add an option to disable searching for discordant pair reads.

23ad622

Update default downsampling regions.

068fe85

Remove some sites from the default downsampling regions.

26cae8f

Fix lint errors.

d9bd795

VJalili marked this pull request as ready for review April 25, 2024 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripts to create downsampled files #662

Scripts to create downsampled files #662

VJalili commented Apr 2, 2024 •

edited

Loading

Scripts to create downsampled files #662

Are you sure you want to change the base?

Scripts to create downsampled files #662

Conversation

VJalili commented Apr 2, 2024 • edited Loading

VJalili commented Apr 2, 2024 •

edited

Loading