Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCF entry point with intensity and contamination checks using BCF for data_catalog usage. #314

Merged
merged 36 commits into from
Sep 25, 2024

Conversation

rajwanir
Copy link

  1. Creates BCF entry point for data catalog.
  2. Makes BPM manifest optional (relevant for aggregated file inputs e.g. BCF, VCF or BED)
  3. Adds scripts to compute median intensity and contamination score from BCF.
  4. Separates the intensity checks from contamination checks.
  5. No observable difference with standard GTC input.

jaamarks

This comment was marked as resolved.

jaamarks

This comment was marked as resolved.

@jaamarks

This comment was marked as resolved.

@jaamarks

This comment was marked as resolved.

@jaamarks

This comment was marked as resolved.

@jaamarks

This comment was marked as resolved.

@rajwanir

This comment was marked as resolved.

@rajwanir

This comment was marked as resolved.

@rajwanir

This comment was marked as resolved.

@jaamarks

This comment was marked as resolved.

@rajwanir

This comment was marked as resolved.

jaamarks

This comment was marked as resolved.

jaamarks

This comment was marked as resolved.

rajwanir2 added 11 commits September 25, 2024 16:21
Avoids dependency of bpm to name allele B frequencies (abf) file.
to separate median idat intensity retrieval from verifyIDintensity bundled into contamination.smk
Modifis the entry_points.smk to create BCF entry point by simply converting BCF to plink BED. Testing and validation yet to be done.
A previous commit puts them in a separate idat_intensity.smk
Modifies contamination.smk and grouped_contamination.py to enable contamination check in cluster mode.
Avoids the dependency on IDAT files for calculating median intensity with VCF/BCF input.
Adds scripts for both in per-sample and grouped/cluster mode.
Modifies the intensity workflow to execute appropriately if VCF/BCF input.
…kefile and sample_qc subworkflow

Other than existing 'use_contamination' checks, also adds 'intensity_retreived' and 'contamination_checked'
tests which simply tests specifically if output csv files were created regardless of configs/entry point to feed
them to sample_qc.
rajwanir2 added 23 commits September 25, 2024 16:21
Removes idat_intensity.smk and keeps intensity_check.smk
Renames from vcf_file to bcf_file explictly indicate that bcf is input.
snakemake params were imported through named import. Changed to import all params through a loop in unnamed fashion. Allows seemless compatibility when gtc or bcf entry point is used.
The starting few lines were duplicated in the entry_points in copy/paste. This removes the duplicated lines.
…r contamination checks.

Previously GC_SCORE was added to the adpc.bin which had depenency that a cluster egt file had to be used in preparation of vcf/bcf. IGC score is encoded in gtc so doesn't depended on cluster egt file. The conamination scores should also be more similar with the gtc input.
Consistent with vcf2adpc.py. Now both should IGC and work with vcf/bcf prepared with gtc2vcf workflow.
Earliar gentrain score was used to mark AF as NA if score is negative. This change excludes using it to ensure compatibility with gtc2vcf workflow prepared bcf.
Copy link
Collaborator

@jaamarks jaamarks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jaamarks jaamarks merged commit f2ff2c3 into default Sep 25, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants