Skip to content

Commit

Permalink
Merge pull request #314 from NCI-CGR/data_catalog
Browse files Browse the repository at this point in the history
BCF entry point with intensity and contamination checks using BCF for data_catalog usage.
  • Loading branch information
jaamarks authored Sep 25, 2024
2 parents fe64e33 + 58ed006 commit f2ff2c3
Show file tree
Hide file tree
Showing 25 changed files with 1,385 additions and 245 deletions.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ This lets us take advantage of snakemake_'s amazing workflow management system,
:maxdepth: 1

sub_workflows/entry_points
sub_workflows/intensity_check
sub_workflows/contamination
sub_workflows/sample_qc
sub_workflows/subject_qc
Expand Down
80 changes: 80 additions & 0 deletions docs/static/bcf_contamination.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 44 additions & 0 deletions docs/static/bcf_intensity.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
80 changes: 80 additions & 0 deletions docs/static/gtc_contamination.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
56 changes: 56 additions & 0 deletions docs/static/idat_intensity.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
31 changes: 16 additions & 15 deletions docs/sub_workflows/contamination.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,28 @@ Contamination Sub-workflow
**Workflow File**:
https://github.com/NCI-CGR/GwasQcPipeline/blob/default/src/cgr_gwas_qc/workflow/sub_workflows/contamination.smk

**Major Outputs**:

- ``sample_level/<BPM Prefix>.<software_params.contam_population>.abf.txt`` B allele frequencies from the 1000 genomes.
- ``sample_level/contamination/verifyIDintensity.csv`` aggregated table of contamination scores.

**Config Options**: see :ref:`config-yaml` for more details

- ``reference_files.thousand_genome_vcf``
- ``reference_files.thousand_genome_tbi``
- ``user_files.gtc_pattern``
- ``user_files.idat_pattern``
- ``user_files.bcf`` or ( ``reference_files.illumina_manifest_file`` and ``user_files.gtc_pattern`` )
- ``software_params.contam_population``

**Major Outputs**:

- ``sample_level/<BPM Prefix>.<software_params.contam_population>.abf.txt`` B allele frequencies from the 1000 genomes.
- ``sample_level/contamination/median_idat_intensity.csv`` aggregated table of median IDAT intensities.
- ``sample_level/contamination/verifyIDintensity.csv`` aggregated table of contamination scores.
|bcf_input_contamination| |gtc_input_contamination|

.. |gtc_input_contamination| image:: ../static/gtc_contamination.svg
:width: 45%

.. figure:: ../static/contamination.png
:name: fig-contamination-workflow
.. |bcf_input_contamination| image:: ../static/bcf_contamination.svg
:width: 45%

The contamination sub-workflow.
This workflow will estimate contamination using verifyIDintensity on each sample individually.
It requires that you have GTC/IDAT files.
It first pulls B-allele frequencies from the 1000 Genomes VCF file.
It then estimate contamination for each sample and aggregates these results.
Finally, it also estimates the per sample median IDAT intensity, which is used to filter contamination results in the :ref:`sample-qc`
The contamination sub-workflow.
This workflow will estimate contamination using verifyIDintensity on each sample individually.
It requires that you have aggregated BCF or GTC files.
It first pulls B-allele frequencies from the 1000 Genomes VCF file.
It then estimates contamination for each sample and aggregates these results.
4 changes: 3 additions & 1 deletion docs/sub_workflows/entry_points.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,15 @@ Entry Points Sub-workflow
- ``user_files.bed``
- ``user_files.bim``
- ``user_files.fam``
- ``user_files.bcf``

**Major Outputs**:

- ``sample_level/samples.bed``
- ``sample_level/samples.bim``
- ``sample_level/samples.fam``

There are three paths we can take to create these files:
There are four paths we can take to create these files:

1. If GTC files are provided using ``user_files.gtc_pattern`` then we will

Expand All @@ -34,3 +35,4 @@ There are three paths we can take to create these files:

2. If an aggregated PED/MAP is provided using ``user_files.ped`` and ``user_files.map`` then we will convert the PED/MAP to BED/BIM/FAM.
3. If an aggregated BED/BIM/FAM is provided using ``user_files.bed``, ``user_files.bim``, ``user_files.fam`` then we will create a symbolic link.
4. If an aggregated BCF file is provided using ``user_files.bcf`` then we will convert the BCF to BED/BIM/FAM.
Loading

0 comments on commit f2ff2c3

Please sign in to comment.