Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sfitz add mosdepth quantize #88

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 44 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
1e093a2
add mosdepth and index file and change path name
sorelfitzgibbon Jun 19, 2024
5375b19
update resources
sorelfitzgibbon Jun 19, 2024
7ff81eb
update version in metadata
sorelfitzgibbon Jun 20, 2024
9a971e5
fix path to bam in tuple
sorelfitzgibbon Jun 27, 2024
fcd824b
update changelog
sorelfitzgibbon Jun 27, 2024
91136c9
resource adjustments
sorelfitzgibbon Jul 22, 2024
5e345c4
adjust log info
sorelfitzgibbon Jul 22, 2024
8beaf43
change process names
sorelfitzgibbon Jun 27, 2024
acbb0fd
rename bamqc_outformat to bamqc_output_format
sorelfitzgibbon Jun 27, 2024
668d83c
rename bamqc_outformat to bamqc_output_format
sorelfitzgibbon Jun 27, 2024
b178e75
remove fastqc as default
sorelfitzgibbon Jun 27, 2024
b59503d
fix CollectWgsMetrics params bug
sorelfitzgibbon Jul 22, 2024
c3d525f
add mosdepth and index file and change path name
sorelfitzgibbon Jun 19, 2024
b97d512
update changelog
sorelfitzgibbon Jun 27, 2024
7d536cd
Merge branch 'main' of github.com:uclahs-cds/pipeline-generate-SQC-BA…
sorelfitzgibbon Jul 23, 2024
f90c11f
fix left over merge lines M64.config
sorelfitzgibbon Jul 23, 2024
35f5aac
change template mosdepth fast to false
sorelfitzgibbon Jul 25, 2024
4975b4c
add quantize to resource configs
sorelfitzgibbon Jul 26, 2024
cad71c7
add quantize to resource configs
sorelfitzgibbon Jul 26, 2024
2f89689
update schema
sorelfitzgibbon Jul 26, 2024
80b352e
add quantize
sorelfitzgibbon Jul 26, 2024
76fcb40
add quantize
sorelfitzgibbon Jul 26, 2024
8a254cc
Merge branch 'sfitz-add-mosdepth' into sfitz-add-mosdepth-quantize
sorelfitzgibbon Jul 26, 2024
1aca7bc
add mosdepth to nftest
sorelfitzgibbon Jul 26, 2024
8ea75a1
update nftest mosdepth slow
sorelfitzgibbon Jul 26, 2024
81ee369
change algorithm option coverage to windows
sorelfitzgibbon Jul 27, 2024
7c0812b
update README
sorelfitzgibbon Jul 27, 2024
82bdfe9
merge in coverage to windows
sorelfitzgibbon Jul 29, 2024
1715272
require quantize cutoffs
sorelfitzgibbon Jul 29, 2024
51fdb1e
output filename dash and add to test config
sorelfitzgibbon Jul 29, 2024
3f11738
update changelog
sorelfitzgibbon Jul 29, 2024
313b079
add mosdepth per-base output
sorelfitzgibbon Jul 31, 2024
ce597f6
update submodules
sorelfitzgibbon Oct 29, 2024
55679d0
merge main
sorelfitzgibbon Oct 29, 2024
8c976ab
move gitignore lines to local only
sorelfitzgibbon Oct 29, 2024
8aa3634
add nftest config files
sorelfitzgibbon Oct 29, 2024
7462538
finish main merge
sorelfitzgibbon Oct 29, 2024
cab90a8
alorithm name
sorelfitzgibbon Oct 29, 2024
4fe2727
add quantize to nftest and update config names
sorelfitzgibbon Oct 29, 2024
027bad9
typo
sorelfitzgibbon Oct 29, 2024
7e48841
update readme
sorelfitzgibbon Oct 30, 2024
9527f7f
fix duplicate keys
sorelfitzgibbon Oct 30, 2024
563775d
fix nftest
sorelfitzgibbon Oct 30, 2024
5fb7a72
reorganize nftest.yml
sorelfitzgibbon Oct 30, 2024
b5be444
fix minor issues
sorelfitzgibbon Nov 1, 2024
4514bd4
fix output filenames, mosdepth windows too
sorelfitzgibbon Nov 1, 2024
20a6d52
Merge remote-tracking branch 'origin' into sfitz-add-mosdepth-quantize
sorelfitzgibbon Nov 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,3 @@ work/
*.tar
*.zip

# Other
test/*
test/*/*
slurm-*.out
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
## [Unreleased]

### Added
- Add `mosdepth` quantize workflow
- Add `mosdepth` coverage windows workflow
- Add `FastQC` workflow
- Add per readgroup and per library functionality
Expand Down
64 changes: 43 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
9. [References](#references)
## Overview

This pipeline takes BAMs and runs selected Quality Control (QC) steps. Available algorithms are currently `SAMtools stats`, `Picard CollectWgsMetrics` and `Qualimap bamqc`. Generally either `Qualimap bamqc` or `SAMtools stats and Picard CollectWgsMetrics` should be run, not both. `Qualimap bamqc` uses a lot of memory and should not be run within `uclahs-cds/metapipeline-DNA`. Input can include any combination of tumor and normal BAMs from a single donor. Each will be processed independently. RNA specific QC is not yet implemented but is expected soon.
This pipeline takes BAMs and runs selected Quality Control (QC) steps. Available algorithms are currently `SAMtools stats`, `Picard CollectWgsMetrics`, `FastQC`, `Qualimap bamqc`, `mosdepth coverage` and `mosdepth quantize`. Generally either `Qualimap bamqc` or `SAMtools stats and Picard CollectWgsMetrics` should be run, not both. `Qualimap bamqc` uses a lot of memory and should not be run within `uclahs-cds/metapipeline-DNA`. Input can include any combination of tumor and normal BAMs from a single donor. Each will be processed independently. RNA specific QC is not yet implemented but is expected soon.

---

Expand Down Expand Up @@ -51,9 +51,12 @@ Each of the below algorithms, if selected, will run in parallel subject to avail
### 4. FastQC
[FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/) aims to provide a QC report which can spot problems which originate either in the sequencer or in the starting library material.

### 5. mosdepth windows
### 5. mosdepth coverage
[mosdepth](https://github.com/brentp/mosdepth) by windows provides fast BAM/CRAM depth calculation.

### 6. mosdepth quantize
[mosdepth](https://github.com/brentp/mosdepth) quantize creates a bed file labeling regions within specified coverage thresholds. Similar to GATK's callable loci tool.

---

## Inputs
Expand All @@ -78,7 +81,7 @@ input:

| Field | Type | Required | Description |
| ----- | ---- | ------------ | ------------------------ |
| `algorithm` | list | no | List of tools to be run: ['fastqc', 'samtools_stats', 'collectwgsmetrics', 'mosdepth_coverage', 'qualimap_bamqc'], default = ['stats', 'collectwgsmetrics'] |
| `algorithm` | list | no | List of tools to be run: ['fastqc', 'samtools_stats', 'collectwgsmetrics', 'mosdepth_coverage', 'mosdepth_quantize', 'qualimap_bamqc'], default = ['stats', 'collectwgsmetrics'] |
| `reference` | path | yes/no | Reference fasta is required only for `CollectWgsMetrics` |
| `output_dir` | path | yes | Not required if `blcds_registered_dataset` = `true` |
| `blcds_registered_dataset` | boolean | no | Default is `false`. Only `uclahs_cds` users should change this. When `true`, BLCDS folder structure is used |
Expand All @@ -92,12 +95,6 @@ input:
| stats_remove_duplicates | boolean | no | Ignore reads marked as duplicate. Default = `false` |
| stats_additional_options | string | no | Any additional options recognized by `samtools stats` |

#### FastQC specific configuration
| Field | Type | Required | Description |
| ----- | ---- | ------------ | ------------------------ |
| fastqc_level | string | yes | 'readgroup', 'library' or 'sample' |
| fastqc_additional_options | string | no | Any additional options recognized by `FastQC` |

#### Picard specific configuration
| Field | Type | Required | Description |
| ----- | ---- | ------------ | ------------------------ |
Expand All @@ -107,19 +104,37 @@ input:
| cwm_use_fast_algorithm | boolean | no | If `true`, fast algorithm is used |
| cwm_additional_options | string | no | Any additional options recognized by `CollectWgsMetrics` |

#### mosdepth windows specific configuration
#### FastQC specific configuration
| Field | Type | Required | Description |
| ----- | ---- | ------------ | ------------------------ |
| mosdepth_use_fast_algorithm | boolean | no | `fast` algorithm ignores read pair overlaps and CIGARs. It should not be used on libraries with small insert sizes. Default = `false` |
| mosdepth_window_size | integer | no | Size for `mosdepth windows` coverage calculations |
| mosdepth_additional_options | string | no | Any additional options recognized by `mosdepth` |
| fastqc_level | string | yes | 'readgroup', 'library' or 'sample' |
| fastqc_additional_options | string | no | Any additional options recognized by `FastQC` |

#### Qualimap specific configuration
| Field | Type | Required | Description |
| ----- | ---- | ------------ | ------------------------ |
| bamqc_output_format | string | no | Choice of 'pdf' or 'html', default = 'pdf' |
| bamqc_additional_options | string | no | Any additional options recognized by `bamqc` |

#### mosdepth coverage specific configuration
| Field | Type | Required | Description |
| ----- | ---- | ------------ | ------------------------ |
| mosdepth_use_fast_algorithm | boolean | no | `fast` algorithm ignores read pair overlaps and CIGARs. It should not be used on libraries with small insert sizes. Default = `false` |
| mosdepth_per_base_output | boolean | no | Output coverage for every base. Default = `true` |
| mosdepth_window_size | integer | no | Size for `mosdepth windows` coverage calculations |
| mosdepth_additional_options | string | no | Any additional options recognized by `mosdepth`, `--mapq 20 recommended` |

#### mosdepth quantize specific configuration
| Field | Type | Required | Description |
| ----- | ---- | ------------ | ------------------------ |
| mosdepth_quantize_cutoffs | string | no | cutoffs for coverage regions. Default = `0:1:5:150` |
| mosdepth_quantize_use_fast_algorithm | boolean | no | `fast` algorithm ignores read pair overlaps and CIGARs. It should not be used on libraries with small insert sizes. Default = `false` |
| mosdepth_q0_label | string | no | lowest coverage regions label. Default = `Q0`
| mosdepth_q1_label | string | no | next coverage regions label. Default = `Q1`
| mosdepth_q2_label | string | no | next coverage regions label. Default = `Q2`
| mosdepth_q3_label | string | no | highest coverage regions label. Default = `Q3`
| mosdepth_quantize_additional_options | string | no | Any additional options recognized by `mosdepth`. `--mapq 20 recommended` |

#### Base resource allocation updaters
To update the base resource (cpus or memory) allocations for processes, use the following structure. The default allocations can be found in the [node-specific config files](./config/)
```Nextflow
Expand Down Expand Up @@ -172,14 +187,21 @@ base_resource_update {

| Output | Description |
| ------------ | ------------------------ |
| `{SAMtools-version}_{dataset_id}_{sample_id}_stats.txt` | SAMtools stats results |
| `{Picard-version}_{dataset_id}_{sample_id}_wgs-metrics.txt` | Picard CollectWgsMetrics results |
| `{Qualimap-version}_{dataset_id}_{sample_id}_stats` | Directory of Qualimap results, including, `genome_results.txt` and either `.pdf` or `.html and supporting directories`|
| `{FastQC-version}_{dataset_id}_{sample_id}_fastqc` | Directory of FastQC results |
| `{mosdepth-version}_{dataset_id}_{sample_id}-{window_size}.mosdepth.summary.txt` | Coverage by region with a final line for `total` |
| `{mosdepth-version}_{dataset_id}_{sample_id}-{window_size}.mosdepth.global.dist.txt` | a cumulative distribution indicating the proportion of total bases that were covered for at least a given coverage value |
| `{mosdepth-version}_{dataset_id}_{sample_id}-{window_size}.mosdepth.region.dist.txt` | a cumulative distribution indicating the proportion of the windows that were covered for at least a given coverage value |
| `{mosdepth-version}_{dataset_id}_{sample_id}-{window_size}.regions.bed.gz` | Bedfile giving coverage for each window |
| `{SAMtools-version}_{dataset_id}_{sample_id}_stats.txt` | `SAMtools stats` sample level results |
| `{SAMtools-version}_{dataset_id}_{sample_id}-{library_id}_stats.txt` | `SAMtools stats` library level results |
| `{SAMtools-version}_{dataset_id}_{sample_id}-{library_id}-{rg_id}_stats.txt` | `SAMtools stats` readgroup level results |
| `{Picard-version}_{dataset_id}_{sample_id}_wgs-metrics.txt` | `Picard CollectWgsMetrics` results |
| `{Qualimap-version}_{dataset_id}_{sample_id}_stats` | Directory of `Qualimap` results, including, `genome_results.txt` and either `.pdf` or `.html and supporting directories`|
| `{FastQC-version}_{dataset_id}_{sample_id}_fastqc` | Directory of sample level `FastQC` results |
| `{FastQC-version}_{dataset_id}_{sample_id}-{library_id}_fastqc` | Directory of library level `FastQC` results |
| `{FastQC-version}_{dataset_id}_{sample_id}-{library_id}-{rg_id}_fastqc` | Directory of readgroup level `FastQC` results |
| `{mosdepth-version}_{dataset_id}_{sample_id}-{window_size}.mosdepth.summary.txt` | `mosdepth` coverage results by region with a final line for `total` |
| `{mosdepth-version}_{dataset_id}_{sample_id}-{window_size}.mosdepth.global.dist.txt` | `mosdepth` coverage cumulative distribution indicating the proportion of total bases that were covered for at least a given coverage value |
| `{mosdepth-version}_{dataset_id}_{sample_id}-{window_size}.mosdepth.region.dist.txt` | `mosdepth` coverage cumulative distribution indicating the proportion of the windows that were covered for at least a given coverage value |
| `{mosdepth-version}_{dataset_id}_{sample_id}-{window_size}.regions.bed.gz` | `mosdepth` coverage bedfile giving coverage for each window |
| `{mosdepth-version}_{dataset_id}_{sample_id}-quantize-{q0}-{q1}-{q2}-{q3}.mosdepth.summary.txt` | `mosdepth` quantize coverage results by region with a final line for `total` |
| `{mosdepth-version}_{dataset_id}_{sample_id}-quantize-{q0}-{q1}-{q2}-{q3}.mosdepth.global.dist.txt` | `mosdepth` quantize cumulative distribution indicating the proportion of total bases that were covered for at least a given coverage value |
| `{mosdepth-version}_{dataset_id}_{sample_id}-quantize-{q0}-{q1}-{q2}-{q3}.quantized.bed.gz` | `mosdepth` quantize bed file

---

Expand Down
10 changes: 10 additions & 0 deletions config/F16.config
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,16 @@ process {
}
}
}
withName: quantize_coverage_mosdepth {
cpus = 1
memory = 8.GB
retry_strategy {
memory {
strategy = 'add'
operand = 4.GB
}
}
}
withName: run_statsReadgroups_SAMtools {
cpus = 1
memory = 1.GB
Expand Down
20 changes: 20 additions & 0 deletions config/F2.config
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,26 @@ process {
}
}
}
withName: assess_coverage_mosdepth {
cpus = 1
memory = 1500.MB
retry_strategy {
memory {
strategy = 'add'
operand = 2000.MB
}
}
}
withName: quantize_coverage_mosdepth {
cpus = 1
memory = 1500.MB
retry_strategy {
memory {
strategy = 'add'
operand = 2000.MB
}
}
}
withName: run_statsReadgroups_SAMtools {
cpus = 1
memory = 1.5.GB
Expand Down
10 changes: 10 additions & 0 deletions config/F32.config
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,16 @@ process {
}
}
}
withName: quantize_coverage_mosdepth {
cpus = 1
memory = 8.GB
retry_strategy {
memory {
strategy = 'add'
operand = 4.GB
}
}
}
withName: run_statsReadgroups_SAMtools {
cpus = 1
memory = 1.GB
Expand Down
10 changes: 10 additions & 0 deletions config/F4.config
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,16 @@ process {
}
}
}
withName: quantize_coverage_mosdepth {
cpus = 1
memory = 8.GB
retry_strategy {
memory {
strategy = 'add'
operand = 2.GB
}
}
}
withName: run_statsReadgroups_SAMtools {
cpus = 1
memory = 1.GB
Expand Down
10 changes: 10 additions & 0 deletions config/F72.config
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,16 @@ process {
}
}
}
withName: quantize_coverage_mosdepth {
cpus = 1
memory = 8.GB
retry_strategy {
memory {
strategy = 'add'
operand = 8.GB
}
}
}
withName: run_statsReadgroups_SAMtools {
cpus = 1
memory = 1.GB
Expand Down
10 changes: 10 additions & 0 deletions config/F8.config
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,16 @@ process {
}
}
}
withName: quantize_coverage_mosdepth {
cpus = 1
memory = 8.GB
retry_strategy {
memory {
strategy = 'add'
operand = 2.GB
}
}
}
withName: run_statsReadgroups_SAMtools {
cpus = 1
memory = 1.GB
Expand Down
10 changes: 10 additions & 0 deletions config/M64.config
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,16 @@ process {
}
}
}
withName: quantize_coverage_mosdepth {
cpus = 1
memory = 8.GB
retry_strategy {
memory {
strategy = 'add'
operand = 8.GB
}
}
}
withName: run_statsReadgroups_SAMtools {
cpus = 1
memory = 1.GB
Expand Down
49 changes: 43 additions & 6 deletions config/schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,12 @@ dataset_id:
help: 'Dataset identifier'
algorithm:
type: 'List'
required: false
required: true
help: 'List of QC algorithms'
choices:
- fastqc
- mosdepth_coverage
- mosdepth_quantize
- samtools_stats
- collectwgsmetrics
- qualimap_bamqc
Expand Down Expand Up @@ -82,22 +83,58 @@ mosdepth_use_fast_algorithm:
required: false
default: true
help: 'fast algorithm ignores read pair overlaps and CIGAR information'
mosdepth_window_size:
type: 'Integer'
required: false
default: 500
help: 'Window size for mosdepth coverage calculation'
mosdepth_per_base_output:
type: 'Bool'
required: false
default: true
help: 'Output per-base coverage'
mosdepth_window_size:
type: 'Integer'
required: false
default: 500
help: 'Window size for mosdepth coverage calculation'
mosdepth_additional_options:
type: 'String'
required: false
allow_empty: true
default: ''
help: 'Additional arguments for mosdepth command'
mosdepth_quantize_cutoffs:
type: 'String'
required: false
default: '0:1:5:150'
help: 'Quantize coverage values into these bins'
mosdepth_quantize_use_fast_algorithm:
type: 'Bool'
required: false
default: false
help: 'Use fast algorithm for quantizing coverage values'
Comment on lines +107 to +111
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): Is there a downside to enabling the use of the fast algorithm by default?

Copy link
Collaborator Author

@sorelfitzgibbon sorelfitzgibbon Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a difficult decision. Not using fast mode gives more "correct" results. Fast mode ignores paired read overlaps and CIGAR strings (thus indels wrt ref). Ignoring the paired read overlap is the bigger issue, especially for samples with small insert sizes (wrt read length). This is noted in the README. The time difference isn't clear as I haven't benchmarked more than a few samples and not directly on scratch. It's enough that the mosdepth author recommends fast mode for most use-cases (I assume non-small insert cases). We currently have fast mode true by default for the regular mosdepth coverage calculation, so we should probably change one or the other to make them consistent.

mosdepth_q0_label:
type: 'String'
required: false
default: 'Q0'
help: 'Label for lowest coverage bin'
mosdepth_q1_label:
type: 'String'
required: false
default: 'Q1'
help: 'Label for second lowest coverage bin'
mosdepth_q2_label:
type: 'String'
required: false
default: 'Q2'
help: 'Label for third lowest coverage bin'
mosdepth_q3_label:
type: 'String'
required: false
default: 'Q3'
help: 'Label for highest coverage bin'
mosdepth_quantize_additional_options:
type: 'String'
required: false
allow_empty: true
default: ''
help: 'Additional arguments for mosdepth-quantize command'
cwm_coverage_cap:
type: 'Integer'
required: false
Expand Down
15 changes: 12 additions & 3 deletions config/template.config
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ includeConfig "${projectDir}/nextflow.config"

// Inputs/parameters of the pipeline
params {
algorithm = ['samtools_stats', 'collectwgsmetrics'] // 'fastqc', 'samtools_stats', 'collectwgsmetrics', 'mosdepth_coverage', 'qualimap_bamqc'
algorithm = ['samtools_stats', 'collectwgsmetrics'] // 'fastqc', 'samtools_stats', 'collectwgsmetrics', 'mosdepth_quantize', 'mosdepth_coverage', 'qualimap_bamqc'
reference = '/hot/resource/reference-genome/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta'
output_dir = '/path/to/output/directory'
blcds_registered_dataset = false // if you want the output to be registered
Expand All @@ -22,9 +22,18 @@ params {
// mosdepth window-base coverage options
// fast algorithm ignores read pair overlaps and should not be used on libraries with small insert sizes
mosdepth_use_fast_algorithm = false
mosdepth_window_size = 500
mosdepth_per_base_output = true
mosdepth_additional_options = ''
mosdepth_window_size = 500
mosdepth_additional_options = '--mapq 20'

// mosdepth quantized coverage (like GATK's Callable Regions)
mosdepth_quantize_cutoffs = '0:1:5:150'
mosdepth_quantize_use_fast_algorithm = false
mosdepth_q0_label = 'NO_COVERAGE'
mosdepth_q1_label = 'LOW_COVERAGE'
mosdepth_q2_label = 'CALLABLE'
mosdepth_q3_label = 'HIGH_COVERAGE'
mosdepth_quantize_additional_options = '--mapq 20'

// Picard CollectWgsMetrics options
cwm_coverage_cap = 1000
Expand Down
Loading