Update Snakemake 8 and Gather/Scatter Indel Calling (#13)

* added pysam * current changes to changelog * implemented scatter and gather for first round htc * removed optional quantification - should be required * caught error when unknown bases occur in wildtype * remove unwanted print * added routine to select mhc class type * splitting in mutect2 analysis for speedup * rename rule name * combine single-end and paired-end reads to prepare input for mhc-II genotyping * added instructions for Snakemake 8 * updated minimum version of Snakemake to 8.x.x * gather scatter for indel calling * added instructions to Snakemake 8 and apptainer replaces singularity * added routine to ease the use of custom variants * refactor hlatyping to combine read retrieval for MHC-I and MHC-II * outsource rules for custom variants to improve readability * added reference sets for hla alleles (to compare against) * added separate rules for MHC-II prediction tools download * accept wildcard <group> as parameter to improve usability * Remove for check for valid alleles - this is now done later to include also user-provided ones * change to singe file input * add routine for MHC-I and MHC-II into same script * add safety routine is no counts can be found (when no seqdata present) * added custom rules * added parameters for alignment to config * changed order when adding INFO tags * added sorting routine * safety routines added * outsource merging of predicted mhccII alleles * added few parameters * added to feature list * changed path to provided hlahd path * hlhd call as non-file parameter * added changes to path also to testconfig
ylab-hi · Mar 1, 2024 · 7145dc4 · 7145dc4
1 parent 2928171
commit 7145dc4
Show file tree

Hide file tree

Showing 30 changed files with 4,927 additions and 456 deletions.
diff --git a/.tests/integration/config_basic/config.yaml b/.tests/integration/config_basic/config.yaml
@@ -18,7 +18,6 @@ data:
     hlatyping:
       MHC-I:
       MHC-II:
-    readgroups: 
 
 ### pre-processing (only applied on fastq reads)
 preproc: 
@@ -80,14 +79,15 @@ indel:
   sliprate: 0.1  # frequency of slippage when it is supsected
 
 quantification:
-  activate: true
   mode: BOTH # RNA, RNA or BOTH
 
 hlatyping:
   class: I # I, II or BOTH
   # specific path for class II hlatyping (only required when class: II, or BOTH)
   MHC-I_mode: DNA, RNA # DNA, RNA, or BOTH (if empty alleles have to be specified in custom)
   MHC-II_mode: BOTH # DNA, RNA, or BOTH (if empty alleles have to be specified in custom)
+
+  hlahd_path: ./hlahd.1.7.0/
   freqdata: ./hlahd_files/freq_data/ 
   split: ./hlahd_files/HLA_gene.split.txt
   dict: ./hlahd_files/dictionary/

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,30 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.2.0] - 2024-02-25
+
+### Features
+
+- ScanNeo2 supports Snakmake>=8 
+    - --use-conda replaced by --software-deployment-method conda
+    - --use-singularity replaced by --software-deployment-method apptainer
+- Gather/scatter of the indel calling speeds up ScanNeo2 on multiple cores
+    - added script to split bamfiles by chromosome (scripts/split_bam_by_chr.py)
+    - haplotypecaller first/final round is done per chromosome and later merged
+    - mutect2 is done per chromosome and later merged
+- Genotyping MHC-II works now on both single-end and paired-end
+- User-defined HLA alleles are matched against the hla refset
+- Added multiple routine to catch errors when only custom variants are provided
+- Added additional parameters in config file
+
+### Fix 
+
+- When using BAMfiles the HLA typing wrongly expected single-end reads and performed preprocessing
+- Each environment is no thoroughly versioned to ensure interoperability
+- Missing immunogenicity calculation on certain values of MHC-I fixed
+- Fixed prediction of binding affinity in MHC-II (as the columns are different from MHC-I)
+
+
 ## [0.1.6] - 2024-02-13
 
 ### Fix 

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 <div align="left">
     <h1>ScanNeo2</h1>
-    <img src="https://img.shields.io/badge/snakemake-≥6.4.1-brightgreen.svg">
+    <img src="https://img.shields.io/badge/snakemake-≥8.0.0-brightgreen.svg">
     <img src="https://github.com/ylab-hi/ScanNeo2/actions/workflows/linting.yml/badge.svg" alt="Workflow status badge">
 </div>
 
@@ -29,9 +29,10 @@ To get started with ScanNeo2, follow the steps below:
     mamba activate scanneo2
     ```
 
-    Note: This installs Snakemake v7.32.x. In its current form, ScanNeo2 is not comptabile with Snakemake >= 8.x.x. 
-    If ScanNeo2 is configured to use the exitron module, singularity needs to be installed. For that, the 
-    `environment_singularity.yml' can be used. However, most HPC servers provide their own module installation.
+    Note: ScanNeo2 requires Snakemake >= 8.x.x is not compatible with Snakemake <= 8.x.x. If ScanNeo2 
+    is configured to use the exitron module, apptainer (formerly singularity) needs to be installed.
+    For that, the `environment_apptainer.yml` can be used. However, most HPC servers provide their own 
+    module installation (which should be preferred)
 
 2. Deploy ScanNeo2:
 
@@ -66,13 +67,13 @@ To run the workflow, use the following command:
 
 ```bash
 cd /path/to/your/working/directory/
-snakemake --cores all --use-conda
+snakemake --cores all --software-deployment-method conda 
 ```
 
-As mentioned above, when exitron detection is activated the singularity option `--use-singularity` has to be used as well.
+As mentioned above, when exitron detection is activated the singularity option `--software-deployment-method apptainer` has to be used as well.
 
 ```bash
-snakemake --cores all --use-conda --use-singularity
+snakemake --cores all --software-deployment-method conda apptainer 
 ```
 
 In addition, custom configfiles can be configured using `--configfile <path/to/configfile>`. In principle, this merely 
@@ -101,7 +102,19 @@ ScanNeo2 provides an accessible, efficient method for predicting neoantigens. It
 
 ## Citation
 
-If ScanNeo2 has proven useful in your work please cite it using the linked publication.
+@article{Schafer2023Nov,
+	author = {Sch{\ifmmode\ddot{a}\else\"{a}\fi}fer, Richard A. and Guo, Qingxiang and Yang, Rendong},
+	title = {{ScanNeo2: a comprehensive workflow for neoantigen detection and immunogenicity prediction from diverse genomic and transcriptomic alterations}},
+	journal = {Bioinformatics},
+	volume = {39},
+	number = {11},
+	pages = {btad659},
+	year = {2023},
+	month = nov,
+	issn = {1367-4811},
+	publisher = {Oxford Academic},
+	doi = {10.1093/bioinformatics/btad659}
+}
 
 ## License
 

diff --git a/config/config.yaml b/config/config.yaml
@@ -16,7 +16,6 @@ data:
     hlatyping:
       MHC-I:
       MHC-II:
-    readgroups: 
 
 ### pre-processing (only applied on fastq reads)
 preproc: 
@@ -29,11 +28,11 @@ preproc:
 
 ### alingment
 align:
-  minovlps: 10
-  chimsegmin: 20
-  chimoverhang: 10
-  chimmax: 50
-  chimmaxdrop: 30
+  chimSegmentMin: 20
+  chimScoreMin: 10
+  chimJunctionOverhangMin: 10
+  chimScoreDropMax: 30
+  chimScoreSeparation: 10
 
 ### variant calling
 # alternative splicing
@@ -77,7 +76,6 @@ indel:
   sliprate: 0.1  # frequency of slippage when it is supsected
 
 quantification:
-  activate: true
   mode: BOTH # RNA, RNA or BOTH
 
 hlatyping:
@@ -86,6 +84,9 @@ hlatyping:
   # specific path for class II hlatyping (only required when class: II, or BOTH)
   MHC-I_mode: BOTH # DNA, RNA, or BOTH (if empty alleles have to be specified in custom)
   MHC-II_mode: BOTH # DNA, RNA, or BOTH (if empty alleles have to be specified in custom)
+
+  # specific path for class II hlatyping (only required when class: II, or BOTH)
+  hlahd_path: ./hlahd.1.7.0/
   freqdata: ./hlahd_files/freq_data/ 
   split: ./hlahd_files/HLA_gene.split.txt
   dict: ./hlahd_files/dictionary/

diff --git a/environment.yml b/environment.yml
@@ -4,5 +4,5 @@ channels:
  - conda-forge
  - anaconda
 dependencies:
- - snakemake=7.32.3
+ - snakemake=8.4.11
  - snakemake-wrapper-utils
diff --git a/environment_singularity.yml → environment_apptainer.yml b/environment_singularity.yml → environment_apptainer.yml
@@ -4,5 +4,7 @@ channels:
  - conda-forge
  - anaconda
 dependencies:
- - snakemake=7.32.3
+ - snakemake=8.4.11
  - snakemake-wrapper-utils
+ - apptainer
+
diff --git a/workflow/Snakefile b/workflow/Snakefile
@@ -1,7 +1,7 @@
 from snakemake.utils import min_version
 
 ##### set minimum snakemake version #####
-min_version("6.4.1")
+min_version("8.0.0")
 
 #### setup #######
 configfile: "config/config.yaml"
@@ -23,6 +23,7 @@ include: "rules/genefusion.smk"
 include: "rules/altsplicing.smk"
 include: "rules/exitron.smk"
 include: "rules/indel.smk"
+include: "rules/custom.smk"
 include: "rules/germline.smk"
 include: "rules/annotation.smk"
 include: "rules/prioritization.smk"
diff --git a/workflow/envs/basic.yml b/workflow/envs/basic.yml
@@ -9,3 +9,4 @@ dependencies:
   - pyfaidx
   - biopython=1.78
   - gffutils
+  - pysam
diff --git a/workflow/rules/align.smk b/workflow/rules/align.smk
@@ -1,6 +1,6 @@
 ### align reads to genome using STAR (when reads are in FASTQ)
 if config['data']['rnaseq_filetype'] == '.fastq' or config['data']['rnaseq_filetype'] == '.fq':
-  rule star_fq_paired_end:
+  rule star_align_fastq:
     input:
       unpack(get_star_input),
       faidx = "resources/refs/genome.fasta.fai",
@@ -19,22 +19,23 @@ if config['data']['rnaseq_filetype'] == '.fastq' or config['data']['rnaseq_filet
           --outSAMattributes RG HI \
           --outSAMattrRGline ID:{wildcards.group} \
           --outFilterMultimapNmax 50 \
-          --peOverlapNbasesMin 20 \
+          --peOverlapNbasesMin 15 \
           --alignSplicedMateMapLminOverLmate 0.5 \
           --alignSJstitchMismatchNmax 5 -1 5 5 \
           --chimOutType WithinBAM HardClip \
-          --chimSegmentMin 20 \
-          --chimJunctionOverhangMin 10 \
-          --chimScoreDropMax 30 \
+          --chimSegmentMin {config["align"]["chimSegmentMin"]} \
+          --chimJunctionOverhangMin {config["align"]["chimJunctionOverhangMin"]} \
+          --chimScoreDropMax {config["align"]["chimScoreDropMax"]} \
+          --chimScoreMin {config["align"]["chimScoreMin"]} \
           --chimScoreJunctionNonGTAG 0 \
-          --chimScoreSeparation 1 \
+          --chimScoreSeparation {config["align"]["chimScoreSeparation"]} \
           --chimSegmentReadGapMax 3 \
           --chimMultimapNmax 50 \
           --outSAMstrandField intronMotif"""
     threads: config['threads']
     wrapper:
         "v2.2.1/bio/star/align"
-
+          
 ### align reads to genome using STAR (when reads are in BAM - no preprocessing performed)
 if config['data']['rnaseq_filetype'] == '.bam':
   checkpoint split_bamfile_RG:
@@ -88,12 +89,17 @@ if config['data']['rnaseq_filetype'] == '.bam':
       extra=lambda wildcards: f"""--outSAMtype BAM Unsorted --genomeSAindexNbases 10 \
         --readFilesCommand zcat \
         --outSAMattributes RG HI --outSAMattrRGline ID:{wildcards.rg} \
-        --outFilterMultimapNmax 50 --peOverlapNbasesMin 20 \
+        --outFilterMultimapNmax 50 \
+        --peOverlapNbasesMin 15 \
         --alignSplicedMateMapLminOverLmate 0.5 \
         --alignSJstitchMismatchNmax 5 -1 5 5 \
-        --chimOutType WithinBAM HardClip --chimSegmentMin 20 \
-        --chimJunctionOverhangMin 10 --chimScoreDropMax 30 \
-        --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 \
+        --chimOutType WithinBAM HardClip \
+        --chimSegmentMin {config["align"]["chimSegmentMin"]} \
+        --chimJunctionOverhangMin {config["align"]["chimJunctionOverhangMin"]} \
+        --chimScoreDropMax {config["align"]["chimScoreDropMax"]} \
+        --chimScoreMin {config["align"]["chimScoreMin"]} \
+        --chimScoreJunctionNonGTAG 0 \
+        --chimScoreSeparation {config["align"]["chimScoreSeparation"]} \
         --chimSegmentReadGapMax 3 --chimMultimapNmax 50 \
         --outSAMstrandField intronMotif"""
     threads: config['threads']
-Original file line number
+Diff line change
@@ Expand Up / @@ -9,3 +9,4 @@ dependencies: @@
       - pyfaidx
       - biopython=1.78
       - gffutils
+      - pysam