From 3241ff39eb0356ed7ba7a688b0a81084526dde63 Mon Sep 17 00:00:00 2001
From: "Philip R. Kensche"
Date: Fri, 10 Dec 2021 15:07:28 +0100
Subject: [PATCH 01/13] Envisaged changes for version 4 in README. Typo fix in
config.
---
README.md | 26 +++++++++++++------
.../analysisRNAseqGRCh38.xml | 2 +-
2 files changed, 19 insertions(+), 9 deletions(-)
diff --git a/README.md b/README.md
index 04aa285..916feb5 100755
--- a/README.md
+++ b/README.md
@@ -213,6 +213,18 @@ G Set your star index (GENOME_STAR_INDEX) and gene models (GENE_MODELS) paramete
## Change Log
+* 4.0.0
+ - major: Arriba 2
+ - major: `resources/configurationFiles/analysisRNAseq.xml` is now just a default configuration with many configuration options left blank. For your `` tags in your project XMLs the analysis names can be changed to use the following defaults:
+ * GRCh37-specific: [RNAseqAnalysisGRCh37](resources/configurationFiles/analysisRNAseqGRCh37.xml)
+ * GRCh38-specific: [RNAseqAnalysisGRCh38](resources/configurationFiles/analysisRNAseqGRCh38.xml)
+ - Update default software versions
+ * STAR 2.7.6a (was 2.5.3a for GRCh37/hg19)
+ * Kallisto 0.46.0 (was 0.42.0 for GRCh37/hg19)
+ * Samtools 1.9 (was 1.6 for GRCh37/hg19)
+ * HTSlib 1.9 (was 1.6 for GRCh37/hg19)
+ * Gene model gencode version 31 was used to create the STAR and Kallisto indexes
+
* 3.1.0
- patch: Updated default from subread 1.5.1 to 1.6.5. The previous version produces occasional segmentation faults (related to extreme optimization option `-O6`) but otherwise produces the same results. Both versions produced exactly the same `featureCounts` in multiple tests.
@@ -224,12 +236,12 @@ G Set your star index (GENOME_STAR_INDEX) and gene models (GENE_MODELS) paramete
- "TPM_legacy{,_fw,_rv}" -> "TPM_standard{,_fw,_rv}"
* 2.1.0 (2.0.3-deprecated) [26th Mar 2021]
- - minor: Added GRCh38 genome support
+ - minor: Added GRCh38 genome support; version changes only affect hg38
- Reference genome (core_ref_GRCh38_hla_decoy_ebv.tar.gz) was downloaded from ftp://ftp.sanger.ac.uk/pub/cancer/dockstore/human/GRCh38_hla_decoy_ebv/ and Illumina PhiX genome was added.
- Added GRCh38 specific configs to a `resources/configurationFiles/analysisRNAseqGRCh38.xml`
- - Updated STAR to 2.7.6a
- - Updated Kallisto to 0.46.0
- - Gene model gencode version 31 was used to create the STAR and Kallisto indexes
+ - minor: STAR to 2.7.6a
+ - minor: Updated Kallisto to 0.46.0
+ - minor: Gene model gencode version 31 was used to create the STAR and Kallisto indexes
- minor: Update Samtools to 1.9
- minor: Update HTSlib to 1.9
- 2.0.3 was deprecated to correct the incorrect version number.
@@ -243,13 +255,11 @@ G Set your star index (GENOME_STAR_INDEX) and gene models (GENE_MODELS) paramete
* 2.0.0 [13th Mar 2020]
- major: Update Arriba to version 1.2.0
- - major: Update STAR to 2.5.3a
- - minor: Update Samtools to 1.6
- patch: Update Subread (featureCounts) to 1.5.3
- patch: Added draft Conda environment (incomplete)
* 1.3.0-2 [26th Nov 2019]
- - Removed the single-quotes around `${ADAPTER_SEQ}` in `--clip3pAdapterSeq` again. STAR uses non-standard way of parsing parameters and manages to get all adapters. With quotes the adapters get also quoted and it is unclear what STAR does with them, except that it does not complain about a configuration error and that it also does not complain with even more severe misconfigurations, such as other non-DNA sequences as adapter sequences. The manual also does not use quoted parameter arguments, so no-quotes is conform to this STAR-specific CLI parameter handling pattern.
+ - Removed the single-quotes around `${ADAPTER_SEQ}` in `--clip3pAdapterSeq` again. STAR uses non-standard way of parsing parameters and manages to get all adapters. With quotes the adapters get also, and it is unclear what STAR does with them, except that it does not complain about a configuration error and that it also does not complain with even more severe misconfigurations, such as other non-DNA sequences as adapter sequences. The manual also does not use quoted parameter arguments, so no-quotes is conform to this STAR-specific CLI parameter handling pattern.
* 1.3.0-1 [7th Nov 7 2018]
- patch: Added single quotes around `$ADAPTER_SEQ` parameter in `--clip3pAdapterSeq` to allow for separate first and second read adapters.
@@ -259,7 +269,7 @@ G Set your star index (GENOME_STAR_INDEX) and gene models (GENE_MODELS) paramete
- minor: works with Roddy 3
* 1.2.23-2 [15th Feb 2018]
- - Modified software defaults to samtools 1.6, star 2.5.3a, arriba 0.12.
+ - Modified software defaults to samtools 1.6, star 2.5.2b -> 2.5.3a, arriba 0.8 -> 0.12.
- Added exception for loading htslib if samtools version is too low.
- Modified output file check of BAM file
diff --git a/resources/configurationFiles/analysisRNAseqGRCh38.xml b/resources/configurationFiles/analysisRNAseqGRCh38.xml
index 6c403a2..992d436 100644
--- a/resources/configurationFiles/analysisRNAseqGRCh38.xml
+++ b/resources/configurationFiles/analysisRNAseqGRCh38.xml
@@ -1,4 +1,4 @@
-
Date: Thu, 3 Feb 2022 11:37:02 +0100
Subject: [PATCH 02/13] Updated versions also for hg37. Updated changelog.
---
README.md | 26 +++++++++----------
.../configurationFiles/analysisRNAseq.xml | 17 +++++++-----
.../analysisRNAseqGRCh38.xml | 19 +++++++-------
3 files changed, 33 insertions(+), 29 deletions(-)
diff --git a/README.md b/README.md
index 916feb5..0d4ee45 100755
--- a/README.md
+++ b/README.md
@@ -205,28 +205,26 @@ The following is merely on overview over the most important parameters.
## Example Call
-* Change the basic stuff like inputbase and outputbase directories
-set RUN_FEATURE_COUNTS_DEXSEQ, RUN_RNASEQC, RUN_KALLISTO, RUN_ARRIBA, runFingerprinting to FALSE.
-* Check if your sample type is in possibleTumorSampleNamePrefixes
-G Set your star index (GENOME_STAR_INDEX) and gene models (GENE_MODELS) parameters to what your created for (1) and (2).
-* Set GENOME_GATK_INDEX to point to the new FASTA file (1), for which you have created a dict file. The dict file should be in the same directory as the FASTA.
+* Change the basic stuff like `inputbase` and `outputbase` directories
+set `RUN_FEATURE_COUNTS_DEXSEQ`, `RUN_RNASEQC`, `RUN_KALLISTO`, `RUN_ARRIBA`, `runFingerprinting` to `FALSE`.
+* Check if your sample type is in `possibleTumorSampleNamePrefixes`
+* Set your star index (`GENOME_STAR_INDEX`) and gene models (`GENE_MODELS`) parameters to what your created for (1) and (2).
+* Set `GENOME_GATK_INDEX` to point to the new FASTA file (1), for which you have created a dict file. The dict file should be in the same directory as the FASTA.
## Change Log
* 4.0.0
- - major: Arriba 2
+ - major: Arriba 2.2.1
- major: `resources/configurationFiles/analysisRNAseq.xml` is now just a default configuration with many configuration options left blank. For your `` tags in your project XMLs the analysis names can be changed to use the following defaults:
* GRCh37-specific: [RNAseqAnalysisGRCh37](resources/configurationFiles/analysisRNAseqGRCh37.xml)
* GRCh38-specific: [RNAseqAnalysisGRCh38](resources/configurationFiles/analysisRNAseqGRCh38.xml)
- Update default software versions
- * STAR 2.7.6a (was 2.5.3a for GRCh37/hg19)
- * Kallisto 0.46.0 (was 0.42.0 for GRCh37/hg19)
- * Samtools 1.9 (was 1.6 for GRCh37/hg19)
- * HTSlib 1.9 (was 1.6 for GRCh37/hg19)
- * Gene model gencode version 31 was used to create the STAR and Kallisto indexes
-
-* 3.1.0
- - patch: Updated default from subread 1.5.1 to 1.6.5. The previous version produces occasional segmentation faults (related to extreme optimization option `-O6`) but otherwise produces the same results. Both versions produced exactly the same `featureCounts` in multiple tests.
+ * STAR 2.7.6a (from 2.5.3a)
+ * Kallisto 0.46.0 (from 0.42.0)
+ * Samtools 1.9 (from 1.6)
+ * HTSlib 1.9 (from 1.6)
+ * subread 1.6.5 (from 1.5.1): The previous version produces occasional segmentation faults (related to extreme optimization option `-O6`) but otherwise produces the same results. Both versions produced exactly the same `featureCounts` in multiple tests.
+ - Gene model gencode version 31 was used to create the STAR and Kallisto indexes
* 3.0.0 [19th Oct 2021]
- major: Column rename in feature counts table:
diff --git a/resources/configurationFiles/analysisRNAseq.xml b/resources/configurationFiles/analysisRNAseq.xml
index f55f0e0..b625631 100755
--- a/resources/configurationFiles/analysisRNAseq.xml
+++ b/resources/configurationFiles/analysisRNAseq.xml
@@ -63,13 +63,14 @@
-
+
+
-
+
@@ -102,11 +103,13 @@
-
+
-
+
@@ -164,8 +167,10 @@
-
-
+
diff --git a/resources/configurationFiles/analysisRNAseqGRCh38.xml b/resources/configurationFiles/analysisRNAseqGRCh38.xml
index 992d436..bb77629 100644
--- a/resources/configurationFiles/analysisRNAseqGRCh38.xml
+++ b/resources/configurationFiles/analysisRNAseqGRCh38.xml
@@ -8,14 +8,11 @@
cleanupScript="cleanupScript">
-
-
-
-
-
-
+
@@ -37,7 +34,9 @@
-
+
@@ -45,8 +44,10 @@
-
-
+
From 20313e9353af2f0f3b2524a283d56a056b129c45 Mon Sep 17 00:00:00 2001
From: "Philip R. Kensche"
Date: Thu, 3 Feb 2022 11:42:35 +0100
Subject: [PATCH 03/13] More smaller changes.
---
README.md | 7 ++++---
resources/configurationFiles/analysisRNAseq.xml | 1 -
resources/configurationFiles/analysisRNAseqGRCh38.xml | 5 ++---
3 files changed, 6 insertions(+), 7 deletions(-)
diff --git a/README.md b/README.md
index 0d4ee45..b381854 100755
--- a/README.md
+++ b/README.md
@@ -216,15 +216,16 @@ set `RUN_FEATURE_COUNTS_DEXSEQ`, `RUN_RNASEQC`, `RUN_KALLISTO`, `RUN_ARRIBA`, `r
* 4.0.0
- major: Arriba 2.2.1
- major: `resources/configurationFiles/analysisRNAseq.xml` is now just a default configuration with many configuration options left blank. For your `` tags in your project XMLs the analysis names can be changed to use the following defaults:
- * GRCh37-specific: [RNAseqAnalysisGRCh37](resources/configurationFiles/analysisRNAseqGRCh37.xml)
+ * GRCh37, mm10: [RNAseqAnalysis](resources/configurationFiles/analysisRNAseq.xml)
* GRCh38-specific: [RNAseqAnalysisGRCh38](resources/configurationFiles/analysisRNAseqGRCh38.xml)
- - Update default software versions
+ - major: Update default software versions
* STAR 2.7.6a (from 2.5.3a)
* Kallisto 0.46.0 (from 0.42.0)
* Samtools 1.9 (from 1.6)
* HTSlib 1.9 (from 1.6)
* subread 1.6.5 (from 1.5.1): The previous version produces occasional segmentation faults (related to extreme optimization option `-O6`) but otherwise produces the same results. Both versions produced exactly the same `featureCounts` in multiple tests.
- - Gene model gencode version 31 was used to create the STAR and Kallisto indexes
+ - major: Gene model gencode version 31 was used to create the STAR and Kallisto indexes
+ - major: Update default paths to new ngs_share at ODCF.
* 3.0.0 [19th Oct 2021]
- major: Column rename in feature counts table:
diff --git a/resources/configurationFiles/analysisRNAseq.xml b/resources/configurationFiles/analysisRNAseq.xml
index b625631..8d38bfb 100755
--- a/resources/configurationFiles/analysisRNAseq.xml
+++ b/resources/configurationFiles/analysisRNAseq.xml
@@ -74,7 +74,6 @@
-
diff --git a/resources/configurationFiles/analysisRNAseqGRCh38.xml b/resources/configurationFiles/analysisRNAseqGRCh38.xml
index bb77629..614fc0b 100644
--- a/resources/configurationFiles/analysisRNAseqGRCh38.xml
+++ b/resources/configurationFiles/analysisRNAseqGRCh38.xml
@@ -7,7 +7,7 @@
imports="RNAseqAnalysis"
cleanupScript="cleanupScript">
-
+
-
-
+
From 989c8eee07e6bcec5584392f0741e21dcb07df17 Mon Sep 17 00:00:00 2001
From: "Philip R. Kensche"
Date: Thu, 3 Feb 2022 14:22:10 +0100
Subject: [PATCH 04/13] Updated Conda environment.yaml.
---
README.md | 6 +-
environment.yaml | 63 ++++++++++---------
.../configurationFiles/analysisRNAseq.xml | 2 +-
3 files changed, 39 insertions(+), 32 deletions(-)
diff --git a/README.md b/README.md
index b381854..6f5455f 100755
--- a/README.md
+++ b/README.md
@@ -218,14 +218,16 @@ set `RUN_FEATURE_COUNTS_DEXSEQ`, `RUN_RNASEQC`, `RUN_KALLISTO`, `RUN_ARRIBA`, `r
- major: `resources/configurationFiles/analysisRNAseq.xml` is now just a default configuration with many configuration options left blank. For your `` tags in your project XMLs the analysis names can be changed to use the following defaults:
* GRCh37, mm10: [RNAseqAnalysis](resources/configurationFiles/analysisRNAseq.xml)
* GRCh38-specific: [RNAseqAnalysisGRCh38](resources/configurationFiles/analysisRNAseqGRCh38.xml)
- - major: Update default software versions
- * STAR 2.7.6a (from 2.5.3a)
+ - major: Update default software versions for all assemblies
+ * STAR 2.7.10a (from 2.5.3a for hg37/mm10 and 2.7.6a for hg38)
* Kallisto 0.46.0 (from 0.42.0)
* Samtools 1.9 (from 1.6)
* HTSlib 1.9 (from 1.6)
* subread 1.6.5 (from 1.5.1): The previous version produces occasional segmentation faults (related to extreme optimization option `-O6`) but otherwise produces the same results. Both versions produced exactly the same `featureCounts` in multiple tests.
- major: Gene model gencode version 31 was used to create the STAR and Kallisto indexes
- major: Update default paths to new ngs_share at ODCF.
+ - major: Updated the Conda `environment.yaml`.
+ * Note for users at the DKFZ cluster: The software versions used in Conda environment do not exactly match the versions in the default configuration used for the cluster's module system.
* 3.0.0 [19th Oct 2021]
- major: Column rename in feature counts table:
diff --git a/environment.yaml b/environment.yaml
index eeceb55..c8aa459 100644
--- a/environment.yaml
+++ b/environment.yaml
@@ -1,15 +1,15 @@
-name: RNAseqWorkflow
+name: RNAseqWorkflow_4
channels:
- - conda-forge
- bioconda
- - bioconda-legacy
+ - conda-forge
- defaults
+ - bioconda-legacy
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=1_llvm
- _r-mutex=1.0.1=anacondar_1
- - arriba=1.2.0=hc088bd4_0
- - bcftools=1.10.2=hd2cd319_0
+ - arriba=2.2.1=h3198e80_0
+ - bcftools=1.12=h45bccc9_1
- bioconductor-affy=1.56.0=r3.4.1_0
- bioconductor-affyio=1.50.0=r341h470a237_0
- bioconductor-annotate=1.58.0=r341_0
@@ -43,39 +43,43 @@ dependencies:
- bioconductor-xvector=0.20.0=r341h470a237_0
- bioconductor-zlibbioc=1.26.0=r341h470a237_0
- bzip2=1.0.8=h516909a_2
- - ca-certificates=2019.11.28=hecc5488_0
+ - c-ares=1.18.1=h7f98852_0
+ - ca-certificates=2021.10.8=ha878542_0
- cairo=1.14.12=he6fea26_5
- - certifi=2019.11.28=py27_0
- - curl=7.68.0=hf8cf82a_0
+ - certifi=2019.11.28=py27h8c360ce_1
+ - curl=7.76.1=h979ede3_1
- fontconfig=2.13.1=h2176d3f_1000
- freetype=2.9.1=h3cfcefd_1004
- gettext=0.19.8.1=hc5be6a0_1002
- glib=2.55.0=h464dc38_2
- graphite2=1.3.13=hf484d3e_1000
- - gsl=2.5=h294904e_1
+ - gsl=2.6=he838d99_2
- harfbuzz=1.9.0=h08d66d9_0
- - hdf5=1.8.17=11
- - htslib=1.10.2=h78d89cc_0
+ - hdf5=1.10.5=nompi_h5b725eb_1114
+ - htslib=1.12=h9093b5e_1
- icu=58.2=hf484d3e_1000
- jpeg=9c=h14c3975_1001
- - kallisto=0.43.0=hdf51.8.17_2
- - krb5=1.16.4=h2fd8d38_0
- - libblas=3.8.0=15_openblas
- - libcblas=3.8.0=15_openblas
- - libcurl=7.68.0=hda55be3_0
- - libdeflate=1.3=h516909a_0
+ - kallisto=0.46.0=h4f7b962_1
+ - krb5=1.17.1=h2fd8d38_0
+ - libblas=3.9.0=13_linux64_openblas
+ - libcblas=3.9.0=13_linux64_openblas
+ - libcurl=7.76.1=hc4aaa36_1
+ - libdeflate=1.7=h7f98852_5
- libedit=3.1.20170329=0
+ - libev=4.33=h516909a_1
- libffi=3.2.1=he1b5a44_1006
- libgcc=7.2.0=h69d50b8_2
- - libgcc-ng=9.2.0=h24d8f2e_2
+ - libgcc-ng=11.2.0=h1d223b6_12
- libgfortran=3.0.0=1
- - libgfortran-ng=7.3.0=hdf63c60_5
+ - libgfortran-ng=11.2.0=h69a702a_12
+ - libgfortran5=11.2.0=h5c6108e_12
- libiconv=1.15=h516909a_1005
- libidn2=2.3.0=h516909a_0
- - libopenblas=0.3.8=h5ec1e0e_0
+ - libnghttp2=1.43.0=h812cca2_1
+ - libopenblas=0.3.18=pthreads_h8fe5266_0
- libpng=1.6.34=ha92aebf_2
- - libssh2=1.8.2=h22169c7_2
- - libstdcxx-ng=9.2.0=hdf63c60_2
+ - libssh2=1.10.0=ha56f1ee_2
+ - libstdcxx-ng=11.2.0=he4da1e4_12
- libtiff=4.0.9=h648cc4a_1002
- libunistring=0.9.10=h14c3975_0
- libuuid=2.32.1=h14c3975_1000
@@ -83,8 +87,8 @@ dependencies:
- libxml2=2.9.9=h13577e0_2
- llvm-openmp=9.0.1=hc9558a2_2
- ncurses=5.9=10
- - openjdk=11.0.1=h516909a_1016
- - openssl=1.1.1d=h516909a_0
+ - openjdk=7.0.161=zulu7.21.0.3_0
+ - openssl=1.1.1l=h7f98852_0
- pango=1.40.14=h53a7087_1002
- pcre=8.39=0
- perl=5.26.2=h516909a_1006
@@ -92,6 +96,7 @@ dependencies:
- pixman=0.34.0=h14c3975_1003
- pthread-stubs=0.4=h14c3975_1001
- python=2.7.15=h721da81_1008
+ - python_abi=2.7=1_cp27mu
- qualimap=2.2.2a=3
- r=3.4.1=r3.4.1_0
- r-aroma.affymetrix=3.1.1=r341h6115d3f_0
@@ -184,12 +189,13 @@ dependencies:
- r-xml=3.98_1.16=r341hc070d10_0
- r-xtable=1.8_3=r341_1000
- readline=7.0=0
+ - rna-seqc=1.1.8=1
- sambamba=0.6.5=0
- - samtools=1.6=h244ad75_5
+ - samtools=1.9=h46bd0b3_0
- setuptools=44.0.0=py27_0
- sqlite=3.26.0=h7b6447c_0
- - star=2.5.3a=0
- - subread=1.5.3=0
+ - star=2.7.10a=h9ee0642_0
+ - subread=1.6.4=h84994c4_1
- tk=8.6.10=hed695b0_0
- wget=1.20.1=h22169c7_0
- wheel=0.34.2=py_1
@@ -204,6 +210,5 @@ dependencies:
- xorg-renderproto=0.11.1=h14c3975_1002
- xorg-xextproto=7.3.0=h14c3975_1002
- xorg-xproto=7.0.31=h14c3975_1007
- - xz=5.2.4=h14c3975_1001
+ - xz=5.2.5=h516909a_1
- zlib=1.2.11=h516909a_1006
-
diff --git a/resources/configurationFiles/analysisRNAseq.xml b/resources/configurationFiles/analysisRNAseq.xml
index 8d38bfb..e81c460 100755
--- a/resources/configurationFiles/analysisRNAseq.xml
+++ b/resources/configurationFiles/analysisRNAseq.xml
@@ -63,7 +63,7 @@
-
+
From 29e8684865278b691302810eb9796c7b38e00071 Mon Sep 17 00:00:00 2001
From: "Philip R. Kensche"
Date: Fri, 25 Feb 2022 10:59:14 +0100
Subject: [PATCH 05/13] File documentation
---
README.md | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 59 insertions(+), 4 deletions(-)
diff --git a/README.md b/README.md
index 6f5455f..8da5944 100755
--- a/README.md
+++ b/README.md
@@ -7,10 +7,65 @@ This workflow does the primary data processing for RNAseq data: alignment, QC, r
### Description
-The following is kind of a template protocol for a methods section. You will probably need to adapt it to your specific settings.
+
+#### Output
+
+| File | Description |
+|----------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `*_merged.mdup.bam` & `.bai` | Standard alignment. |
+| `*_chimeric_merged.mdup.bam` & `.bai` | Chimeric alignments. |
+| `_chimeric_merged.junction` | Correspond to the file mentioned in section “5.4 Chimeric alignments in Chimeric.out.junction” in the [STAR manual](https://github.com/alexdobin/STAR/blob/2.5.3a/doc/STARmanual.pdf). The columns and their content is described in the STAR manual. |
+| `_merged.mdup.bam.flagstat` | flag statistics of the “standard” bam file |
+| `_star_logs_and_files` | |
+| `featureCounts/*.fpkm_tpm.featureCounts.tsv` | Gene-based expression values derived from `featureCounts`. See below. |
+| `featureCounts_dexseq/*.fpkm_tpm.featureCounts.dexseq.tsv` | Exon-based expression values derived from `featureCounts`. See below. |
+| `fusions_arriba/` | Output of the [Arriba](https://github.com/suhrig/arriba/) tool |
+
+
+#### Extended Feature Counts Table
+
+The following concerns the tables in the `featureCounts/*.fpkm_tpm.featureCounts.tsv` (gene-based) and `featureCounts_dexseq/*.fpkm_tpm.featureCounts.dexseq.tsv` (exon-based) files. The [FPKM and TPM values](https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/) are calculated in [featureCounts_2_FpkmTpm](https://github.com/DKFZ-ODCF/RNAseqWorkflow/blob/master/resources/analysisTools/rnaseqworkflow/featureCounts_2_FpkmTpm) (gene-based) and [featureCountsDexseq_2_FpkmTpm](https://github.com/DKFZ-ODCF/RNAseqWorkflow/blob/master/resources/analysisTools/rnaseqworkflow/featureCountsDexseq_2_FpkmTpm) (exon-based) from the `.s0` (unstranded), `.s1` (forward/sense), and `.s2` (reverse/antisense) output files of featureCounts.
+
+* RPKM: Reads per kilobase gene model, per million read
+
+* FPKM: Like RPKM, but for fragments with paired-end data (a pair of aligning reads is counted as one fragment)
+
+* TPM: Like RPKM, but the length-weighted sum of read counts is used in the denominator
+
+
+
+| Header | Description |
+|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| #chrom | Number of the chromosome |
+| chromStart | Start position of the gene |
+| chromEnd | End position of the gene |
+| gene_id | Ensembl gene ID. For the exon-based featureCounts file this column states all Ensembl gene IDs for which the respective exon is part of (separated by “+”).
+|
+| score | - |
+| strand | Strand of the gene |
+| name | Gene name |
+| exonic_length | Length of the coding region of the gene |
+| num_reads | The number of reads falling into the gene region, in a non-strand specific counting mode |
+| num_reads_fw | The number(s) of reads falling into the gene region, in a strand specific counting modes. Applying the featureCounts option for standedness either 1/forward (_fw) or 2/reverse (_rv). Which option is the appropriate one for strandedly sequenced data is dependent on the library preparation step. For example, the Illumina TruSeq stranded RNA kit usually produces data that fits the option 2/reverse (_rv). |
+| num_reads_rv | |
+| FPKM_no_mt_rrna_trna_chrxy | Unstranded FPKM for which the count values for rRNA, tRNA, chrX, chrY and Mt were omitted during the normalization process. (For example for FPKM these counts are subtracted from the sum of all counts). |
+| FPKM_no_mt_rrna_trna_chrxy_fw | FPKM like before but just forward/sense strand. |
+| FPKM_no_mt_rrna_trna_chrxy_rv | FPKM like before but just reverse/antisense stand |
+| TPM_no_mt_rrna_trna_chrxy | TPM for which the count values for rRNA, tRNA, chrX, chrY and Mt were omitted during the normalization process. |
+| TPM_no_mt_rrna_trna_chrxy_fw | TPM like before but just forward/sense strand |
+| TPM_no_mt_rrna_trna_chrxy_rv | TPM like before but just reverse/antisense strand |
+| FPKM_standard | FPKM calculated in the standard way, that is counting all reads in the normalization irrespective of whether they come from expected high-abundance transcripts (e.g. rRNA, etc.), possible haploid regions (X, Y), or variable copy number regions (mtDNA). |
+| FPKM_standard_fw | Standard FPKM, forward/sense strand |
+| FPKM_standard_rv | Standard FKPM, reverse/antisense strand |
+| TPM_standard | TPM calculated in the standard way |
+| TPM_standard_fw | Standard TPM, forward/sense strand |
+| TPM_standard_rv | Standard TPM, reverse/antisense strand |
+
#### Example Protocol
+The following is kind of a template protocol for a methods section. You will probably need to adapt it to your specific settings.
+
The RNAseq data were analysed with the DKFZ/ODCF RNAseq workflow (https://github.com/DKFZ-ODCF/RNAseqWorkflow, **version**; https://github.com/DKFZ-ODCF/AlignmentAndQCWorkflows, **version**; https://github.com/TheRoddyWMS/Roddy-Default-Plugin, **version**; https://github.com/TheRoddyWMS/Roddy-Base-Plugin, **version**; https://github.com/TheRoddyWMS/Roddy, **version**). The workflow performs the following analysis steps.
NOTE: You should also list the other plugins in the dependency chain, and the Roddy version you used to ensure reproducibility. Use the short commit hash if the version you used was not tagged.
@@ -101,14 +156,14 @@ Note that the workflow directory can be suffixed by the version tag to allow for
The following is a list of citations for software used by the RNAseq workflow. Note that dependent on your configuration some tools may be unused. The actual versions used may deviate from those given in this list.
* Python 2.7.9
-* [star 2.5.3a](https://www.ncbi.nlm.nih.gov/pubmed/23104886)
+* [star 2.7.10a](https://www.ncbi.nlm.nih.gov/pubmed/23104886)
* [samtools 1.6](https://www.ncbi.nlm.nih.gov/pubmed/19505943)
-* [arriba 1.2.0](https://github.com/suhrig/arriba/)
+* [arriba 2.2.1](https://github.com/suhrig/arriba/)
* [subread 1.6.5](http://subread.sourceforge.net/) providing [featurecounts](https://www.ncbi.nlm.nih.gov/pubmed/24227677)
* [rnaseqc 1.1.8](https://www.ncbi.nlm.nih.gov/pubmed/22539670)
* [sambamba 0.6.5](https://www.ncbi.nlm.nih.gov/pubmed/25697820)
* [qualimap 2.2.1](http://qualimap.bioinfo.cipf.es/)
-* [kallisto 0.43.0](https://pachterlab.github.io/kallisto/about)
+* [kallisto 0.46.0](https://pachterlab.github.io/kallisto/about)
* [Jemultiplexer 1.0.6](https://gbcs.embl.de/portal/tiki-index.php?page=Jemultiplexer)
##### Conda Environment
From f8717d7f07d7818ea28a3220e5848d0aaa8a6d39 Mon Sep 17 00:00:00 2001
From: "Philip R. Kensche"
Date: Tue, 5 Apr 2022 15:29:51 +0200
Subject: [PATCH 06/13] Fixes in README.md.
---
README.md | 71 +++++++++++++++++++++++++++----------------------------
1 file changed, 35 insertions(+), 36 deletions(-)
diff --git a/README.md b/README.md
index 8da5944..fac213a 100755
--- a/README.md
+++ b/README.md
@@ -10,16 +10,16 @@ This workflow does the primary data processing for RNAseq data: alignment, QC, r
#### Output
-| File | Description |
-|----------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `*_merged.mdup.bam` & `.bai` | Standard alignment. |
-| `*_chimeric_merged.mdup.bam` & `.bai` | Chimeric alignments. |
-| `_chimeric_merged.junction` | Correspond to the file mentioned in section “5.4 Chimeric alignments in Chimeric.out.junction” in the [STAR manual](https://github.com/alexdobin/STAR/blob/2.5.3a/doc/STARmanual.pdf). The columns and their content is described in the STAR manual. |
-| `_merged.mdup.bam.flagstat` | flag statistics of the “standard” bam file |
-| `_star_logs_and_files` | |
-| `featureCounts/*.fpkm_tpm.featureCounts.tsv` | Gene-based expression values derived from `featureCounts`. See below. |
-| `featureCounts_dexseq/*.fpkm_tpm.featureCounts.dexseq.tsv` | Exon-based expression values derived from `featureCounts`. See below. |
-| `fusions_arriba/` | Output of the [Arriba](https://github.com/suhrig/arriba/) tool |
+| File | Description |
+|----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `*_merged.mdup.bam` & `.bai` | Standard alignment. |
+| `*_chimeric_merged.mdup.bam` & `.bai` | Chimeric alignments. |
+| `_chimeric_merged.junction` | Correspond to the file mentioned in section “6.4 Chimeric alignments in Chimeric.out.junction” in the [STAR manual](https://github.com/alexdobin/STAR/blob/2.7.10a/doc/STARmanual.pdf). The columns and their content is described in the STAR manual. |
+| `_merged.mdup.bam.flagstat` | flag statistics of the “standard” bam file |
+| `_star_logs_and_files` | STAR logs |
+| `featureCounts/*.fpkm_tpm.featureCounts.tsv` | Gene-based expression values derived from `featureCounts`. See below. |
+| `featureCounts_dexseq/*.fpkm_tpm.featureCounts.dexseq.tsv` | Exon-based expression values derived from `featureCounts`. See below. |
+| `fusions_arriba/` | Output of the [Arriba](https://github.com/suhrig/arriba/) tool |
#### Extended Feature Counts Table
@@ -34,32 +34,31 @@ The following concerns the tables in the `featureCounts/*.fpkm_tpm.featureCounts
-| Header | Description |
-|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| #chrom | Number of the chromosome |
-| chromStart | Start position of the gene |
-| chromEnd | End position of the gene |
-| gene_id | Ensembl gene ID. For the exon-based featureCounts file this column states all Ensembl gene IDs for which the respective exon is part of (separated by “+”).
-|
-| score | - |
-| strand | Strand of the gene |
-| name | Gene name |
-| exonic_length | Length of the coding region of the gene |
-| num_reads | The number of reads falling into the gene region, in a non-strand specific counting mode |
-| num_reads_fw | The number(s) of reads falling into the gene region, in a strand specific counting modes. Applying the featureCounts option for standedness either 1/forward (_fw) or 2/reverse (_rv). Which option is the appropriate one for strandedly sequenced data is dependent on the library preparation step. For example, the Illumina TruSeq stranded RNA kit usually produces data that fits the option 2/reverse (_rv). |
-| num_reads_rv | |
-| FPKM_no_mt_rrna_trna_chrxy | Unstranded FPKM for which the count values for rRNA, tRNA, chrX, chrY and Mt were omitted during the normalization process. (For example for FPKM these counts are subtracted from the sum of all counts). |
-| FPKM_no_mt_rrna_trna_chrxy_fw | FPKM like before but just forward/sense strand. |
-| FPKM_no_mt_rrna_trna_chrxy_rv | FPKM like before but just reverse/antisense stand |
-| TPM_no_mt_rrna_trna_chrxy | TPM for which the count values for rRNA, tRNA, chrX, chrY and Mt were omitted during the normalization process. |
-| TPM_no_mt_rrna_trna_chrxy_fw | TPM like before but just forward/sense strand |
-| TPM_no_mt_rrna_trna_chrxy_rv | TPM like before but just reverse/antisense strand |
-| FPKM_standard | FPKM calculated in the standard way, that is counting all reads in the normalization irrespective of whether they come from expected high-abundance transcripts (e.g. rRNA, etc.), possible haploid regions (X, Y), or variable copy number regions (mtDNA). |
-| FPKM_standard_fw | Standard FPKM, forward/sense strand |
-| FPKM_standard_rv | Standard FKPM, reverse/antisense strand |
-| TPM_standard | TPM calculated in the standard way |
-| TPM_standard_fw | Standard TPM, forward/sense strand |
-| TPM_standard_rv | Standard TPM, reverse/antisense strand |
+| Header | Description |
+|------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| #chrom | Number of the chromosome |
+| chromStart | Start position of the gene |
+| chromEnd | End position of the gene |
+| gene_id | Ensembl gene ID. For the exon-based featureCounts file this column states all Ensembl gene IDs for which the respective exon is part of (separated by “+”). |
+| score | - |
+| strand | Strand of the gene |
+| name | Gene name |
+| exonic_length | Length of the coding region of the gene |
+| num_reads | The number of reads falling into the gene region, in a non-strand specific counting mode |
+| num_reads_fw | The number(s) of reads falling into the gene region, in a strand specific counting modes. Applying the featureCounts option for standedness either 1/forward (_fw) or 2/reverse (_rv). Which option is the appropriate one for strandedly sequenced data is dependent on the library preparation step. For example, the Illumina TruSeq stranded RNA kit usually produces data that fits the option 2/reverse (_rv). |
+| num_reads_rv | |
+| FPKM_no_mt_rrna_trna_chrxy | Unstranded FPKM for which the count values for rRNA, tRNA, chrX, chrY and Mt were omitted during the normalization process. (For example for FPKM these counts are subtracted from the sum of all counts). |
+| FPKM_no_mt_rrna_trna_chrxy_fw | FPKM like before but just forward/sense strand. |
+| FPKM_no_mt_rrna_trna_chrxy_rv | FPKM like before but just reverse/antisense stand |
+| TPM_no_mt_rrna_trna_chrxy | TPM for which the count values for rRNA, tRNA, chrX, chrY and Mt were omitted during the normalization process. |
+| TPM_no_mt_rrna_trna_chrxy_fw | TPM like before but just forward/sense strand |
+| TPM_no_mt_rrna_trna_chrxy_rv | TPM like before but just reverse/antisense strand |
+| FPKM_standard | FPKM calculated in the standard way, that is counting all reads in the normalization irrespective of whether they come from expected high-abundance transcripts (e.g. rRNA, etc.), possible haploid regions (X, Y), or variable copy number regions (mtDNA). |
+| FPKM_standard_fw | Standard FPKM, forward/sense strand |
+| FPKM_standard_rv | Standard FKPM, reverse/antisense strand |
+| TPM_standard | TPM calculated in the standard way |
+| TPM_standard_fw | Standard TPM, forward/sense strand |
+| TPM_standard_rv | Standard TPM, reverse/antisense strand |
#### Example Protocol
From b0cb8db7bdd8895da86512e5ba801a1943593dcf Mon Sep 17 00:00:00 2001
From: "Philip R. Kensche"
Date: Wed, 14 Sep 2022 14:04:18 +0200
Subject: [PATCH 07/13] Typo fixes README
---
README.md | 58 +++++++++----------
.../tbi-cluster-composite-module.sh | 14 -----
2 files changed, 29 insertions(+), 43 deletions(-)
delete mode 100755 resources/analysisTools/rnaseqworkflow/environments/tbi-cluster-composite-module.sh
diff --git a/README.md b/README.md
index fac213a..08541ae 100755
--- a/README.md
+++ b/README.md
@@ -34,31 +34,31 @@ The following concerns the tables in the `featureCounts/*.fpkm_tpm.featureCounts
-| Header | Description |
-|------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| #chrom | Number of the chromosome |
-| chromStart | Start position of the gene |
-| chromEnd | End position of the gene |
-| gene_id | Ensembl gene ID. For the exon-based featureCounts file this column states all Ensembl gene IDs for which the respective exon is part of (separated by “+”). |
-| score | - |
-| strand | Strand of the gene |
-| name | Gene name |
-| exonic_length | Length of the coding region of the gene |
-| num_reads | The number of reads falling into the gene region, in a non-strand specific counting mode |
-| num_reads_fw | The number(s) of reads falling into the gene region, in a strand specific counting modes. Applying the featureCounts option for standedness either 1/forward (_fw) or 2/reverse (_rv). Which option is the appropriate one for strandedly sequenced data is dependent on the library preparation step. For example, the Illumina TruSeq stranded RNA kit usually produces data that fits the option 2/reverse (_rv). |
-| num_reads_rv | |
-| FPKM_no_mt_rrna_trna_chrxy | Unstranded FPKM for which the count values for rRNA, tRNA, chrX, chrY and Mt were omitted during the normalization process. (For example for FPKM these counts are subtracted from the sum of all counts). |
-| FPKM_no_mt_rrna_trna_chrxy_fw | FPKM like before but just forward/sense strand. |
-| FPKM_no_mt_rrna_trna_chrxy_rv | FPKM like before but just reverse/antisense stand |
-| TPM_no_mt_rrna_trna_chrxy | TPM for which the count values for rRNA, tRNA, chrX, chrY and Mt were omitted during the normalization process. |
-| TPM_no_mt_rrna_trna_chrxy_fw | TPM like before but just forward/sense strand |
-| TPM_no_mt_rrna_trna_chrxy_rv | TPM like before but just reverse/antisense strand |
-| FPKM_standard | FPKM calculated in the standard way, that is counting all reads in the normalization irrespective of whether they come from expected high-abundance transcripts (e.g. rRNA, etc.), possible haploid regions (X, Y), or variable copy number regions (mtDNA). |
-| FPKM_standard_fw | Standard FPKM, forward/sense strand |
-| FPKM_standard_rv | Standard FKPM, reverse/antisense strand |
-| TPM_standard | TPM calculated in the standard way |
-| TPM_standard_fw | Standard TPM, forward/sense strand |
-| TPM_standard_rv | Standard TPM, reverse/antisense strand |
+| Header | Description |
+|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| #chrom | Number of the chromosome |
+| chromStart | Start position of the gene |
+| chromEnd | End position of the gene |
+| gene_id | Ensembl gene ID. For the exon-based featureCounts file this column states all Ensembl gene IDs for which the respective exon is part of (separated by “+”). |
+| score | - |
+| strand | Strand of the gene |
+| name | Gene name |
+| exonic_length | Length of the coding region of the gene |
+| num_reads | The number of reads falling into the gene region, in a non-strand specific counting mode |
+| num_reads_fw | The number(s) of reads falling into the gene region, in a strand specific counting modes. Applying the featureCounts option for standedness either 1/forward (_fw) or 2/reverse (_rv). Which option is the appropriate one for strandedly sequenced data is dependent on the library preparation step. For example, the Illumina TruSeq stranded RNA kit usually produces data that fits the option 2/reverse (_rv). |
+| num_reads_rv | |
+| FPKM_no_mt_rrna_trna_chrxy | Unstranded FPKM for which the count values for rRNA, tRNA, chrX, chrY and Mt were omitted during the normalization process. (For example for FPKM these counts are subtracted from the sum of all counts). |
+| FPKM_no_mt_rrna_trna_chrxy_fw | FPKM like before but just forward/sense strand. |
+| FPKM_no_mt_rrna_trna_chrxy_rv | FPKM like before but just reverse/antisense strand |
+| TPM_no_mt_rrna_trna_chrxy | TPM for which the count values for rRNA, tRNA, chrX, chrY and Mt were omitted during the normalization process. |
+| TPM_no_mt_rrna_trna_chrxy_fw | TPM like before but just forward/sense strand |
+| TPM_no_mt_rrna_trna_chrxy_rv | TPM like before but just reverse/antisense strand |
+| FPKM_standard | FPKM calculated in the standard way, that is counting all reads in the normalization irrespective of whether they come from expected high-abundance transcripts (e.g. rRNA, etc.), possible haploid regions (X, Y), or variable copy number regions (mtDNA). |
+| FPKM_standard_fw | Standard FPKM, forward/sense strand |
+| FPKM_standard_rv | Standard FKPM, reverse/antisense strand |
+| TPM_standard | TPM calculated in the standard way |
+| TPM_standard_fw | Standard TPM, forward/sense strand |
+| TPM_standard_rv | Standard TPM, reverse/antisense strand |
#### Example Protocol
@@ -157,7 +157,7 @@ The following is a list of citations for software used by the RNAseq workflow. N
* Python 2.7.9
* [star 2.7.10a](https://www.ncbi.nlm.nih.gov/pubmed/23104886)
* [samtools 1.6](https://www.ncbi.nlm.nih.gov/pubmed/19505943)
-* [arriba 2.2.1](https://github.com/suhrig/arriba/)
+* [arriba 2.3.0](https://github.com/suhrig/arriba/)
* [subread 1.6.5](http://subread.sourceforge.net/) providing [featurecounts](https://www.ncbi.nlm.nih.gov/pubmed/24227677)
* [rnaseqc 1.1.8](https://www.ncbi.nlm.nih.gov/pubmed/22539670)
* [sambamba 0.6.5](https://www.ncbi.nlm.nih.gov/pubmed/25697820)
@@ -262,17 +262,17 @@ The following is merely on overview over the most important parameters.
* Change the basic stuff like `inputbase` and `outputbase` directories
set `RUN_FEATURE_COUNTS_DEXSEQ`, `RUN_RNASEQC`, `RUN_KALLISTO`, `RUN_ARRIBA`, `runFingerprinting` to `FALSE`.
* Check if your sample type is in `possibleTumorSampleNamePrefixes`
-* Set your star index (`GENOME_STAR_INDEX`) and gene models (`GENE_MODELS`) parameters to what your created for (1) and (2).
+* Set your star index (`GENOME_STAR_INDEX`) and gene models (`GENE_MODELS`) parameters to what you created for (1) and (2).
* Set `GENOME_GATK_INDEX` to point to the new FASTA file (1), for which you have created a dict file. The dict file should be in the same directory as the FASTA.
## Change Log
* 4.0.0
- - major: Arriba 2.2.1
- major: `resources/configurationFiles/analysisRNAseq.xml` is now just a default configuration with many configuration options left blank. For your `` tags in your project XMLs the analysis names can be changed to use the following defaults:
* GRCh37, mm10: [RNAseqAnalysis](resources/configurationFiles/analysisRNAseq.xml)
* GRCh38-specific: [RNAseqAnalysisGRCh38](resources/configurationFiles/analysisRNAseqGRCh38.xml)
- major: Update default software versions for all assemblies
+ * Arriba 2.3.0
* STAR 2.7.10a (from 2.5.3a for hg37/mm10 and 2.7.6a for hg38)
* Kallisto 0.46.0 (from 0.42.0)
* Samtools 1.9 (from 1.6)
@@ -338,7 +338,7 @@ set `RUN_FEATURE_COUNTS_DEXSEQ`, `RUN_RNASEQC`, `RUN_KALLISTO`, `RUN_ARRIBA`, `r
failed -- for at least Bash < 4.2 -- to export both adapters correctly to the called job script.
* 1.2.22-8 [26th Nov 2019]
- - Removed the single-quotes around `${ADAPTER_SEQ}` in `--clip3pAdapterSeq` again. STAR uses non-standard way of parsing parameters and manages to get all adapters. With quotes the adapters get also quoted and it is unclear what STAR does with them, except that it does not complain about a configuration error and that it also does not complain with even more severe misconfigurations, such as other non-DNA sequences as adapter sequences. The manual also does not use quoted parameter arguments, so no-quotes is conform to this STAR-specific CLI parameter handling pattern.
+ - Removed the single-quotes around `${ADAPTER_SEQ}` in `--clip3pAdapterSeq` again. STAR uses non-standard way of parsing parameters and manages to get all adapters. With quotes the adapters get also quoted, and it is unclear what STAR does with them, except that it does not complain about a configuration error and that it also does not complain with even more severe misconfigurations, such as other non-DNA sequences as adapter sequences. The manual also does not use quoted parameter arguments, so no-quotes is conform to this STAR-specific CLI parameter handling pattern.
* 1.2.22-7 [7th Nov 2018]
- Fixed a number of variable references that were without braces (now everywhere `${...}` is used to allow Roddy to resolve the references and order the parameter file correctly)
diff --git a/resources/analysisTools/rnaseqworkflow/environments/tbi-cluster-composite-module.sh b/resources/analysisTools/rnaseqworkflow/environments/tbi-cluster-composite-module.sh
deleted file mode 100755
index 424484e..0000000
--- a/resources/analysisTools/rnaseqworkflow/environments/tbi-cluster-composite-module.sh
+++ /dev/null
@@ -1,14 +0,0 @@
-#!/usr/bin/env bash
-
-if [ "$LOAD_MODULE" == true ]; then
- module load "${MODULE_ENV:?MODULE_ENV variable is undefined}"
- export STAR_BINARY=STAR
- export FEATURECOUNTS_BINARY=featureCounts
- export SAMBAMBA_BINARY=sambamba
- export SAMTOOLS_BINARY=samtools # samtools_bin defaults to v0.0.19 despite the RNAseq.XML!!!!!!!!!!!!!!!!!!!!!!!!!!
- export RNASEQC_BINARY=rnaseqc
- export KALLISTO_BINARY=kallisto
- export QUALIMAP_BINARY=qualimap
- export ARRIBA_BINARY=arriba
- export ARRIBA_DRAW_FUSIONS=draw_fusions.R
-fi
From 75ae3843ea6cddea5299cc6255e5b07c240faca7 Mon Sep 17 00:00:00 2001
From: paramasi
Date: Tue, 27 Jun 2023 15:27:39 +0200
Subject: [PATCH 08/13] Removing tbiClusterCompositeModule from xml
---
resources/configurationFiles/analysisRNAseq.xml | 1 -
1 file changed, 1 deletion(-)
diff --git a/resources/configurationFiles/analysisRNAseq.xml b/resources/configurationFiles/analysisRNAseq.xml
index e81c460..9e26ad4 100755
--- a/resources/configurationFiles/analysisRNAseq.xml
+++ b/resources/configurationFiles/analysisRNAseq.xml
@@ -182,7 +182,6 @@
-
From ed5112d58a4cc23d0496c729846d1ed15b3984b7 Mon Sep 17 00:00:00 2001
From: paramasi
Date: Tue, 27 Jun 2023 15:28:52 +0200
Subject: [PATCH 09/13] Two clip3pAdapterMMp values are needed for the pairs
---
resources/configurationFiles/analysisRNAseq.xml | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/resources/configurationFiles/analysisRNAseq.xml b/resources/configurationFiles/analysisRNAseq.xml
index 9e26ad4..b35edc9 100755
--- a/resources/configurationFiles/analysisRNAseq.xml
+++ b/resources/configurationFiles/analysisRNAseq.xml
@@ -147,7 +147,7 @@
-
Date: Tue, 27 Jun 2023 15:36:24 +0200
Subject: [PATCH 10/13] Refactoring GRCh38 XML to work with refmake reference
and legacy annotation
---
.../analysisRNAseqGRCh38.xml | 61 ++++++++++---------
1 file changed, 32 insertions(+), 29 deletions(-)
diff --git a/resources/configurationFiles/analysisRNAseqGRCh38.xml b/resources/configurationFiles/analysisRNAseqGRCh38.xml
index 614fc0b..4d58044 100644
--- a/resources/configurationFiles/analysisRNAseqGRCh38.xml
+++ b/resources/configurationFiles/analysisRNAseqGRCh38.xml
@@ -8,44 +8,47 @@
cleanupScript="cleanupScript">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-
+
+
+
+
-
-
-
-
+
-
+
+
-
-
-
+
+
+
From bf29ee995fb5f144f35714f8b8516bdc077e4ebb Mon Sep 17 00:00:00 2001
From: paramasi
Date: Fri, 30 Jun 2023 14:37:46 +0200
Subject: [PATCH 11/13] Update featureCounts column names
---
README.md | 58 ++++++++++---------
.../featureCountsDexseq_2_FpkmTpm | 18 +++---
.../rnaseqworkflow/featureCounts_2_FpkmTpm | 18 +++---
3 files changed, 50 insertions(+), 44 deletions(-)
diff --git a/README.md b/README.md
index 08541ae..fe20606 100755
--- a/README.md
+++ b/README.md
@@ -34,31 +34,31 @@ The following concerns the tables in the `featureCounts/*.fpkm_tpm.featureCounts
-| Header | Description |
-|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| #chrom | Number of the chromosome |
-| chromStart | Start position of the gene |
-| chromEnd | End position of the gene |
-| gene_id | Ensembl gene ID. For the exon-based featureCounts file this column states all Ensembl gene IDs for which the respective exon is part of (separated by “+”). |
-| score | - |
-| strand | Strand of the gene |
-| name | Gene name |
-| exonic_length | Length of the coding region of the gene |
-| num_reads | The number of reads falling into the gene region, in a non-strand specific counting mode |
-| num_reads_fw | The number(s) of reads falling into the gene region, in a strand specific counting modes. Applying the featureCounts option for standedness either 1/forward (_fw) or 2/reverse (_rv). Which option is the appropriate one for strandedly sequenced data is dependent on the library preparation step. For example, the Illumina TruSeq stranded RNA kit usually produces data that fits the option 2/reverse (_rv). |
-| num_reads_rv | |
-| FPKM_no_mt_rrna_trna_chrxy | Unstranded FPKM for which the count values for rRNA, tRNA, chrX, chrY and Mt were omitted during the normalization process. (For example for FPKM these counts are subtracted from the sum of all counts). |
-| FPKM_no_mt_rrna_trna_chrxy_fw | FPKM like before but just forward/sense strand. |
-| FPKM_no_mt_rrna_trna_chrxy_rv | FPKM like before but just reverse/antisense strand |
-| TPM_no_mt_rrna_trna_chrxy | TPM for which the count values for rRNA, tRNA, chrX, chrY and Mt were omitted during the normalization process. |
-| TPM_no_mt_rrna_trna_chrxy_fw | TPM like before but just forward/sense strand |
-| TPM_no_mt_rrna_trna_chrxy_rv | TPM like before but just reverse/antisense strand |
-| FPKM_standard | FPKM calculated in the standard way, that is counting all reads in the normalization irrespective of whether they come from expected high-abundance transcripts (e.g. rRNA, etc.), possible haploid regions (X, Y), or variable copy number regions (mtDNA). |
-| FPKM_standard_fw | Standard FPKM, forward/sense strand |
-| FPKM_standard_rv | Standard FKPM, reverse/antisense strand |
-| TPM_standard | TPM calculated in the standard way |
-| TPM_standard_fw | Standard TPM, forward/sense strand |
-| TPM_standard_rv | Standard TPM, reverse/antisense strand |
+| Header | Description |
+|------------|-------------------|
+| #chrom | Chromosome contigs |
+| chromStart | Start position of the gene |
+| chromEnd | End position of the gene |
+| gene_id | Ensembl gene ID. For the exon-based featureCounts file this column states all Ensembl gene IDs for which the respective exon is part of (separated by "+"). |
+| score | - |
+| strand | Strand of the gene |
+| name | Gene name |
+| exonic_length | Length of the coding region of the gene |
+| num_reads_unstranded | The number of reads falling into the gene region, in a non-strand specific counting mode (unstranded - `s 0` option in featureCounts).|
+| num_reads_stranded | The number(s) of reads falling into the gene region, in a strand specific counting modes. Applying the featureCounts option for standedness of `s 1`. When using featureCounts, you can choose the `1/_stranded` or `2/_reverse_stranded` option for standedness based on the library preparation step of your stranded sequenced data. For example, the Illumina TruSeq stranded RNA kit usually produces data that fits the option `2/_reverse_stranded`. |
+| num_reads_reverse_stranded | The number of reads falling into the gene region. The counts are based on the `s 2` option in featureCounts. To obtain the counts for the Illumina TruSeq stranded RNA kit library protocol, refer to this column as mentioned above. And also use `_reverse_stranded` columns for FPKM and TPM values from this protocol. |
+| FPKM_customLibSize_unstranded | Unstranded FPKM calculation using the counts from the `s 0` option in feature counts. Counts for genes from rRNA, tRNA, chrX, chrY, and Mt were excluded during library size estimation.|
+| FPKM_customLibSize_stranded | Stranded FPKM calculation based on the counts from `-s 1` option in featureCount with custom library size estimation as described above.|
+| FPKM_customLibSize_reverse_stranded | Reverse stranded FPKM calculation based on the counts from `-s 2` option in featureCount with custom library size estimation as described above.|
+| TPM_customLibSize_unstranded | Unstranded TPM calculation using the counts from the `s 0` option in feature counts. Counts for genes from rRNA, tRNA, chrX, chrY, and Mt were excluded during library size estimation.|
+| TPM_customLibSize_stranded | Stranded TPM calculation was performed using the counts from the `s 1` option in feature counts with custom library size estimation as above.|
+| TPM_customLibSize_reverse_stranded | Reverse stranded TPM calculation was performed using the counts from the `s 2` option in feature counts with custom library size estimation as above.|
+| FPKM_unstranded | Unstranded FPKM calculation based on the counts from `-s 0` option in featureCount|
+| FPKM_stranded | Stranded FPKM calculation based on the counts from `-s 1` option in featureCount|
+| FPKM_reverse_stranded | Reverse stranded FKPM calculation based on the counts from `-s 2` option in featureCount|
+| TPM_unstranded |Unstranded TPM calculation based on the counts from `-s 0` option in featureCount |
+| TPM_stranded | Stranded TPM calculation based on the counts from `-s 1` option in featureCount|
+| TPM_reverse_stranded | Reverse stranded TPM calculation based on the counts from `-s 2` option in featureCount|
#### Example Protocol
@@ -278,10 +278,16 @@ set `RUN_FEATURE_COUNTS_DEXSEQ`, `RUN_RNASEQC`, `RUN_KALLISTO`, `RUN_ARRIBA`, `r
* Samtools 1.9 (from 1.6)
* HTSlib 1.9 (from 1.6)
* subread 1.6.5 (from 1.5.1): The previous version produces occasional segmentation faults (related to extreme optimization option `-O6`) but otherwise produces the same results. Both versions produced exactly the same `featureCounts` in multiple tests.
- - major: Gene model gencode version 31 was used to create the STAR and Kallisto indexes
+ - major: Gene model gencode version 39/43 was used to create the STAR and Kallisto indexes
- major: Update default paths to new ngs_share at ODCF.
+ - major: GRCh38 STAR and Kallisto indexes are based on `refmake` workflow.
- major: Updated the Conda `environment.yaml`.
* Note for users at the DKFZ cluster: The software versions used in Conda environment do not exactly match the versions in the default configuration used for the cluster's module system.
+ - major: Column rename in featureCounts table:
+ - "FPKM_no_mt_rrna_trna_chrxy{,_fw,_rv}" -> "FPKM_customLibSize{_unstranded, _stranded, _reverse_stranded}"
+ - "TPM_no_mt_rrna_trna_chrxy{,_fw,_rv}" -> "TPM_customLibSize{_unstranded, _stranded, _reverse_stranded}"
+ - "FPKM_standard{,_fw,_rv}" -> "FPKM{_unstranded, _stranded, _reverse_stranded}"
+ - "TPM_standard{,_fw,_rv}" -> "TPM{_unstranded, _stranded, _reverse_stranded}"
* 3.0.0 [19th Oct 2021]
- major: Column rename in feature counts table:
diff --git a/resources/analysisTools/rnaseqworkflow/featureCountsDexseq_2_FpkmTpm b/resources/analysisTools/rnaseqworkflow/featureCountsDexseq_2_FpkmTpm
index b4aa05b..27e992a 100755
--- a/resources/analysisTools/rnaseqworkflow/featureCountsDexseq_2_FpkmTpm
+++ b/resources/analysisTools/rnaseqworkflow/featureCountsDexseq_2_FpkmTpm
@@ -238,11 +238,11 @@ open (my $fh, "tail -n $counter $fc0_file |") or die;
# print header
print "\#chrom\tchromStart\tchromEnd\tgene_id\tscore\tstrand\tname\texonic_length\t";
-print "num_reads\tnum_reads_fw\tnum_reads_rv\t";
-print "FPKM_no_mt_rrna_trna_chrxy\tFPKM_no_mt_rrna_trna_chrxy_fw\tFPKM_no_mt_rrna_trna_chrxy_rv\t";
-print "TPM_no_mt_rrna_trna_chrxy\tTPM_no_mt_rrna_trna_chrxy_fw\tTPM_no_mt_rrna_trna_chrxy_rv\t";
-print "FPKM_standard\tFPKM_standard_fw\tFPKM_standard_rv\t";
-print "TPM_standard\tTPM_standard_fw\tTPM_standard_rv\n";
+print "num_reads_unstranded\tnum_reads_stranded\tnum_reads_reverse_unstranded\t";
+print "FPKM_customLibSize_unstranded\tFPKM_customLibSize_stranded\tFPKM_customLibSize_reverse_stranded\t";
+print "TPM_customLibSize_unstranded\tTPM_customLibSize_stranded\tTPM_customLibSize_reverse_stranded\t";
+print "FPKM_unstranded\tFPKM_stranded\tFPKM_reverse_stranded\t";
+print "TPM_unstranded\tTPM_stranded\tTPM_reverse_stranded\n";
while (!eof($fh)){
@@ -278,22 +278,22 @@ while (!eof($fh)){
print "$read_counts_all{1}{$hash_id}\t";
print "$read_counts_all{2}{$hash_id}\t";
- # print RPKMS no_mt_rrna_trna_chrxy
+ # print FPKM with custom library size estimation
print "".($read_counts_all_per_length{0}{$hash_id})*($BILLION/$ignore_read_counts{0})."\t";
print "".($read_counts_all_per_length{1}{$hash_id})*($BILLION/$ignore_read_counts{1})."\t";
print "".($read_counts_all_per_length{2}{$hash_id})*($BILLION/$ignore_read_counts{2})."\t";
- # print TPMS no_mt_rrna_trna_chrxy
+ # print TPM with custom library size estimation
print "".($read_counts_all_per_length{0}{$hash_id})*($BILLION/$ignore_read_counts{0})/($ignore_fpkm_counts{0}/$MILLION)."\t";
print "".($read_counts_all_per_length{1}{$hash_id})*($BILLION/$ignore_read_counts{1})/($ignore_fpkm_counts{1}/$MILLION)."\t";
print "".($read_counts_all_per_length{2}{$hash_id})*($BILLION/$ignore_read_counts{2})/($ignore_fpkm_counts{2}/$MILLION)."\t";
- # print RPKMS standard
+ # print FPKM
print "".($read_counts_all_per_length{0}{$hash_id})*($BILLION/$all_read_counts{0})."\t";
print "".($read_counts_all_per_length{1}{$hash_id})*($BILLION/$all_read_counts{1})."\t";
print "".($read_counts_all_per_length{2}{$hash_id})*($BILLION/$all_read_counts{2})."\t";
- # print TPMS standard
+ # print TPM
print "".($read_counts_all_per_length{0}{$hash_id})*($BILLION/$all_read_counts{0})/($all_fpkm_counts{0}/$MILLION)."\t";
print "".($read_counts_all_per_length{1}{$hash_id})*($BILLION/$all_read_counts{1})/($all_fpkm_counts{1}/$MILLION)."\t";
print "".($read_counts_all_per_length{2}{$hash_id})*($BILLION/$all_read_counts{2})/($all_fpkm_counts{2}/$MILLION)."";
diff --git a/resources/analysisTools/rnaseqworkflow/featureCounts_2_FpkmTpm b/resources/analysisTools/rnaseqworkflow/featureCounts_2_FpkmTpm
index 9947bcc..240cb7a 100755
--- a/resources/analysisTools/rnaseqworkflow/featureCounts_2_FpkmTpm
+++ b/resources/analysisTools/rnaseqworkflow/featureCounts_2_FpkmTpm
@@ -235,11 +235,11 @@ open (my $fh, "cut -f 1 $fc0_file| tail -n $counter |") or die;
# print header
print "\#chrom\tchromStart\tchromEnd\tgene_id\tscore\tstrand\tname\texonic_length\t";
-print "num_reads\tnum_reads_fw\tnum_reads_rv\t";
-print "FPKM_no_mt_rrna_trna_chrxy\tFPKM_no_mt_rrna_trna_chrxy_fw\tFPKM_no_mt_rrna_trna_chrxy_rv\t";
-print "TPM_no_mt_rrna_trna_chrxy\tTPM_no_mt_rrna_trna_chrxy_fw\tTPM_no_mt_rrna_trna_chrxy_rv\t";
-print "FPKM_standard\tFPKM_standard_fw\tFPKM_standard_rv\t";
-print "TPM_standard\tTPM_standard_fw\tTPM_standard_rv\n";
+print "num_reads_unstranded\tnum_reads_stranded\tnum_reads_reverse_unstranded\t";
+print "FPKM_customLibSize_unstranded\tFPKM_customLibSize_stranded\tFPKM_customLibSize_reverse_stranded\t";
+print "TPM_customLibSize_unstranded\tTPM_customLibSize_stranded\tTPM_customLibSize_reverse_stranded\t";
+print "FPKM_unstranded\tFPKM_stranded\tFPKM_reverse_stranded\t";
+print "TPM_unstranded\tTPM_stranded\tTPM_reverse_stranded\n";
while (!eof($fh)){
@@ -256,22 +256,22 @@ while (!eof($fh)){
print "$read_counts_all{1}{$fc0}\t";
print "$read_counts_all{2}{$fc0}\t";
- # print RPKMS no_mt_rrna_trna_chrxy
+ # print FPKM with custom library size estimation
print "".($read_counts_all_per_length{0}{$fc0})*($BILLION/$ignore_read_counts{0})."\t";
print "".($read_counts_all_per_length{1}{$fc0})*($BILLION/$ignore_read_counts{1})."\t";
print "".($read_counts_all_per_length{2}{$fc0})*($BILLION/$ignore_read_counts{2})."\t";
- # print TPMS no_mt_rrna_trna_chrxy
+ # print TPM with custom library size estimation
print "".($read_counts_all_per_length{0}{$fc0})*($BILLION/$ignore_read_counts{0})/($ignore_fpkm_counts{0}/$MILLION)."\t";
print "".($read_counts_all_per_length{1}{$fc0})*($BILLION/$ignore_read_counts{1})/($ignore_fpkm_counts{1}/$MILLION)."\t";
print "".($read_counts_all_per_length{2}{$fc0})*($BILLION/$ignore_read_counts{2})/($ignore_fpkm_counts{2}/$MILLION)."\t";
- # print RPKMS standard
+ # print FPKM
print "".($read_counts_all_per_length{0}{$fc0})*($BILLION/$all_read_counts{0})."\t";
print "".($read_counts_all_per_length{1}{$fc0})*($BILLION/$all_read_counts{1})."\t";
print "".($read_counts_all_per_length{2}{$fc0})*($BILLION/$all_read_counts{2})."\t";
- # print TPMS standard
+ # print TPM
print "".($read_counts_all_per_length{0}{$fc0})*($BILLION/$all_read_counts{0})/($all_fpkm_counts{0}/$MILLION)."\t";
print "".($read_counts_all_per_length{1}{$fc0})*($BILLION/$all_read_counts{1})/($all_fpkm_counts{1}/$MILLION)."\t";
print "".($read_counts_all_per_length{2}{$fc0})*($BILLION/$all_read_counts{2})/($all_fpkm_counts{2}/$MILLION)."";
From d739e457d9ad906c297f9c80db83c2d8e0976b9e Mon Sep 17 00:00:00 2001
From: paramasi
Date: Mon, 3 Jul 2023 10:10:11 +0200
Subject: [PATCH 12/13] Header typo fix
---
.../analysisTools/rnaseqworkflow/featureCountsDexseq_2_FpkmTpm | 2 +-
resources/analysisTools/rnaseqworkflow/featureCounts_2_FpkmTpm | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/resources/analysisTools/rnaseqworkflow/featureCountsDexseq_2_FpkmTpm b/resources/analysisTools/rnaseqworkflow/featureCountsDexseq_2_FpkmTpm
index 27e992a..4fa8800 100755
--- a/resources/analysisTools/rnaseqworkflow/featureCountsDexseq_2_FpkmTpm
+++ b/resources/analysisTools/rnaseqworkflow/featureCountsDexseq_2_FpkmTpm
@@ -238,7 +238,7 @@ open (my $fh, "tail -n $counter $fc0_file |") or die;
# print header
print "\#chrom\tchromStart\tchromEnd\tgene_id\tscore\tstrand\tname\texonic_length\t";
-print "num_reads_unstranded\tnum_reads_stranded\tnum_reads_reverse_unstranded\t";
+print "num_reads_unstranded\tnum_reads_stranded\tnum_reads_reverse_stranded\t";
print "FPKM_customLibSize_unstranded\tFPKM_customLibSize_stranded\tFPKM_customLibSize_reverse_stranded\t";
print "TPM_customLibSize_unstranded\tTPM_customLibSize_stranded\tTPM_customLibSize_reverse_stranded\t";
print "FPKM_unstranded\tFPKM_stranded\tFPKM_reverse_stranded\t";
diff --git a/resources/analysisTools/rnaseqworkflow/featureCounts_2_FpkmTpm b/resources/analysisTools/rnaseqworkflow/featureCounts_2_FpkmTpm
index 240cb7a..4d7123c 100755
--- a/resources/analysisTools/rnaseqworkflow/featureCounts_2_FpkmTpm
+++ b/resources/analysisTools/rnaseqworkflow/featureCounts_2_FpkmTpm
@@ -235,7 +235,7 @@ open (my $fh, "cut -f 1 $fc0_file| tail -n $counter |") or die;
# print header
print "\#chrom\tchromStart\tchromEnd\tgene_id\tscore\tstrand\tname\texonic_length\t";
-print "num_reads_unstranded\tnum_reads_stranded\tnum_reads_reverse_unstranded\t";
+print "num_reads_unstranded\tnum_reads_stranded\tnum_reads_reverse_stranded\t";
print "FPKM_customLibSize_unstranded\tFPKM_customLibSize_stranded\tFPKM_customLibSize_reverse_stranded\t";
print "TPM_customLibSize_unstranded\tTPM_customLibSize_stranded\tTPM_customLibSize_reverse_stranded\t";
print "FPKM_unstranded\tFPKM_stranded\tFPKM_reverse_stranded\t";
From 7f1953f75878dbe7a782bc4b5d9b42aceb370d0f Mon Sep 17 00:00:00 2001
From: paramasi
Date: Mon, 3 Jul 2023 10:10:55 +0200
Subject: [PATCH 13/13] Add scripts to prepare gencode annotation
---
.../prepare_reference_data/README.md | 20 ++
.../dexseq_prepare_annotation2.py | 177 ++++++++++++++++++
.../prepare_gencode_annotation.sh | 23 +++
3 files changed, 220 insertions(+)
create mode 100644 resources/analysisTools/rnaseqworkflow/prepare_reference_data/README.md
create mode 100644 resources/analysisTools/rnaseqworkflow/prepare_reference_data/dexseq_prepare_annotation2.py
create mode 100644 resources/analysisTools/rnaseqworkflow/prepare_reference_data/prepare_gencode_annotation.sh
diff --git a/resources/analysisTools/rnaseqworkflow/prepare_reference_data/README.md b/resources/analysisTools/rnaseqworkflow/prepare_reference_data/README.md
new file mode 100644
index 0000000..d74a509
--- /dev/null
+++ b/resources/analysisTools/rnaseqworkflow/prepare_reference_data/README.md
@@ -0,0 +1,20 @@
+## **Script to prepare gencode annotation data for RNA-seq analysis**
+
+The reference data for GRCh38 and GRCm39 are prepared using the [refmake workflow](https://odcf-gitlab.dkfz.de/ODCF/refmake). This includes the
+1. STAR index
+2. Kallisto index
+3. Gencode annotation GTF file
+
+The downstream annotation files that are listed below are generated using the `prepare_gencode_annotation.sh` script.
+1. annotation.bed
+2. annotation.nogene.gtf
+3. annotation.chrXYMT.rRNA.gtf
+4. annotation.dexseq.gff
+
+The script can be run as follows:
+
+```bash
+sh prepare_gencode_annotation.sh /omics/odcf/reference_data/legacy/ngs_share/assemblies/hg_GRCh38/databases/gencode/GRCh38_decoy_ebv_alt_hla_phiX/gencode_v39_chr_patch_hapl_scaff/annotation.gtf
+```
+
+The Python script `dexseq_prepare_annotation2.py` was downloaded from [here](https://raw.githubusercontent.com/vivekbhr/Subread_to_DEXSeq/master/dexseq_prepare_annotation2.py) and edited to not print transcript IDs in the output files. This was to avoid the memory issue caused by the long concatenation of the transcript IDs.
\ No newline at end of file
diff --git a/resources/analysisTools/rnaseqworkflow/prepare_reference_data/dexseq_prepare_annotation2.py b/resources/analysisTools/rnaseqworkflow/prepare_reference_data/dexseq_prepare_annotation2.py
new file mode 100644
index 0000000..409f97d
--- /dev/null
+++ b/resources/analysisTools/rnaseqworkflow/prepare_reference_data/dexseq_prepare_annotation2.py
@@ -0,0 +1,177 @@
+#!/usr/bin/env python
+
+import sys, collections, itertools, os.path, optparse, os # changed here to add sed
+
+optParser = optparse.OptionParser(
+
+ usage = "python %prog [options] ",
+
+ description=
+ "Script to prepare annotation for DEXSeq." +
+ "This script takes an annotation file in Ensembl GTF format" +
+ "and outputs a 'flattened' annotation file suitable for use " +
+ "with the count_in_exons.py script ",
+
+ epilog =
+ "Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology " +
+ "Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General " +
+ "Public License v3. Part of the 'DEXSeq' package. " +
+ "Modified by Vivek Bhardwaj (just a bit!) to write featurecounts gtf as an option.")
+
+optParser.add_option( "-r", "--aggregate", type="choice", dest="aggregate",
+ choices = ( "no", "yes" ), default = "yes",
+ help = "'yes' or 'no'. Indicates whether two or more genes sharing an exon should be merged into an 'aggregate gene'. If 'no', the exons that can not be assiged to a single gene are ignored." )
+
+# add option for featurecounts output
+optParser.add_option( "-f", "--featurecountsgtf", type="string", dest="fcgtf", action = "store",
+ help = "gtf file to write for featurecounts." )
+
+##
+
+(opts, args) = optParser.parse_args()
+
+if len( args ) != 2:
+ sys.stderr.write( "Script to prepare annotation for DEXSeq.\n\n" )
+ sys.stderr.write( "Usage: python %s \n\n" % os.path.basename(sys.argv[0]) )
+ sys.stderr.write( "This script takes an annotation file in Ensembl GTF format\n" )
+ sys.stderr.write( "and outputs a 'flattened' annotation file suitable for use\n" )
+ sys.stderr.write( "with the count_in_exons.py script.\n" )
+ sys.exit(1)
+
+try:
+ import HTSeq
+except ImportError:
+ sys.stderr.write( "Could not import HTSeq. Please install the HTSeq Python framework\n" )
+ sys.stderr.write( "available from http://www-huber.embl.de/users/anders/HTSeq\n" )
+ sys.exit(1)
+
+
+
+
+gtf_file = args[0]
+out_file = args[1]
+
+aggregateGenes = opts.aggregate == "yes"
+
+# Step 1: Store all exons with their gene and transcript ID
+# in a GenomicArrayOfSets
+
+exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True )
+for f in HTSeq.GFF_Reader( gtf_file ):
+ if f.type != "exon":
+ continue
+ f.attr['gene_id'] = f.attr['gene_id'].replace( ":", "_" )
+ exons[f.iv] += ( f.attr['gene_id'], f.attr['transcript_id'] )
+
+
+# Step 2: Form sets of overlapping genes
+
+# We produce the dict 'gene_sets', whose values are sets of gene IDs. Each set
+# contains IDs of genes that overlap, i.e., share bases (on the same strand).
+# The keys of 'gene_sets' are the IDs of all genes, and each key refers to
+# the set that contains the gene.
+# Each gene set forms an 'aggregate gene'.
+
+if aggregateGenes == True:
+ gene_sets = collections.defaultdict( lambda: set() )
+ for iv, s in exons.steps():
+ # For each step, make a set, 'full_set' of all the gene IDs occuring
+ # in the present step, and also add all those gene IDs, whch have been
+ # seen earlier to co-occur with each of the currently present gene IDs.
+ full_set = set()
+ for gene_id, transcript_id in s:
+ full_set.add( gene_id )
+ full_set |= gene_sets[ gene_id ]
+ # Make sure that all genes that are now in full_set get associated
+ # with full_set, i.e., get to know about their new partners
+ for gene_id in full_set:
+ assert gene_sets[ gene_id ] <= full_set
+ gene_sets[ gene_id ] = full_set
+
+
+# Step 3: Go through the steps again to get the exonic sections. Each step
+# becomes an 'exonic part'. The exonic part is associated with an
+# aggregate gene, i.e., a gene set as determined in the previous step,
+# and a transcript set, containing all transcripts that occur in the step.
+# The results are stored in the dict 'aggregates', which contains, for each
+# aggregate ID, a list of all its exonic_part features.
+
+aggregates = collections.defaultdict( lambda: list() )
+for iv, s in exons.steps( ):
+ # Skip empty steps
+ if len(s) == 0:
+ continue
+ gene_id = list(s)[0][0]
+ ## if aggregateGenes=FALSE, ignore the exons associated to more than one gene ID
+ if aggregateGenes == False:
+ check_set = set()
+ for geneID, transcript_id in s:
+ check_set.add( geneID )
+ if( len( check_set ) > 1 ):
+ continue
+ else:
+ aggregate_id = gene_id
+ # Take one of the gene IDs, find the others via gene sets, and
+ # form the aggregate ID from all of them
+ else:
+ assert set( gene_id for gene_id, transcript_id in s ) <= gene_sets[ gene_id ]
+ aggregate_id = '+'.join( gene_sets[ gene_id ] )
+ # Make the feature and store it in 'aggregates'
+ f = HTSeq.GenomicFeature( aggregate_id, "exonic_part", iv )
+ f.source = os.path.basename( sys.argv[0] )
+# f.source = "camara"
+ f.attr = {}
+ f.attr[ 'gene_id' ] = aggregate_id
+ transcript_set = set( ( transcript_id for gene_id, transcript_id in s ) )
+ # Comment out the following line if you do not want to have the transcript IDs in the GFF output
+ # Concatenate the transcript IDs, separated by '+', makes a long transcript ID which leads to out of memory
+ #f.attr[ 'transcripts' ] = '+'.join( transcript_set )
+ aggregates[ aggregate_id ].append( f )
+
+
+# Step 4: For each aggregate, number the exonic parts
+
+aggregate_features = []
+for l in aggregates.values():
+ for i in range( len(l)-1 ):
+ assert l[i].name == l[i+1].name, str(l[i+1]) + " has wrong name"
+ assert l[i].iv.end <= l[i+1].iv.start, str(l[i+1]) + " starts too early"
+ if l[i].iv.chrom != l[i+1].iv.chrom:
+ raise ValueError(
+ "Same name found on two chromosomes: %s, %s" % ( str(l[i]), str(l[i+1]) )
+ )
+ if l[i].iv.strand != l[i+1].iv.strand:
+ raise ValueError(
+ "Same name found on two strands: %s, %s" % ( str(l[i]), str(l[i+1]) )
+ )
+ aggr_feat = HTSeq.GenomicFeature( l[0].name, "aggregate_gene",
+ HTSeq.GenomicInterval( l[0].iv.chrom, l[0].iv.start,
+ l[-1].iv.end, l[0].iv.strand ) )
+ aggr_feat.source = os.path.basename( sys.argv[0] )
+ aggr_feat.attr = { 'gene_id': aggr_feat.name }
+ for i in range( len(l) ):
+ l[i].attr['exonic_part_number'] = "%03d" % ( i+1 )
+ aggregate_features.append( aggr_feat )
+
+
+# Step 5: Sort the aggregates, then write everything out
+
+aggregate_features.sort( key = lambda f: ( f.iv.chrom, f.iv.start ) )
+
+fout = open( out_file, "w" )
+for aggr_feat in aggregate_features:
+ fout.write( aggr_feat.get_gff_line() )
+ for f in aggregates[ aggr_feat.name ]:
+ fout.write( f.get_gff_line() )
+
+fout.close()
+
+## modify file to print gtf if featurecounts gtf requested
+fcountgtf = opts.fcgtf
+
+if fcountgtf :
+ os.system('sed s/aggregate_gene/gene/g ' + out_file + ' > ' + fcountgtf)
+ os.system('sed -i s/exonic_part/exon/g ' + fcountgtf)
+ print("Done!")
+else :
+ print("Done!")
diff --git a/resources/analysisTools/rnaseqworkflow/prepare_reference_data/prepare_gencode_annotation.sh b/resources/analysisTools/rnaseqworkflow/prepare_reference_data/prepare_gencode_annotation.sh
new file mode 100644
index 0000000..60242af
--- /dev/null
+++ b/resources/analysisTools/rnaseqworkflow/prepare_reference_data/prepare_gencode_annotation.sh
@@ -0,0 +1,23 @@
+#! /bin/env bash
+# Prepare gencode annotation for RNAseq workflow based on the GTF file from the refmake workflow
+# Example run sh prepare_gencode_annotation.sh /omics/odcf/reference_data/legacy/ngs_share/assemblies/hg_GRCh38/databases/gencode/GRCh38_decoy_ebv_alt_hla_phiX/gencode_v39_chr_patch_hapl_scaff/annotation.gtf
+module load bedops/2.4.14
+GTF_FILE=$1
+GTF_FILE_basename=${GTF_FILE%.gtf}
+
+# GTF to bed
+convert2bed -i gtf < $GTF_FILE > ${GTF_FILE_basename}.bed
+
+## For RNAseq workflow
+cat $GTF_FILE | grep -P "^chr[X|Y|MT]|\brRNA" | awk -F '\t' '$3=="gene"' > ${GTF_FILE_basename}.chrXYMT.rRNA.gtf
+
+# Creating dexseq gff files using vivekbhr/Subread_to_DEXSeq
+# wget https://raw.githubusercontent.com/vivekbhr/Subread_to_DEXSeq/master/dexseq_prepare_annotation2.py
+# conda create --name htseq -c bioconda htseq
+source activate htseq
+
+python dexseq_prepare_annotation2.py -f ${GTF_FILE_basename}.dexseq.gtf $GTF_FILE ${GTF_FILE_basename}.dexseq.gff
+
+## Creating nogene GTF file for RNAseqQC analysis
+# The gene annotations doesn't contain 'transcript_id', so has to be excluded for the RNAseqQC analysis
+cat $GTF_FILE | awk '$3 != "gene"' > ${GTF_FILE_basename}.nogene.gtf