Skip to content

Commit

Permalink
Merge pull request #28 from DKFZ-ODCF/pre-release-4
Browse files Browse the repository at this point in the history
Pre release 4
  • Loading branch information
vinjana authored Jul 3, 2023
2 parents 17976f3 + 7f1953f commit 0950781
Show file tree
Hide file tree
Showing 10 changed files with 410 additions and 122 deletions.
111 changes: 91 additions & 20 deletions README.md

Large diffs are not rendered by default.

63 changes: 34 additions & 29 deletions environment.yaml
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
name: RNAseqWorkflow
name: RNAseqWorkflow_4
channels:
- conda-forge
- bioconda
- bioconda-legacy
- conda-forge
- defaults
- bioconda-legacy
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=1_llvm
- _r-mutex=1.0.1=anacondar_1
- arriba=1.2.0=hc088bd4_0
- bcftools=1.10.2=hd2cd319_0
- arriba=2.2.1=h3198e80_0
- bcftools=1.12=h45bccc9_1
- bioconductor-affy=1.56.0=r3.4.1_0
- bioconductor-affyio=1.50.0=r341h470a237_0
- bioconductor-annotate=1.58.0=r341_0
Expand Down Expand Up @@ -43,55 +43,60 @@ dependencies:
- bioconductor-xvector=0.20.0=r341h470a237_0
- bioconductor-zlibbioc=1.26.0=r341h470a237_0
- bzip2=1.0.8=h516909a_2
- ca-certificates=2019.11.28=hecc5488_0
- c-ares=1.18.1=h7f98852_0
- ca-certificates=2021.10.8=ha878542_0
- cairo=1.14.12=he6fea26_5
- certifi=2019.11.28=py27_0
- curl=7.68.0=hf8cf82a_0
- certifi=2019.11.28=py27h8c360ce_1
- curl=7.76.1=h979ede3_1
- fontconfig=2.13.1=h2176d3f_1000
- freetype=2.9.1=h3cfcefd_1004
- gettext=0.19.8.1=hc5be6a0_1002
- glib=2.55.0=h464dc38_2
- graphite2=1.3.13=hf484d3e_1000
- gsl=2.5=h294904e_1
- gsl=2.6=he838d99_2
- harfbuzz=1.9.0=h08d66d9_0
- hdf5=1.8.17=11
- htslib=1.10.2=h78d89cc_0
- hdf5=1.10.5=nompi_h5b725eb_1114
- htslib=1.12=h9093b5e_1
- icu=58.2=hf484d3e_1000
- jpeg=9c=h14c3975_1001
- kallisto=0.43.0=hdf51.8.17_2
- krb5=1.16.4=h2fd8d38_0
- libblas=3.8.0=15_openblas
- libcblas=3.8.0=15_openblas
- libcurl=7.68.0=hda55be3_0
- libdeflate=1.3=h516909a_0
- kallisto=0.46.0=h4f7b962_1
- krb5=1.17.1=h2fd8d38_0
- libblas=3.9.0=13_linux64_openblas
- libcblas=3.9.0=13_linux64_openblas
- libcurl=7.76.1=hc4aaa36_1
- libdeflate=1.7=h7f98852_5
- libedit=3.1.20170329=0
- libev=4.33=h516909a_1
- libffi=3.2.1=he1b5a44_1006
- libgcc=7.2.0=h69d50b8_2
- libgcc-ng=9.2.0=h24d8f2e_2
- libgcc-ng=11.2.0=h1d223b6_12
- libgfortran=3.0.0=1
- libgfortran-ng=7.3.0=hdf63c60_5
- libgfortran-ng=11.2.0=h69a702a_12
- libgfortran5=11.2.0=h5c6108e_12
- libiconv=1.15=h516909a_1005
- libidn2=2.3.0=h516909a_0
- libopenblas=0.3.8=h5ec1e0e_0
- libnghttp2=1.43.0=h812cca2_1
- libopenblas=0.3.18=pthreads_h8fe5266_0
- libpng=1.6.34=ha92aebf_2
- libssh2=1.8.2=h22169c7_2
- libstdcxx-ng=9.2.0=hdf63c60_2
- libssh2=1.10.0=ha56f1ee_2
- libstdcxx-ng=11.2.0=he4da1e4_12
- libtiff=4.0.9=h648cc4a_1002
- libunistring=0.9.10=h14c3975_0
- libuuid=2.32.1=h14c3975_1000
- libxcb=1.13=h14c3975_1002
- libxml2=2.9.9=h13577e0_2
- llvm-openmp=9.0.1=hc9558a2_2
- ncurses=5.9=10
- openjdk=11.0.1=h516909a_1016
- openssl=1.1.1d=h516909a_0
- openjdk=7.0.161=zulu7.21.0.3_0
- openssl=1.1.1l=h7f98852_0
- pango=1.40.14=h53a7087_1002
- pcre=8.39=0
- perl=5.26.2=h516909a_1006
- pip=20.0.2=py_2
- pixman=0.34.0=h14c3975_1003
- pthread-stubs=0.4=h14c3975_1001
- python=2.7.15=h721da81_1008
- python_abi=2.7=1_cp27mu
- qualimap=2.2.2a=3
- r=3.4.1=r3.4.1_0
- r-aroma.affymetrix=3.1.1=r341h6115d3f_0
Expand Down Expand Up @@ -184,12 +189,13 @@ dependencies:
- r-xml=3.98_1.16=r341hc070d10_0
- r-xtable=1.8_3=r341_1000
- readline=7.0=0
- rna-seqc=1.1.8=1
- sambamba=0.6.5=0
- samtools=1.6=h244ad75_5
- samtools=1.9=h46bd0b3_0
- setuptools=44.0.0=py27_0
- sqlite=3.26.0=h7b6447c_0
- star=2.5.3a=0
- subread=1.5.3=0
- star=2.7.10a=h9ee0642_0
- subread=1.6.4=h84994c4_1
- tk=8.6.10=hed695b0_0
- wget=1.20.1=h22169c7_0
- wheel=0.34.2=py_1
Expand All @@ -204,6 +210,5 @@ dependencies:
- xorg-renderproto=0.11.1=h14c3975_1002
- xorg-xextproto=7.3.0=h14c3975_1002
- xorg-xproto=7.0.31=h14c3975_1007
- xz=5.2.4=h14c3975_1001
- xz=5.2.5=h516909a_1
- zlib=1.2.11=h516909a_1006

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -238,11 +238,11 @@ open (my $fh, "tail -n $counter $fc0_file |") or die;

# print header
print "\#chrom\tchromStart\tchromEnd\tgene_id\tscore\tstrand\tname\texonic_length\t";
print "num_reads\tnum_reads_fw\tnum_reads_rv\t";
print "FPKM_no_mt_rrna_trna_chrxy\tFPKM_no_mt_rrna_trna_chrxy_fw\tFPKM_no_mt_rrna_trna_chrxy_rv\t";
print "TPM_no_mt_rrna_trna_chrxy\tTPM_no_mt_rrna_trna_chrxy_fw\tTPM_no_mt_rrna_trna_chrxy_rv\t";
print "FPKM_standard\tFPKM_standard_fw\tFPKM_standard_rv\t";
print "TPM_standard\tTPM_standard_fw\tTPM_standard_rv\n";
print "num_reads_unstranded\tnum_reads_stranded\tnum_reads_reverse_stranded\t";
print "FPKM_customLibSize_unstranded\tFPKM_customLibSize_stranded\tFPKM_customLibSize_reverse_stranded\t";
print "TPM_customLibSize_unstranded\tTPM_customLibSize_stranded\tTPM_customLibSize_reverse_stranded\t";
print "FPKM_unstranded\tFPKM_stranded\tFPKM_reverse_stranded\t";
print "TPM_unstranded\tTPM_stranded\tTPM_reverse_stranded\n";

while (!eof($fh)){

Expand Down Expand Up @@ -278,22 +278,22 @@ while (!eof($fh)){
print "$read_counts_all{1}{$hash_id}\t";
print "$read_counts_all{2}{$hash_id}\t";

# print RPKMS no_mt_rrna_trna_chrxy
# print FPKM with custom library size estimation
print "".($read_counts_all_per_length{0}{$hash_id})*($BILLION/$ignore_read_counts{0})."\t";
print "".($read_counts_all_per_length{1}{$hash_id})*($BILLION/$ignore_read_counts{1})."\t";
print "".($read_counts_all_per_length{2}{$hash_id})*($BILLION/$ignore_read_counts{2})."\t";

# print TPMS no_mt_rrna_trna_chrxy
# print TPM with custom library size estimation
print "".($read_counts_all_per_length{0}{$hash_id})*($BILLION/$ignore_read_counts{0})/($ignore_fpkm_counts{0}/$MILLION)."\t";
print "".($read_counts_all_per_length{1}{$hash_id})*($BILLION/$ignore_read_counts{1})/($ignore_fpkm_counts{1}/$MILLION)."\t";
print "".($read_counts_all_per_length{2}{$hash_id})*($BILLION/$ignore_read_counts{2})/($ignore_fpkm_counts{2}/$MILLION)."\t";

# print RPKMS standard
# print FPKM
print "".($read_counts_all_per_length{0}{$hash_id})*($BILLION/$all_read_counts{0})."\t";
print "".($read_counts_all_per_length{1}{$hash_id})*($BILLION/$all_read_counts{1})."\t";
print "".($read_counts_all_per_length{2}{$hash_id})*($BILLION/$all_read_counts{2})."\t";

# print TPMS standard
# print TPM
print "".($read_counts_all_per_length{0}{$hash_id})*($BILLION/$all_read_counts{0})/($all_fpkm_counts{0}/$MILLION)."\t";
print "".($read_counts_all_per_length{1}{$hash_id})*($BILLION/$all_read_counts{1})/($all_fpkm_counts{1}/$MILLION)."\t";
print "".($read_counts_all_per_length{2}{$hash_id})*($BILLION/$all_read_counts{2})/($all_fpkm_counts{2}/$MILLION)."";
Expand Down
18 changes: 9 additions & 9 deletions resources/analysisTools/rnaseqworkflow/featureCounts_2_FpkmTpm
Original file line number Diff line number Diff line change
Expand Up @@ -235,11 +235,11 @@ open (my $fh, "cut -f 1 $fc0_file| tail -n $counter |") or die;

# print header
print "\#chrom\tchromStart\tchromEnd\tgene_id\tscore\tstrand\tname\texonic_length\t";
print "num_reads\tnum_reads_fw\tnum_reads_rv\t";
print "FPKM_no_mt_rrna_trna_chrxy\tFPKM_no_mt_rrna_trna_chrxy_fw\tFPKM_no_mt_rrna_trna_chrxy_rv\t";
print "TPM_no_mt_rrna_trna_chrxy\tTPM_no_mt_rrna_trna_chrxy_fw\tTPM_no_mt_rrna_trna_chrxy_rv\t";
print "FPKM_standard\tFPKM_standard_fw\tFPKM_standard_rv\t";
print "TPM_standard\tTPM_standard_fw\tTPM_standard_rv\n";
print "num_reads_unstranded\tnum_reads_stranded\tnum_reads_reverse_stranded\t";
print "FPKM_customLibSize_unstranded\tFPKM_customLibSize_stranded\tFPKM_customLibSize_reverse_stranded\t";
print "TPM_customLibSize_unstranded\tTPM_customLibSize_stranded\tTPM_customLibSize_reverse_stranded\t";
print "FPKM_unstranded\tFPKM_stranded\tFPKM_reverse_stranded\t";
print "TPM_unstranded\tTPM_stranded\tTPM_reverse_stranded\n";

while (!eof($fh)){

Expand All @@ -256,22 +256,22 @@ while (!eof($fh)){
print "$read_counts_all{1}{$fc0}\t";
print "$read_counts_all{2}{$fc0}\t";

# print RPKMS no_mt_rrna_trna_chrxy
# print FPKM with custom library size estimation
print "".($read_counts_all_per_length{0}{$fc0})*($BILLION/$ignore_read_counts{0})."\t";
print "".($read_counts_all_per_length{1}{$fc0})*($BILLION/$ignore_read_counts{1})."\t";
print "".($read_counts_all_per_length{2}{$fc0})*($BILLION/$ignore_read_counts{2})."\t";

# print TPMS no_mt_rrna_trna_chrxy
# print TPM with custom library size estimation
print "".($read_counts_all_per_length{0}{$fc0})*($BILLION/$ignore_read_counts{0})/($ignore_fpkm_counts{0}/$MILLION)."\t";
print "".($read_counts_all_per_length{1}{$fc0})*($BILLION/$ignore_read_counts{1})/($ignore_fpkm_counts{1}/$MILLION)."\t";
print "".($read_counts_all_per_length{2}{$fc0})*($BILLION/$ignore_read_counts{2})/($ignore_fpkm_counts{2}/$MILLION)."\t";

# print RPKMS standard
# print FPKM
print "".($read_counts_all_per_length{0}{$fc0})*($BILLION/$all_read_counts{0})."\t";
print "".($read_counts_all_per_length{1}{$fc0})*($BILLION/$all_read_counts{1})."\t";
print "".($read_counts_all_per_length{2}{$fc0})*($BILLION/$all_read_counts{2})."\t";

# print TPMS standard
# print TPM
print "".($read_counts_all_per_length{0}{$fc0})*($BILLION/$all_read_counts{0})/($all_fpkm_counts{0}/$MILLION)."\t";
print "".($read_counts_all_per_length{1}{$fc0})*($BILLION/$all_read_counts{1})/($all_fpkm_counts{1}/$MILLION)."\t";
print "".($read_counts_all_per_length{2}{$fc0})*($BILLION/$all_read_counts{2})/($all_fpkm_counts{2}/$MILLION)."";
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
## **Script to prepare gencode annotation data for RNA-seq analysis**

The reference data for GRCh38 and GRCm39 are prepared using the [refmake workflow](https://odcf-gitlab.dkfz.de/ODCF/refmake). This includes the
1. STAR index
2. Kallisto index
3. Gencode annotation GTF file

The downstream annotation files that are listed below are generated using the `prepare_gencode_annotation.sh` script.
1. annotation.bed
2. annotation.nogene.gtf
3. annotation.chrXYMT.rRNA.gtf
4. annotation.dexseq.gff

The script can be run as follows:

```bash
sh prepare_gencode_annotation.sh /omics/odcf/reference_data/legacy/ngs_share/assemblies/hg_GRCh38/databases/gencode/GRCh38_decoy_ebv_alt_hla_phiX/gencode_v39_chr_patch_hapl_scaff/annotation.gtf
```

The Python script `dexseq_prepare_annotation2.py` was downloaded from [here](https://raw.githubusercontent.com/vivekbhr/Subread_to_DEXSeq/master/dexseq_prepare_annotation2.py) and edited to not print transcript IDs in the output files. This was to avoid the memory issue caused by the long concatenation of the transcript IDs.
Loading

0 comments on commit 0950781

Please sign in to comment.