diff --git a/README.md b/README.md index d113964..69dc7b1 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ # What the Phage (WtP) * by Christian Brandt & Mike Marquet * **this tool is under heavy development, feel free to report issues and add suggestions** -* use a release candidates for a stable experience via `-r release.number` e.g. `-r v0.6` +* use a release candidate for a stable experience via `-r release.number` e.g. `-r v0.6` * these are extensively tested release versions of WtP * [releases of WtP are here](https://github.com/replikation/What_the_Phage/releases) @@ -61,7 +61,7 @@ sudo mv nextflow /usr/bin/ sudo apt-get install -y docker-ce docker-ce-cli containerd.io sudo usermod -a -G docker $USER ``` -* Restart your computer and [go](##Quick-execution) +* Restart your computer and [go](#Quick-execution) ## Default @@ -81,9 +81,9 @@ sudo usermod -a -G docker $USER * Try out the installation by entering the following ```shell # for docker (local use) -nextflow run replikation/What_the_Phage -r v0.6 --cores 8 --fasta ~/.nextflow/assets/replikation/What_the_Phage/test-data/all_pos_phage.fasta -profile local,docker +nextflow run replikation/What_the_Phage -r v0.7 --cores 8 --fasta ~/.nextflow/assets/replikation/What_the_Phage/test-data/all_pos_phage.fasta -profile local,docker # for singularity (local use) -nextflow run replikation/What_the_Phage -r v0.6 --cores 8 --fasta ~/.nextflow/assets/replikation/What_the_Phage/test-data/all_pos_phage.fasta -profile local,singularity +nextflow run replikation/What_the_Phage -r v0.7 --cores 8 --fasta ~/.nextflow/assets/replikation/What_the_Phage/test-data/all_pos_phage.fasta -profile local,singularity ``` # Execution / Examples / Help @@ -103,7 +103,7 @@ nextflow run \ # calling the workflow --fasta /path/to/file.fa \ # provide a fasta-file as input --cores 4 \ # number of cores you want to use -profile local,docker # choose the environment:local and docker - -r v0.6 # WtP release version + -r v0.7 # WtP release version ``` @@ -120,7 +120,7 @@ nextflow run replikation/What_the_Phage \ --fasta '/path/to/*.fasta' \ -profile local,docker \ --cores 4 \ - -r v0.6 \ + -r v0.7 \ --anno \ --dv \ --vf \ @@ -173,11 +173,11 @@ nextflow run replikation/What_the_Phage \ ### Release candidate * A release candidate is a [released version of WtP](https://github.com/replikation/What_the_Phage/releases) which ensures proper functionality -* version control ensures reproducability as each tools version is also "locked" within the release candidate +* version control ensures reproducibility as each tools version is also "locked" within the release candidate * databases have no automatic version control (they are downloaded from the source) * if you need version control for databases, just make a copy of the database dir after download * you can specify the database dir via the `--database` flag (see below) - * WtP only downloads a database if its missing, it is not "auto updating them" + * WtP only downloads a database if it's missing, it is not "auto-updating" them * add this flag to your command and a specific release is used instead ```bash -r v0.6 @@ -211,22 +211,37 @@ nextflow run replikation/What_the_Phage --setup # Example results #### 1. Identification Tool and contig overview (UpSetR) -![plot](figures/plot.png) -*Figure 1:* This chart (UpSetR plot) quantifies the result-intersections of the phage identification tools, similar to a venn diagram. The amount of positive phage-sequences identified by each tool is represented on the left barplot in blue. The dot plot shows via line connection(s) which of the tools identified the exact same positive phage sequences. The amount of these shared matches is quantified as a barplot above each corresponding dot pattern. +![plot](figures/plot.svg) + +*Figure 1:* This chart (UpSetR plot) quantifies the result-intersections of the phage identification tools, similar to a Venn diagram. The amount of positive phage-sequences identified by each tool is represented on the left barplot in blue. The dot plot shows via line connection(s) which of the tools identified the exact same positive phage sequences. The amount of these shared matches is quantified as a barplot above each corresponding dot pattern. #### 2. Annotation Visualization (Chromomap) * [chromomap results](https://replikation.github.io/What_the_Phage/index.html) -*See Link:* The graphical output of the annotation shows an overview of the individual loci of the predicted ORFs and the corresponding genes in the fasta sequences identified as phages. For better visibility, we have chosen 4 categories tail, capsid, baseplate, and other. This output can be used to verify the identified sequences (if the predicted sequences make sense or not). The annotation results are additionally plotted in an interactive HTML-file and are available as a file for further analysis. +*See Link:* The graphical output of the annotation shows an overview of the individual loci of the predicted ORFs and the corresponding genes in the fasta sequences identified as phages. For a better visibility, we have chosen 4 categories tail, capsid, baseplate, and other. This output can be used to verify the identified sequences (if the predicted sequences make sense or not). The annotation results are additionally plotted in an interactive HTML-file and are available as a file for further analysis. #### 3. Summary Table (checkV + Results) -* Featured for release `-r v0.7` (wip) +* check [CheckV](https://bitbucket.org/berkeleylab/checkv/src/master/) for a detailed explanation + +contig_id| contig_length| genome_copies| gene_count| viral_genes| host_genes| checkv_quality| miuvig_quality| completeness| completeness_method| contamination| provirus| +|-|-|-|-|-|-|-|-|-|-|-|-| +pos_phage_0| 146647| 1| 243| 141| 1| High-quality| High-quality| 97.03| AAI-based| 0| No| +pos_phage_1| 58871| 1| 97| 21| 0| High-quality| High-quality| 100| AAI-based| 0| No| +pos_phage_2| 58560| 1| 95| 20| 0| High-quality| High-quality| 99.47| AAI-based| 0| No| +pos_phage_3| 59443| 1| 90| 52| 0| High-quality| High-quality| 100| AAI-based| 0| No| +pos_phage_4| 51290| 1| 74| 44| 0| High-quality| High-quality| 100| AAI-based| 0| No| +pos_phage_5| 43293| 1| 69| 55| 0| High-quality| High-quality| 100| AAI-based| 0| No| +pos_phage_6| 43851| 1| 53| 30| 0| High-quality| High-quality| 98.71| AAI-based| 0| No| +pos_phage_7| 44262| 1| 54| 31| 0| High-quality| High-quality| 99.64| AAI-based| 0| No| +pos_phage_8| 41865| 1| 60| 57| 0| High-quality| High-quality| 97.29| AAI-based| 0| No| +pos_phage_9| 221908| 1| 310| 48| 9| High-quality| High-quality| 100| AAI-based| 0| No| + # Under the hood ![plot](figures/wtp-flowchart-simple.png) -*Figure 3:* This plot shows a simplified dagchart of WtP for better understanding what's going on behind the curtain. +*Figure 3:* This plot shows a simplified dag-chart of WtP for better understanding of what's going on behind the curtain. @@ -255,7 +270,7 @@ Toolname/Git | Reference [prodigal](https://github.com/hyattpd/Prodigal)|[Prodigal: prokaryotic gene recognition and translation initiation site identification](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119) [hmmer](http://hmmer.org/)|[nhmmer: DNA homology search with profile HMMs](https://academic.oup.com/bioinformatics/article/29/19/2487/186765) [chromomap](https://cran.r-project.org/web/packages/chromoMap/vignettes/chromoMap.html)| -[CheckV](https://bitbucket.org/berkeleylab/checkv/src/master/)| +[CheckV](https://bitbucket.org/berkeleylab/checkv/src/master/)|[CheckV: assessing the quality of metagenome-assembled viral genomes](https://www.biorxiv.org/content/10.1101/2020.05.06.081778v1) ### Other tools Toolname/Git | Reference |-|-| diff --git a/configs/container.config b/configs/container.config index 78ee54b..b112d20 100644 --- a/configs/container.config +++ b/configs/container.config @@ -1,6 +1,6 @@ process { withLabel: chromomap { container = 'nanozoo/r_fungi:0.1--097b1bb' } - withLabel: checkV { container = 'nanozoo/checkv:0.4.0--f3ed06e' } + withLabel: checkV { container = 'nanozoo/checkv:0.6.0--e97f45e' } withLabel: deepvirfinder { container = 'multifractal/deepvirfinder:0.1' } withLabel: emboss { container = 'quay.io/biocontainers/emboss:6.5.7--4' } withLabel: ggplot2 { container = 'michelsteuwer/ggplot2:latest' } diff --git a/figures/plot.png b/figures/plot.png deleted file mode 100644 index ff0ba5b..0000000 Binary files a/figures/plot.png and /dev/null differ diff --git a/figures/plot.svg b/figures/plot.svg new file mode 100644 index 0000000..128b8a7 --- /dev/null +++ b/figures/plot.svg @@ -0,0 +1,678 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/modules/checkV.nf b/modules/checkV.nf index ed36dcb..5237dc4 100644 --- a/modules/checkV.nf +++ b/modules/checkV.nf @@ -1,17 +1,18 @@ process checkV { publishDir "${params.output}/${name}/" + errorStrategy 'ignore' label 'checkV' input: tuple val(name), path(fasta) file(database) output: - tuple val(name), file("${name}_quality_summary.tsv") + tuple val(name), file("${name}_quality_summary.tsv"), file("negative_result_${name}.txt") optional true script: - """ - checkv completeness ${fasta} -d ${database} -t ${task.cpus} results - checkv repeats ${fasta} results - checkv contamination ${fasta} -d ${database} -t ${task.cpus} results - checkv quality_summary ${fasta} results - cp results/quality_summary.tsv ${name}_quality_summary.tsv + """ + checkv completeness ${fasta} -d ${database} -t ${task.cpus} results #2> /dev/null + checkv repeats ${fasta} results #2> /dev/null + checkv contamination ${fasta} -d ${database} -t ${task.cpus} results #2> /dev/null + checkv quality_summary ${fasta} results #2> /dev/null + cp results/quality_summary.tsv ${name}_quality_summary.tsv #2> /dev/null """ } \ No newline at end of file diff --git a/modules/sourmash_for_tax.nf b/modules/sourmash_for_tax.nf new file mode 100644 index 0000000..b4311e6 --- /dev/null +++ b/modules/sourmash_for_tax.nf @@ -0,0 +1,51 @@ +process sourmash_for_tax { + publishDir "${params.output}/${name}/taxonomic-classification", mode: 'copy', pattern: "${name}_*.tax-class.csv" + label 'sourmash' + // errorStrategy 'ignore' + input: + tuple val(name), file(fasta_dir) + file(database) + output: + tuple val(name), file("${name}_*.tax-class.csv") + shell: + """ + tar xzf ${database} + + for fastafile in ${fasta_dir}/*.fa; do + sourmash compute -p ${task.cpus} --scaled 100 -k 21 \${fastafile} + done + + for signature in *.sig; do + sourmash search -k 21 \${signature} phages.sbt.json -o \${signature}.temporary + done + + touch ${name}_\${PWD##*/}.tax-class.csv + sed -i 1i"contig, prediction_value, predicted-organism-name" ${name}_\${PWD##*/}.tax-class.csv + + for classfile in *.temporary; do + phagename=\$(grep -v "similarity,name,filename,md5" \$classfile \ + | sort -nrk1,1 | head -1 \ + | tr -d '"' \ + | tr "|" "," \ + | tr -s _ \ + | awk -F "\\"*,\\"*" '{print \$6 \$7}' ) + + similarity=\$(grep -v "similarity,name,filename,md5" \$classfile \ + | sort -nrk1,1 | head -1 \ + | tr -d '"' \ + | tr "|" "," \ + | tr -s _ \ + | awk -F "\\"*,\\"*" '{print \$1}' \ + | awk '{printf "%.2f\\n",\$1}' ) + + filename=\$(basename \${classfile} .fa.sig.temporary) + + + echo "\$filename,\$similarity,\${phagename:1} " >> ${name}_\${PWD##*/}.tax-class.csv + done + """ +} + +/* +filtering criteria is at line 24 (awk part) with a current similiarity of 0.5 or higher to known phages +*/ \ No newline at end of file diff --git a/modules/tools/metaphinder.nf b/modules/tools/metaphinder.nf index 1cce6cb..1df7033 100644 --- a/modules/tools/metaphinder.nf +++ b/modules/tools/metaphinder.nf @@ -1,7 +1,6 @@ process metaphinder { label 'metaphinder' errorStrategy 'ignore' - input: tuple val(name), file(fasta) output: diff --git a/phage.nf b/phage.nf index 6bb06b7..debe242 100755 --- a/phage.nf +++ b/phage.nf @@ -169,6 +169,7 @@ if (!params.setup) { include virsorter_collect_data from './modules/raw_data_collection/virsorter_collect_data' include virsorter_download_DB from './modules/databases/virsorter_download_DB' include vog_DB from './modules/databases/download_vog_DB' + include sourmash_for_tax from './modules/sourmash_for_tax' /************* * DATABASES for Phage Identification @@ -610,16 +611,22 @@ workflow phage_annotation_MSF { pvog_DB vog_DB rvdb_DB - main : - prodigal(fasta) + //annotation-process + prodigal(fasta) hmmscan(prodigal.out, pvog_DB.map { it -> [it[0]] }) chromomap( chromomap_parser( fasta.join(hmmscan.out), pvog_DB.map { it -> [it[1]] })) +} +workflow phage_tax_classification { + take: fasta + sourmash_database + main: + sourmash_for_tax(split_multi_fasta(fasta), sourmash_database).groupTuple(remainder: true) } /************* @@ -634,7 +641,6 @@ workflow { phage_references() if (params.mp || params.annotate) { ref_phages_DB = Channel.from( [ 'deactivated', 'deactivated'] ) } else { ref_phages_DB = phage_blast_DB (phage_references.out) } if (params.pp || params.annotate) { ppr_deps = Channel.from( [ 'deactivated', 'deactivated'] ) } else { ppr_deps = ppr_dependecies() } - if (params.sm || params.annotate) { sourmash_DB = Channel.from( [ 'deactivated', 'deactivated'] ) } else { sourmash_DB = sourmash_database (phage_references.out) } if (params.vb || params.annotate) { vibrant_DB = Channel.from( [ 'deactivated', 'deactivated'] ) } else { vibrant_DB = vibrant_download_DB() } if (params.vs || params.annotate) { virsorter_DB = Channel.from( [ 'deactivated', 'deactivated'] ) } else { virsorter_DB = virsorter_database() } @@ -643,7 +649,8 @@ workflow { if (params.identify) { vog_DB = Channel.from( [ 'deactivated', 'deactivated'] ) } else { vog_DB = vog_database() } if (params.identify) { rvdb_DB = Channel.from( [ 'deactivated', 'deactivated'] ) } else { rvdb_DB = rvdb_database() } if (params.identify) { checkV_DB = Channel.from( [ 'deactivated', 'deactivated'] ) } else { checkV_DB = checkV_database() } - + // database sourmash + if (params.identify && params.sm) { sourmash_DB = Channel.from( [ 'deactivated', 'deactivated'] ) } else { sourmash_DB = sourmash_database (phage_references.out) } // Phage identification if (params.fasta && !params.annotate) { identify_fasta_MSF(fasta_input_ch, ref_phages_DB, ppr_deps,sourmash_DB, vibrant_DB, virsorter_DB) } @@ -651,16 +658,17 @@ workflow { // annotation based on fasta and fastq input combinations // channel handling - if (params.fasta && params.fastq && params.annotate) { annotation_ch = identify_fastq_MSF.out.mix(fasta_input_ch) } + if (params.fasta && params.fastq && params.annotate) { annotation_ch = identify_fastq_MSF.out.mix(fasta_validation_wf(fasta_input_ch)) } else if (params.fasta && params.fastq && !params.annotate) { annotation_ch = identify_fastq_MSF.out.mix(identify_fasta_MSF.out) } - else if (params.fasta && params.annotate) { annotation_ch = fasta_input_ch } + else if (params.fasta && params.annotate) { annotation_ch = fasta_validation_wf(fasta_input_ch)} else if (params.fasta && !params.annotate) { annotation_ch = identify_fasta_MSF.out } else if (params.fastq ) { annotation_ch = identify_fastq_MSF.out } - // actual annotation -> annotation_ch = tuple val(name), path(fasta) + // actual annotation & classification -> annotation_ch = tuple val(name), path(fasta) if (!params.identify) { phage_annotation_MSF(annotation_ch, pvog_DB, vog_DB, rvdb_DB) checkV_wf(annotation_ch, checkV_DB) + phage_tax_classification(annotation_ch, sourmash_DB ) } } diff --git a/test-data/all_neg_phage.fa b/test-data/all_neg_phage.fa index 8586508..37a32d4 100644 --- a/test-data/all_neg_phage.fa +++ b/test-data/all_neg_phage.fa @@ -21670,7 +21670,7 @@ TTCACCTCATTCCATTCTACAGAGAATCTCCTTCACATTCCTAGGGTGGATTAAAGTGTTATATTTAACT GCCAATTGTGAAATTGGCGGAAGGCATTAGACTAAAGGAAGAAACTGACCTCTCGTTGGGCCATAATTCT AGAAAGCCGTTGTGATCCTGAGTCTGAGTGTGACCTTCCAAGTTTTCCCGACCCTCCTGCAAGTGAGATA CTGGTCTGTG ->neg_phage_7 +>neg_phage_6 TCTTGGCTCATTGCAACCTCCGCCTCCTGGGTTCAAGCAATTCTCCTGCCTCAGCCTCCTGAGTAGCTGG GATTACAGGCGCTCGCCACCACGCCCAGCTAATTTTTGTATGTTTTTTTTTAGTAGAGACGGGGTTTCAC CATGTTGGCCAGGCTGGTCTTGAACTCTTGACCTTGTGATCGGCCTCCCAAAGTGCTGGGATTACAGGCG