Merge pull request #47 from jannikseidelQBiC/corrections_release1.1.0_I

Corrections release 1.1.0 I
nf-core · Nov 4, 2024 · 51ba34e · 51ba34e
2 parents 7f58e94 + 267c390
commit 51ba34e
Show file tree

Hide file tree

Showing 10 changed files with 127 additions and 112 deletions.
diff --git a/.nf-core.yml b/.nf-core.yml
@@ -9,13 +9,13 @@ repository_type: pipeline
 template:
   author: Jannik Seidel
   description: A pipeline to identify (and remove) certain sequences from raw genomic
-    data. Default taxa to identify (and remove) are Homo and Homo sapiens. Removal
+    data. Default taxon to identify (and remove) is Homo sapiens. Removal
     is optional.
   force: false
   is_nfcore: true
   name: detaxizer
   org: nf-core
   outdir: .
   skip_features: null
-  version: 1.1.0dev
+  version: 1.1.0
 update: null
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,76 +3,77 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## v1.1.0 - Kombjuudr - [2024-10-23]
+## v1.1.0 - Kombjuudr - [2024-11-04]
 
 ### `Added`
 
-- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Added bbduk to the classification step (kraken2 as default, both can be run together)
-- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Added `fasta_bbduk` parameter to provide a fasta file with contaminants
-- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Rewrote summary step of classification to be usable with bbduk and/or kraken2
-- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Made preprocessing with fastp optional and added a parameter to turn on duplication removal (off as default, was on/not changeable in v1.0.0)
-- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Optionally the removed reads can now be written to the output folder
-- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Added optional classification of filtered and removed reads via kraken2
-- [PR #39](https://github.com/nf-core/detaxizer/pull/39) - Added generation of samplesheet for MAG, Taxprofiler
+- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Added bbduk to the classification step (kraken2 as default, both can be run together) (by @jannikseidelQBiC)
+- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Added `--fasta_bbduk` parameter to provide a fasta file with contaminants (by @jannikseidelQBiC)
+- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Rewrote summary step of classification to be usable with bbduk and/or kraken2 (by @jannikseidelQBiC)
+- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Made preprocessing with fastp optional and added the parameter `--fastp_eval_duplication` to turn on duplication removal (off as default, was on/not changeable in v1.0.0) (by @jannikseidelQBiC)
+- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Optionally the removed reads can now be written to the output folder (by @jannikseidelQBiC)
+- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Added optional classification of filtered and removed reads via kraken2 (by @jannikseidelQBiC)
+- [PR #39](https://github.com/nf-core/detaxizer/pull/39) - Added generation of input samplesheet for nf-core/mag, nf-core/taxprofiler (by @Joon-Klaps)
 
 #### Parameters
 
 Added parameters:
 
-| Parameter                               |
-| --------------------------------------- |
-| `fasta_bbduk`                           |
-| `preprocessing`                         |
-| `output_removed_reads`                  |
-| `classification_kraken2`                |
-| `classification_bbduk`                  |
-| `kraken2confidence_filtered`            |
-| `kraken2confidence_removed`             |
-| `classification_kraken2_post_filtering` |
-| `fastp_eval_duplication`                |
-| `bbduk_kmers`                           |
+| Parameter                                 |
+| ----------------------------------------- |
+| `--fasta_bbduk`                           |
+| `--preprocessing`                         |
+| `--output_removed_reads`                  |
+| `--classification_kraken2`                |
+| `--classification_bbduk`                  |
+| `--kraken2confidence_filtered`            |
+| `--kraken2confidence_removed`             |
+| `--classification_kraken2_post_filtering` |
+| `--fastp_eval_duplication`                |
+| `--bbduk_kmers`                           |
 
 Changed default values of parameters:
 
-| Parameter                | Old default value                                                             | New default value                                                             |
-| ------------------------ | ----------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
-| `fastp_cut_mean_quality` | 15                                                                            | 1                                                                             |
-| `kraken2db`              | 'https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20231009.tar.gz' | 'https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20240605.tar.gz' |
-| `kraken2confidence`      | 0.05                                                                          | 0.00                                                                          |
-| `tax2filter`             | 'Homo'                                                                        | 'Homo sapiens'                                                                |
-| `cutoff_tax2filter`      | 2                                                                             | 0                                                                             |
-| `cutoff_tax2keep`        | 0.5                                                                           | 0.0                                                                           |
+| Parameter                  | Old default value                                                             | New default value                                                             |
+| -------------------------- | ----------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
+| `--fastp_cut_mean_quality` | 15                                                                            | 1                                                                             |
+| `--kraken2db`              | 'https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20231009.tar.gz' | 'https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20240605.tar.gz' |
+| `--kraken2confidence`      | 0.05                                                                          | 0.00                                                                          |
+| `--tax2filter`             | 'Homo'                                                                        | 'Homo sapiens'                                                                |
+| `--cutoff_tax2filter`      | 2                                                                             | 0                                                                             |
+| `--cutoff_tax2keep`        | 0.5                                                                           | 0.0                                                                           |
 
 ### `Changed`
 
 - [PR #42](https://github.com/nf-core/detaxizer/pull/42) - Template update for nf-core/tools 3.0.2, for details read [this blog post](https://nf-co.re/blog/2024/tools-3_0_0#important-template-updates)
 
 ### `Fixed`
 
-- [PR #33](https://github.com/nf-core/detaxizer/pull/33) - Addition of quotation marks in `parse_kraken2report.nf` prevents failure of the pipeline when using a taxon with space (e.g. Homo sapiens) with the `tax2filter` parameter
-- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Made validation via blastn optional by default
-- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Changed parameter `fasta` to `fasta_blastn`
+- [PR #33](https://github.com/nf-core/detaxizer/pull/33) - Addition of quotation marks in `parse_kraken2report.nf` prevents failure of the pipeline when using a taxon with space (e.g. Homo sapiens) with the `--tax2filter` parameter (by @jannikseidelQBiC)
+- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Made validation via blastn optional by default (by @jannikseidelQBiC)
+- [PR #34](https://github.com/nf-core/detaxizer/pull/34) - Changed parameter `--fasta` to `--fasta_blastn` (by @jannikseidelQBiC)
 
 ### `Dependencies`
 
 Updated and added dependencies
-| Tool | Previous version | Current version |
+
+| Tool    | Previous version | Current version |
 | ------- | ---------------- | --------------- |
-| bbmap | - | 39.10 |
-| blastn | 2.14.1 | 2.15.0 |
-| multiQC | 1.21 | 1.25.1 |
-| kraken2 | 2.1.2 | 2.1.3 |
-| seqkit | 2.8.0 | 2.8.2 |
+| bbmap   | -                | 39.10           |
+| blastn  | 2.14.1           | 2.15.0          |
+| multiQC | 1.21             | 1.25.1          |
+| kraken2 | 2.1.2            | 2.1.3           |
+| seqkit  | 2.8.0            | 2.8.2           |
 
 ### `Deprecated`
 
-| Parameter     | New parameter       | Reason                                                                                                                                              |
-| ------------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `fasta`       | `fasta_blastn`      | Introduction of fasta_bbduk; necessary to further distinguish the two parameters                                                                    |
-| `skip_blastn` | `validation_blastn` | blastn is now to be enabled on purpose; too resource intensive for a default setting                                                                |
-| `max_cpus`    | -                   | New behavior of [nextflow](https://www.nextflow.io/docs/latest/reference/process.html#resourcelimits), `resourceLimits` can now be set via a config |
-| `max_memory`  | -                   | New behavior of [nextflow](https://www.nextflow.io/docs/latest/reference/process.html#resourcelimits), `resourceLimits` can now be set via a config |
-| `max_time`    | -                   | New behavior of [nextflow](https://www.nextflow.io/docs/latest/reference/process.html#resourcelimits), `resourceLimits` can now be set via a config |
+| Parameter       | New parameter         | Reason                                                                                                                                              |
+| --------------- | --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `--fasta`       | `--fasta_blastn`      | Introduction of fasta_bbduk; necessary to further distinguish the two parameters                                                                    |
+| `--skip_blastn` | `--validation_blastn` | blastn is now to be enabled on purpose; too resource intensive for a default setting                                                                |
+| `--max_cpus`    | -                     | New behavior of [nextflow](https://www.nextflow.io/docs/latest/reference/process.html#resourcelimits), `resourceLimits` can now be set via a config |
+| `--max_memory`  | -                     | New behavior of [nextflow](https://www.nextflow.io/docs/latest/reference/process.html#resourcelimits), `resourceLimits` can now be set via a config |
+| `--max_time`    | -                     | New behavior of [nextflow](https://www.nextflow.io/docs/latest/reference/process.html#resourcelimits), `resourceLimits` can now be set via a config |
 
 ## v1.0.0 - Kobbfarbad - [2024-03-26]
 

diff --git a/README.md b/README.md
@@ -19,7 +19,7 @@
 
 ## Introduction
 
-**nf-core/detaxizer** is a bioinformatics pipeline that checks for the presence of a specific taxon in (meta)genomic fastq files and offers the option to filter out this taxon or taxonomic subtree. The process begins with quality assessment via FastQC and optional preprocessing (adapter trimming, quality cutting and optional length and quality filtering) using fastp, followed by taxon classification with kraken2 and/or bbduk, and optionally employs blastn for validation of the reads associated with the identified taxa. Users must provide a samplesheet to indicate the fastq files and, if utilizing bbduk in the classification and/or the validation step, fasta files for usage of bbduk and creating the blastn database to verify the targeted taxon.
+**nf-core/detaxizer** is a bioinformatics pipeline that checks for the presence of a specific taxon in (meta)genomic fastq files and offers the option to filter out this taxon or taxonomic subtree. The process begins with quality assessment via FastQC and optional preprocessing (adapter trimming, quality cutting and optional length and quality filtering) using fastp, followed by taxonomic classification with kraken2 and/or bbduk, and optionally employs blastn for validation of the reads associated with the identified taxa. Users must provide a samplesheet to indicate the fastq files and, if utilizing bbduk in the classification and/or the validation step, fasta files for usage of bbduk and creating the blastn database to verify the targeted taxon.
 
 ![detaxizer metro workflow](docs/images/Detaxizer_metro_workflow.png)
 
@@ -45,9 +45,6 @@ CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,A
 
 Each row represents a fastq file (single-end) or a pair of fastq files (paired end). A third fastq file can be provided if long reads are present in your project. For more detailed information about the samplesheet, see the [usage documentation](docs/usage.md).
 
-> [!NOTE]
-> Be aware that the `tax2filter` (default _Homo sapiens_) has to be in the provided kraken2 database (if kraken2 is used) and that the reference for bbduk (provided by the `fasta_bbduk` parameter) should contain the taxa to filter/assess if it is wanted to assess/remove the same taxa as in `tax2filter`. This overlap in the databases is not checked by the pipeline. To filter out/assess taxa with bbduk only, the `tax2filter` parameter is not needed but a fasta file with references of these taxa has to be provided.
-
 Now, you can run the pipeline using:
 
 ```bash

diff --git a/docs/images/Detaxizer_metro_workflow.png b/docs/images/Detaxizer_metro_workflow.png
diff --git a/docs/images/Detaxizer_metro_workflow.svg b/docs/images/Detaxizer_metro_workflow.svg
diff --git a/docs/output.md b/docs/output.md
@@ -56,25 +56,28 @@ kraken2 classifies the reads. The important files are `*.classifiedreads.txt`, `
 - `kraken2/`: Contains the output from the kraken2 classification steps.
   - `filtered/`: Contains the classification of the filtered reads (post-filtering).
     - `<sample>.classifiedreads.txt`: The whole kraken2 output for filtered reads.
-    - `<sample>.kraken2.report.txt`: Statistics on how many reads where assigned to which taxon/taxonomic group in the filtered reads.
+    - `<sample>.kraken2.report.txt`: Statistics on how many reads were assigned to which taxon/taxonomic group in the filtered reads.
   - `isolated/`: Contains the isolated lines and ids for the taxon/taxa mentioned in the `tax2filter` parameter.
     - `<sample>.classified.txt`: The whole kraken2 output for the taxon/taxa mentioned in the `tax2filter` parameter.
     - `<sample>.ids.txt`: The ids from the whole kraken2 output assigned to the taxon/taxa mentioned in the `tax2filter` parameter.
   - `removed/`: Contains the classification of the removed reads (post-filtering).
     - `<sample>.classifiedreads.txt`: The whole kraken2 output for removed reads.
-    - `<sample>.kraken2.report.txt`: Statistics on how many reads where assigned to which taxon/taxonomic group in the removed reads.
+    - `<sample>.kraken2.report.txt`: Statistics on how many reads were assigned to which taxon/taxonomic group in the removed reads.
   - `summary/`: Summary of the kraken2 process.
     - `<sample>.kraken2_summary.tsv`: Contains two three columns, column 1 is the sample name, column 2 the amount of lines in the untouched kraken2 output and column 3 the amount of lines in the isolated output.
   - `taxonomy/`: Contains the list of taxa to filter/to assess for.
     - `taxa_to_filter.txt`: Contains the taxon ids of all taxa to assess the data for or to filter out.
   - `<sample>.classifiedreads.txt`: The whole kraken2 output for all reads.
-  - `<sample>.kraken2.report.txt`: Statistics on how many reads where assigned to which taxon/taxonomic group.
+  - `<sample>.kraken2.report.txt`: Statistics on how many reads were assigned to which taxon/taxonomic group.
 
 </details>
 
 ### bbduk
 
-bbduk classifies the reads. The important files are `*.bbduk.log` and `ids/*.bbduk.txt`. `<sample>` can be replaced by `<sample>_longReads`, `<sample>_R1` or left as `<sample>` depending on the cases mentioned in [fastp](#fastp).
+bbduk classifies the reads by kmer matching to a reference.
+As soon as one k-mer is in the reference, the read is classified.
+The important files are `*.bbduk.log` and `ids/*.bbduk.txt`.
+`<sample>` can be replaced by `<sample>_longReads`, `<sample>_R1` or left as `<sample>` depending on the cases mentioned in [fastp](#fastp).
 
 <details markdown="1">
 <summary>Output files</summary>
@@ -88,7 +91,7 @@ bbduk classifies the reads. The important files are `*.bbduk.log` and `ids/*.bbd
 
 ### classification
 
-Either the merged IDs from [bbduk](#bbduk) and [kraken2](#kraken2) or the ones produced by one of the tools are shown in this folder. Also, the summary files of the classification step are shown.
+Either the merged IDs from [bbduk](#bbduk) and [kraken2](#kraken2) or the ones produced by one of the tools are shown in this folder. Also, the summary files of the classification step are included.
 
 <details markdown="1">
 <summary>Output files</summary>
@@ -169,14 +172,15 @@ The pipeline can also generate input files for the following downstream
 pipelines:
 
 - [nf-core/taxprofiler](https://nf-co.re/taxprofiler)
+- [nf-core/mag](https://nf-co.re/mag)
 
 <details markdown="1">
 <summary>Output files</summary>
 
 - `downstream_samplesheets/`
-  - `taxprofiler.csv`: Filled out nf-core/taxprofiler `--input` csv with paths to reads relative to the results directory
-  - `mag-pe.csv`: Filled out nf-core/mag `--input` csv for paired-end reads with paths to reads relative to the results directory
-  - `mag-se.csv`: Filled out nf-core/mag `--input` csv for single-end reads with paths to reads relative to the results directory
+  - `taxprofiler.csv`: Filled out nf-core/taxprofiler `--input` csv with paths to reads saved in the results directory
+  - `mag-pe.csv`: Filled out nf-core/mag `--input` csv for paired-end reads with paths to reads saved in the results directory
+  - `mag-se.csv`: Filled out nf-core/mag `--input` csv for single-end reads with paths to reads saved in the results directory
 
 </details>
 
@@ -186,7 +190,7 @@ They may not be complete (e.g. some columns may need to be manually filled in).
 :::
 
 :::warning
-Detaxizer can process long-reads independent from short reads. `nf-core/mag@v3.1.0` can only take short, or short+long but not standalone long-reads as an input (this is being worked on). If you want to use the output of Detaxizer for Mag, you'll have to remove the standalone long reads from the `mag.csv` file.
+Detaxizer can process long-reads independent from short reads. nf-core/mag (as of 3.1.0) can only take short, or short + long but not standalone long-reads as an input (this is being worked on). Standalone long-reads will not be included in the nf-core/mag samplesheets.
 :::
 
 ### Pipeline information