Skip to content

Commit

Permalink
applying all comments from Marius, changing all Readme's accordingly,…
Browse files Browse the repository at this point in the history
… and changing the Allele based workflow to be general not specific to Nanopore
  • Loading branch information
EngyNasr committed Jun 19, 2024
1 parent 1c90f5d commit f099304
Show file tree
Hide file tree
Showing 13 changed files with 38 additions and 102 deletions.
85 changes: 13 additions & 72 deletions workflows/microbiome/README.md
Original file line number Diff line number Diff line change
@@ -1,89 +1,30 @@
# Microbiome Workflows

The following workflows can be used directly for microbiome data analysis, pathogen detection, and tracking purposes. The workflows can also be adapted to any other sequencing technique.
In this directory, you will find a collection of workflows designed for microbiome data analysis, pathogen detection, and tracking. These workflows are ready to use and can be adapted for various sequencing techniques using Galaxy's customizable and automatable API.

To learn more about the following workflows and try them with real datasets, please check out our Microbiome tutorials on the Galaxy Training Network [GTN](https://training.galaxyproject.org/training-material/topics/microbiome/)
## Avaiable Workflows

## Nanopore _Preprocessing_
- **Nanopore Preprocessing**

Before starting any analysis, it is always a good idea to assess the quality of your input data and to discard poor-quality base content by trimming and filtering reads.
- **Taxonomy Profiling and Visualisation with Krona**

Generally, we are not interested in the host sequences, but rather only those originating from the pathogen itself. It is important to get rid of all host sequences and to only retain sequences that might include a pathogen, both in order to speed up further steps and to avoid host sequences compromising the analysis.
- **Gene-based Pathogen Identification**

### Input Datasets
- **Allele-based Pathogen Identification**

- Collection of sequenced Nanopore reads of all samples to be analysed in a `fastqsanger` or `fastqsanger.gz` format.
- **Pathogen Detection: PathoGFAIR Samples Aggregation and Visualisation**

### Output Datasets
## Getting Started

- Collection of Pre-Processed Sequenced reads of all samples, ready for further analysis with the other workflows, in a `fastqsanger` or `fastqsanger.gz` format.
To learn more about these workflows and to try them with real datasets, please visit our Microbiome tutorials on the Galaxy Training Network (GTN):

- Tables indicating total number of reads before and after host sequences trimming, and the host sequences percentages found in each sample.
[Microbiome Tutorials on GTN](https://training.galaxyproject.org/training-material/topics/microbiome/)

This workflow is available for trial through our [GTN tutorial](https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html)

## _Taxonomy Profiling_ and Visualisation with Krona
## Dedicated Training Material

In this workflow, we identify the different organisms found in our samples by assigning taxonomy levels to the reads starting from the kingdom level down to the species level and visualise the result.
The workflows for **Nanopore Preprocessing**, **Taxonomy Profiling and Visualization with Krona**, **Gene-based Pathogen Identification**, **Allele-based Pathogen Identification**, and **Pathogen Detection: PathoGFAIR Samples Aggregation and Visualization** can all be tried out in a dedicated training material on GTN for foodborne pathogen detection and tracking:

It’s important to check what might be the species of a possible pathogen to be found, it gets us closer to the investigation as well as discovering possible multiple pathogenetic infections if any existed.
[GTN Tutorial for Foodborne Pathogen Detection and Tracking](https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html)

For taxonomy profiling Kraken2 tool is used along with one of its standard databases available on Galaxy, you can freely choose between Kraken2 different databases based on your input datasets. For visualisation multiple tools can be used, Krona pie chart (as default in this workflow), Phinch interactive tool, Pavian, etc.

### Input Datasets
- Collection of Pre-Processed Sequenced reads of all samples, ready for further analysis with the other workflows, in a `fastqsanger` or `fastqsanger.gz` format, the output of **Nanopore Preprocessing** workflow.

### Output Datasets
- Taxonomy profiling Tabular file, visualisation figures and interactive pie charts.

This workflow is available for trial through our [GTN tutorial](https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html)

## Gene-based Pathogen Identification

In this workflow, we determine whether the samples are pathogenic or not, by looking for genes known to be linked to pathogenicity or to the pathogenecity character of the organism.

- Virulence Factor (VF): gene products, usually proteins, involved in pathogenicity. By identifying them, we can call a pathogen and its severity level

- Antimicrobial Resistance genes (AMR).

These type of genes have three fundamental mechanisms of antimicrobial resistance that are enzymatic degradation of antibacterial drugs, alteration of bacterial proteins that are antimicrobial targets, and changes in membrane permeability to antibiotics, which will lead to not altering the target site and spread throughput the pathogenic bacteria decreasing the overall fitness of the host.

In this workflow we:

1. Perform genome assembly to get contigs, i.e. longer sequences, using metaflye (Kolmogorov et al. 2020) then assembly polishing using medaka consensus pipeline and visualizing the assembly graph using Bandage Image (Wick et al. 2015)
2. Generate reports with AMR genes and VF using ABRicate

### Input Datasets
- Collection of Pre-Processed Sequenced reads of all samples, ready for further analysis with the other workflows, in a `fastqsanger` or `fastqsanger.gz` format, the output of **Nanopore Preprocessing** workflow.

### Output Datasets
- FASTA and Tabular files to track genes and visualise our pathogenic identification through out all samples.

This workflow is available for trial through our [GTN tutorial](https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html)

## Nanopore Allele-based Pathogen Identification

This workflow identifies pathogens using an allelic approach, detecting Single Nucleotide Polymorphisms (SNPs) to track emerging variants, i.e. markers showing evolutionary histories of homogeneous strains. This process includes SNP calling, aimed at identifying novel pathogen strains and elucidating discrepancies compared to reference sequences, thereby facilitating the tracking of emerging variants. Within this workflow, both variants and SNPs are discerned, serving as crucial elements for subsequent pathogen identification and variant tracking purposes.

To perform the mapping step before variant identification, we used the Minimap2 tool, specifically designed for Nanopore reads. If you're working with Illumina data, you can substitute Minimap2 with Bowtie2.

### Input Datasets
- Collection of Pre-Processed Sequenced reads of all samples, ready for further analysis with the other workflows, in a `fastq or fastq.gz` format, the output of **Nanopore Preprocessing** workflow.
- A reference genome to the tested pathogen.

### Output Datasets
- VCF files indicating identified variants and SNPs, BAM files with mapping results, and Tabular files with mapping depth and coverage calculations.

This workflow is available for trial through our [GTN tutorial](https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html)

## Pathogen Detection: PathoGFAIR Samples Aggregation and Visualisation

In this workflow, we will aggregate results and use the results from 3 workflows (**Nanopore Preprocessing**, **Gene-based Pathogen Identification** and **Nanopore Allele-based Pathogen Identification**) to help track pathogens among samples and visualise all performed analysis by:

1. Drawing a presence-absence heatmap of the identified VF genes within all samples to visualise in which samples these genes can be found.
2. Drawing a phylogenetic tree for each pathogenic genes detected, where we will relate the contigs of the samples together where this gene is found.
3. Plotting QC reads, host reads, mapping coverage and depth, and SNP analysis.

With these types of visualisations, we can have an overview of all samples and the genes, but also how samples are related to each other, which common pathogenic genes they share. Given the time of the sampling and the location one can easily identify using these graphs, where and when the contamination has occurred among the different samples.

This workflow is available for trial through our [GTN tutorial](https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html)
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ workflows:
- name: Engy Nasr
orcid: 0000-0001-9047-4215
url: https://orcid.org/0000-0001-9047-4215
- name: "B\xE9r\xE9nice Batut"
- name: "Bérénice Batut"
orcid: 0000-0001-9852-1987
url: https://orcid.org/0000-0001-9852-1987
- name: Paul Zierep
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# _Gene-based Pathogen Identification_
# Gene-based Pathogen Identification

In this workflow, we determine whether the samples are pathogenic or not, by looking for genes known to be linked to pathogenicity or to the pathogenecity character of the organism.

Expand All @@ -14,10 +14,9 @@ In this workflow we:
2. Generate reports with AMR genes and VF using ABRicate

## Input Datasets
- Collection of Pre-Processed Sequenced reads of all samples, ready for further analysis with the other workflows, in a `fastq or fastq.gz` format, the output of **Nanopore _Preprocessing_** workflow.
- Collection of Pre-Processed Sequenced reads of all samples, ready for further analysis with the other workflows, in a `fastqsanger` or `fastqsanger.gz` format, the output of **Nanopore Preprocessing** workflow.

## Output Datasets
- FASTA and Tabular files to track genes and visualise our pathogenic identification through out all samples.

This workflow is available for trial through our [GTN tutorial](https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html)

This workflow is available to try via our [GTN tutorial](https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html)
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@ workflows:
- name: main
subclass: Galaxy
publish: true
primaryDescriptorPath: /Nanopore-Allele-based-Pathogen-Identification.ga
primaryDescriptorPath: /Allele-based-Pathogen-Identification.ga
testParameterFiles:
- /Nanopore-Allele-based-Pathogen-Identification-tests.yml
- /Allele-based-Pathogen-Identification-tests.yml
authors:
- name: Engy Nasr
orcid: 0000-0001-9047-4215
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
- doc: Test outline for Nanopore-Allele-based-Pathogen-Identification
- doc: Test outline for Allele-based-Pathogen-Identification
job:
reference_genome_of_tested_strain:
class: File
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -101,9 +101,9 @@
"format-version": "0.1",
"license": "MIT",
"release": "0.1",
"name": "Nanopore Allele-based Pathogen Identification",
"name": "Allele-based Pathogen Identification",
"report": {
"markdown": "# Nanopore - Allele based Pathogen Identification Workflow Report\nBelow are the results for the Allele based Pathogenic Identification Workflow\n\nThis workflow was run on:\n\n```galaxy\ngenerate_time()\n```\n\nWith Galaxy version:\n\n```galaxy\ngenerate_galaxy_version()\n```\n\n## Workflow Inputs\nThe Perprocessing workflow main output (Collection of all samples reads after quality retaining and hosts filtering), and a FASTA file of the reference genome of the main Pathogen identified in the Gene based Pathogen Identification workflow, or per-known to the user.\n\n## Workflow Output: \n\n### All variants found per sample against the reference genome\n\n```galaxy\nhistory_dataset_display(output=\"extracted_fields_from_the_vcf_output\")\n```\n\n### Number of variants per sample\n\n```galaxy\nhistory_dataset_display(output=\"number_of_variants_per_sample\")\n```\n\n### Mapping mean depth per sample\n\n```galaxy\nhistory_dataset_display(output=\"mapping_mean_depth_per_sample\")\n```\n\n### Mapping coverage per sample\n\n```galaxy\nhistory_dataset_display(output=\"mapping_coverage_percentage_per_sample\")\n```\n"
"markdown": "# Allele based Pathogen Identification Workflow Report\nBelow are the results for the Allele based Pathogenic Identification Workflow\n\nThis workflow was run on:\n\n```galaxy\ngenerate_time()\n```\n\nWith Galaxy version:\n\n```galaxy\ngenerate_galaxy_version()\n```\n\n## Workflow Inputs\nThe Perprocessing workflow main output (Collection of all samples reads after quality retaining and hosts filtering), and a FASTA file of the reference genome of the main Pathogen identified in the Gene based Pathogen Identification workflow, or per-known to the user.\n\n## Workflow Output: \n\n### All variants found per sample against the reference genome\n\n```galaxy\nhistory_dataset_display(output=\"extracted_fields_from_the_vcf_output\")\n```\n\n### Number of variants per sample\n\n```galaxy\nhistory_dataset_display(output=\"number_of_variants_per_sample\")\n```\n\n### Mapping mean depth per sample\n\n```galaxy\nhistory_dataset_display(output=\"mapping_mean_depth_per_sample\")\n```\n\n### Mapping coverage per sample\n\n```galaxy\nhistory_dataset_display(output=\"mapping_coverage_percentage_per_sample\")\n```\n"
},
"steps": {
"0": {
Expand Down Expand Up @@ -1320,7 +1320,6 @@
"name:Collection",
"name:microGalaxy",
"name:PathoGFAIR",
"name:Nanopore",
"name:IWC"
],
"uuid": "deb94861-ed4d-41fe-881a-8565c6b8fa82",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
# Nanopore _Allele-based Pathogen Identification_
# Allele-based Pathogen Identification

This workflow identifies pathogens using an allelic approach, detecting Single Nucleotide Polymorphisms (SNPs) to track emerging variants, i.e. markers showing evolutionary histories of homogeneous strains. This process includes SNP calling, aimed at identifying novel pathogen strains and elucidating discrepancies compared to reference sequences, thereby facilitating the tracking of emerging variants. Within this workflow, both variants and SNPs are discerned, serving as crucial elements for subsequent pathogen identification and variant tracking purposes.

To perform the mapping step before variant identification, we used the Minimap2 tool, specifically designed for Nanopore reads. If you're working with Illumina data, you can substitute Minimap2 with Bowtie2.

## Input Datasets
- Collection of Pre-Processed Sequenced reads of all samples, ready for further analysis with the other workflows, in a `fastq or fastq.gz` format, the output of **Nanopore _Preprocessing_** workflow.
- Collection of Pre-Processed Sequenced reads of all samples, ready for further analysis with the other workflows, in a `fastqsanger or fastqsanger.gz` format, the output of **Nanopore Preprocessing** workflow.
- A reference genome to the tested pathogen.

## Output Datasets
- VCF files indicating identified variants and SNPs, BAM files with mapping results, and Tabular files with mapping depth and coverage calculations.

This workflow is available for trial through our [GTN tutorial](https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html)
This workflow is available to try via our [GTN tutorial](https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html)
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ workflows:
testParameterFiles:
- /Nanopore-Pre-Processing-tests.yml
authors:
- name: "B\xE9r\xE9nice Batut"
- name: "Bérénice Batut"
orcid: 0000-0001-9852-1987
url: https://orcid.org/0000-0001-9852-1987
- name: Engy Nasr
Expand Down
8 changes: 4 additions & 4 deletions workflows/microbiome/nanopore-pre-processing/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# Nanopore _Preprocessing_
# Nanopore Preprocessing

Before starting any analysis, it is always a good idea to assess the quality of your input data and to discard poor-quality base content by trimming and filtering reads.

Generally, we are not interested in the host sequences, but rather only those originating from the pathogen itself. It is important to get rid of all host sequences and to only retain sequences that might include a pathogen, both in order to speed up further steps and to avoid host sequences compromising the analysis.

## Input Datasets

- Collection of sequenced Nanopore reads of all samples to be analysed in a `fastq or fastq.gz` format.
- Collection of sequenced Nanopore reads of all samples to be analysed in a `fastqsanger` or `fastqsanger.gz` format.

## Output Datasets

- Collection of Pre-Processed Sequenced reads of all samples, ready for further analysis with the other workflows, in a `fastq or fastq.gz` format.
- Collection of Pre-Processed Sequenced reads of all samples, ready for further analysis with the other workflows, in a `fastqsanger` or `fastqsanger.gz` format.

- Tables indicating total number of reads before and after host sequences trimming, and the host sequences percentages found in each sample.

This workflow is available for trial through our [GTN tutorial](https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html)
This workflow is available to try via our [GTN tutorial](https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html)
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ workflows:
- name: Engy Nasr
orcid: 0000-0001-9047-4215
url: https://orcid.org/0000-0001-9047-4215
- name: "B\xE9r\xE9nice Batut"
- name: "Bérénice Batut"
orcid: 0000-0001-9852-1987
url: https://orcid.org/0000-0001-9852-1987
- name: Paul Zierep
Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Pathogen Detection: _PathoGFAIR Samples Aggregation and Visualisation_
# Pathogen Detection: PathoGFAIR Samples Aggregation and Visualisation

In this workflow, we will aggregate results and use the results from 3 workflows (**Nanopore _Preprocessing_**, **_Gene-based Pathogen Identification_** and **Nanopore _Allele-based Pathogen Identification_**) to help track pathogens among samples and visualise all performed analysis by:
In this workflow, we will aggregate results and use the results from 3 workflows (**Nanopore Preprocessing**, **Gene-based Pathogen Identification** and **Nanopore Allele-based Pathogen Identification**) to help track pathogens among samples and visualise all performed analysis by:

1. Drawing a presence-absence heatmap of the identified VF genes within all samples to visualise in which samples these genes can be found.
2. Drawing a phylogenetic tree for each pathogenic genes detected, where we will relate the contigs of the samples together where this gene is found.
3. Plotting QC reads, host reads, mapping coverage and depth, and SNP analysis.

With these types of visualisations, we can have an overview of all samples and the genes, but also how samples are related to each other, which common pathogenic genes they share. Given the time of the sampling and the location one can easily identify using these graphs, where and when the contamination has occurred among the different samples.

This workflow is available for trial through our [GTN tutorial](https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html)
This workflow is available to try via our [GTN tutorial](https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html)
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ workflows:
- name: Engy Nasr
orcid: 0000-0001-9047-4215
url: https://orcid.org/0000-0001-9047-4215
- name: "B\xE9r\xE9nice Batut"
- name: "Bérénice Batut"
orcid: 0000-0001-9852-1987
url: https://orcid.org/0000-0001-9852-1987
- name: Paul Zierep
Expand Down
Loading

0 comments on commit f099304

Please sign in to comment.