SamnSero
was developed to streamline the analysis of Nanopore sequencing data from bacterial isolates and metagenomic samples. The workflow serves three primary purposes:
- Long read (meta-)genome assembly
- Data quality assessment and control
- Post-assembly analysis e.g. Typing and annotation
Originally designed as a post-run analysis tool, it has since evolved to leverage the live basecalling feature of Oxford Nanopore Technologies (ONT) to deliver real-time data processing. Motivated by the need to reduce time to genomics results and bring greater transparency to sequencing experiments, users can now execute the pipeline in parallel with their sequencing experiments to monitor read quality, genome assembly quality, taxonomic abundance and typing results as a function of time (Note: for typing, only Salmonella in silico serotyping is supported at the moment!).
The pipeline can either be run directly from the command line or EPI2ME.
For command line execution, use the workflow management feature of Nextflow
to install SamnSero
# Install pre-requisites
- Nextflow >= 23.0
- Docker or Singularity
- Git
# Get the latest release tag of the pipeline
ver=$(git ls-remote -t https://github.com/jimmyliu1326/SamnSero_Nextflow.git | cut -f3 -d'/' | sort -r | head -n 1)
# Install the latest version of SamnSero
nextflow pull -hub github jimmyliu1326/SamnSero_Nextflow -r $ver
# Print pipeline help to validate installation
nextflow run jimmyliu1326/SamnSero_Nextflow -r $ver --help
For more user-friendly graphical interfaces, we recommend installing the pipeline in EPI2ME. After clicking on Import Workflow
in the application, enter the full URL to this GitHub repository when prompted:
Required Parameters:
--input
: comma-delimited sample sheet-profile
:docker
/singularity
/slurm
or a combination of multiple profiles delimiting the values by comma
To initiate a SamnSero
run, you must first prepare a headerless samples.csv
with two columns that indicate sample IDs and paths to DIRECTORIES containing .FASTQ or .FASTQ.GZ files
Example samples.csv
Sample_1,/path/to/data/Sample_1/
Sample_2,/path/to/data/Sample_2/
Sample_3,/path/to/data/Sample_3/
Given the samples.csv
above, your data directory should be set up like the following:
/path/to/data/
├── Sample_1
│ └── Sample_1.fastq.gz
├── Sample_2
│ └── Sample_2.fastq.gz
└── Sample_3
├── Sample_3a.fastq
└── Sample_3b.fastq
Note:
- The sequencing data for each sample must be placed within a unique subdirectory
- The names of the sample subdirectories do not have to match the sample ID listed in the
samples.csv
- You can have multiple .FASTQ files associated with a single sample. The pipeline will aggregate all .FASTQ files within the same directory before proceeding (for Nanopore data only)
Once you have set up the data directory as described and created the samples.csv
, you are ready to run the pipeline!
Required Parameters:
--watchdir
: directory to where FASTQ files are deposited in real-time-profile
:docker
/singularity
/slurm
or a combination of multiple profiles delimiting the values by comma
The pipeline assumes watchdir
follows the nested output directory structure of ONT basecallers: [run_id]/**/[qscore_pass_fail]/[barcode_arrangement]/
The FASTQ data can be deeply nested in as many directories as possible under the [run_id]
parent directory. The pipeline by default performs a recursive search for FASTQ data in watchdir
. FASTQ data can be uncompressed or gzip compressed.
An example watchdir
directory structure should resemble the following:
/path/to/watchdir/
├── ANY_NUMBER_OF_SUBDIRECTORIES/
│ ├── pod5/
│ ├── fastq_fail/
└── └── fastq_pass/
├── barcode01/
│ ├── batch01.fastq.gz
│ └── batch02.fastq.gz
├── barcode02/
│ └── batch03.fastq.gz
└── barcode03/
└── batch04.fastq.gz
The pipeline by default monitors for the creation or modification of FASTQ data passing quality filter i.e. written to the pass
or fastq_pass
folder. The file monitoring behavior of the pipeline can be amended using --watch_mode
which supports three file event types: create
, modify
, delete
. The default value for watch_mode
is create,modify
.
For real-time analysis, the pipeline internally attempts to optimize the resource allocation by limiting the resource consumption of compute-heavy processes. The optimization helps to guarantee access to real-time feedback in the form of technical reports by ensuring that there are always resources to spare for data summary and results reporting. The resource optimization is dependent on the value of watch_cpus
which specifies the maximum available CPU cores available for analysis. The recommended watch_cpus
for real-time analysis is 64 or greater.
The pipeline executes process in Docker containers by default. Singularity containers and process management by Slurm are also supported. See below how to run the pipeline using different containerization technologies.
Docker (Default)
Without specifying a profile, Docker containers are used by default.
nextflow run jimmyliu1326/SamnSero_Nextflow -r [vers] --input samples.csv --out_dir results
Singularity
Add -profile singularity
to use Singularity containers.
nextflow run jimmyliu1326/SamnSero_Nextflow -r [vers] --input samples.csv --out_dir results -profile singularity
Slurm + Singularity
For environments with Slurm support, append -profile singularity
and specify your Slurm account using the --account
parameter.
nextflow run jimmyliu1326/SamnSero_Nextflow -r [vers] --input samples.csv --out_dir results --account my-slurm-account -profile slurm,singularity
Both ONT long-read and Illumina paired-end short read genome assemblies are supported which can be toggled using --seq_platform nanopore/illumina
.
For metagenomic data, use --meta on
to generate metagenome assembled genomes (MAGs). Post-assembly contig binning is currently under active development. We are actively working towards resolving strain-level haplotypes and reconstruct genomes of individual strains which would enable the decomposition of mixed populations of the same species.
There is currently no plans to support hybrid genome assembly.
Sequencing data quality assessment and control can be invoked by the --qc
option.
The sequencing data QA/QC involves the following modules:
- nanoq: Read quality/length filtering
- centrifuge: Taxonomic classification
- krona: Visualization of microbial composition
- QUAST: Genome assembly statistics summary
- CheckM: Single copy marker gene analysis
- nanocomp: Raw read summary statistics report
To use the --qc
option, a pre-downloaded Centrifuge database is required, which can be downloaded from here.
After downloading the tar file, it needs to be decompressed into a directory, the path of which needs to supplied via the --centrifuge
option along with --qc
.