Nextflow pipeline to call variants from Nanopore FASTQ files from bacterial clones relative to a wildtype control.
The pipeline broadly recapitualtes, where possible, the GATK best practices for germline short variant calling.
For each sample:
- Quality Trim reads using
cutadapt
. - Map to genome FASTA using
minimap2
. - Mark duplicates with
picard MarkDuplicates
. - Re-align in "active" regions and calculate variant likelihood with GATK
HaplotypeCaller
.
Then merge resulting GVCFs using GATK CombineGVCFs
. With the combined variant calls:
- Joint genoptype with GATK
GenotypeGVCFs
. - Filter variants using GATK
VariantFiltration
. - Annotate variant effects using
snpEff
. - Filter out variants where all samples are identical to the wildtype control, which is assumed to be the
sample_id
which is alphabetically last. - Write to output TSV.
- Get FASTQ quality metrics with
fastqc
. - Generate alignment statistics with
samtools stats
. - Map to genome FASTA using
bowtie2
becauseminimap2
logs are not compatible withmultiqc
. This way, some kind of alignment metrics are possible. - Compile the logs of processing steps into an HTML report with
multiqc
.
- Nextflow
- At Crick, activate using
module load Nextflow
- Otherwise, see below
- At Crick, activate using
conda
ormamba
- If possible, use
mamba
because it will be faster. - At Crick, activate using
module load Anaconda3
- If possible, use
- GATK
- Picard
- snpEff
You also need the genome FASTA and GFF annotations for the bacteria you are sequencing. These can be obtained from NCBI Nucleotide:
- Search for your strain of interest, and open its main page
- On the right-hand side, click
Customize view
, thenCustomize
and checkShow sequence
. Finally, clickUpdate view
. You may have to wait a few minute while the sequence downloads. - Click
Send to: > Complete record > File > FASTA > Create file
- Save the files to a path which you provide as
--genome_fasta
below.
If it's your first time using Nextflow on your system, you can install it using conda
:
conda install -c bioconda nextflow
You may need to set the NXF_HOME
environment variable. For example,
mkdir -p ~/.nextflow
export NXF_HOME=~/.nextflow
To make this a permanent change, you can do something like the following:
mkdir -p ~/.nextflow
echo "export NXF_HOME=~/.nextflow" >> ~/.bash_profile
source ~/.bash_profile
Make sure you have GATK, Picard, and snpEff on your system, and provide their paths as parameters on the command line or in your nextflow.config
file.
Make a sample sheet (see below) and, optionally, a nextflow.config
file in the directory where you want the pipeline to run. Then run Nextflow.
nextflow run scbirlab/nf-ont-call-variants
Each time you run the pipeline after the first time, Nextflow will use a locally-cached version which will not be automatically updated. If you want to ensure that you're using the very latest version of the pipeline, use the -latest
flag.
nextflow run scbirlab/nf-ont-call-variants -latest
If you want to run a particular tagged version of the pipeline, such as v0.0.1
, you can do so using
nextflow run scbirlab/nf-ont-call-variants -r v0.0.1
For help, use nextflow run scbirlab/nf-ont-call-variants --help
.
The first time you run the pipeline on your system, the software dependencies in environment.yml
will be installed. This may take several minutes.
The following parameters are required:
sample_sheet
: path to a CSV with information about the samples and FASTQ files to be processedgatk_path
: path to GATK executablepicard_path
: path to Picard executablesnpeff_path
: path to snpEff executablegenome_fasta
: path to reference genome FASTAsnpeff_database
: name of snpEff database to use for annotation. This should be derived from the same assembly asgenome_fasta
. You can get a list of databases usingjava -jar snpEff database
. Database names often end in the assembly name, such asgca_000015005
, which you can check matches yourgenome_fasta
The following parameters have default values which can be overridden if necessary.
trim_qual = 10
: Forcutadapt
, the minimum Phred score for trimming 3' callsmin_length = 10
: Forcutadapt
, the minimum trimmed length of a read. Shorter reads will be discarded
The parameters can be provided either in the nextflow.config
file or on the nextflow run
command.
Here is an example of the nextflow.config
file:
params {
gatk_path = "/path/to/gatk"
picard_path = "/path/to/picard.jar"
snpeff_path = "/path/to/snpEff.jar"
sample_sheet = "/path/to/sample-sheet.csv"
genome_fasta = "/path/to/MsmMC2155-CP000480.1.fasta"
snpeff_database = "Mycolicibacterium_smegmatis_mc2_155_gca_000015005"
}
Alternatively, you can provide the parameters on the command line:
nextflow run scbirlab/nf-ont-call-variants \
--sample_sheet /path/to/sample-sheet.csv \
--gatk_path /path/to/gatk \
--picard_path /path/to/picard.jar \
--snpeff_path /path/to/snpEff.jar \
--genome_fasta /path/to/MsmMC2155-CP000480.1.fasta \
--snpeff_database Mycolicibacterium_smegmatis_mc2_155_gca_000015005
The sample sheet is a CSV file providing information about which FASTQ files belong to which sample.
The file must have a header with the column names below, and one line per sample to be processed.
sample_id
: the unique name of the sample. The wildtype must be named so that it is alphabetically lastreads
: path to compressed FASTQ files derived from Nanopore sequencing
Here is an example of the sample sheet:
sample_id | reads |
---|---|
wt | /path/to/reads/WT/raw_reads.fastq.gz |
mut1 | /path/to/reads/mut1/raw_reads.fastq.gz |
Outputs are saved in the same directory as sample_sheet
. They are organised under three directories:
processed
: FASTQ files and logs resulting from alignmentstables
: tables and VCF files corresponding to variant callsmultiqc
: HTML report on processing steps
Add to the issue tracker.
Here are the help pages of the software used by this pipeline.