This is a compilation of the scripts I used for genome assembly using long read sequencing data (ONT) Workflow
- Basecalling (Guppy)
- Checking the quality of our data (pycoQC)
- Filtering reads (Nanolyse)
- Removing adapters (Porechop)
- Quality check (nanoQC)
- Genome assembler
- Flye
- Canu
- Redbean
- Assembly polishing with Racon
- Polishing with Medaka
- Polishing with nanopolish
- Scaffolding with long reads using lrscaff
- Gapclosing with lrgapcloser
- Polishing the assembly with RNA-seq data
- Purgehaplotigs
We sequenced the genome using minion without live basecalling. So we are going to use guppy (v5.0.7) on GPU (In Nesi) for basecalling, we will use disable_qscore_filtering becasue we don't want to separate reads based on their quality scores in to pass and fail folders instead get all the fastq files in one folder and then decide on quality filtering.
Here is the script to run
sbatch guppy.sl
We can check the quality of our data using pycoQC, we can install it using miniconda, lets activate our miniconda using conda activate
from miniconda2/bin
directory
conda create -n pycoQC python=3.6
# or just
conda install -c bioconda pycoqc # pycoQC was built in python3 so need python3 environment.
pycoQC uses sequencing_summary.txt
generated by guppy or other basecaller as an input.
Script to run
sbatch pycoqc.sl
PycoQC produces an interactive html report, as the one in here: PycoQC Report
We have used DNACS
during nanopore sequencing library preparation so we will use Nanolyse
to remove lamda DNA from our fastq files
We can install Nanolyse using miniconda environment, see Nanolyse page on how
Let's run this script to process our data with nanolyse.
sbatch nanolyse.sl
This will remove all the lamda DNA in our data
I have had problem running Nanolyse: that it couldn't find the lamdaDNA file dna_cs.fasta
to filter out, in that case here is the lamda dna sequence copy it in your working directory and slightly change nanolyse.sl
script as:
cat Nem.ont.merged.fastq | NanoLyse --reference ./dna_cs.fasta | gzip > m.neg_nanopore_filtered.fastq.gz
We can use Porechop to remove adapters. Adapters on the ends are trimmed off and if on middle, they are treated as chimeric and chopped into seperate reads Porechop
We used a flag --trim_barcodes
with Guppy
during base calling so the adapters on ends of the reads are already removed, however porechop helped remove adapters from middle of the reads.
Porechop is available as a module in Nesi so no need to install it we can run it with this script
sbatch porechop.sl
We can install nanoQC using conda see nanoQC Lets check the quality of our processed data script
sbatch nanoqc.sl
We will use flye assembler for our data. I have tried several assemblers on this data with trimmed (based on quality scores) and untrimmed input reads. Flye assembly had the highest busco scores, however it produces bigger assembly (length and number of contigs). Below are the scripts to run each of assembler.
Flye is available as module in Nesi
so no need to install it. We can run it with the script below, it will also run 3 iteration of genome polishing -i 3
script
sbatch flye.sl
Canu is also available in Nesi as module, so no need to install it. It performs reads trimming, correction and assembly. We can either run these three steps separately or run them all together. check out canu for more details.
With the script below we can run all three steps of canu together and get an assembly for our input data. Options below like minReadLength=500 minOverlapLength=300 correctedErrorRate=0.146
are tuned for our low coverage ONT data.
script
sbatch canu.sl
Redbean previously known as wtdbg2 is also available as module in Nesi
. We will run this assembler two times with two different (kmer) settings and merge those two assemblies to get the best out of it.
Script for running first Redbean
sbatch redbean_1.sl
Script for running second Redbean
sbatch redbean_2.sl
After getting these two Redbean assembly, we will run quast to quickly check their quality and busco score and we will use the assembly with better busco score as primary and merge another in to that.
We will run 5 iteration of racon, it utilizes minimap2 to map the reads to the draft assembly and polishes it. script
sbatch racon.5ite.sl
Medaka installation with conda is easier, we can just follow the steps in this link And run medaka for each of 5 iterations of racon output Script
sbatch medaka.sl
We can also use nanopolish to polish the assembly. Script We will compare outputs from all the polishing steps using quast and whichever has better assembly stats, we will choose that assembly to process forward with scaffolding
We will use lrscaff to scaffold the genome using long reads Script We will run scaffolding for 5 iterations and compare output from each and select the one with better stats for gapclosing
This Script will run lrgapcloser for 15 iterations using long reads. We will again run quast to compare output from each iteratioins and select the one with best stats.
We will use pilon to polish our assembly with RNA-seq data we have, if you have short read data for your genome, you would want to use that instead of RNA-seq data. but as I don't have short read data for my genome. I am using RNA-seq data we will run it in 3 iterations and check which one is the best. script
We will use purgehaplotigs to remove redundant contigs. Script