We will continue first with the E. coli data from yesterday. The raw FASTQ data should be located in your folder input-data/eco.nanopore.fastq.gz
. Remember, that you already length-filtered the data producing a file eco-filtered.fastq
. Use this as an input for the de novo assembly. Remember to activate your Conda environment or install the necessary tools if not available.
- use your quality-checked and filtered reads for input
- note that we're using
--nano-raw
in this example because the E. coli data is a bit older- with newer ONT chemistry (e.g. >R10 flow cells) you should use
--nano-hq
, see also the--help
ofFlye
- with newer ONT chemistry (e.g. >R10 flow cells) you should use
- the output folder is called
flye_output
- we use
--meta
to activate the "expect metagenome/uneven coverage" mode which can help to recover full plasmid sequences - we tell the tool that the expected
--genome-size
is 5 Mbp
# run the assembly, this will take a bit time
conda activate envs/workshop
flye --nano-raw eco-filtered.fastq -o flye_output -t 4 --meta --genome-size 5M
# the final output genome assembly will be in flye_output/assembly.fasta
While this is running, check the original publication and the GitHub repository of the tool:
# open the GUI
Bandage &
# load graph file generated by flye:
# -> flye_output/assembly_graph.gfa
# click "draw graph"
Tools that have a graphical user interface can cause problems on a cluster machine.
Alternative, if you can't get Bandage running with the above commands:
- go to https://rrwick.github.io/Bandage
- download the correct version for your Operating system, e.g.
- download Windows version
- or do a
wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Windows_v0_8_1.zip
- or do a
- unzip
- start
Bandage.exe
- load graph file produced by flye and "Draw graph"
- download Windows version
How many contigs did the flye
assembly produce for the E. coli sample?
Now, we want to map the long reads to the assembly you calculated to visualize them.
minimap2 -ax map-ont flye_output/assembly.fasta eco-filtered.fastq > eco-mapping.sam
Inspect the resulting SAM file. Check the SAM format specification.
# first, we need to convert the SAM file into a sorted BAM file to load it subsequently in IGV
samtools view -bS eco-mapping.sam | samtools sort -@ 4 > eco-mapping.sorted.bam
samtools index eco-mapping.sorted.bam
# start IGV browser and load the assembly (FASTA) and BAM file, inspect the output
igv &
# open the GUI
tablet &
# load mapping file as 'primary assembly'
# -> eco-mapping.sam
# load assembly file as 'Reference/consensus file'
# -> flye_output/assembly.fasta
Alternative ways to visualize such a mapping are given by (commercial software) such as Geneious or CLC Genomic Workbench.
For the following tasks, you will use now again the Nanopore FASTQ data of Salmonella from ENA. Remember, the Nanopore data corresponds to the Illumina samples you already worked on:
Sample ID | Nanopore read ID | Illumina read ID |
---|---|---|
8640 | SRR21833890 | SRR21833889 |
9866-12 | SRR21833871 | SRR21833888 |
8640-41 | SRR21833878 | SRR21833877 |
There are three Nanopore samples, you can work on all of them or pick one! The data is a bit older, from 2019 and was sequenced on a MinION flow cell (FLO-MIN106). Basecalling was done with the FAST
basecalling model.
De novo assemble the genome(s). Remember, that you qc'ed the data already and you might want to use the length-filtered reads. If not yet done, check out the flye paper (Maybe first start the assembly, then read the paper while it is running). Install flye
if not available and run on the filtered reads. Investigate the results via Bandage
. How good is your assembly? Remember that you also calculated a de novo assembly based on the short Illumina reads using SPAdes
? If so, also load the *.gfa
graph file from your previous SPAdes
results and for the corresponding Salmonella sample and compare them.
Now, annotate genes in your assembly like you learned for Illumina data before (e.g. Prokka
, Bakta
, Abricate
...). How many genes do you find (CDS, hypothetical genes)? Can you compare that to Illumina? Is it better? Worse?
Try a different assembly tool, e.g. other long-read assemblers are given and compared here: https://www.frontiersin.org/articles/10.3389/fmicb.2022.796465/full
Remember the R10.4.1 data you downloaded on day 1 for the Bonus tasks? That one:
wget --no-check-certificate https://osf.io/7f8jz/download -O 2023-08-nanopore-workshop-example-bacteria.zip
Also assemble the FASTQ data that is in this archive. Inspect the assembly graph. Can you find out which bacterial species is in the sample?
Download a reference genome FASTA for E. coli from NCBI. Also download an annotation file (e.g. in GFF format). Now, use minimap2
to map your E. coli reads to the reference genome. Visualize the mapping in IGV and also load the annotation file as an additional track. Now, specifically investigate the two genome regions:
- position 1,325,600 bp
- position 1,362,500 bp
What do you see in both regions? Can you tell if a gene located in one or the other region is somehow affected by the event you can observe?