Skip to content

Latest commit

 

History

History
127 lines (86 loc) · 6.1 KB

File metadata and controls

127 lines (86 loc) · 6.1 KB

Workshop: De novo assembly and mapping

Hands-on

We will continue first with the E. coli data from yesterday. The raw FASTQ data should be located in your folder input-data/eco.nanopore.fastq.gz. Remember, that you already length-filtered the data producing a file eco-filtered.fastq. Use this as an input for the de novo assembly. Remember to activate your Conda environment or install the necessary tools if not available.

De novo assembly (Flye)

  • use your quality-checked and filtered reads for input
  • note that we're using --nano-raw in this example because the E. coli data is a bit older
    • with newer ONT chemistry (e.g. >R10 flow cells) you should use --nano-hq, see also the --help of Flye
  • the output folder is called flye_output
  • we use --meta to activate the "expect metagenome/uneven coverage" mode which can help to recover full plasmid sequences
  • we tell the tool that the expected --genome-size is 5 Mbp
# run the assembly, this will take a bit time
conda activate envs/workshop
flye --nano-raw eco-filtered.fastq -o flye_output -t 4 --meta --genome-size 5M
# the final output genome assembly will be in flye_output/assembly.fasta

While this is running, check the original publication and the GitHub repository of the tool:

Publication | Code

Visualization of the assembly (Bandage)

# open the GUI
Bandage &

# load graph file generated by flye:
# ->  flye_output/assembly_graph.gfa
# click "draw graph"

Publication | Code

Tools that have a graphical user interface can cause problems on a cluster machine.

Alternative, if you can't get Bandage running with the above commands:

  • go to https://rrwick.github.io/Bandage
  • download the correct version for your Operating system, e.g.
    • download Windows version
      • or do a wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Windows_v0_8_1.zip
    • unzip
    • start Bandage.exe
    • load graph file produced by flye and "Draw graph"

How many contigs did the flye assembly produce for the E. coli sample?

Mapping (minimap2)

Now, we want to map the long reads to the assembly you calculated to visualize them.

minimap2 -ax map-ont flye_output/assembly.fasta eco-filtered.fastq > eco-mapping.sam

Publication | Code

Inspect the resulting SAM file. Check the SAM format specification.

Visualization of the mapping (IGV)

# first, we need to convert the SAM file into a sorted BAM file to load it subsequently in IGV
samtools view -bS eco-mapping.sam | samtools sort -@ 4 > eco-mapping.sorted.bam  
samtools index eco-mapping.sorted.bam

# start IGV browser and load the assembly (FASTA) and BAM file, inspect the output
igv &

Alternative: Visualization of mapping (Tablet)

# open the GUI
tablet &

# load mapping file as 'primary assembly'
# ->  eco-mapping.sam

# load assembly file as 'Reference/consensus file'
# ->  flye_output/assembly.fasta

Publication | Code

Alternative ways to visualize such a mapping are given by (commercial software) such as Geneious or CLC Genomic Workbench.

Exercise

For the following tasks, you will use now again the Nanopore FASTQ data of Salmonella from ENA. Remember, the Nanopore data corresponds to the Illumina samples you already worked on:

Sample ID Nanopore read ID Illumina read ID
8640 SRR21833890 SRR21833889
9866-12 SRR21833871 SRR21833888
8640-41 SRR21833878 SRR21833877

There are three Nanopore samples, you can work on all of them or pick one! The data is a bit older, from 2019 and was sequenced on a MinION flow cell (FLO-MIN106). Basecalling was done with the FAST basecalling model.

De novo assemble the genome(s). Remember, that you qc'ed the data already and you might want to use the length-filtered reads. If not yet done, check out the flye paper (Maybe first start the assembly, then read the paper while it is running). Install flye if not available and run on the filtered reads. Investigate the results via Bandage. How good is your assembly? Remember that you also calculated a de novo assembly based on the short Illumina reads using SPAdes? If so, also load the *.gfa graph file from your previous SPAdes results and for the corresponding Salmonella sample and compare them.

Now, annotate genes in your assembly like you learned for Illumina data before (e.g. Prokka, Bakta, Abricate ...). How many genes do you find (CDS, hypothetical genes)? Can you compare that to Illumina? Is it better? Worse?

Bonus 1

Try a different assembly tool, e.g. other long-read assemblers are given and compared here: https://www.frontiersin.org/articles/10.3389/fmicb.2022.796465/full

Bonus 2

Remember the R10.4.1 data you downloaded on day 1 for the Bonus tasks? That one:

wget --no-check-certificate https://osf.io/7f8jz/download -O 2023-08-nanopore-workshop-example-bacteria.zip

Also assemble the FASTQ data that is in this archive. Inspect the assembly graph. Can you find out which bacterial species is in the sample?

Bonus 3

Download a reference genome FASTA for E. coli from NCBI. Also download an annotation file (e.g. in GFF format). Now, use minimap2 to map your E. coli reads to the reference genome. Visualize the mapping in IGV and also load the annotation file as an additional track. Now, specifically investigate the two genome regions:

  • position 1,325,600 bp
  • position 1,362,500 bp

What do you see in both regions? Can you tell if a gene located in one or the other region is somehow affected by the event you can observe?