Workshop: Metagenome assembly, binning, qc, and annotation

Now we want to reconstruct genomes from the ZymoBIOMICS Microbial Community Standard II, Log Distribution (CSII) sample, downsampled to 10% of the total number of reads, that we started working on yesterday. Continue using the length-filtered data set that you should have stored in a folder such as reads/zymo-2022-barcode01-perc10.filtered.fastq.

De novo assembly (Flye)

use your quality-checked and filtered reads for input (e.g. reads/zymo-2022-barcode01-perc10.filtered.fastq)
note that we're using --nano-raw in this example because the nanopore example data is a bit older (R9)
- with newer ONT chemistry (e.g. >R10 flow cells) you should use --nano-hq, see also the --help of Flye
the output folder is called flye-output
we use --meta to activate the "expect metagenome/uneven coverage" mode which is important for metagenomics data

First, we need to create a new environment and install the tools:

# change to your project dir
mamba create -y -p envs/assembly -c bioconda flye bandage minimap2 samtools metabat2 checkm-genome 
conda activate envs/assembly

Now, let's run the metagenome assembly:

# run the assembly, this will take a bit time (if possible increase threads)
flye --nano-raw reads/zymo-2022-barcode01-perc10.filtered.fastq -o flye-output -t 8 --meta
# the final output genome assembly will be in flye-output/assembly.fasta

While this is running, check the original publication and the GitHub repository of the tool:

Publication | Code

Polishing

Nanopore data, especially in the older days, had high error rates. With the latest flow cell chemistry (R10.4.1) and improvements in the basecalling software and bioinformatics tools the accuracy improves. However, especially with older Nanopore data, one usually has to do polishing to reduce the error rate.

In the context of this little hands-on workshop, we will skip the polishing of the initial Flye assembly. But keep in mind, that when you want to produce high-quality metagenome-assembled genomes (MAGs), polishing might be an important part before you go into binning of your contigs and annotation.

You can find some hands-on information about polishing here. But note, that this is focused on assembly and polishing of single bacteria isolates but can be also transfered to metagenomics data.

Visualization of the assembly (Bandage)

# open the GUI
Bandage &

# load graph file generated by flye:
# ->  flye-output/assembly_graph.gfa
# click "draw graph"

Publication | Code

Tools that have a graphical user interface can cause problems on a cluster machine.

Alternative, if you can't get Bandage running with the above commands:

go to https://rrwick.github.io/Bandage
download the correct version for your Operating system, e.g.
- download Windows version
  - or do a wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Windows_v0_8_1.zip
- unzip
- start Bandage.exe
- load graph file produced by flye and "Draw graph"

How many contigs did the flye assembly produce for the zymo sample?

Binning

Now we want to figure out which of the assembled sequences (contigs) belong to the same species. This is called binning.

Mapping (minimap2)

First, we have to map the long reads to the assembly because we will utilize coverage information to figure out what belongs together:

# change to your working directory
mkdir -p mapping
# map
minimap2 -ax map-ont flye-output/assembly.fasta reads/zymo-2022-barcode01-perc10.filtered.fastq > mapping/zymo-2022.sam
# convert to a sorted BAM file
samtools view -bS mapping/zymo-2022.sam | samtools sort -@ 4 > mapping/zymo-2022.sorted.bam
samtools index mapping/zymo-2022.sorted.bam

Publication | Code | SAM format specification.

MetaBat2

Now we do the actual binning:

mkdir binning

# First, generate a depth file from BAM file
jgi_summarize_bam_contig_depths --outputDepth binning/zymo-2022-depth.txt mapping/zymo-2022.sorted.bam

# Run metabat
metabat2 -i flye-output/assembly.fasta -a binning/zymo-2022-depth.txt -o binning/metabat-bins/zymo-2022 -t 4

Inspect the output. How many "bins" do you have?

Quality control of the bins aka MAGs

Now one important questions is: how good and complete are your MAGs? This can be, for example, assessed with CheckM.

Further reading: A typical CheckM workflow.

# we need a folder for temporary files
mkdir tmp
# we use --reduced_tree to save RAM
checkm lineage_wf --tmpdir tmp -t 4 --reduced_tree -x fa binning/metabat-bins checkm-result
checkm tree_qa checkm-result
# unfortunately, the following command was not working for me - deprecated in latest CheckM version?
# checkm bin_qa_plot --image_type png -x fa checkm checkm/bins/ checkm_plot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hands-on.md

hands-on.md

Workshop: Metagenome assembly, binning, qc, and annotation

De novo assembly (Flye)

Polishing

Visualization of the assembly (Bandage)

Binning

Mapping (minimap2)

MetaBat2

Quality control of the bins aka MAGs

Files

hands-on.md

Latest commit

History

hands-on.md

File metadata and controls

Workshop: Metagenome assembly, binning, qc, and annotation

De novo assembly (Flye)

Polishing

Visualization of the assembly (Bandage)

Binning

Mapping (minimap2)

MetaBat2

Quality control of the bins aka MAGs