Now we want to reconstruct genomes from the ZymoBIOMICS Microbial Community Standard II, Log Distribution (CSII) sample, downsampled to 10% of the total number of reads, that we started working on yesterday. Continue using the length-filtered data set that you should have stored in a folder such as reads/zymo-2022-barcode01-perc10.filtered.fastq
.
- use your quality-checked and filtered reads for input (e.g.
reads/zymo-2022-barcode01-perc10.filtered.fastq
) - note that we're using
--nano-raw
in this example because the nanopore example data is a bit older (R9)- with newer ONT chemistry (e.g. >R10 flow cells) you should use
--nano-hq
, see also the--help
ofFlye
- with newer ONT chemistry (e.g. >R10 flow cells) you should use
- the output folder is called
flye-output
- we use
--meta
to activate the "expect metagenome/uneven coverage" mode which is important for metagenomics data
First, we need to create a new environment and install the tools:
# change to your project dir
mamba create -y -p envs/assembly -c bioconda flye bandage minimap2 samtools metabat2 checkm-genome
conda activate envs/assembly
Now, let's run the metagenome assembly:
# run the assembly, this will take a bit time (if possible increase threads)
flye --nano-raw reads/zymo-2022-barcode01-perc10.filtered.fastq -o flye-output -t 8 --meta
# the final output genome assembly will be in flye-output/assembly.fasta
While this is running, check the original publication and the GitHub repository of the tool:
Nanopore data, especially in the older days, had high error rates. With the latest flow cell chemistry (R10.4.1) and improvements in the basecalling software and bioinformatics tools the accuracy improves. However, especially with older Nanopore data, one usually has to do polishing to reduce the error rate.
In the context of this little hands-on workshop, we will skip the polishing of the initial Flye assembly. But keep in mind, that when you want to produce high-quality metagenome-assembled genomes (MAGs), polishing might be an important part before you go into binning of your contigs and annotation.
You can find some hands-on information about polishing here. But note, that this is focused on assembly and polishing of single bacteria isolates but can be also transfered to metagenomics data.
# open the GUI
Bandage &
# load graph file generated by flye:
# -> flye-output/assembly_graph.gfa
# click "draw graph"
Tools that have a graphical user interface can cause problems on a cluster machine.
Alternative, if you can't get Bandage running with the above commands:
- go to https://rrwick.github.io/Bandage
- download the correct version for your Operating system, e.g.
- download Windows version
- or do a
wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Windows_v0_8_1.zip
- or do a
- unzip
- start
Bandage.exe
- load graph file produced by flye and "Draw graph"
- download Windows version
How many contigs did the flye
assembly produce for the zymo sample?
Now we want to figure out which of the assembled sequences (contigs) belong to the same species. This is called binning.
First, we have to map the long reads to the assembly because we will utilize coverage information to figure out what belongs together:
# change to your working directory
mkdir -p mapping
# map
minimap2 -ax map-ont flye-output/assembly.fasta reads/zymo-2022-barcode01-perc10.filtered.fastq > mapping/zymo-2022.sam
# convert to a sorted BAM file
samtools view -bS mapping/zymo-2022.sam | samtools sort -@ 4 > mapping/zymo-2022.sorted.bam
samtools index mapping/zymo-2022.sorted.bam
Publication | Code | SAM format specification.
Now we do the actual binning:
mkdir binning
# First, generate a depth file from BAM file
jgi_summarize_bam_contig_depths --outputDepth binning/zymo-2022-depth.txt mapping/zymo-2022.sorted.bam
# Run metabat
metabat2 -i flye-output/assembly.fasta -a binning/zymo-2022-depth.txt -o binning/metabat-bins/zymo-2022 -t 4
Inspect the output. How many "bins" do you have?
Now one important questions is: how good and complete are your MAGs? This can be, for example, assessed with CheckM.
Further reading: A typical CheckM workflow.
# we need a folder for temporary files
mkdir tmp
# we use --reduced_tree to save RAM
checkm lineage_wf --tmpdir tmp -t 4 --reduced_tree -x fa binning/metabat-bins checkm-result
checkm tree_qa checkm-result
# unfortunately, the following command was not working for me - deprecated in latest CheckM version?
# checkm bin_qa_plot --image_type png -x fa checkm checkm/bins/ checkm_plot