See the instructions here.
First things first.
When you connect to Puhti, you will be in your home folder. You have all your course data in your own folder under the course projects /scratch
folder. So by using commands like csc-workspaces
, cd
, ls
and pwd
make sure you are in the right folder before you start working.
In Visual Studio Code you can use the Explorer
tab to open the right folder. Again remember to add your actual username in place of $USER
.
/scratch/project_2006616/$USER/MMB-114_Genomics
When your ready and can see your own course folder, you can move on.
First we will annotate our genome using a program called Bakta. Among other things, Bakta uses a program called Prodigal to find genes and then annotates them using several tools.
Let's start by connecting to the interactive partition. Now we will need a little bit more memory than what we get as default, so we need to specify that (and let's also ask for some more CPUs):
sinteractive -A project_2006616 -m 20000 -c 4
Bakta is not found from Puhti, but it has been installed into the project applications folder (as many other tools used so far).
It also needs its own database files and that they can be found from the Databases
folder.
We can also add some additional data about the strain to guide the annotation. So before running the annotation, we need to fill in some data or leave some options out.
/projappl/project_2006616/bakta/bin/bakta \
flye_out/assembly.fasta \
--db /projappl/project_2006616/Databases/bakta/db/ \
--prefix # our strain name \
--genus # type here \
--locus # your strain name \
--threads 4 \
--output bakta
Take a look inside the output folder using ls
. Or the Explorer
. To understand what are these files that Bakta has created, take a look here.
Now take a look at the file ending in .txt
file using less
. How many protein-coding genes (a.k.a coding sequences, CDSs) were found? And how many rRNA genes/fragments?
Then open the file ending in .tsv
. Look for the 16S rRNA gene from the annotations. Can you locate it from the assembly? We will need this information later today.
Bakta also gives the KEGG IDs for different metabolic enzymes in our genome. To reconstruct metabolic pathways based on these annotations, we need to extract them from one the annotation files.
Navigate to the Bakta output folder and run the following command:
grep -o "KEGG:K....." *.gff3 | tr ":" "\t" > kegg_ids.txt
Investigate the file kegg_ids.txt
using less
.
These are known as KO identifiers and is how we link genes to metabolic pathways in the KEGG database.
This will allow us to put the gene annotations in the context of metabolic pathways.
We will use IGV
to visualise the annotations and to extract genes for more exact taxonomic annotation.
For this you need to download the .fna
and .gff3
files from bakta annotations folder to your own computer.
We will go thru the steps in IGV
together.
In case you don't have IGV
on your own computer, you can use Proksee. For that you will need the Genbank file (.gbff
from Bakta).
The completeness of a genome can be estimated using CheckM2. CheckM2 uses a machine learning model o predict the compeleteness and contamination of a genome. The tools was developed for metagenome-assembled genomes, so it also predicts contamination. Contamination is not such a big problem for us as we work with isolate genomes.
So first go to the right folder and allocate resources from a computing node.
cd /scratch/project_2006616/$USER/MMB-114_Genomics
sinteractive -A project_2006616 -m 75000 --tmp 200 -c 4
Then run CheckM2 on your own genome.
/projappl/project_2006616/tax_tools/bin/checkm2 predict \
--output-directory CheckM2_out \
--lowmem \
--extension .fasta \
--tmpdir $TMPDIR \
--input # your assembly output folder
Open the output folder of CheckM2 and find a file called quality_report.tsv
.
The taxonomic annotation of genomes can be done with GTDB-Tk against the Genome Taxonomy Database (GTDB).
GTDB-Tk has its own database that has been downloaded to our course scratch folder (/scratch/project_2006616/GTDB/
) due to its large size. We need to set an environmental variable pointing to the database.
export GTDBTK_DATA_PATH="/scratch/project_2006616/GTDB/release220"
Then we can run the taxonomic annotation with GTDB-Tk.
/projappl/project_2006616/tax_tools/bin/gtdbtk classify_wf \
--out_dir GTDBTK_out \
--extension .fasta \
--scratch_dir $TMPDIR \
--tmpdir $TMPDIR \
--skip_ani_screen \
--min_perc_aa 0 \
--pplacer_cpus 1 \
--cpus $SLURM_CPUS_PER_TASK \
--genome_dir # the assembly output folder
Open the output folder of GTDB-Tk and find a file called gtdbtk.bac120.summary.tsv
.
Two most important questions from the above steps:
- How complete was the genome you obtained from the assembly based on CheckM2?
- What was the taxonomic annotation of your genome based on GTDB-Tk?