KEGG_sketching_annotation

Scripts to sketch KEGG and explore using FracMinHash as a way to functionally annotate a metagenome.

Requirements

The requirements.txt file contains the necessary packages required to run the code in this repo. You can install it via:

conda create -y --name KEGG_env python=3.8
conda install -y --name KEGG_env -c conda-forge -c bioconda --file requirements.txt
conda activate KEGG_env

Data

The data this repo uses is currently on the lab GPU server, under /data/shared_data/. This was extracted with this repo.

Main idea

Sketch the AA and NT sequences, compare the following approaches:

AA (vary the k-mer size and sketch size) with sourmash gather
NT (vary the k-mer size and sketch size) with sourmash gather
AA/NT prefetch, then alignment
Cmash/mash screen approach: sketch the database, stream the query over it looking for all matches

Using the following query data (in increasing level of realism):

Simulated subsequences of the AA/NT database
Simulated metagenomes that have AA/NT database sequences spiked in
Full simulated metagenomes (AA+prodigal/AA+6-frame-translation/NT)
Mock metagenomes
Real metagenomes that have been extensively analyzed previously

Structure of repo

src contains the module with various helper functions, wrappers to sourmash, etc. scripts contains the scripts that perform various tasks (sketching, simulation, etc.) utils contains third party tools (in this case, bbtools) for things like making simulations. test_data contains a small portion of real KEGG data for rapid testing and development.

Order of scripts to run

make_sketches.py This will take the FAA and FNA files and create sketches/signatures for each one.
simulate_metagenome.py simulates a metagenome when provided with a reference database. This will need some work since currently bbtools uses the whole database to simulate a metagenome (i.e. coverage is very low), so we'll need to do some down-sampling. This will also create a ground-truth functional profile (though that needs to be checked for accuracy).
classify_and_report.py is intended to run sourmash gather, parse the results, and then compare this to the ground truth.

Classification results

Running the following will print out binary classification results:

./classify_and_report.py -r ../test_data/input/kegg_genes_KO.fna -m ../test_data/output/test_simulation.fq -o ../test_data/output/ -k 21 -t 100 -n 1000 --ref_scale_size 100 --query_scale_size 100

which will result in something like:

=======
{'TP': 97, 'FP': 0, 'FN': 3, 'precision': 1.0, 'recall': 0.97, 'F1': 0.9847715736040609}

Comparison to DIAMOND

DIAMOND is an alignment-based functional annotation software package that is widely used. Let's compare the sourmash/FracMinHash approach to DIAMOND.

Install instructions

Updated from here:

# downloading the tool
wget http://github.com/bbuchfink/diamond/releases/download/v2.0.15/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz
# creating a diamond-formatted database file
./diamond makedb --in reference.fasta -d reference
# running a search in blastp mode
./diamond blastp -d reference -q queries.fasta -o matches.tsv
# running a search in blastx mode
./diamond blastx -d reference -q reads.fasta -o matches.tsv
# downloading and using a BLAST database
update_blastdb.pl --decompress --blastdb_version 5 swissprot
./diamond prepdb -d swissprot
./diamond blastp -d swissprot -q queries.fasta -o matches.tsv

Running DIAMOND and parsing the results

After making the simulation, run via:

 ./classify_and_report_diamond.py -r ../test_data/input/kegg_genes_KO.faa -m ../test_data/output/test_simulation.fq -o ../test_data/output/

And you should get a result like

{'TP': 98, 'FP': 110, 'FN': 2, 'precision': 0.47115384615384615, 'recall': 0.98, 'F1': 0.6363636363636364, 'Percent correct alignments': 0.8630737190242755, 'Total number of alignments': 1156228, 'Total number of sequences': 1000000.0}

CAMISIM

Following the guide here (thanks Shaopeng!) to set things up.

Other methods to compare against

https://github.com/biobakery/shortbred

c3db5216001d872ba169b89ab0d16cccf774dd78

Name		Name	Last commit message	Last commit date
Latest commit History 299 Commits
KEGG_scripts		KEGG_scripts
experiments		experiments
scripts		scripts
src		src
test_data/input		test_data/input
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KEGG_sketching_annotation

Requirements

Data

Main idea

Structure of repo

Order of scripts to run

Classification results

Comparison to DIAMOND

Install instructions

Running DIAMOND and parsing the results

CAMISIM

Other methods to compare against

About

Releases

Packages

Languages

License

mahmudhera/KEGG_sketching_annotation

Folders and files

Latest commit

History

Repository files navigation

KEGG_sketching_annotation

Requirements

Data

Main idea

Structure of repo

Order of scripts to run

Classification results

Comparison to DIAMOND

Install instructions

Running DIAMOND and parsing the results

CAMISIM

Other methods to compare against

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages