Releases: mtisza1/Cenote-Taker2
Cenote Taker 2 version 2.1.5
NOTE: Downloading the binaries will not help you to set up Cenote-Taker 2
. If you haven't already installed Cenote-Taker 2, please follow installation/update instructions in README, including the database updates.
Update notes:
- Major changes have been made to make the installation faster, easier and have a smaller data footprint (was ~130GB and now is ~8GB to ~75GB depending on your database choices). Details:
- The following tools (either tricky to install or out of date) were removed from the dependencies:
krona
,emboss suite
,circlator
,mummer
. - The following tools were added to the dependencies:
seqkit
- The following tools were changed from stand-alone git clones to packages in the conda environment:
lastal/lastdb
,hhblits/hhsearch
,phanotate
. - The protein BLAST database of RefSeq etc sequences was updated to include ~3000 new RefSeq virus entries
- The hhsuite databases are now optional. PDB, PFAM, CDD
- The tool now checks that your run_title is appropriately formatted
- For contigs with DTRs (direct terminal repeats), the
--wrap
option allows users to choose either: clip repeat region and rotate contig to an appropriate position, or forgo rotating and clipping but DTRs are reported in the genome map. #29 - Certain
rm
commands were fixed. #21 - The taxonomy calling framework has been updated. NCBI Taxdump files are used for TaxIDs instead of the krona database. "tax_guide.blastx.out" files now show the taxid of the best hit, and have tab-separated hierarchical taxonomy info for that reference. Example:
example_ct1_1 gi|849254117|ref|YP_009150201.1| terminase [Propionibacterium phage PHL085N00] 45.575 9.81e-119 452
taxid: 1500812
10239 Viruses superkingdom
2731341 Duplodnaviria clade
2731360 Heunggongvirae kingdom
2731618 Uroviricota phylum
2731619 Caudoviricetes class
28883 Caudovirales order
10699 Siphoviridae family
1982251 Pahexavirus genus
1982275 Pahexavirus PHL037M02 species
- protein sequence based taxonomy now is more flexible, with thresholds for genome taxon assignment:
Hallmark AAI to Reference | Taxonomic granularity from CT2 |
---|---|
>90% | Genus, e.g. "Ilzatvirus" |
>40% | Family, e.g. "Siphoviridae" |
>25% | Order, e.g. "Caudovirales" |
=<25% | Generic name, e.g. "phage" |
--hallmark_taxonomy
option allows users to get hierarchical taxonomy information for all identified hallmark genes. This could be useful for more sophisticated downstream taxonomy assignments.-db virion
is now the default setting. I think most people are inputting contigs assembled from WGS data, and this is the correct option for this data type.
Good luck with all of your Cenotes 💖
Mike
Cenote-Taker 2 Version 2.1.3
Downloading the binaries will not help you to set up Cenote-Taker 2. If you haven't already installed Cenote-Taker 2, please follow installation instructions in README. If you have already installed it and you are updating from v2.1.1
or earlier, please do:
conda activate cenote-taker2_env
conda install -c bioconda biopython bedtools
cd Cenote-Taker2
git pull
If you are updating from v2.1.2
:
conda activate cenote-taker2_env
git pull
Anyone doing the update should also update the HMM database!
Thank you.
Update notes:
- ITR sequencing are now getting annotated correctly.
- Problems with very large (many contigs) datasets should be resolved. There were previously some issues with
find
commands and argument list length. - New HMMs for RNA-dependent RNA polymerase genes (7 new HMMs) have been added to the hallmark database. Thanks to Darren Obbard.
Best,
Mike
Cenote-Taker 2 Version 2.1.2
If you haven't already installed Cenote-Taker 2, please follow installation instructions in README. If you have already installed it, please do:
conda activate cenote-taker2_env
conda install -c bioconda biopython bedtools
cd Cenote-Taker2
git pull
Then update the HMM database.
Thank you.
This release improves a number of things regarding the annotation and outputs of Cenote-Taker 2. Here is a fairly comprehensive list:
- BLASTN can be used to determine if your sequence belongs to an extant virus species based on 95% Average Nucleotide Identity (ANI) and 85% Alignment Fraction (AF), per community standards. This module requires GenBank nt database, GenBank virus nucleotide database, or some subset thereof. If a sequence has at least 95% ANI and 85% AF to a virus, the taxonomy/organism name will be changed to match the GenBank entry. This module uses anicalc.py from CheckV, see license and copyright in anicalc directory.
- ORFs that overlap tRNAs are now removed to comply with GenBank guidelines. ORFs that are cut off by the end of a contig are now properly formatted per GenBank guidelines.
- "Messy" gene names are largely improved to comply with GenBank guidelines.
- Organism/Taxonomy and BLASTN info are now included in the summary .tsv file
- Cenote-Taker 2 uses more refined gene content searches to identify putative conjugative transposons. Also, genes that Cenote-Taker 2 flags as conjugative machinery are output as a .gtf file in the sequin_and_genome_maps directory.
- Cenote-Taker 2 will now take a CRISPR spacer hit table as an optional input, and will put CRISPR spacer hit info in the note of the genome output files. The format required is a tab-separated table:
CONTIG_NAME HOST_NAME NUMBER_OF_HITS
e.g.
my_contig_1 bacteroides 9
Best,
Mike
Cenote-Taker 2 Version 2.1.1
If you haven't already installed Cenote-Taker 2, please follow installation instructions in README. If you have already installed it, please do: cd Cenote-Taker2
then git pull
. Then update the HMM database.
Thank you.
The code was largely re-written to increase parallelization, making runs with any number of contigs of any length run much faster. I've improved the output file structure to make a more sensible summary (.tsv) file, and I've put all the genome maps (.gbf) and gene tables (.gtf) in a single directory (sequin_and_genome_maps/). I've added some additional options to make the user experience more intuitive, especially the -am True
option which makes Cenote-Taker 2 assume that all input sequences are viral, and will simply annotate them. On the other hand, I've written code for Cenote Unlimited Breadsticks (unlimited_breadsticks.py) and included it in the Cenote-Taker 2 repo. The Unlimited Breadsticks tool is ONLY the discovery and pruning modules of Cenote-Taker 2. It runs dramatically faster than Cenote-Taker 2, as it skips all annotation steps. I still urge users to generate and examine genome maps for important virus sequences in order to manually inspect putative viruses.
Finally, I've removed about 100 HMMs from the hallmark gene database, and added about 50 new HMMs. The removed HMMs, while generated from virus sequences, were also found in non-virus regions of bacterial chromosomes, making them unsuitable for virus discovery.
Best of luck :)
Mike