Skip to content

Arche: a functional-optimized annotator for microbial meta(genomes)

License

Notifications You must be signed in to change notification settings

gundizalv/Arche

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 

Repository files navigation

License: GPL v3 Don't judge me DOI:10.1101/2022.11.28.518280/bioRxiv

Arche: a flexible tool for annotation of microbial contigs

Installing dependencies

Before you download Arche (13Gb), make sure GeneMarkS-2 (GMS2) is working properly on your computer. As GMS2 requires a licence (free), you must download it manually

  Download GeneMarkS-2 and key from http://exon.gatech.edu/GeneMark/license_download.cgi
  tar xvfz gms2_linux_[version].tar.gz

Move the dir to the desired place, and make the binary files accesible to your PATH (e.g. add export PATH=$PATH:</path/to/gms2_linux_[version]> to your ~/.bashrc file)

Configure the key you've downloaded

  gunzip gm_key.gz
  cp gm_key ~/.gmhmmp2_key
or   cp gm_key ~/.gm_key

Test the software

  gms2.pl --seq YOUR_GENOME

To install the other dependencies, you will require the anaconda distribution. Download and install it from https://www.anaconda.com/download/success

  conda create -n arche_annotator diamond=2.0.14 bedtools=2.27.0 p7zip=16.02 barrnap=0.9 hmmer=3.3.2 prodigal=2.6.3 blast=2.12.0 fasta3=36.3.8i ucsc-fasomerecords=455 trnascan-se=2.0.9 gdown -c bioconda -c conda-forge

This command wil create a conda environment for arche future runs. It includes the installation of specific packages from bioconda and conda-forge channels.

Installing Arche

The program with the already formatted databases and mapping files can be downloaded (13Gb) via command line using gdown:

  conda activate arche_annotator
  gdown --fuzzy https://drive.google.com/file/d/1x9caXGPpYXCHUoodOdnuJI0tCDe9qtGG/view?usp=sharing

Once the download is finished:

  tar -xvf arche_[version].tar (move the output directory to the desired place)
  cd arche_[version]/bin/
  chmod +777 arche.sh
  ./arche.sh --install

You should make the bin directory accessible to your PATH (e.g. add export PATH=$PATH:</path/to/arche_[version]/bin> to your ~/.bashrc file)

Troubleshooting

In the case the instalation process or the running fails:

  1. Check you are working within the conda environment you've created ("conda activate arche_annotator")
  2. Check you have properly installed GeneMarkS-2
  3. If you have already run the command ./arche.sh --install, open the arche.sh script using a text editor and in the section "Main directory" (first lines) replace the string after DIR= with the full path of the working directory, e.g. /home/YOUR_USER/arche_1.0.1
  4. Delete arche's directory, uncompress from tar file, and install again

Running Arche

BlastP annotation of a bacterial genome, using 20 threads and 40 GB of memory:

arche.sh -n ecoli -t 20 -r 40 e_coli.fna

SSEARCH annotation of an archaeal genome, using 1 thread and 2 GB of memory

arche.sh -n halorubrum -a ssearch -k achaea halorubrum_sp_DM2.fa

DIAMOND annotation of a metagenome

arche.sh -n seawater_meatgenome -k meta seawater_metagenome.fna

Annotation of Escherichia coli K12

Here you can download a sample which includes the annotation of Escherichia coli K12 with several tools including Arche:

https://docs.google.com/spreadsheets/d/17Nd_y7w2axfxsjFJYAvb_NI3AW9HjNx4/edit?usp=sharing&ouid=115908476093915484477&rtpof=true&sd=true

Output Files

File(s) Description
rRNA.tsv GFF v3 file containing rRNA annotations.
rRNA.fna FASTA file of all rRNA features.
tRNA.tsv Table with tRNA details (coordinates, isotype, anticodon, scores, etc).
[...]_struc_annot.fna FASTA file of all genomic features (nucleotide).
[...]_struc_annot.faa FASTA file of translated coding genes (aminoacid).
heuristic[...]_out Output matches of the search instance(s) performed with BLASTp, DIAMOND or SSEARCH36.
heuristic[...]_non_match.faa FASTA file with the remaining non-matched sequences after the search instance(s) performed with BLASTp, DIAMOND or SSEARCH36.
hmmscan_[...]_out HMMER3 output table of the search instance(s) performed against a specific HMMDB.
[HMMDB]_non_match.faa FASTA file with the remaining non-matched sequences after the search instance performed against a specific HMMDB.
[...]_omic_table.tbl Feature table with fields separated by vertical bars.
[...]_omic_table.tsv Feature table with tab-separated fields.
arche_report File which includes the parameters of the run and results.

Command line options

-h, --help           This help.

-i, --install	     Set up the executable location, and install databases.

-n, --name-files     Name of the files to be created in the output directory, in-
		     cluding the directory itself (default 'arche').
             
-o, --output	     Provide the full path to the directory where the output di-
		     rectory will be created. E.g. /home/user/ (default current).
             
-k, --kingdom        Source of the contigs. Use 'arch' for archaeal genomes or
                         'meta' for metagenomes (default is for bacterial genomes).
                         
-m, --mode           Gives priority to Orthology (KO, eggNOG) or Enzyme Comission
                     designed databases during the annotation. Use 'kegg' for KO-->
                         eggNOG-->E.C., 'eggnog' for eggNOG-->KO-->E.C., or 'ec' for
                         E.C.-->KO-->eggNOG (default will use a shorter swiss-prot KO·
                         ·eggNOG·E.C. designed database with no priority).
                         
-a, --alignment      Select the algorithm to use during the protein alignment step:
                         'diamond' (accelerated blastp) or 'ssearch' (Smith-Waterman)
                         (default 'blastp').
                         
-t, --threads        Number of threads to use (default '1').

-r, --memory         Amount of RAM to use in GB (default '2').

-e, --evalue         Similarity e-value cut-off (default '1e-08').

-q, --query-cov      Minimum coverage on query protein (default '70').

-b, --bypass         Use 'yes' to bypass the RNA gene prediction.

-v, --verbose	     Use 'yes' to turn on the verbose mode.

Licence

GPL v3

Author

Releases

No releases published

Packages

No packages published

Languages