A2GB is a eukaryote genome annotation pipeline that will transform the annotation files exported from Apollo (fomerly known as WebApollo) in preparation for sequence submission to NCBI’s GenBank. While its primary goal is to submit annotations to GenBank for the generation of accession numbers, the conversion of file formats will permit the use of different tools in downstream analyses.
In a stepwise approach, the A2GB pipeline converts annotation files from GFF3 -> EMBL -> TBL -> ASN. Each format will be useful for diagnostic quality checks of the annotations or become the springboard for other analyses, such as protein function prediction.
Furthermore, A2GB acts as a guide to prepare sequence submissions according to NCBI’s guidelines, including project registration with BioSample and BioProject for the generation of locus_tag prefixes. Upon acceptance from NCBI, these essential steps will ensure that sequence data will be made publicly available through GenBank and other member databases within the International Nucleotide Sequence Database Collaboration.
- Introduction
- Requirements
- A2GB workflow
- Exporting annotations from Apollo
- Converting GFF3 files to EMBL format
- Protein function prediction
- Converting EMBL files to ASN format
- Submitting ASN files to GenBank
- Funding and acknowledgments
- References
Various analyses contribute to assigning biological interpretation to a DNA sequence. The goal of genome annotation is to identify the location and function of a genome's encoded features, particularly of protein-coding and RNA-coding genes. Thus, generating the best descriptions of encoded features (i.e. annotations) is paramount for any genomic sequencing project that intends to identify these encoded features and attribute biological meaning to them.
Apollo provides sequencing projects with the tools for gene prediction via evidence-based annotation. This is accomplished with the automatic generation of sequence features which can be refined through expert user curation. Exporting and utilizing these annotations is the preliminary step to assigning user-curated biological function to predictions.
To extend the value of these efforts to the broader scientific community, annotations should be made available to the public in widely accessible biological databases. Upon successful submission, GenBank will assign accession numbers to submitted data to act as unique identifiers. To achieve this, careful adherence to NCBI guidelines for genome submission is essential. This pipeline is equipped with a series of checks to assess the quality of annotations and minimize errors.
The A2GB pipeline will:
- Reformat the GFF3 annotations from Apollo for deposition into the NCBI databases.
- Run sequence searches against UniProt's SwissProt/TrEMBL and InterPro databases for protein function prediction.
- Run intermittent checks to assess the quality of annotations.
- A UNIX-like environment (UNIX/Linux, MacOS X/11, or Miscrosoft's WSL2)
- Perl 5
- BioPerl ## Bio::SeqIO
- Apollo (2.5.0+)
- RNAmmer (1.2+)
- tRNAscan-SE (2.0+)
- Artemis (18.0.0+)
- aria2
- InterProScan 5 (latest version)
- DIAMOND (2.0+) or NCBI BLAST+ (2.10+)
- table2asn
- ChimeraX
After genomic annotations are completed in Apollo, export the curated annotations. Begin by selecting the 'Ref Sequence' tab.
Then, select Export -> select GFF3; select ALL; select GFF3 with FASTA; click Export. The file created will be called Annotations.gff3.gz.
We recommend exporting only protein features (CDS) from Apollo. Altough rRNAs and tRNAs inferences (e.g. from RNAmmer and tRNAscan-SE, respectively) can be added to Apollo, the process of exporting those back is finicky and prone to errors (some rRNAs/tRNAs appear to be missing when exporting features from Apollo 2.5.0).
First, let's create a directory to store annotations:
export ANNOT=/media/FatCat/user/raw_data ## Replace /media/FatCat/user/raw_data by desired annotation directory
mkdir $ANNOT; ## Create directory $ANNOT
mv Annotations.gff3.gz $ANNOT/ ## move Annotations.gff3.gz into $ANNOT
cd $ANNOT/;
gunzip Annotations.gff3.gz ## Decompress the GZIP file
Second, let's predict ribosomal RNAs with RNAmmer using run_RNAmmer.pl; then convert the annotations to GFF3 format with RNAmmer_to_GFF3.pl:
mkdir $ANNOT/RNAmmer/
run_RNAmmer.pl \
-f *.fasta \
-d $ANNOT/RNAmmer/
RNAmmer_to_GFF3.pl \
-g $ANNOT/RNAmmer/*.gff2 \
-d $ANNOT/RNAmmer/
Third , let's predict transfer RNAs with tRNAscan-SE using run_tRNAscan.pl; then convert the annotations to GFF3 format with tRNAscan_to_GFF3.pl:
mkdir $ANNOT/tRNAscan/
run_tRNAscan.pl \
-f *.fasta \
-t E \
-d $ANNOT/tRNAscan/
tRNAscan_to_GFF3.pl \
-t $ANNOT/tRNAscan/*.tRNAs \
-d $ANNOT/tRNAscan/
Fourth, let's concatenate the tRNA, rRNA and CDS GFF3 annotations from RNAmmer, tRNAscan-SE, and Apollo:
cat $ANNOT/RNAmmer/*.gff3 \
$ANNOT/tRNAscan/*.gff3 \
$ANNOT/Annotations.gff3 \
> $ANNOT/all_annotations.gff3
## We concatenate Apollo's GFF3 file last as it includes sequence data
Because debugging issues with annotations is easier when working with single files in Artemis, let's split the concatenated Apollo GFF3 file into distinct GFF3 (.gff3) and FASTA (.fsa) files, one per contig/chromosome with splitGFF3.pl.
splitGFF3.pl \
-g $ANNOT/all_annotations.gff3 \
-d $ANNOT/splitGFF3
This step requires a locus_tag prefix. If a locus_tag prefix has not been created, visit the BioSample and BioProject databases to submit all relevant sample metadata and project details. Once the sample has been accepted, the submitter will receive a BioSample accession number and a unique locus_tag prefix to be referenced during submission of corresponding experimental data to the NCBI, EMBL-EBI and DDBJ databases. Alternatively, to proceed without the locus_tag prefix, simply use a temporary prefix to be replaced later.
Let's convert the GFF3 files to EMBL format with ApolloGFF3toEMBL.pl. This script will generate locus tags automatically based on the provided prefix from NCBI.
ApolloGFF3toEMBL.pl \
-p LOCUS_TAG_PREFIX \
-g $ANNOT/splitGFF3/*.gff3 \
-f $ANNOT/features.list \
-c 1
Options for ApolloGFF3toEMBL.pl are:
-p (--prefix) locus_tag prefix
-g (--gff) GFF3 files generated by Apollo
-o (--outdir) Output directory [Default: ./]
-f (--features) Generate a tab-delimited list of features [Default: features.list]
-z (--zeroes) Number of padding zeroes for locus tags [Default: 5]
-x (--exon) Create exon features for genes with introns
-i (--intron) Create intron features
-r (--rgb) Change default colors of EMBL features for Artemis
-l (--lcolors) Display a list of possible RGB colors
-c (--gcode) NCBI genetic code [Default: 1]
1 - The Standard Code
2 - The Vertebrate Mitochondrial Code
3 - The Yeast Mitochondrial Code
4 - The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
11 - The Bacterial, Archaeal and Plant Plastid Code
# For complete list; see https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
After converting files to EMBL format, the content of the directory should look like this:
ls -la $ANNOT/splitGFF3/*
-rw-rw-r--. 1 jpombert jpombert 2701562 Oct 10 14:02 /media/FatCat/user/raw_data/splitGFF3/chromosome_01.embl
-rw-rw-r--. 1 jpombert jpombert 1886580 Oct 10 13:40 /media/FatCat/user/raw_data/splitGFF3/chromosome_01.fsa
-rw-rw-r--. 1 jpombert jpombert 1073070 Oct 10 13:40 /media/FatCat/user/raw_data/splitGFF3/chromosome_01.gff3
-rw-rw-r--. 1 jpombert jpombert 506424 Oct 10 14:02 /media/FatCat/user/raw_data/splitGFF3/chromosome_01.prot
-rw-rw-r--. 1 jpombert jpombert 1536924 Oct 10 14:02 /media/FatCat/user/raw_data/splitGFF3/chromosome_01.RNA
-rw-rw-r--. 1 jpombert jpombert 2573724 Oct 10 14:02 /media/FatCat/user/raw_data/splitGFF3/chromosome_02.embl
-rw-rw-r--. 1 jpombert jpombert 1798505 Oct 10 13:40 /media/FatCat/user/raw_data/splitGFF3/chromosome_02.fsa
-rw-rw-r--. 1 jpombert jpombert 1021787 Oct 10 13:40 /media/FatCat/user/raw_data/splitGFF3/chromosome_02.gff3
-rw-rw-r--. 1 jpombert jpombert 471868 Oct 10 14:02 /media/FatCat/user/raw_data/splitGFF3/chromosome_02.prot
-rw-rw-r--. 1 jpombert jpombert 1427737 Oct 10 14:02 /media/FatCat/user/raw_data/splitGFF3/chromosome_02.RNA
-rw-rw-r--. 1 jpombert jpombert 2180866 Oct 10 14:02 /media/FatCat/user/raw_data/splitGFF3/chromosome_03.embl
-rw-rw-r--. 1 jpombert jpombert 1528854 Oct 10 13:40 /media/FatCat/user/raw_data/splitGFF3/chromosome_03.fsa
-rw-rw-r--. 1 jpombert jpombert 808054 Oct 10 13:40 /media/FatCat/user/raw_data/splitGFF3/chromosome_03.gff3
-rw-rw-r--. 1 jpombert jpombert 408288 Oct 10 14:02 /media/FatCat/user/raw_data/splitGFF3/chromosome_03.prot
-rw-rw-r--. 1 jpombert jpombert 1214029 Oct 10 14:02 /media/FatCat/user/raw_data/splitGFF3/chromosome_03.RNA
...
The EMBL files should resemble this:
head -n 16 $ANNOT/splitGFF3/chromosome_01.embl
ID chromosome_01; genomic DNA; 1855637 bp
XX
DE Generated on Sat Mar 27 13:17:03 2021 by ApolloGFF3toEMBL.pl version 4.0
XX
FT gene 2..148
FT /locus_tag="H0P50_01g00010"
FT /colour=255 255 255
FT CDS 2..148
FT /locus_tag="H0P50_01g00010"
FT /colour=0 255 255
FT gene complement(239..3043)
FT /locus_tag="H0P50_01g00020"
FT /colour=255 255 255
FT CDS complement(239..3043)
FT /locus_tag="H0P50_01g00020"
FT /colour=0 255 255
If created properly, the files should be easy to open with Artemis:
art $ANNOT/splitGFF3/chromosome_01.embl
Note that ApolloGFF3toEMBL.pl will also create FASTA files of proteins and RNAs with the .prot and .RNA extensions, respectively, and which can be used for debugging issues with the corresponding annotations.
For example, we can use check_problems.pl to check for missing start methionines and for internal stop codons in proteins:
check_problems.pl \
-p $ANNOT/splitGFF3/*.prot \
-v
If present, we should see error messages like this:
Checking for errors in chromosome_01.prot located in /media/FatCat/user/raw_data/splitGFF3/
Invalid Start Codon Internal Stop Codon
HOP50_01g00010 K .
HOP50_01g00140 V .
HOP50_01g07580 V X
Checking for errors in chromosome_02.prot located in /media/FatCat/user/raw_data/splitGFF3/
Invalid Start Codon Internal Stop Codon
HOP50_02g14980 . X
HOP50_02g18670 V .
Checking for errors in chromosome_03.prot located in /media/FatCat/user/raw_data/splitGFF3/
Invalid Start Codon Internal Stop Codon
HOP50_03g26990 . X
If present, internal stop codons and missing start methionines (wrong amino acids) can be corrected in Apollo, the GFF exported again, and the subsequent steps performed anew. Alternatively, the errors can be corrected directly on the EMBL files with Artemis:
art $ANNOT/splitGFF3/chromosome_01.embl
We can check if the issues have been fixed by regenerating the .prot files with check_problems.pl again using the -u flag:
check_problems.pl \
-p $ANNOT/splitGFF3/*.prot \
-u \
-v
If fixed, the error message(s) should be gone:
Checking for errors in chromosome_01.prot located in /media/FatCat/user/raw_data/splitGFF3/
OK: No error found in chromosome_01.prot
Options for check_problems.pl are:
-p (--prot) FASTA files (.prot) to be checked for abnormalities
-o (--out) Print the output to a log file
-u (--update) Update .prot files using EMBLtoFeatures.pl
-v (--verb) Add verbosity
Additionally, features can be extracted from the EMBL files with EMBLtoFeatures.pl. Options for EMBLtoFeatures.pl are:
-e (--embl) EMBL files
-o (--outdir) Output directory [Default: ./]
-x (--exon) Create fasta files of exons (.exons) for genes with introns
-i (--intron) Create fasta files of exons (.introns) for genes with introns
-m (--mRNA) Export messenger RNAs of CDS features to .mRNAs
-v (--verbose) Add Verbosity
-c (--gcode) NCBI genetic code [Default: 1]
1 - The Standard Code
2 - The Vertebrate Mitochondrial Code
3 - The Yeast Mitochondrial Code
4 - The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
11 - The Bacterial, Archaeal and Plant Plastid Code
NOTE - For complete list; see https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
When running ApolloGFF3toEMBL.pl, locus tags are generated automatically from the provided prefix. We can generate tab-delimited lists of tRNAs/rRNAs and their products from the features.list generated by ApolloGFF3toEMBL.pl and from the files located in $ANNOT/tRNAscan/ and $ANNOT/RNAmmer/ (see section above). Protein function prediction will be performed separately in the next section.
To generate tab-delimited lists of tRNAs/rRNAs, we can use get_RNA_locus_tags.pl. This script will convert automatically the odd 8S_rRNA naming scheme produced by RNAmmer to '5S ribosomal RNA', which is recognized by NCBI. To use get_RNA_locus_tags.pl, type:
get_RNA_locus_tags.pl \
-f $ANNOT/features.list \
-ti $ANNOT/tRNAscan/*.tRNAs \
-ri $ANNOT/RNAmmer/*.gff3 \
-to $ANNOT/tRNA.annotations \
-ro $ANNOT/rRNA.annotations
The lists created should look like this:
head -n 3 $ANNOT/tRNA.annotations $ANNOT/rRNA.annotations
==> /media/FatCat/user/raw_data/tRNA.annotations <==
HOP50_01g07820 tRNA-Leu(CAG)
HOP50_01g08050 tRNA-Ile(UAU)
HOP50_01g07840 tRNA-Leu(AAG)
==> /media/FatCat/user/raw_data/rRNA.annotations <==
HOP50_06g41910 28s ribosomal RNA
HOP50_06g42050 28s ribosomal RNA
HOP50_17g79280 28s ribosomal RNA
In this step, individual protein sequences will be characterized using InterProScan 5 searches, BLASTP/DIAMOND searches against UniProt's SwissProt/TrEMBL databases, and BLASTP/DIAMOND searches against reference genome(s), if available. These annotators will help assign putative functions to predicted proteins.
First, let's generate a single multifasta file containing all of the predicted protein sequences. Ideally, internal stop codons and missing methionines should have been corrected prior to this point:
cat $ANNOT/splitGFF3/*.prot > proteins.fasta
InterPro is a free, widely used database which functionally characterizes unknown protein sequences by classifying them into families and predicts the presence of domains, repeats, and various functional sites. Unknown sequences are queried against predictive models built from identified domains and families. These models, or diagnostic signatures, are provided by InterPro’s diverse set of member databases. The result of pooling distinct signatures from member databases into a single searchable database makes InterPro a robust tool for protein functional prediction.
InterProScan 5 can be run using the interproscan.sh script provided with its distribution or with the run_InterProScan.pl Perl wrapper. To run InterProScan 5 using run_InterProScan.pl:
run_InterProScan.pl \
-c 10 \
-ip \
-go \
-pa \
-f $ANNOT/proteins.fasta \
-d $ANNOT/Interproscan/ \
-l interproscan.log
Options for run_InterProScan.pl are:
-c (--cpu) Number of CPU cores to use [Default: 10]
-f (--fasta) FASTA file(s) to query
-ip (--iprlookup) Use InterPro's pre-calculated match lookup service
-go (--goterms) Gene ontology search (requires --iprlookup)
-pa (--pathways) KEGG pathways (requires --iprlookup)
-d (--dir) Output directory (Optional)
-l (--log) Log name [Default: interproscan.log]
Important, if any stop codon is present in the queries, InterProScan will throw an error message looking like this:
ERROR: uk.ac.ebi.interpro.scan.jms.worker.LocalJobQueueListener - 2. The exception is :java.lang.IllegalArgumentException: You have submitted a protein sequence which contains an asterix (*). This may be from an ORF prediction program. '*' is not a valid IUPAC amino acid character and amino acid sequences which go through our pipeline should not contain it. Please strip out all asterix characters from your sequence and resubmit your search.
If this happens, you can fix the issues with the stop codons by following the steps described in this section.
The UniProt Knowledgebase (UniProtKB) is a wide-ranging database of extensively curated information of protein sequence and functional information. UniProtKB is comprised of UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each of these offer a varying level of reliability and quality. The Swiss-Prot database contains proteins that have been tested experimentally and are manually annotated and reviewed. The TrEMBL database utilizes semi-automatic annotation, which is computationally analyzed and typically not reviewed. Together, these databases provide a substantial collection of functional information on proteins.
Homology searches against the SwissProt and TrEMBL databases can be performed with NCBI BLAST+ or DIAMOND. We recommend using DIAMOND due to its significantly decreased computation time.
We can use get_UniProt.pl to download the SwissProt and/or TrEMBL databases from UniProt by leveraging aria2:
get_UniProt.pl \
-s \
-t \
-f $ANNOT/UNIPROT/ \
-n 20 \
-l download.log
Options for get_UniProt.pl are:
-s (--swiss) Download Swiss-Prot
-t (--trembl) Download trEMBL
-f (--folder) Download folder [Default: ./]
-l (--log) Print download information to log file
-dt (--dtool) Specify download tool: aria2c, wget or curl ## Tries to autodetect otherwise
-x (--connex) Number of aria connections [Default: 10]
-d (--decompress) Decompress downloaded files with gunzip ## trEMBL is huge, off by default
-v (--version) Show script version
Homology searches against the UniProt databases will return positive matches against the corresponding accession numbers. However, these matches will not include product names. To facilitate downstream analyses, we can create tab-delimited lists of accession numbers and their products with get_uniprot_products.pl:
get_uniprot_products.pl \
-f $ANNOT/UNIPROT/uniprot_*.fasta.gz \
-o $ANNOT/UNIPROT
These lists should be regenerated everytime the local UniProt databases are updated. Note that creating a product list from the TrEMBL database will take time due to its size. The tab-delimited lists should look like this:
head -n 4 $ANNOT/UNIPROT/*.list
==> /media/FatCat/user/raw_data/UNIPROT/uniprot_sprot.list <==
sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R
sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L
sp|Q197F8|002R_IIV3 Uncharacterized protein 002R
sp|Q197F7|003L_IIV3 Uncharacterized protein 003L
==> /media/FatCat/user/raw_data/UNIPROT/uniprot_trembl.list <==
tr|Q51723|Q51723_9EURY Beta-galactosidase
tr|A2TI13|A2TI13_9EURY Methyl-coenzyme M reductase (Fragment)
tr|A8USH6|A8USH6_9EURY Methyl-coenzyme M reductase alpha (Fragment)
tr|C0LL04|C0LL04_9EURY Methyl-coenzyme M reductase I alpha subunit (Fragment)
We can use DIAMOND to perform homology searches against the UniProt databases. Documentation on how to use DIAMOND can be found here.
First, let's create DIAMOND-formatted databases:
mkdir $ANNOT/DIAMOND/; mkdir $ANNOT/DIAMOND/DB/;
diamond makedb \
--in $ANNOT/UNIPROT/uniprot_sprot.fasta.gz \
-d $ANNOT/DIAMOND/DB/sprot
diamond makedb \
--in $ANNOT/UNIPROT/uniprot_trembl.fasta.gz \
-d $ANNOT/DIAMOND/DB/trembl
Second, let's perform protein-protein homology searches against the UniProt databases with a tabular output format (same as NCBI BLAST+'s -outfmt 6 format). Note that these searches will likely take a while depending the total number of proteins queried and the size of these databases:
diamond blastp \
-d $ANNOT/DIAMOND/DB/sprot \
-q $ANNOT/proteins.fasta \
-o $ANNOT/DIAMOND/sprot.diamond.6 \
-f 6
diamond blastp \
-d $ANNOT/DIAMOND/DB/trembl \
-q $ANNOT/proteins.fasta \
-o $ANNOT/DIAMOND/trembl.diamond.6 \
-f 6
The result of the DIAMOND homology searches should look like this:
head -n 4 $ANNOT/DIAMOND/*.diamond.6
==> /media/FatCat/user/raw_data/DIAMOND/sprot.diamond.6 <==
HOP50_01g00020 sp|Q54YZ9|DHKJ_DICDI 29.8 514 224 6 543 924 1340 1848 2.5e-50 202.2
HOP50_01g00020 sp|Q8D5Z6|LUXQ_VIBVU 33.3 381 234 6 542 916 476 842 6.3e-49 197.6
HOP50_01g00020 sp|Q7MD16|LUXQ_VIBVY 33.3 381 234 6 542 916 476 842 8.2e-49 197.2
HOP50_01g00020 sp|Q5A599|NIK1_CANAL 29.2 520 224 9 543 925 494 1006 2.4e-48 195.7
==> /media/FatCat/user/raw_data/DIAMOND/trembl.diamond.6 <==
HOP50_01g00010 tr|A0A5B8MBL8|A0A5B8MBL8_9CHLO 100.0 48 0 0 1 48 244 291 3.2e-19 102.8
HOP50_01g00010 tr|A0A5B8MD09|A0A5B8MD09_9CHLO 93.3 45 3 0 1 45 231 275 3.3e-16 92.8
HOP50_01g00020 tr|A0A5B8MDW7|A0A5B8MDW7_9CHLO 100.0 890 0 0 45 934 1 890 0.0e+00 1426.8
HOP50_01g00020 tr|A0A5B8MTR1|A0A5B8MTR1_9CHLO 65.7 895 299 3 42 932 26 916 2.5e-277 964.5
If desired, reference datasets (custom or downloaded from NCBI) can also be used as databases in homology searches to help with annotations. NCBI datasets can be accessed from the NCBI genome database or directly from their new dataset repository.
For example, using two datasets downloaded from NCBI:
## Downloading data from NCBI
mkdir $ANNOT/REFERENCES/;
wget -O $ANNOT/REFERENCES/ref1.faa.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/007/859/695/GCA_007859695.1_ASM785969v1/GCA_007859695.1_ASM785969v1_protein.faa.gz
wget -O $ANNOT/REFERENCES/ref2.faa.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/214/015/GCF_000214015.3_version_140606/GCF_000214015.3_version_140606_protein.faa.gz
## Creating DIAMOND database; using zcat to concatenate the output to STDOUT, then feed it to diamond as input
zcat $ANNOT/REFERENCES/*.gz | diamond makedb -d $ANNOT/DIAMOND/DB/reference
## Running DIAMOND
diamond blastp \
-d $ANNOT/DIAMOND/DB/reference \
-q $ANNOT/proteins.fasta \
-o $ANNOT/REFERENCES/reference.diamond.6 \
-f 6
To create a tab-delimited list of accession numbers and their associated proteins from the downloaded NCBI .faa.gz files, we can use get_reference_products.pl.
get_reference_products.pl \
-f $ANNOT/REFERENCES/*.gz \
-l $ANNOT/REFERENCES/reference.list
The list created should look like this:
head -n 5 $ANNOT/REFERENCES/reference.list
QDZ17483.1 hypothetical protein A3770_01p00010
QDZ17484.1 cytochrome P450
QDZ17485.1 hypothetical protein A3770_01p00030
QDZ17486.1 coiled-coil domain-containing protein
QDZ17487.1 putative transmembrane protein
The Kyoto Encyclopedia of Genes and Genomes (KEGG) database is a useful resource to identify which metabolic pathways are present in an organism. The KEGG databases can be queried for orthologs; proteins with matches against KEGG proteins will be assigned KO numbers (for KEGG orthologs). These can be useful during the annotation process. KEGG orthologs can be identified with BlastKOALA, GhostKOALA and/or KofamKOALA using the KEGG web portal.
First, let's start by creating a simple list of all proteins queries, even those that returned no homology in InterProScan5, DIAMOND and/or KEGG searches. We will use get_queries.pl for this:
get_queries.pl $ANNOT/proteins.fasta
head -n 4 $ANNOT/proteins.queries ## Looking at the list produced by get_queries.pl; a simple list with one entry per line
HOP50_01g00010
HOP50_01g00020
HOP50_01g00030
HOP50_01g00040
Then, let's parse the output of the InterProScan 5 and DIAMOND searches using the list of queries produced by get_queries.pl and the lists of accession numbers/products created with get_uniprot_products.pl. We will use parse_annotators.pl to do this:
parse_annotators.pl \
-q $ANNOT/proteins.queries \
-o $ANNOT/proteins.annotations \
-ip $ANNOT/Interproscan/proteins.fasta.interpro.tsv \
-sl $ANNOT/UNIPROT/uniprot_sprot.list \
-tl $ANNOT/UNIPROT/uniprot_trembl.list \
-sb $ANNOT/DIAMOND/sprot.diamond.6 \
-tb $ANNOT/DIAMOND/trembl.diamond.6
The parsed output should look like this:
head -n 4 $ANNOT/proteins.annotations
#Locus_tag Evalue SwissProt Evalue trEMBL Evalue PFAM Evalue TIGR Score HAMAP Evalue CDD
HOP50_01g00010 NA hypothetical protein NA hypothetical protein NA hypothetical protein NA hypothetical protein NA hypothetical protein NA no motif found
HOP50_01g00020 2.5e-50 Hybrid signal transduction histidine kinase J 0.0e+00 Signal transduction histidine kinase 5.5E-30 Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase NA hypothetical protein NA hypothetical protein 1.8907E-11 HisKA
HOP50_01g00030 NA hypothetical protein 1.0e-07 Insulin-like growth factor binding, N-terminal 1.0E-6 Putative ephrin-receptor like NA hypothetical protein NA hypothetical protein 6.31891E-8 TNFRSF
To use parse_annotators.pl with a reference dataset, type:
parse_annotators.pl \
-q $ANNOT/proteins.queries \
-o $ANNOT/proteins.annotations \
-ip $ANNOT/Interproscan/proteins.fasta.interpro.tsv \
-sl $ANNOT/UNIPROT/uniprot_sprot.list \
-tl $ANNOT/UNIPROT/uniprot_trembl.list \
-sb $ANNOT/DIAMOND/sprot.diamond.6 \
-tb $ANNOT/DIAMOND/trembl.diamond.6 \
-rl $ANNOT/REFERENCES/reference.list \
-rb $ANNOT/REFERENCES/reference.diamond.6
We can also use multiple reference datasets if desired (file prefixes should match between the corresponding .list and .diamond.6 files). To use parse_annotators.pl with multiple reference datasets, type:
parse_annotators.pl \
-q $ANNOT/proteins.queries \
-o $ANNOT/proteins.annotations \
-ip $ANNOT/Interproscan/proteins.fasta.interpro.tsv \
-sl $ANNOT/UNIPROT/uniprot_sprot.list \
-tl $ANNOT/UNIPROT/uniprot_trembl.list \
-sb $ANNOT/DIAMOND/sprot.diamond.6 \
-tb $ANNOT/DIAMOND/trembl.diamond.6 \
-rl $ANNOT/REFERENCES/*.list \
-rb $ANNOT/REFERENCES/*.diamond.6
If desired, results from KEGG and dbCAN2 searches can also be parsed accordingly by invoking the corresponding command line options. Current options for parse_annotators.pl are:
-q List of proteins queried against annotators
-o Output file
-v Verbose
## InterProScan5
-ip TSV output from InterProScan
## BLAST/DIAMOND searches
-sl SwissProt list of proteins and their products
-sb SwissProt BLAST/DIAMOND outfmt 6 results
-tl TREMBL list of proteins and their products
-tb TREMBL BLAST/DIAMOND outfmt 6 results
-rl Reference genome(s) list(s) of proteins and their products
-rb Reference(s) BLAST/DIAMOND outfmt 6 results
## KEGG searches
-ko KofamKOALA output file
-gk GhostKOALA output file
-bk BlastKOALA output file
## dbCAN2 CAZy searches: http://bcb.unl.edu/dbCAN2/
-ca dbCAN2 output file
-cl CAZy families list ## http://bcb.unl.edu/dbCAN2/download/Databases/CAZyDB.07302020.fam-activities.txt
The script curate_annotations.pl was designed to faciliate comparisons between function prediction tools. It requires as input the file generated with parse_annotators.pl. At minimum, this file should include the results from DIAMOND (or NCBI BLAST+) BLASTP homology searches against the SwissProt/TrEMBL databases and the InterProScan5 results. curate_annotations.pl will generate a user-curated tab-delimited list of locus_tags and their predicted functions. This list will be stored in a file with the .curated file extension.
To start curating annotations with curate_annotations.pl, simply type:
curate_annotations.pl -sq $ANNOT/proteins.annotations
## Putative annotation(s) found for protein HOP50_01g00020:
1. SwissProt 2.5e-50 Hybrid signal transduction histidine kinase J
2. trEMBL 0.0e+00 Signal transduction histidine kinase
3. PFAM 5.5E-30 Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase
4. TIGR NA hypothetical protein
5. HAMAP NA hypothetical protein
6. CDD 1.8907E-11 HisKA
Please enter:
[1-6] to assign annotation
[0] to annotate the locus as a 'hypothetical protein'
[m] to manually annotate the locus, e.g. DUFxxx domain-containing protein
[n] to manually annotate the locus with annotation notes, e.g. structural homolog
[?] to mark this annotation for review and add annotation notes
[x] to exit
To speed up the manual annotation process, proteins without any homology/significant hit in any of the predictors used will be annotated automatically as 'hypothetical protein'. Proteins with one or more matches identified by the predictors will show a menu like the one above. Users can enter the desired selection from the menu to annotate the proteins accordingly. The option [m] for manual annotation will likely be useful to fix typos and/or lower/uppercase character issues in the corresponding matches. The option [n] allows the user to enter a note in addition to the manual annotation. The option [?] allows the user to flag the entry for later review. The option [x] will enable the user to quit and resume at a later stage. If the option entered is not recognized, the script will exit automatically to prevent potential problems.
To resume annotations from the last annotated proteins, simply add -r (resume) to the command line:
curate_annotations.pl -r -sq $ANNOT/proteins.annotations
[....................................................................................................] 7/8631
## Putative annotation(s) found for protein HOP50_01g00070:
1. SwissProt 4.3e-186 Eukaryotic translation initiation factor 3 subunit A
2. trEMBL 0.0e+00 Eukaryotic translation initiation factor 3 subunit A
3. PFAM 4.4E-9 PCI domain
4. TIGR NA hypothetical protein
5. HAMAP 24.645 Eukaryotic translation initiation factor 3 subunit A [EIF3A].
6. CDD NA no motif found
Please enter:
[1-6] to assign annotation
[0] to annotate the locus as a 'hypothetical protein'
[m] to manually annotate the locus, e.g. DUFxxx domain-containing protein
[n] to manually annotate the locus with annotation notes, e.g. structural homolog
[?] to mark this annotation for review and add annotation notes
[x] to exit.
Selection:
The tab-delimited list of locus_tags and their predicted functions generated by curate_annotations.pl should look like this:
head -n 7 $ANNOT/proteins.annotations.curated
HOP50_01g00010 hypothetical protein
HOP50_01g00010 hypothetical protein
HOP50_01g00010 hypothetical protein
HOP50_01g00020 signal transduction histidine kinase
HOP50_01g00030 hypothetical protein
HOP50_01g00040 hypothetical protein
HOP50_01g00050 glutamine-dependent NAD(+) synthetase
If a reference dataset was used, the menu from curate_annotations.pl should look like this:
curate_annotations.pl -sq $ANNOT/proteins.annotations
[....................................................................................................] 0002/8631
## Putative annotation(s) found for protein HOP50_01g00020:
1. SwissProt 2.5e-50 Hybrid signal transduction histidine kinase J
2. trEMBL 0.0e+00 Signal transduction histidine kinase
3. PFAM 5.5E-30 Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase
4. TIGR NA hypothetical protein
5. HAMAP NA hypothetical protein
6. CDD 1.8907E-11 HisKA
7. Ref_organism 0.0e+00 signal transduction histidine kinase
Please enter:
[1-7] to assign annotation
[0] to annotate the locus as a 'hypothetical protein'
[m] to manually annotate the locus, e.g. DUFxxx domain-containing protein
[n] to manually annotate the locus with annotation notes, e.g. structural homolog
[?] to mark this annotation for review and add annotation notes (optional)
[x] to exit.
Selection: m
Enter desired annotation: signal transduction histidine kinase
Options for curate_annotations.pl are:
-sq (--seq_hom) Sequence homology based annotations (generated from parse_annotators.pl)
-rd (--rcsb_3d) 3D structural homology based annotations (Generated with descriptive_GESAMT_matches.pl)
-pd (--pfam_3d) 3D structural homology annotations based on predicted stuctures (Generated with descriptive_GESAMT_matches.pl)
-cx (--chimerax) Path to ChimeraX pdb sessions
-r (--resume) Resume annotation from last curated locus_tag
-c (--check) Check loci marked with '?'
-v (--verify) Check loci marked for 3D verification
The conversion of EMBL files to TBL format in A2GB is a two step process. EMBL files are first converted to TBL format with EMBLtoTBL.pl, then NCBI's table2asn converts the later format to ASN.
EMBLtoTBL.pl converts EMBL files to TBL format. This script requires a single tab-limited list of the locus tags and their predicted annotations. We can create this list by concatenating the tRNAs.annotations and rRNAs.annotations files generated previously together with the curated list of proteins annotations (see above). Alternatively, any tab-delimited list of locus_tags and their products can be used.
Concatenating the annotations can be quickly performed with:
cat \
$ANNOT/tRNA.annotations \
$ANNOT/rRNA.annotations \
$ANNOT/proteins.annotations.curated \
> $ANNOT/verified_annotations.tsv
The conversion from EMBL to TBL can then be performed with EMBLtoTBL.pl:
EMBLtoTBL.pl \
-id ITTBIO \
-p $ANNOT/verified_annotations.tsv \
-embl $ANNOT/splitGFF3/*.embl \
1> $ANNOT/STD.log \
2> $ANNOT/ERROR.log
Options for EMBLtoTBL.pl are:
-id Desired institute ID [default: IITBIO]
-p (--prod) Tab-delimited list of locus_tags and their products
-e (--embl) EMBL files to convert
-c (--gcode) NCBI genetic code [Default: 1]
1 - The Standard Code
2 - The Vertebrate Mitochondrial Code
3 - The Yeast Mitochondrial Code
4 - The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
11 - The Bacterial, Archaeal and Plant Plastid Code
# For complete list; see https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
-o (--organelle) Organelle mode; turns off protein_id / transcript_id
-t (--partial) Add partials flags (< and >) to all gene/mRNA features [Default: off]
Locus tag entries missing from the tab-delimited list of annotations will be reported in the $ANNOT/ERROR.log file. Missing entries will be annotated automatically as 'hypothetical protein', 'hypothetical tRNA' or as 'hypothetical RNA' for CDS, tRNA and rRNA features, respectively. If any entry is missing, the $ANNOT/ERROR.log file will look like this:
head -n 6 $ANNOT/ERROR.log
Cannot find database entry for locus_tag: HOP50_01g00020
Cannot find database entry for locus_tag: HOP50_01g00020
Cannot find database entry for locus_tag: HOP50_01g00030
Cannot find database entry for locus_tag: HOP50_01g00030
Cannot find database entry for locus_tag: HOP50_01g00040
Cannot find database entry for locus_tag: HOP50_01g00040
The TBL files created should look like below. In this example, the first gene is incomplete in 5'; this is indicated in the table by the presence of smaller than signs (<). The codon start entry is also added to ensure proper translation of the open reading frame.
head -n 25 `ls $ANNOT/splitGFF3/*.tbl | head -n 1`
>Feature chromosome_01
<1 148 gene
locus_tag HOP50_01g00010
<1 148 mRNA
locus_tag HOP50_01g00010
product hypothetical protein
protein_id gnl|IITBIO|HOP50_01g00010
transcript_id gnl|IITBIO|HOP50_01g00010_mRNA
<1 148 CDS
locus_tag HOP50_01g00010
product hypothetical protein
protein_id gnl|IITBIO|HOP50_01g00010
transcript_id gnl|IITBIO|HOP50_01g00010_mRNA
codon_start 2
3043 239 gene
locus_tag HOP50_01g00020
3043 239 mRNA
locus_tag HOP50_01g00020
product signal transduction histidine kinase
protein_id gnl|IITBIO|HOP50_01g00020
transcript_id gnl|IITBIO|HOP50_01g00020_mRNA
3043 239 CDS
locus_tag HOP50_01g00020
product signal transduction histidine kinase
protein_id gnl|IITBIO|HOP50_01g00020
Metadata must be included together with genome sequences during the submission process to NCBI. Although some of this metadata can be entered from the online submission form(s), it is often easier to add it beforehand while generating the ASN files. Metadata for genome submission includes taxonomic information about the source of the data being submitted, details about the sequencing experiments/computational analyses performed, and general information about the author(s) and institution(s) submitting the genomes.
Taxonomic metadata can be added directly to the FASTA files. The list of modifiers that can be added directly to the FASTA definition lines can be found here. Mandatory modifiers include the organism name [organism=XXX].
To add metadata with add_metadata_to_fasta.pl using single metadata keys, type:
add_metadata_to_fasta.pl \
-f $ANNOT/splitGFF3/*.fsa \
-o 'Chloropicon primus RCC138' \
-s RCC138 \
-g 1
Alternatively, to add metadata with add_metadata_to_fasta.pl and tab-delimited metadata files, type:
add_metadata_to_fasta.pl \
-f $ANNOT/splitGFF3/*.fsa \
-k metakeys_NCBI.tsv \
-c chromosomes.tsv
Options for add_metadata_to_fasta.pl are:
-f (--fasta) Specifies which FASTA files to add metadata to
## Single metadata keys
-o (--organism) Full organism name; e.g. 'Chloropicon primus RCC138'
-s (--strain) Strain definition; e.g. RCC138
-i (--isolate) Isolate name; e.g. 'Pacific Isolate'
-g (--gcode) NCBI genetic code ## https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
-m (--moltype) NCBI moltype descriptor (e.g. genomic)
## Metadata files
-k (--keys) Tab-delimited NCBI metadata key -> value file
-c (--chromosome) Tab-delimited contig -> chromosome assignment file
The tab-delimited NCBI metadata key -> value file (e.g. metakeys_NCBI.tsv) should look like this:
### metadata key metadata value
organism Chloropicon primus
strain RCC138
moltype genomic
gcode 1
The tab-delimited contig -> chromosome assignment file (e.g. chromosomes.tsv) should look like this:
#contig_name chromosome
chromosome_01 I
chromosome_02 II
chromosome_03 III
chromosome_04 IV
Once modified, the FASTA definition lines should look like this:
head -n 1 $ANNOT/splitGFF3/*.fsa
==> /media/FatCat/user/raw_data/splitGFF3/chromosome_01.fsa <==
>chromosome_01 [organism=Chloropicon primus][moltype=genomic][gcode=1][strain=RCC138][location=chromosome][chromosome=I]
==> /media/FatCat/user/raw_data/splitGFF3/chromosome_02.fsa <==
>chromosome_02 [organism=Chloropicon primus][moltype=genomic][gcode=1][strain=RCC138][location=chromosome][chromosome=II]
==> /media/FatCat/user/raw_data/splitGFF3/chromosome_03.fsa <==
>chromosome_03 [organism=Chloropicon primus][moltype=genomic][gcode=1][strain=RCC138][location=chromosome][chromosome=III]
NCBI provides a simple web-based tool to generate the GenBank submission template file (template.sbt) required by table2asn. To generate a template.sbt file, visit: https://submit.ncbi.nlm.nih.gov/genbank/template/submission/.
More information about NCBI's stuctured comments can be found here. Files containing stuctured comments are tab-delimited; assembly-data structured comments for genomes usually looks like this:
StructuredCommentPrefix ##Genome-Assembly-Data-START##
Assembly Method SPAdes v. 3.13.0; Canu v. 1.8
Assembly Name Version 1
Long Assembly Name RCC138 version 1
Genome Coverage 345x
Sequencing Technology Illumina MiSeq; Oxford Nanopore
table2asn is a command-line program created by NCBI to automate the creation of sequence records for submission to GenBank. The table2asn tool will generate the .sqn file to be used for the submission. More information about the structure and content of the TBL files generated by EMBLtoTBL.pl and required as input for table2asn can be found in NCBI's Eukaryotic Genome Annotation Guide.
To prepare a genome for submission with table2asn, we must use:
- a single multifasta file containing all sequences to be deposited
- a single concatenated TBL file containing all corresponding annotations
While we can run table2asn on separate files to generate a single SQN output, this SQN output file would result in X separate submissions rather than a single one, e.g.:
## A single SQN file with 20 sequences considered separate submissions
cat test_1.sqn | grep -c 'Seq-submit ::='
20
## A single SQN file with the 20 same sequences, but this time considered a single submission
cat test_2.sqn | grep -c 'Seq-submit ::='
1
To generate a single submission from a set of sequences with table2asn, we can use the following:
## Concatenating the FASTA and TBL files:
ANNOT_DIR=$ANNOT/splitGFF3/
cat $ANNOT_DIR/*.tbl > genome.tbl
cat $ANNOT_DIR/*.fsa > genome.fasta
## Generating a single submission with table2asn
table2asn \
-t template.sbt \
-w genome.cmt \
-i genome.fasta \
-f genome.tbl \
-o output_file.sqn \
-euk \
-J \
-M n \
-V vb \
-Z \
-H 12/31/2024
## Checking for the presence of a single submission in the output file
cat output_file.sqn | grep -c 'Seq-submit ::='
1 ## Expected number for a single submission
Options for table2asn used above are:
-t Template file (.sbt).
-w File (.cmt) containing Genome Assembly structured comments.
-i Creates a single submission from indicated .fasta file
-f Concatenated TBL file
-o Desired output .sqn file. Sets the basename for all output files.
-euk Asserts eukaryotic lineage for the discrepancy report tests.
-J Delayed Genomic Product Set
-M n Master Genome Flags: n: Normal. Combines flags for genomes submissions (invokes FATAL calls when -Z discrep is included).
-V vb Verification: v Validates the data records; b Generates GenBank flatfiles with a .gbf suffix.
-Z Runs the Discrepancy Report.
-H Desired date for data release.
table2asn generates two distinct types of error reports. The first consists of validation files with the file extension .val (generated if potential errors are found). The second report (.dr => discrepancy report) contains a detailed summary including metrics, warnings and fatal errors.
Ideally, the .val files should be absent, indicating that no error has been found. We can check if errors were potentially found with:
ls -lh *.val
-rw-rw-r--. 1 jpombert jpombert 5.6K Dec 8 14:12 /media/FatCat/user/raw_data/splitGFF3/chromosome_01.val
-rw-rw-r--. 1 jpombert jpombert 4.7K Dec 8 14:12 /media/FatCat/user/raw_data/splitGFF3/chromosome_02.val
-rw-rw-r--. 1 jpombert jpombert 1.8K Dec 8 14:12 /media/FatCat/user/raw_data/splitGFF3/chromosome_03.val
...
In the above example, a few .val files exist (and their sizes are not zero), which means that errors have been detected. Errors will vary per file, obviously, but the content of a .val file should look like:
cat $ANNOT/splitGFF3/chromosome_01.val
WARNING: valid [SEQ_FEAT.PartialProblem] PartialLocation: 5' partial is not at start AND is not at consensus splice site FEATURE: CDS: hypothetical protein [lcl|chromosome_01:<2-148] [lcl|chromosome_01: raw, dna len= 1855637] -> [gnl|ITTBIO|HOP50_01g00010]
WARNING: valid [SEQ_FEAT.PartialProblem] PartialLocation: 5' partial is not at start AND is not at consensus splice site FEATURE: CDS: hypothetical protein [lcl|chromosome_01:c>27956-27786] [lcl|chromosome_01: raw, dna len= 1855637] -> [gnl|ITTBIO|HOP50_01g00140]
WARNING: valid [SEQ_FEAT.PartialProblem] PartialLocation: 3' partial is not at stop AND is not at consensus splice site FEATURE: CDS: hypothetical protein [(lcl|chromosome_01:194078-194080, 194546->194695)] [lcl|chromosome_01: raw, dna len= 1855637] -> [gnl|ITTBIO|HOP50_01g01170]
WARNING: valid [SEQ_FEAT.NotSpliceConsensusDonor] Splice donor consensus (GT) not found after exon ending at position 194695 of lcl|chromosome_01 FEATURE: CDS: hypothetical protein [(lcl|chromosome_01:194078-194080, 194546->194695)] [lcl|chromosome_01: raw, dna len= 1855637] -> [gnl|ITTBIO|HOP50_01g01170]
ERROR: valid [SEQ_FEAT.SeqLocOrder] Location: Intervals out of order in SeqLoc [(lcl|chromosome_01:c>1361096-1360062, c1361878-1361864, c1361711-1361185, c1361097-1360062)] FEATURE: CDS: hypothetical protein [(lcl|chromosome_01:c>1361096-1360062, c1361878-1361864, c1361711-1361185, c1361097-1360062)] [lcl|chromosome_01: raw, dna len= 1855637] -> [gnl|ITTBIO|HOP50_01g07580]
WARNING: valid [SEQ_FEAT.PartialProblem] PartialLocation: 5' partial is not at start AND is not at consensus splice site FEATURE: CDS: hypothetical protein [(lcl|chromosome_01:c>1361096-1360062, c1361878-1361864, c1361711-1361185, c1361097-1360062)] [lcl|chromosome_01: raw, dna len= 1855637] -> [gnl|ITTBIO|HOP50_01g07580]
WARNING: valid [SEQ_FEAT.ShortExon] Internal coding region exon is too short FEATURE: CDS: hypothetical protein [(lcl|chromosome_01:c>1361096-1360062, c1361878-1361864, c1361711-1361185, c1361097-1360062)] [lcl|chromosome_01: raw, dna len= 1855637] -> [gnl|ITTBIO|HOP50_01g07580]
ERROR: valid [SEQ_FEAT.InternalStop] 1 internal stops. Genetic code [1] FEATURE: CDS: hypothetical protein [(lcl|chromosome_01:c>1361096-1360062, c1361878-1361864, c1361711-1361185, c1361097-1360062)] [lcl|chromosome_01: raw, dna len= 1855637] -> [gnl|ITTBIO|HOP50_01g07580]
WARNING: valid [SEQ_FEAT.NotSpliceConsensusDonor] Splice donor consensus (GT) not found after exon ending at position 1360062 of lcl|chromosome_01 FEATURE: CDS: hypothetical protein [(lcl|chromosome_01:c>1361096-1360062, c1361878-1361864, c1361711-1361185, c1361097-1360062)] [lcl|chromosome_01: raw, dna len= 1855637] -> [gnl|ITTBIO|HOP50_01g07580]
WARNING: valid [SEQ_FEAT.NotSpliceConsensusAcceptor] Splice acceptor consensus (AG) not found before exon starting at position 1361096 of lcl|chromosome_01 FEATURE: CDS: hypothetical protein [(lcl|chromosome_01:c>1361096-1360062, c1361878-1361864, c1361711-1361185, c1361097-1360062)] [lcl|chromosome_01: raw, dna len= 1855637] -> [gnl|ITTBIO|HOP50_01g07580]
WARNING: valid [SEQ_FEAT.NotSpliceConsensusAcceptor] Splice acceptor consensus (AG) not found before exon starting at position 1361878 of lcl|chromosome_01 FEATURE: CDS: hypothetical protein [(lcl|chromosome_01:c>1361096-1360062, c1361878-1361864, c1361711-1361185, c1361097-1360062)] [lcl|chromosome_01: raw, dna len= 1855637] -> [gnl|ITTBIO|HOP50_01g07580]
WARNING: valid [SEQ_FEAT.NotSpliceConsensusDonor] Splice donor consensus (GT) not found after exon ending at position 194695 of lcl|chromosome_01 FEATURE: mRNA: hypothetical protein [(lcl|chromosome_01:<194078-194080, 194546->194695)] [lcl|chromosome_01: raw, dna len= 1855637]
WARNING: valid [SEQ_FEAT.PartialProblem] PartialLocation: Start does not include first/last residue of sequence FEATURE: Gene: HOP50_01g07580 [lcl|chromosome_01:c>1361096-<1360062] [lcl|chromosome_01: raw, dna len= 1855637]
WARNING: valid [SEQ_FEAT.PartialProblem] PartialLocation: Stop does not include first/last residue of sequence FEATURE: Gene: HOP50_01g07580 [lcl|chromosome_01:c>1361096-<1360062] [lcl|chromosome_01: raw, dna len= 1855637]
ERROR: valid [SEQ_FEAT.SeqLocOrder] Location: Intervals out of order in SeqLoc [(lcl|chromosome_01:c>1361096-1360062, c1361878-1361864, c1361711-1361185, c1361097-<1360062)] FEATURE: mRNA: hypothetical protein [(lcl|chromosome_01:c>1361096-1360062, c1361878-1361864, c1361711-1361185, c1361097-<1360062)] [lcl|chromosome_01: raw, dna len= 1855637]
WARNING: valid [SEQ_FEAT.NotSpliceConsensusDonor] Splice donor consensus (GT) not found after exon ending at position 1360062 of lcl|chromosome_01 FEATURE: mRNA: hypothetical protein [(lcl|chromosome_01:c>1361096-1360062, c1361878-1361864, c1361711-1361185, c1361097-<1360062)] [lcl|chromosome_01: raw, dna len= 1855637]
WARNING: valid [SEQ_FEAT.NotSpliceConsensusAcceptor] Splice acceptor consensus (AG) not found before exon starting at position 1361096 of lcl|chromosome_01 FEATURE: mRNA: hypothetical protein [(lcl|chromosome_01:c>1361096-1360062, c1361878-1361864, c1361711-1361185, c1361097-<1360062)] [lcl|chromosome_01: raw, dna len= 1855637]
WARNING: valid [SEQ_FEAT.NotSpliceConsensusAcceptor] Splice acceptor consensus (AG) not found before exon starting at position 1361878 of lcl|chromosome_01 FEATURE: mRNA: hypothetical protein [(lcl|chromosome_01:c>1361096-1360062, c1361878-1361864, c1361711-1361185, c1361097-<1360062)] [lcl|chromosome_01: raw, dna len= 1855637]
ERROR: valid [SEQ_INST.StopInProtein] [1] termination symbols in protein sequence (HOP50_01g07580 - hypothetical protein) BIOSEQ: gnl|ITTBIO|HOP50_01g07580: raw, aa len= 870
The discrepancy report generated by table2asn's -Z command line switch is highly verbose, and includes many checks, warnings and error messages. It is composed of two sections; a summary and a detailed section. FATAL errors will prevent submissions to NCBI and should be corrected. Howewer, FATAL: BACTERIAL_ errors, like in the example below, can be safely ignored for eukaryote genomes.
head -n 40 $OUTDIR/*.dr
Discrepancy Report Results
Summary
COUNT_NUCLEOTIDES: 20 nucleotide Bioseqs are present
SOURCE_QUALS: taxname (all present, all same)
SOURCE_QUALS: 20 sources have taxname = Chloropicon primus
SOURCE_QUALS: location (all present, all same)
SOURCE_QUALS: 20 sources have location = chromosome
SOURCE_QUALS: chromosome (all present, all unique)
SOURCE_QUALS: strain (all present, all same)
SOURCE_QUALS: 20 sources have strain = RCC138
FEATURE_COUNT: CDS: 8627 present
FEATURE_COUNT: gene: 8683 present
FEATURE_COUNT: mRNA: 8627 present
FEATURE_COUNT: rRNA: 10 present
FEATURE_COUNT: tRNA: 46 present
SUSPECT_PRODUCT_NAMES: 15 product_names contain suspect phrases or characters
Putative Typo
1 feature May contain plural
May contain database identifier more appropriate in note; remove from product name
14 features contains three or more numbers together that may be identifiers more appropriate in note
FATAL: BACTERIAL_PARTIAL_NONEXTENDABLE_PROBLEMS: 2 features have partial ends that do not abut the end of the sequence or a gap, and cannot be extended by 3 or fewer nucleotides to do so
FATAL: BACTERIAL_JOINED_FEATURES_NO_EXCEPTION: 3219 coding regions with joined locations have no exceptions
JOINED_FEATURES: 6470 features have joined locations.
CHROMOSOME_PRESENT: one or more chromosomes are present
FEATURE_LIST: Feature List
INCONSISTENT_BIOSOURCE: 20 inconsistent contig sources (subsource qualifiers differ)
SUSPICIOUS_SEQUENCE_ID: 20 sequences have suspicious identifiers
Detailed Report
COUNT_NUCLEOTIDES: 20 nucleotide Bioseqs are present
output_file.sqn:chromosome_01 (length 1855637)
output_file.sqn:chromosome_02 (length 1769006)
output_file.sqn:chromosome_03 (length 1503776)
output_file.sqn:chromosome_04 (length 1460551)
output_file.sqn:chromosome_05 (length 1118318)
output_file.sqn:chromosome_06 (length 1154373)
output_file.sqn:chromosome_07 (length 832941)
output_file.sqn:chromosome_08 (length 860321)
A common issue, especially with fragmented assemblies, is the presence of partial genes that abut the edges of contigs or chromosomes. To fix this, we must extend the feature to the edge of the contig and then, for protein-coding genes, add a tag codon_start with the proper frame (e.g. /codon_start=2). This can be done easily with Artemis. EMBLtoTBL.pl will recognize these tags automatically, and adjust the TBL files accordingly.
Another issue with gene predictors is that they sometimes do not include proper stop codons for predicted protein-coding genes. This can be fixed easily with Artemis by selecting the features to modify (gene + CDS), then extending them by dragging the mouse to the proper stop codon. Note that this error is often mislabelled as a GT-AG rule issue by table2asn. Real issues with improper GT-AG intron/exon junctions can also be adjusted easily by drag and drop.
Errors fixed with Artemis can be saved easily from the graphical interface by selecting the 'File > Save All Entries' option. Then, we can simply rerun EMBLtoTBL.pl followed by table2asn. As the errors are getting fixed, the .val files will gradually become empty.
## Running EMBLtoTBL.pl and table2asn again on the updated EMBL files
EMBLtoTBL.pl \
-id ITTBIO \
-p $ANNOT/verified_annotations.tsv \
-embl $ANNOT/splitGFF3/*.embl \
1> $ANNOT/STD.log \
2> $ANNOT/ERROR.log
## Concatenating the FASTA and TBL files:
ANNOT_DIR=$ANNOT/splitGFF3/
cat $ANNOT_DIR/*.tbl > genome.tbl
cat $ANNOT_DIR/*.fsa > genome.fasta
## Generating a single submission with table2asn
table2asn \
-t template.sbt \
-w genome.cmt \
-i genome.fasta \
-f genome.tbl \
-o output_file.sqn \
-euk \
-J \
-M n \
-V vb \
-Z \
-H 12/31/2024
## Looking at the .val files again; no such file should remain.
ls *.val
ls: cannot access '*.val': No such file or directory
Once the errors have been corrected, the file(s) should be ready to deposit in the Submit assembled eukaryotic and prokaryotic genomes (WGS or Complete) segment of the NCBI's genome submission portal. This step will require the corresponding BioProject and BioSample information as part of the submission process.
NCBI now requests the submission of haplotigs generated as part of diploid/polyploid genome assemblies in addition to the consensus genome sequence. Instructions on how to do so are available at Submitting Multiple Haplotype Assemblies. These pseudohaplotypes can be submitted via the "Pseudohaplotypes of a diploid/polyploid assembly" option in the genome submission portal (each require their own BioProject however). Each file should also be named distinctively (e.g. genome.principal_haplotype.sqn / genome.alternate_haplotype.sqn) based on their relationships for greater clarity.
This work was supported in part by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health (award number R15AI128627) to Jean-Francois Pombert. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015 Jan;12(1):59-60. Epub 2014 Nov 17. PMID: 25402007 DOI: 10.1038/nmeth.3176.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. PMID: 2231712 DOI: 10.1016/S0022-2836(05)80360-2.
Barrett T, Clark K, Gevorgyan R, Gorelenkov V, Gribov E, Karsch-Mizrachi I, Kimelman M, Pruitt KD, Resenchuk S, Tatusova T, Yaschenko E, Ostell J. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 2012 Jan;40(Database issue):D57-63. Epub 2011 Dec 1. PMID: 22139929 PMCID: PMC3245069 DOI: 10.1093/nar/gkr1163.
Dunn NA, Unni DR, Diesh C, Munoz-Torres M, Harris NL, Yao E, Rasche H, Holmes IH, Elsik CG, Lewis SE. Apollo: Democratizing genome annotation. PLoS Comput Biol. 2019 Feb 6;15(2):e1006790. PMID: 30726205 PMCID: PMC6380598 DOI: 10.1371/journal.pcbi.1006790.
Carver T, Harris SR, Berriman M, Parkhill J, McQuillan JA. Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics. 2012 Feb 15;28(4):464-9. Epub 2011 Dec 22. PMID: 22199388; PMCID: PMC3278759. DOI: 10.1093/bioinformatics/btr703.
Chan PP, Lowe TM. tRNAscan-SE: Searching for tRNA Genes in Genomic Sequences. Methods Mol Biol. 2019;1962:1-14. PMID: 31020551 PMCID: PMC6768409 DOI: 10.1007/978-1-4939-9173-0_1.
Lagesen K, Hallin P, Rødland EA, Staerfeldt HH, Rognes T, Ussery DW. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8. Epub 2007 Apr 22. PMID: 17452365 PMCID: PMC1888812 DOI: 10.1093/nar/gkm160.
Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, Pesseat S, Quinn AF, Sangrador-Vegas A, Scheremetjew M, Yong SY, Lopez R, Hunter S. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014 May 1;30(9):1236-40. Epub 2014 Jan 21. PMID: 24451626 PMCID: PMC3998142 DOI: 10.1093/bioinformatics/btu031.
Mitchell AL, Attwood TK, Babbitt PC, Blum M, Bork P, Bridge A, Brown SD, Chang HY, El-Gebali S, Fraser MI, Gough J, Haft DR, Huang H, Letunic I, Lopez R, Luciani A, Madeira F, Marchler-Bauer A, Mi H, Natale DA, Necci M, Nuka G, Orengo C, Pandurangan AP, Paysan-Lafosse T, Pesseat S, Potter SC, Qureshi MA, Rawlings ND, Redaschi N, Richardson LJ, Rivoire C, Salazar GA, Sangrador-Vegas A, Sigrist CJA, Sillitoe I, Sutton GG, Thanki N, Thomas PD, Tosatto SCE, Yong SY, Finn RD. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res. 2019 Jan 8;47(D1):D351-D360. PMID: 30398656 PMCID: PMC6323941 DOI: 10.1093/nar/gky1100.
UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515. PMID: 30395287; PMCID: PMC6323992. DOI: 10.1093/nar/gky1049.
Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi I. GenBank. Nucleic Acids Res. 2020 Jan 8;48(D1):D84-D86. PMID: 31665464 PMCID: PMC7145611 DOI: 10.1093/nar/gkz956
Stein L. Genome annotation: from sequence to biology. Nat Rev Genet. 2001 Jul;2(7):493-503. DOI: 10.1038/35080529. PMID: 11433356. DOI: 10.1038/35080529
Kanehisa M, Sato Y, Morishima K. BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. J Mol Biol. 2016;428:726–731. PMID: 31423653 PMCID: PMC6933857 DOI: 10.1002/pro.3711
Suzuki S, Kakuta M, Ishida T, Akiyama Y. GHOSTX: An improved sequence homology search algorithm using a query suffix array and a database suffix array. PLoS One. 2014;9(8):e103833 PMID: 25099887 PMCID: PMC4123905 DOI: 10.1371/journal.pone.0103833
Aramaki T, Blanc‐Mathieu R, Endo H, et al. KofamKOALA: KEGG ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics. 2020 Apr 1;36(7):2251-2252. PMID: 31742321 PMCID: PMC7141845 DOI: 10.1093/bioinformatics/btz859
Zhang H, Yohe T, Huang L, Entwistle S, Wu P, Yang Z, Busk PK, Xu Y, Yin Y. dbCAN2: a meta server for automated carbohydrate-active enzyme annotation. Nucleic Acids Res. 2018 Jul 2;46(W1):W95-W101. PMID: 29771380 PMCID: PMC6031026 DOI: 10.1093/nar/gky418