-
Notifications
You must be signed in to change notification settings - Fork 1
/
main.tex
244 lines (150 loc) · 38.8 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
\documentclass[fleqn,10pt]{wlscirep}
\usepackage{graphicx}
\usepackage{pdfpages}
\usepackage{longtable}
\usepackage{listings}
\usepackage{multicol}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{breakcites}
\usepackage{indentfirst}
\usepackage[hyphens]{url}
\lstset{
basicstyle=\small\ttfamily,
columns=flexible,
breaklines=true
}
\newcommand{\beginsupplement}{%
\setcounter{table}{0}
\renewcommand{\thetable}{S\arabic{table}}%
\setcounter{figure}{0}
\renewcommand{\thefigure}{S\arabic{figure}}%
}
\title{Comparative Annotation Toolkit (CAT) - simultaneous clade and personal genome annotation}
\author[1,7]{Ian T. Fiddes}
\author[1,*]{Joel Armstrong}
\author[1,*]{Mark Diekhans}
\author[2,*]{Stefanie Nachtweide}
\author[3]{Zev N. Kronenberg}
\author[3,5]{Jason G. Underwood}
\author[3,4]{David Gordon}
\author[1]{Dent Earl}
\author[6]{Thomas Keane}
\author[3,4]{Evan E. Eichler}
\author[1]{David Haussler}
\author[2]{Mario Stanke}
\author[1,+]{Benedict Paten}
\affil[1]{Genomics Institute, University of California Santa Cruz and Howard Hughes Medical Institute, Santa Cruz, CA 95064, USA}
\affil[2]{Institute of Mathematics and Computer Science, University of Greifswald, Domstraße 11, Germany}
\affil[3]{Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA}
\affil[4]{Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA}
\affil[5]{Pacific Biosciences of California, Inc., Menlo Park, CA 94025, USA}
\affil[6]{European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SD, UK}
\affil[7]{10x Genomics, Pleasanton, CA 94566, USA}
\affil[+]{Corresponding author. Email: bpaten@ucsc.edu}
\affil[*]{These authors contributed equally to this work}
\keywords{Comparative Annotation, Whole-Genome Alignment, Genome Annotation, Comparative Genomics}
\begin{abstract}
The recent introductions of low-cost, long-read, and read-cloud sequencing technologies coupled with intense efforts to develop efficient algorithms have made affordable, high-quality \textit{de novo} sequence assembly a realistic proposition. The result is an explosion of new, ultra-contiguous genome assemblies. To compare these genomes we need robust methods for genome annotation. We describe the fully open source Comparative Annotation Toolkit (CAT), which provides a flexible way to simultaneously annotate entire clades and identify orthology relationships. We show that CAT can be used to improve annotations on the rat genome, annotate the great apes, annotate a diverse set of mammals, and annotate personal, diploid human genomes. We demonstrate the resulting discovery of novel genes, isoforms and structural variants, even in genomes as well studied as rat and the great apes, and how these annotations improve cross-species RNA expression experiments.
\end{abstract}
\begin{document}
\flushbottom
\maketitle
\thispagestyle{empty}
\section*{Introduction}
Short-read sequencing prices continue to drop and new technologies are being combined to produce assemblies of quality comparable to those previously created through intensive manual curation (\citealt{putnam2016chromosome,weisenfeld2017direct,jain2018nanopore,chaisson2015genetic,gordon2016long}, Kronenberg et al., submitted). These advances have allowed researchers to perform clade genomics, producing assemblies for multiple members of a species or clade \citep{thybert2018repeat,lilue2018multiple,jarvis2014whole}, and are required for the ambitious goals of projects such as Genome 10K \citep{haussler2009genome}, which aim to produce thousands of assemblies of diverse organisms. In addition, efforts are growing to produce \textit{de novo} assemblies of individual humans to evaluate the human health implications of structural variation and variation within regions not currently accessible with reference assisted approaches \citep{schneider2017evaluation,steinberg2014single,pendleton2015assembly}.
These advances in genome assembly require subsequent advances in genome comparison. Central to this comparison is annotation. The challenge of finding functional elements in genome assemblies has been considered for at least the past 20 years \citep{haussler1996generalized}. This problem is traditionally approached by \textit{ab initio} prediction (using statistical models of sequence composition) \citep{stanke2004augustus} and sequence alignment of known mRNAs or proteins \citep{Aken01012016}. The former has limited accuracy while the latter is limited by the existence of useful sequence information. Annotation pipelines such as MAKER \citep{cantarel2008maker}, RefSeq \citep{pruitt2006ncbi} and AUGUSTUS \citep{stanke2006gene} make use of both approaches. See \citep{hoff2015current} for a review of genome annotation methods.
A huge amount of effort has gone into the annotation of model organisms, in particular human and mouse. For the past five years, the GENCODE Consortium \citep{harrow2012gencode} has used a wide range of sequencing and phylogenetic information to manually build and curate comprehensive annotation sets, with over 43,281 and 60,297 open-reading frames in mouse and human, respectively. The GENCODE databases give a glimpse into the diversity of alternative isoforms and noncoding transcripts present in vertebrate genomes. Similarly, efforts in other model organisms, such as zebrafish \citep{westerfield1998zebrafish}, \textit{C. elegans} \citep{stein2001wormbase}, \textit{A. thaliana} \citep{swarbreck2008arabidopsis} and many others, have produced high-quality annotation sets for their respective assemblies.
As we enter a third era of genome assembly, consideration should be given to scaling annotation. Here, we present a method and toolkit to make use of multiple genome alignments produced by Progressive Cactus \citep{paten2011cactus} and existing high-quality annotation sets
to simultaneously project well-curated annotations onto lesser studied genomes.
In contrast to most earlier alignment methods \citep{blanchette2004aligning,earl2014alignathon,miller200728}, Progressive Cactus alignments are not reference based, include duplications, and are thus suitable for the annotation of many-to-many orthology relationships. We show how the output of this projected annotation set can be cleaned up and filtered through special application of AUGUSTUS \citep{stanke2008using} and how novel information can be introduced by combining the projected annotation set with predictions produced by Comparative Augustus \citep{konig2015simultaneous}. These predictions can be further supplemented and validated by incorporating long-range RNA-sequencing (RNA-seq) data, such as those generated by the Iso-Seq protocol \citep{gordon2015widespread}. We provide a fully featured annotation pipeline, the Comparative Annotation Toolkit (CAT), that can perform this annotation process reproducibly on any combination of a local computer, a compute cluster, or on the cloud. We show that this pipeline can be applied to a wide range of genetic distances, from distant members of the same clade to individualized assemblies of the same species.
\section*{Results}
\subsection*{Comparative Annotation Toolkit}
CAT provides a software toolkit designed to perform end-to-end annotation; Figure \ref{fig:pipeline} gives an overview. The only required inputs are a hierarchical alignment format (HAL) \citep{hickey2013hal} multiple genome alignment as produced by Progressive Cactus and a GFF3 format annotation file for the previously annotated genome(s). CAT can take as optional input a set of aligned RNA-seq or Iso-Seq BAM format files, as well as protein FASTA files, which are used to construct hints for AUGUSTUS.
TransMap \citep{stanke2008using,zhu2007comparative} is used to project existing annotations between genomes using the Progressive Cactus alignment. TransMap projections are filtered based on a user-tunable flag for minimum coverage, and then the single highest scoring alignment is chosen. If this results in transcripts for a given gene mapping to multiple loci, these are resolved to one locus based on the highest average score of a locus, rescuing lower scoring alignments.
Based on input parameters, CAT will run AUGUSTUS in up to four distinct parameterizations, two of which rely on transMap projections (AugustusTMR) and two that perform \textit{ab initio} predictions (AugustusCGP and AugustusPB) using extrinsic information to guide prediction. AugustusCGP performs simultaneous comparative prediction \citep{konig2015simultaneous} on all aligned genomes, while AugustusPB uses long read RNA-seq to discover novel isoforms. The output of these modes of AUGUSTUS are evaluated alongside the original transMap projections using a combination of classifiers as well as the output from homGeneMapping \citep{stanke2004augustus}, which uses the Cactus alignments to project features such as annotations and RNA-seq support between the input genomes. AugustusCGP and AugustusPB transcript predictions are assigned to transMap genes based on genomic and exonic overlap. If they overlap projections that were filtered out in the paralog resolution process, then they are flagged as putatively paralogous, while if they do not overlap any transMap projections they are flagged as putatively novel. A consensus-finding algorithm combines these gene sets.
The consensus finding algorithm combines all sources of transcript evidence into an annotation set. On a per-gene basis, it evaluates the transMap transcripts for passing user tunable flags for RNA-seq and annotation support. It then considers the inclusion of \textit{ab initio} transcripts based on their assignment to this locus and their contribution of novel splice junctions supported by RNA-seq or Iso-Seq. Finally, it evaluates \textit{ab inito} transcripts not assigned to a gene as novel loci if they are supported by RNA-seq or Iso-Seq as defined by user-tunable flags. For a more detailed description of CAT see Supplementary Methods.
\subsection*{Annotation of great apes}
The previous generation of great ape assemblies (panTro4, ponAbe2 and gorGor4) as well as the new SMRT (PacBio) great ape assemblies (\citealt{gordon2016long}, Kronenberg et al., submitted) were annotated by CAT, using GRCh38 and GENCODE V27 as the reference. On average, CAT identified 141,477 more transcripts and 25,090 more genes in the new SMRT assemblies of the great apes compared to the Ensembl V91 annotation of the previous generation of great ape assemblies. Relative to the existing human annotation, the CAT annotations represent an average of 95.0\% of GENCODE gene models and 94.3\% of GENCODE isoforms in the SMRT great ape assemblies. This increase in isoform representation is mostly due to the large number of isoforms annotated by GENCODE and reproduced in these genomes, while the increase in gene content is due to the mapping over of non-coding genes poorly represented in the Ensembl annotation. Comparing the CAT annotations of SMRT great apes and older assemblies, we see an average increase of 610 genes (1.9\%) and 3,743 isoforms (1.0\%) (Supplemental Figure \ref{supp_fig:primate_completeness}) in the SMRT assemblies; given this relatively small increase, most of the observed increase in genes and isoforms in the CAT annotations relative to the Ensembl annotations are therefore a result of the CAT annotation process rather than the updated assemblies.
Conversely to the overall increases in genes and isoforms, CAT identifies on average 3,553 fewer protein-coding genes than Ensembl. However, this brings the total number of coding genes more closely in line with the GENCODE annotation of human, as Ensembl has an average of 2,081 more protein-coding genes in great apes than GENCODE has for human (Supplemental Figure \ref{supp_fig:primate_completeness}).
To evaluate these annotations in a non-species-biased fashion, consensus isoform sequences created from Iso-Seq reads for each species were compared to their respective species annotations. As a baseline comparison, equivalent human data was compared to the high-quality human GENCODE V27 annotation. The CAT annotation of both the SMRT and older great ape assemblies (which used the raw Iso-Seq reads during the annotation process) and the Ensembl annotation of the older assemblies were compared. We calculated the rate of isoform concordance, that is the fraction of consensus Iso-Seq sequences that match either exactly or fuzzily an annotated isoform (Figure \ref{fig:primate}A; methods). Fuzzy matching allows for the intron boundaries to shift slightly in a isoform. For the SMRT chimpanzee (74.0\%/82.1\% exact/fuzzy matching) and orangutan (71.4\%/80.4\%) genome assemblies the isoform concordance rates were comparable to the rate for human (74.6\%/82.1\%). The gorilla GSMRT3.2 assembly showed lower concordance (67.6\%/76.9\%), likely due to the higher indel error rate in that assembly (Supplemental Figure \ref{supp_fig:primate_indels}). In contrast, the isoform concordance rate for the older assemblies was lower (on average 60.0\%/69.6\%), mostly reflecting exons in gaps and mis-joins, and was lower still for the existing Ensembl annotations (on average 47.9\%/57.6\%).
To assess the utility of CAT annotations for short-read analysis of RNA expression, species-specific induced pluripontent stem cell (iPSC) Illumina RNA-seq data were quantified (Figure \ref{fig:primate}B). Comparing the annotations of the older assemblies, CAT identified an average of 9,518 more genes and 54,107 more transcripts with measurable expression compared to Ensembl.
We might expect the per-gene abundance estimates of the majority of genes in matched cell types to agree between species, particularly for closely related species. It is reasonable to therefore prefer \textit{a priori} an annotation of the great apes that produces expression estimates that agree with expression estimates from the matched human data using the GENCODE annotation. Doing these comparisons, we find better correlations on average using the CAT annotation of the older assemblies (avg. Pearson r=0.63; Figure \ref{fig:primate}D, Supplemental Figure \ref{supp_fig:primate_expression}) than the Ensembl annotations of the older assemblies (avg. Pearson r=0.44). However, we find by far the highest correlation when CAT annotates the SMRT primate assemblies (avg. Pearson r=0.90). This reflects the increased representation in the updated assemblies of transcript sequence, especially 3` UTRs that are important for quantifying poly(A) primed RNA-seq (Kronenberg et al., submitted). Notably, we find that the correlations between the CAT annotations of the SMRT assemblies and the matched human data are higher than when mapping the species-specific data back to the human GENCODE annotations and comparing to the human data (Figure \ref{fig:primate}D, Supplemental Figure \ref{supp_fig:primate_expression}), demonstrating the benefit of having species-specific annotations within closely related species that have clear cross-species orthology relationships. Analysis at the isoform level showed the same patterns (Supplemental Figure \ref{supp_fig:primate_transcript_expression}), albeit with slightly weaker correlations.
Predictions performed by AugustusCGP and AugustusPB were incorporated into the gene sets based on the presence of splice junctions supported by RNA-seq or Iso-Seq and not present in the transMap/AugustusTMR-derived annotations (Figure \ref{fig:primate}C). An average of 1,677 novel isoforms and 64 novel loci were found across the assemblies with at least one Iso-Seq read supporting the prediction.
CAT provides new metrics for diagnosing assembly quality. In the process of annotating the great ape genomes, we noticed that assemblies that had undergone Quiver and Pilon \citep{walker2014pilon} correction still exhibited a systematic bias towards coding deletions. These were identified to be related to heterozygosity in the input dataset and a variant-calling-based correction method (Kronenberg et al., submitted) was developed to resolve these issues, dramatically lowering the coding indel rate and reducing systematic bias (Supplemental Figure \ref{supp_fig:primate_indels}). CAT can also measure gene assembly contiguity by reporting the number of genes whose transcripts end up split across multiple contigs, or on disjoint intervals in the same contig. Comparison of split gene metrics between the old and new primate assemblies shows 504 fewer split genes in chimpanzee, 560 fewer in gorilla and 1,858 fewer in orangutan (Supplemental Figure \ref{supp_fig:primate_split_genes}).
\subsection*{Annotation of personal human diploid assemblies}
High-quality \textit{de novo} assembly of a human genome is increasingly feasible; both Pacific Biosciences \citep{chin2016phased,huddleston2016discovery,korlach2017novo} (Falcon) and 10x Genomics \citep{weisenfeld2017direct} (Supernova) provide tools to construct phased, diploid assemblies. Annotating diploid assemblies provides a window into haplotype-specific structural variation that may affect gene expression. To evaluate the ability of CAT to provide this analysis, Progressive Cactus alignments were generated between hg38 and the two haploid cell line assemblies, CHM1 (GCA\_001297185.1) and CHM13 (GCA\_000983455.2) as well as the 10x Genomics diploid assemblies of four individuals (NA12878, NA24385, HG00512 and NA19240).
An average of 98.5\% of genes present in GENCODE V27 were identified in CHM1/CHM13, while an average of 97.3\% of genes were identified in the 10x Supernova assemblies. (Figure \ref{fig:human_metrics}A). After filtering, an average of 552 genes in the PacBio assemblies and 461 genes in the 10x assemblies had frame-shifting indels (Figure \ref{fig:human_metrics}B). Compared to ExAC, which found between 75 and 125 putative truncating events per individual \citep{karczewski2016exac}, this result suggests indel errors in the assemblies are producing false positives. All assemblies exhibit systematic overrepresentation of deletions, including the Pacbio assemblies despite coming from haploid cell lines (Figure \ref{fig:human_metrics}B). Split gene analysis found the CHM1 assembly to be the most gene contiguous, with only 39 genes split across multiple contigs, and the PacBio assemblies overall more contiguous (Figure \ref{fig:human_metrics}C). Gene contiguity is measured by looking at genes with multiple alignments post paralog resolution that start and end nearby in transcript coordinates.
Manual analysis of genes with different transMap coverage in CHM13 relative to CHM1 led to the discovery of the example region in Figure \ref{fig:human_example}A. This deletion removes most of the exons of \textit{TRIB3}, a pseudokinase associated with type 2 diabetes \citep{shi2009association} (Supplemental Figure \ref{supp_fig:trib3}). Similar analysis in the diploid assembly of NA12878 led to the discovery of a tandem duplication involving an exon of TAOK3 in one haplotype (Figure \ref{fig:human_example}B).
\subsection*{Reannotating the rat genome}
We tested CAT's ability to reannotate the rat genome using information from the mouse genome. These genomes differ by approximately 0.18 substitutions/site, much more, for example, than the 0.04 substitutions/site separating the human and orangutan genomes \citep{karolchik2003ucsc}.
CAT was run on a Cactus alignment between mouse (mm10) and rat (rn6) using rabbit (oryCun2), Egyptian jerboa (jacJac1) and human (hg38) as outgroups. To provide hints to AUGUSTUS, RNA-seq data were obtained from SRA \citep{fushan2015gene,cortez2014origins,liu2016identification} (Supplemental Table \ref{supp_table:rnaseq_sra_table}). For comparison we used existing Ensembl and RefSeq rat annotations and ran the MAKER2 pipeline \citep{cantarel2008maker} to generate an annotation set. MAKER was provided both a Trinity \citep{haas2013novo} \textit{de novo} assembly of the input RNA-seq data provided to CAT (MAKER does not process raw RNA-seq) as well as the mouse protein sequences from GENCODE VM11, together providing a comparable input set to what CAT had.
CAT comparatively annotated 78.1\% of genes and 91.9\% of protein-coding genes present in GENCODE VM11 on rn6 (Supplemental Figure \ref{supp_fig:rat_completeness}), representing an increase of 14,675 genes and 74,308 transcripts over Ensembl V90, 5,104 genes and 32,157 transcripts over RefSeq and 14,541 genes and 81,022 transcripts over Maker. 13,651 loci were identified with no overlap to any other annotation set (Supplemental Figure \ref{supp_fig:rat_locus_venn}).
We compared CDS exon and CDS intron predictions between annotation sets (Table \ref{table:rat_similarity}A, Supplemental Figure \ref{supp_fig:rat_exon_intron}). We measured precision and recall of coding intron and exon intervals based on comparing the CAT annotation to EnsemblV90, where precision is the proportion of CAT exons/introns that exactly match Ensembl, and recall is the proportion of CAT exons/introns that exactly match Ensembl over the number of exons/introns CAT annotated. Ensembl, RefSeq and CAT CDS exon annotations were all comparably similar (between 0.648 and 0.659 Jaccard similarity), while for CDS introns CAT and RefSeq displayed the highest Jaccard similarity (0.841). In all comparisons MAKER was the outlier (Table \ref{table:rat_similarity}B) with the lowest similarity to the other sets.
The input RNA-seq dataset was used for isoform quantification against the CAT, MAKER, Ensembl and RefSeq transcriptomes (Figure \ref{fig:rat_validation}A). CAT identified 1,881 protein-coding genes and 1,011 lncRNAs with measurable expression not present in either Ensembl or RefSeq. CAT also identified 27,712 expressed coding splice junctions not in the union of RefSeq and Ensembl, for a total of 21,267 novel expressed isoforms. 5,526 of the 13,651 loci unique to CAT had measurable expression.
AugustusTMR, which uses transMap and RNA-seq, provides CAT with a way to improve transcript predictions projected between species. Comparing the 9,532 multi-exon protein-coding transcripts in which the AugustusTMR prediction differed from the input transMap projection, we see considerable overall improvement in resulting RNA-seq support of predicted splice boundaries in the AugustusTMR transcripts (Figure \ref{fig:rat_validation}B).
\subsection*{Annotation of a diverse set of mammals}
Finally, to test CAT's ability to annotate across a substantial and diverse range of genomes, 13 mammalian genomes were comparatively annotated using the mouse (mm10) GENCODE VM15 as the reference transcript set (Figure \ref{fig:13way}A). Species-specific RNA-seq was used for every genome (Supplemental Table \ref{supp_table:rnaseq_sra_table}). To assess the completeness of these annotation sets, 4,104 benchmarking universal single-copy orthologs (BUSCO) were used \citep{simao2015busco}, which by design should be nearly uniformly present in each of these genomes. On average, only 108 BUSCO genes (2.63\%) were not annotated by CAT in each genome (Supplementary Table \ref{supp_table:busco}).
To estimate the usefulness of these annotation sets, the input RNA-seq datasets were used to quantify expression of the annotation sets (Figure \ref{fig:13way}B). The main factor in measurable expression is the variety of the input RNA-seq datasets, as exemplified by the ability to measure expression of 88.9\% of genes annotated in the sheep genome.
To assess the CAT translation of annotations over large phylogenetic distances, as well as provide a baseline validation of CAT performance, the annotation of human hg19 (GRCh37) produced in the representative mammalian genome annotation was compared to the current human GENCODE annotation set for that assembly (GENCODE V27lift37). Of the 19,233 ICE isoforms detected when running ToFU \citep{gordon2015widespread} against hg19, 12,911 (67.2\%) fuzzy matched a CAT isoform compared to 15,920 (82.8\%) of the human GENCODE annotations. Precision and recall analysis shows results similar to the rat annotation, with better matches in introns. 91.2\% of CAT introns and 75.2\% of CAT protein coding isoforms match GENCODE (Table \ref{table:validation}).
\section*{Discussion}
Gene annotation is a longstanding and critical task in genome informatics that must now be scaled to handle the rapidly increasing number of available genomes. At the time of writing there were 570 vertebrate genomes available from NCBI, but only 100 (17.5\%) and 237 (41.6\%) had Ensembl and RefSeq annotations, respectively.
We introduce CAT to help meet this need, building around a number of key innovations. Firstly, CAT utilizes the reference-free, duplication-aware multiple genome alignments we have developed. This allows CAT to annotate multiple genomes symmetrically and simultaneously, breaking from the traditional pattern of annotating each new genome individually, as is currently the practice for the RefSeq, Ensembl and MAKER gene-building pipelines. Not only does this solve a key scalability issue
by annotating multiple genomes simultaneously and consistently, CAT is able to produce orthology mappings, naming each equivalence class of orthologs based upon an initial reference annotation, and add to this sets of newly discovered genes. This can provide valuable evolutionary insights. For example, the analysis of the rat genome shows that many of the alternative isoforms and projected transcription start sites identified by GENCODE in mouse genes are supported by expression analysis in rat (Supplemental Figure \ref{supp_fig:unsupported_junctions}).
A second key innovation made by CAT is its leveraging of existing reference annotations. A huge amount of effort has been placed into the annotation of key species, such as human and mouse, employing a myriad of technologies and extensive, labor-intensive manual curation. It is very unlikely that this effort will be replicated across a significant fraction of other genomes, so instead we propose the `project and augment' strategy employed by CAT to annotate related genomes. Here we show that this strategy is very clearly able to improve the annotation of great ape genomes, using the human GENCODE set as the reference, and we make the case that we can even improve the annotation of a genome as well studied as the rat.
To circumvent the reference bias of existing annotations and to discover new genes and isoforms, CAT is able to integrate multiple forms of extrinsic information, using multiple, novel parameterizations of the AUGUSTUS algorithms. This includes use of new long-read RNA data, in particular Iso-Seq data, and shortly will integrate nanopore-based long-read data \citep{byrne2017nanopore}. Using this expression data not only allowed us to confirm expression of a substantial fraction of isoforms but allowed us to discover thousands of novel isoforms and dozens of novel genes in the great ape genomes.
With the advent of more affordable \textit{de novo} genome assembly, there is renewed interest in the generation of \textit{de novo} human genomes, and in general the creation of multiple \textit{de novo} genomes for a species. This has the advantage of providing fully independent reconstruction and is particularly appropriate for sequences that are highly divergent from the reference, e.g. structural variations. However, such assemblies do not negate the need for genome comparison; Cactus can be parameterized to rapidly create sensitive whole-genome alignments of human genomes, and here we have demonstrated that CAT can be used to build upon this to produce a high-quality diploid gene annotation and ortholog mapping.
CAT works best when provided RNA-seq data, but for many species this may not be possible. From our experience, a reasonable amount (on the order of 50 million reads) of RNA-seq from tissues like brain and liver is fairly informative. Using poly(A)-selected libraries is recommended, as it greatly reduces false positive predictions in AugustusCGP. Iso-Seq data allowed for the discovery of thousands of novel isoforms in the great apes but may be too expensive for many projects. In clade genomics projects, we would suggest generating RNA-seq for a few of the species and then reliance on the coordinate mapping that AugustusCGP and homGeneMapping provide to evaluate support in other members of the clade.
A key barrier to the use of bioinformatics tools is their ease of use; we have focused on providing cloud agnostic distributions of the CAT software so that, despite its complexity, it can be run within a uniform computational environment by external groups.
CAT is not without limitations. In the future it would be good to use the genome alignments to not only project transcripts, but to use the evolutionary conservation signatures to predict the potential likelihood of projected annotations being coded \citep{lin2011phylocsf}. CAT also does not yet provide means to detect new processed, unprocessed and unitary pseudogene predictions other than via projection of existing annotations. CAT’s current implementation also does not attempt to put weights on the features used for constructing a consensus gene set. Instead, it simply scores transcripts based on the sum of all features evaluated. In the future deep learning methods could be added to CAT to construct feature weights and improve consensus finding, better mimicking the labor intensive efforts of manual annotators who currently weigh such evidence.
An earlier version of CAT was used to annotate the PacBio-based assembly of the gorilla genome \citep{gordon2016long} as well as produce the current Ensembl annotations for 16 laboratory mouse strains as part of the Mouse Genomes Project \citep{lilue2018multiple} (\url{http://www.sanger.ac.uk/science/data/mouse-genomes-project}). In addition, CAT has been proposed for the Vertebrate Genomes Project (VGP), which aims as a pilot project to assemble and annotate one member of every order of vertebrate species. CAT also will be used on the 200 Mammals Project, which aims to add approximately 140 new mammalian genome assemblies to the existing set (\url{https://karlssonlab.org/2017/08/03/the-200-mammals-project/}). These projects will provide a new understanding of gene evolution.
\section*{Materials and Methods}
CAT produces as output a series of diagnostic plots, an annotation set for each target genome, and a UCSC comparative assembly hub \citep{nguyen2014comparative}. Both the pipeline and associated documentation can be found at \url{https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit}. CAT is constructed using the Luigi (\url{https://github.com/spotify/luigi}) workflow manager, with Toil \citep{vivian2017toil} used for computationally intensive steps that work best when submitted to a compute cluster.
\subsection*{RNA-seq}
CAT annotation is improved when species-specific RNA-seq data are provided. These data are used as hints for AugustusTMR and AugustusCGP. In AugustusTMR, RNA-seq helps fill in missing information in the alignment, as well as resolve evolutionary changes. In AugustusCGP, RNA-seq additionally helps prevent false positives inherent in \textit{ab initio} gene finding. For these reasons, RNA-seq was obtained from SRA for all species annotated in this paper. All RNA-seq were aligned to their respective genomes with STAR \citep{dobin2013star} and the resulting BAM files passed to CAT to construct the extrinsic hints database. See Table \ref{supp_table:rnaseq_sra_table} for accessions and tissue types of RNA-seq data used for annotation. In addition, for the PacBio great ape annotation, RNA-seq data were generated using iPSC lines for human, chimpanzee, gorilla and orangutan derived from cells for the same individuals as the assemblies (Kronenberg et al., submitted). For all expression analyses, Kallisto \citep{bray2015near} was used.
\subsection*{Annotation set similarity analysis}
Jaccard similarity analysis was performed with BEDTools \citep{quinlan2010bedtools}. The rat locus overlap analysis was performed with the Kent tool clusterGenes, which requires exonic overlap on the same strand.
\subsection*{Iso-Seq}
Iso-Seq full-length non-chimeric reads (FLNC) were also generated from the great ape iPSC lines and aligned with GMAP \citep{wu2005gmap}.To perform isoform-level validation in the primates, the Iso-Seq data used as input to CAT were also clustered with isoform-level clustering (ICE) and then collapsed into isoforms using ToFU \citep{gordon2015widespread}. Ensembl provided a pre-release of their new V91 annotations for panTro4 and gorGor4, but did not yet run their updated pipeline on ponAbe2.
\subsection*{ICE validation}
The output transcripts from ICE were compared to various annotation sets in both an exact and fuzzy matching scheme. In the exact scheme, the genomic order and positions of all of the introns (an intron chain) of a transcript are compared to any ICE isoforms which overlap it. In the fuzzy matching scheme, each annotated intron chain is allowed to move up to 8 bases in either direction and still be called a match.
\subsection*{BUSCO}
The mammalian BUSCO \citep{simao2015busco} analysis was performed using the mammalia\_odb9 set of 4,104 genes. BUSCO was ran against the complete protein-coding sequence repertoire produced by CAT in that species in the `protein' mode.
\subsection*{Progressive Cactus}
All Cactus alignments except the 14-way mammal alignment were generated using Progressive Cactus (\url{https://github.com/glennhickey/Progressive Cactus}) commit 91d6344. For the mouse-rat alignment, the guide tree was
\begin{lstlisting}
(((Lesser_Egyptian_jerboa:0.1,(Mouse:0.084509,Rat:0.091589)mouse_rat:0.107923)rodent:0.148738,Rabbit:0.21569)glires:0.015313,Human:0.143908);
\end{lstlisting}
For the primate alignments, the guide tree was
\begin{lstlisting}
(((((((Susie_Gorilla:0.008964,(Human:0.00655,Clint_Chimp:0.00684)human_chimp:0.00122)gorilla_chimp_human:0.009693,Susie_Orangutan:0.01894)great_ape:0.003471,Gibbon:0.02227)great_ape_gibbon:0.01204,Rhesus:0.004991)old_world_monkey:0.02183,Squirrel_monkey:0.01035)monkey:0.05209,Bushbaby:0.1194)primate_anc:0.013494,Mouse:0.084509);
\end{lstlisting}
An identical tree (with different assembly names) was used for the alignment of current reference great apes.
For the diploid human alignments, the two haploid cell lines (PacBio) or all human haplotypes (10x) were placed under the same node with a very short branch length, with chimpanzee as outgroup. The guide trees were
\begin{lstlisting}
((hg38:0.001,chm1:0.001,chm13:0.001)human:0.01,chimp:0.01);
\end{lstlisting}
and
\begin{lstlisting}
((hg38:0.001,HG00512-H1:.001,HG00512-H2:.001,NA12878-H1:.001,NA12878-H2:.001,NA19240-H1:.001,NA19240-H2:.001,NA24385-H1:.001,NA24385-H2:.001)human:0.01,chimp:0.01);
\end{lstlisting}
representing a star phylogeny of the three human assemblies. For the 14-way mammal alignment, the Progressive Cactus commit used was e3c6055 and the guide tree was
\begin{lstlisting}
((((oryCun2:0.21,((Pahari_EiJ:0.03,mm10:0.025107)1:0.02,rn6:0.013)1:0.252)1:0.01,((((hg19:0.00642915,panTro4:0.00638042)1:0.00217637,gorGor3:0.00882142)1:0.00935116,ponAbe2:0.0185056)1:0.00440069,rheMac3:0.007)1:0.1)1:0.02,((oviAri3:0.019,bosTau8:0.0506)1:0.17,(canFam3:0.11,felCat8:0.08)1:0.06)1:0.02)1:0.02,loxAfr3:0.15);
\end{lstlisting}
Slightly out-of-date versions of some assemblies (hg19 and rheMac3) were used because a collaborator had data on those assemblies that they wished to use the alignment to analyze. The rodent and primate subtrees were first aligned separately (the rodent subtree originally included additional mouse strains) \citep{thybert2018repeat,lilue2018multiple}. The two subtrees were then “stitched” together into a single alignment by aligning together their roots along with several Laurasiatheria genomes. This was done to save alignment time by reusing existing alignments.
\subsection*{CAT}
CAT was run on the UCSC Genome Browser compute cluster for all annotation efforts in this publication. CAT commit f89a814 was used. For a detailed description of how CAT works, see both the supplementary text as well as the README.md on GitHub (\url{https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit}).
\subsection*{Pipeline runtime}
CAT is relatively efficient, taking on the order of thousands of core hours to run. The largest considerations for runtime are running the various parameterizations of AUGUSTUS as well as generating the required cactus alignment. AugustusCGP may run significantly faster on alignments with many genomes by reducing the chunk size from the default, but at the cost of lower quality predictions. AugustusTMR runtime scales linearly with the number of protein-coding transcripts in the input annotation set, but scales non-linearly with the number of extrinsic hints provided, particularly if the hints are contradictory.
All of the analyses in this paper were run on the UCSC cluster, which uses the cluster management tool Parasol and has 1024 cores with 8GB of RAM per core. CAT was optimized for this, and shouldn't need more memory per core in any case except the AugustusCGP step when the number of aligned genomes exceeds \~10. This can be adjusted by reducing the alignment chunk size that AugustusCGP is given to work with. For example, for the 14-way mammalian analysis, the flags \texttt{--maf-chunksize 1000000 --maf-overlap 200000} were set which kept memory usage under 8GB.
Cactus alignments take on the order of 120 CPU days (2,880 core hours) per internal node on the guide tree, assuming a binary tree. This number can fluctuate by a factor of 2-4 depending on how similar the two genomes being aligned at that node are. Cactus alignments are a mix of high CPU low memory steps with a few high memory steps, with some jobs requiring \~240GB of RAM.
Running CAT on the PacBio primate genomes took a total of 7,030 core hours, with 3,437 of those dedicated to running AugustusTMR, 1,191 dedicated to running AugustusPB, and 2,190 dedicated to running AugustusCGP. Running CAT on the 14-way mammalian alignment took a total of 24,122 core hours, with 14,045 of those dedicated to running AugustusTMR, and 8,225 dedicated to running AugustusCGP.
\section*{Software Availability}
CAT is available on GitHub (\url{https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit}). The exact commit used for these analyses is also in the Supplementary Materials section.
\section*{Acknowledgements}
We would like to thank Brian Raney, Hiram Clawson and the rest of the UCSC Genome Browser team for help creating browser tracks. We would also like to thank James Kent for allowing UCSC Genome Browser compute resources to be used for this project. Finally, we would like to thank Fergal Martin and Paul Flicek for revising the paper and providing a pre-release of the Ensembl V91 annotations on chimpanzee and gorilla, as well as the whole GENCODE consortium for their support and advice. This work was supported, in part, by US National Institutes of Health (NIH) grants U24HG009081 and U41HG007635 to E.E.E., HG007990 to D.H and B.P, and HG007234 to B.P. E.E.E. and D.H. are investigators of the Howard Hughes Medical Institute. (COI: E.E.E. is on the scientific advisory board (SAB) of DNAnexus, Inc. and I.T.F. is an employee of 10x Genomics.)
\bibliography{references}
\include{figures}
\include{supplement}
\end{document}