The EnsEMBL Genome Annotation System
We use Linuxbrew to install all our software you will need to tap cask from https://github.com/Ensembl/homebrew-cask before being able to install ensembl/cask/genebuild-annotation
.
You will need to ask for a licence for some of the software we use:
- crossmatch, part of phred/phrap: used by RepeatMasker, you could use NCBI blast instead
- trf
- repbase: you could use RepeatModeler to generate a repeat library. However the generated library could mask some genes coding for some protein families
We recommend that you clone all the repositories into one directory
Repository name | branch | URL |
---|---|---|
ensembl | default | https://github.com/Ensembl/ensembl.git |
ensembl-hive | default | https://github.com/Ensembl/ensembl-hive.git |
ensembl-compara | release/98 | https://github.com/Ensembl/ensembl-compara.git |
ensembl-variation | default | https://github.com/Ensembl/ensembl-variation.git |
ensembl-production | default | https://github.com/Ensembl/ensembl-production.git |
ensembl-taxonomy | default | https://github.com/Ensembl/ensembl-taxonomy.git |
ensembl-orm | default | https://github.com/Ensembl/ensembl-orm.git |
ensembl-killlist | default | https://github.com/Ensembl/ensembl-killlist.git |
ensembl-datacheck | default | https://github.com/Ensembl/ensembl-datacheck.git |
ensembl-metadata | default | https://github.com/Ensembl/ensembl-metadata.git |
ensembl-io | default | https://github.com/Ensembl/ensembl-io.git |
For each of these repository, you will need to install their dependencies using the cpanfile provided in their Git repositories
You can use the Ensembl git commands and run the following command to clone the repositories
git ensembl --clone genebuild
Repository name | branch | URL |
---|---|---|
ensembl-genes | default | https://github.com/Ensembl/ensembl-genes.git |
You will need to create two virtual environment:
genebuild
using the requirements.txt file; it needs to be activated for the pipeline to rungenebuild-mirna
using the requirements_p36_ncrna.txt file; it will be used directly from the analysis, no need to activate it
If you are not part of the Ensembl Genebuild team, you will need to set some shell environment variables to avoid having to provide the information to the configuration files. We will assume you are using your home directory
Variable | Value | Hive configuration parameter | Description |
---|---|---|---|
ENSCODE | $HOME | -enscode_root_dir | Directory path where you cloned all the Perl repositories |
ENSEMBL_SOFTWARE_HOME | $HOME | -software_base_path | Directory where pyenv, plenv and linuxbrew are installed |
LINUXBREW_HOME | $HOME/.linuxbrew | -linuxbrew_home_path | Base directory for your Linuxbrew installation |
PYTHONPATH | $HOME/ensembl-genes/ensembl_genes:$HOME/ensembl-hive/wrappers/python3/ | It needs to be set until the package can be installed properly | |
BLASTDB_DIR | $HOME | It will be used to find the path to the entry_loc file which is the list of accession from SwissProt which would be located at $HOME/uniprot/2021_03/entry_loc |
We currently use MySQL databases to store our data. To avoid having to do many changes to the configuration files we recommend having one read-only user and one read-write user. It is also better to use different servers for keeping the eHive pipeline database, the DNA database and the "data" databases.
There is a main configuration file, Bio::EnsEMBL::Analysis::Hive::Config::Genome_annotation_conf
, which will generate a set of pipelines to:
- load the DNA into the
dna_db
database - mask repeats
- align multiple sets of data to the genome to produce gene models depending on the data available
- select the best transcripts for each loci
- produce the finalised databases for an Ensembl release The whole system is explained in more details below
You will need to activate the genebuild virtual environment
pyenv activate genebuild
If you are operating within an environment prepared for Ensembl with the assembly registry you can use the $ENSCODE/ensembl-analysis/scripts/genebuild/create_annotation_configs.pl
.
You would need to edit $ENSCODE/ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Hive/Config/genome_annotation.ini
Then you can run
perl $ENSCODE/ensembl-analysis/scripts/genebuild/create_annotation_configs.pl --config_file $ENSCODE/ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Hive/Config/genome_annotation.ini --assembly_registry_host <host_name> --assembly_registry_port <port>
If you're setup do not work with the create_annotation_configs.pl script, you would need to edit $ENSCODE/ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Hive/Config/Genome_annotation_conf.pm
and fill in any information according to you environment
Then you can run
perl $ENSCODE/ensembl-hive/scripts/init_Pipeline.pl Bio::EnsEMBL::Analysis::Hive::Config::Genome_annotation_conf [extra parameters]
To start the pipeline you need the URL to your pipeline database which will be provided when running the init_Pipeline.pl script. If you initialised the pipeline automatically, you need to look at the command file created in your working_dir
directory to retrieve the information.
export EHIVE_URL=mysql://readwrite_user:password@host:port/dbname
You can now start the pipeline with
perl $ENSCODE/ensembl-hive/scripts/beekeeper.pl -url $EHIVE_URL -loop
If you only want to run some analyses, you can run
perl $ENSCODE/ensembl-hive/scripts/beekeeper.pl -url $EHIVE_URL -loop -analyses_pattern 1..5
To follow the pipeline steps, it is better to use GuiHive, a graphical interface to ensembl-hive, which allows you to change parameters, debug your problems and much more https://github.com/Ensembl/guiHive
You should first look at the job tab to know the reason of the failure
- Insufficient memory: you can either use a different resource or add a new one more suited to your needs
- Error in the code: I'm afraid you will need to do proper debugging
Once you are happy with your fix, you would need to reset the jobs with
perl $ENSCODE/ensembl-hive/scripts/beekeeper.pl -url $EHIVE_URL -reset_failed_jobs
and restart the pipeline
By default ensembl-hive redirect all output to /dev/null
unless you used some logging parameters.
You will need to run the problematic job with runWorker. First you will need to retrieve the job id using GuiHive or the pipeline database. Then you can run
perl $ENSCODE/ensembl-hive/scripts/runWorker.pl -url $EHIVE_URL -debug 1 -job_id XX
Using a higher value for -debug
is usually not useful as it is mostly seen as a boolean flag.
First you would need to go on the job tab and do a middle click/right click on the guihive
link. You would need to insert the password twice, this is normal and not a scam.
Then from the "GuiHive" of the sub pipeline, you need to make your changes/debugging the same way as a normal pipeline. Once this is done, you can simply reset the main pipeline and restart the main pipeline
We use the main/sub terminology to define the different part of the system. We could have used parent/child
- The "main" pipeline will run multiple analyses and will initialise at least one sub pipeline and run this sub pipeline
- The "sub" pipeline will run multiple analyses and can potentially be the "main" pipeline of a different pipeline. A sub pipeline can be run on its own
A pipeline needs three analyses to be called a "main" pipeline:
create_<sub pipeline name>_jobs
initialise_<sub pipeline name>
run_<sub pipeline name>
The main reason for this analysis is to provide an easy access to the sub pipeline guiHive and the pipeline database details
It is usually a Bio::EnsEMBL::Hive::RunnableDB::JobFactory
where the input list will have at least one element with four key value pair. The reason for this analysis it
- ehive_url: the sub pipeline database connection URI,
mysql://rw\_user:password@host:port/database
, useful for debugging as it can be used - external_link: the sub pipeline guiHive URL
- meta_pipeline_db: the sub pipeline database connection hash,
{'-dbname' => 'database','-driver' => 'mysql','-host' => 'host','-pass' => 'password','-port' => 3306,'-user' => 'rw_user'}
- pipeline_name: the name of the sub pipeline
Example analysis config
{
-logic_name => 'create_rnaseq_db_pipeline_job',
-module => 'Bio::EnsEMBL::Hive::RunnableDB::JobFactory',
-parameters => {
column_names => ['meta_pipeline_db', 'ehive_url', 'pipeline_name', 'external_link'],
inputlist => [[$rnaseq_db_pipe_db, $rnaseq_db_pipe_url, 'rnaseq_db_'.$self->o('production_name'), $rnaseq_db_guihive]],
guihive_host => $self->o('guihive_host'),
guihive_port => $self->o('guihive_port'),
},
-rc_name => 'default',
-max_retry_count => 0,
-flow_into => {
2 => ['initialise_rnaseq_db'],
}
},
This analysis will initialise the sub pipeline.
It must be a Bio::EnsEMBL::Analysis::Hive::RunnableDB::HiveMetaPipelineInit
, it will simply generate the commandline which will be run with $ENSCODE/ensembl-hive/scripts/init_pipeline.pl
There is a set of parameters which are needed:
- hive_config: the name of the config to use
- databases: list of keys which contains the connection details of any database to be used such as 'dna_db'. The key value pairs should not be in
extra_parameters
. - extra_parameters: any parameter to be provided to the initialisation script.
species_name => 'homo_sapiens'
would be used as-species_name homo_sapiens
- metadata_pipeline_db: the connection details of the pipeline database, this would usually be provided by the
create_<sub pipeline name>_jobs
analysis
There is one special parameter which is not required but can be helpful, commandline_params
, which will not process the value associated. It can be used for re-initialising a sub pipeline with something like -hive_force_init 1
or -hive_force_init 1 -drop_databases 1
if the pipeline has already been started/run and you want to drop any databases created by the pipeline.
Arrays and hashes cannot be easily passed to the init script on the command line. Also Hive does not overwrite arrays, it always add elements to the parameter if the array already exist.
Example analysis config
{
-logic_name => 'initialise_rnaseq_db',
-module => 'Bio::EnsEMBL::Analysis::Hive::RunnableDB::HiveMetaPipelineInit',
-parameters => {
hive_config => $self->o('hive_rnaseq_db_config'),
databases => ['rnaseq_db', 'rnaseq_refine_db', 'rnaseq_blast_db', 'dna_db'],
rnaseq_db => $self->o('rnaseq_db'),
rnaseq_refine_db => $self->o('rnaseq_refine_db'),
rnaseq_blast_db => $self->o('rnaseq_blast_db'),
dna_db => $self->o('dna_db'),
enscode_root_dir => $self->o('enscode_root_dir'),
extra_parameters => {
output_path => $self->o('output_path'),
user_r => $self->o('user_r'),
dna_db_host => $self->o('dna_db_host'),
dna_db_port => $self->o('dna_db_port'),
databases_host => $self->o('databases_host'),
databases_port => $self->o('databases_port'),
release_number => $self->o('release_number'),
production_name => $self->o('production_name'),
species_name => $self->o('species_name'),
registry_host => $self->o('registry_host'),
registry_port => $self->o('registry_port'),
registry_db => $self->o('registry_db'),
assembly_name => $self->o('assembly_name'),
assembly_accession => $self->o('assembly_accession'),
},
},
-rc_name => 'default',
-max_retry_count => 0,
-flow_into => {
1 => ['run_rnaseq_db'],
},
},
This analysis will run the sub pipeline
It must be a Bio::EnsEMBL::Analysis::Hive::RunnableDB::HiveMetaPipelineRun
, it will run beekeeper in a loop. If the job is a retry, it will first reset all failed jobs in the sub pipeline and start beekeeper.
There is only one parameter to be passed, beekeeper_script
, it is not required but recommended.
Example analysis config
{
-logic_name => 'run_rnaseq_db',
-module => 'Bio::EnsEMBL::Analysis::Hive::RunnableDB::HiveMetaPipelineRun',
-parameters => {
beekeeper_script => $self->o('hive_beekeeper_script'),
},
-rc_name => 'default',
-max_retry_count => 1,
},
The main pipeline will query ENA to retrieve the possible short read and long read accessions which would be used in the corresponding sub pipelines. Then start each sub pipeline below in the order they appear in this document.
By default, ENA is queried using the NCBI taxonomy id of the species, but you can provide either a single project id or a list of project ids to study_accession
It will create a registry file which will be used for the whole genome alignment and the DataChecks
It will provide stats on the assembly and on the repeat masking which will be sent to the email provided.
Before starting the alignments sub pipelines, it will reset a set of arrays in the "Transcript selection" sub pipeline to allow any related sub pipeline to provide the database connection details some analyses will need.
It will create the core database which is referenced as dna_db
, reference_db
and sometimes core_db
in configuration files. It will load a set of static tables. It will process the assembly report file to load the sequence names and synonyms and the dna. If UCSC sequence accessions exist in the report, they will be loaded as sequence synonyms
When there is a known mitochondrial sequence, it will load the mitochondrial dna and genes. Finally, it will dump the genome in a fasta file and index it for faidx access.
None
If RefSeq has not created a sister assembly (GCF_*), we will not load any mitochondrial data even if a RefSeq mitochondrial annotation exists. Because the mitochodrial sequence is not used in the annotation process it can be checked before the gene set is released with a query on the NCBI website, searching for a mitocondrial annotation from RefSeq for the species of interest.
It checks on the assembly report if a corresponding RefSeq assembly exists. It will then load the RefSeq sequence accessions as sequence synonyms and the RefSeq gene set using their GFF3 annotation file
The gene set is not used for annotation purpose. It can be used to compare the gene set generated and Ensmbl users appreciate the possibilty of looking at both gene sets in one location.
None
None
It will run RepeatMasker using RepBase with the closest clade library and it will run dustmasker and TRF.
It will verify the presence of a RepeatModeler library file and run RepeatMasker with this library if needed.
It will run Red, a different repeat masking program if replace_repbase_with_red_to_mask
is set to 1.
None
Red and RepeatMasker with a RepeatModeler library may mask the genome where protein gene families could be. However in the absence of a RepBase entry for the species of interest, they will be helpful.
It uses an Ensembl Compara pipeline to do the whole genome alignment of your species genome against a high quality assembly. We usually use human GRCh38 for all species except the rodents where we would use mouse GRCm39. You can choose a different species for the source. You will need to check the quality of the assembly and the quality of the annotation.
You will need to use a Compara master database.
It is possible to avoid running this part by setting skip_projection
to 1.
The results of this sub pipeline will be used for the projection sub pipeline.
None
The pipeline provides the best result when the two species are not too divergent. For example, it does not provide good information when used with any fish if human is the source.
It will use the whole genome alignment of the LastZ sub pipeline to project the high quality source annotation onto the genome of interest. It will also use CESAR2.0 to project the high quality source annotation onto the genome of interest. Then we select the best model for each projected gene based on coverage and identity. The selected model can be used later in the transcript selection sub pipeline.
- proj_sel database to transcript selection sub pipeline
- create_toplevel_slices
- split_slices_on_intergenic
- layer_annotation
None
We align a controlled set of proteins grouped by species/family/clade depending on the clade of the species of interest using GenBlast. The models are classified based on their coverage and identity. The models with the highest coverage and the highest identity will be preferably used when selecting the transcripts.
The controlled sets are found in modules/Bio/EnsEMBL/Analysis/Hive/Config/UniProtCladeDownloadStatic.pm
- genblast_nr database to transcript selection sub pipeline
- create_toplevel_slices
- split_slices_on_intergenic
- layer_annotation
None
We align the cDNAs specific to the species onto the genome using Exonerate. First we align all possible sequences, they will be used later to add UTRs. Then we only align the full length cDNAs using the first alignments to reduce the search space and using the reported CDS start and CDS end. When internal stops are found, the model is removed unless we found only one. In this case we replace the stop codon with an intron to avoid any assembly error to interfere.
We also align the species specific proteins retrieved from UniProt with protein existence (PE) level 1 and 2 (protein evidence and transcript evidence). We use PMatch to reduce the search space then Exonerate and GeneWise in two different analyses. We ran a final Exonerate on the whole genome as back up.
We align the species specific seleno proteins and mark the internal stop with an attribute
The final step will select the best model for each protein of a locus with the following ranking:
- cdna
- seleno protein
- edited cdna
- protein
- backup protein
-
bt database to transcript selection sub pipeline
- create_toplevel_slices
- split_slices_on_intergenic
- layer_annotation
-
cdna database to transcript selection sub pipeline
- run_utr_addition
- run_utr_addition_10GB
- run_utr_addition_30GB
- utr_memory_failover
None
It uses a known set of human IG/TR genes and align them using GenBlast with customised parameters to restrict the size of the introns. It collapses the models created at each locus
- igtr database to transcript selection sub pipeline
- create_toplevel_slices
- split_slices_on_intergenic
- layer_annotation
None
It aligns sequence from the RFam database to the genome using Blast and uses the Infernal software suite to assess the short non coding genes.
Sequences from miRBase are aligned to the genome and are filtered using a
None
The genebuild-mirna
virtual environment is needed but do not need to be activacted as we use the full path to Python.
We use STAR to align the set of Illumina short reads downloaded. Then we generate models with Scallop based on the STAR alignments. We determine the coding potential of the models using blast on a sub set of UniProt proteins, PE 1 and 2. For non vertebrate we also use CPC2 and Samba as UniProt does not have enough data. The models are collapsed by loci for the transcript selection pipeline.
-
rnalayer_nr database to transcript selection sub pipeline
- create_toplevel_slices
- split_slices_on_intergenic
- layer_annotation
-
rnalayer_nr database to transcript selection sub pipeline
- run_utr_addition
- run_utr_addition_10GB
- run_utr_addition_30GB
- utr_memory_failover
-
rnalayer_nr database to homology rnaseq sub pipeline
- genblast_rnaseq_support
- genblast_rnaseq_support_himem
-
scallop_blast database to main pipeline
- initialise_rnaseq
The fastq files downloaded will be deleted to free disk space as it can easily be above the terabyte
It will use all the splice sites found by the short read alignment step and compare them with the introns found in the protein models from GenBlast. The models will be ranked based on the completeness of the intron support.
- gb_rnaseq_nr database to transcript selection sub pipeline
- create_toplevel_slices
- split_slices_on_intergenic
- layer_annotation
It is not run when there is no short read data
We use Minimap2 to align the long reads to the genome and the resulting models are collapsed to help reduce the number of isoforms when there is a slight difference in final exons for example. It also try to find the correct canonical splice sites using the redundancy. The default data type is PacBio but ONT can be used too.
-
lrfinal database to transcript selection sub pipeline
- create_toplevel_slices
- split_slices_on_intergenic
- layer_annotation
-
lrfinal database to transcript selection sub pipeline
- run_utr_addition
- run_utr_addition_10GB
- run_utr_addition_30GB
- utr_memory_failover
The fastq files downloaded will be deleted to free disk space as it can easily be above the terabyte
It will look at all the protein coding data generated by sub pipelines and using a layer mechanism it will decide which model to use for each loci. It will also add the UTR regions to the models based on the transcriptomic data sets; cDNA, RNA-seq and long reads. It will trim the UTR regions when necessary.
It will look for pseudogenes. The criterion are single exon models which have a good blast hit with a multi exon model or if the model is within an LTR region. Multi exon models with very short introns are also seen as pseudogenes.
Long non coding RNA will be labelled but we need transcriptomic data.
Short non coding RNA will be copied over unless they overlap protein coding genes.
Some cleaning is done to remove low quality models which haven't been labelled as pseudogenes or lncRNA or when more information was needed about the gene set.
None
None
We copy the final gene set into the core database while making a last check for readthroughs and bad UTRs.
The stable id needs to be generated, either with the simple script when it's a new species or with the stable id mapping pipeline when it is an update to the assembly. If it is an update to the assembly, the mapping_required
parameter of run_stable_ids should be set to 1. Once the mapping has been done you can add a skip_analysis
parameter to the same analysis and set it to 1 to restart the pipeline.
Then the database is prepared to be ready for handover.
None
The external db ids of the supporting evidences are set using a script which when it breaks can brak many things. So if the analysis load_external_db_ids_and_optimise_af_tables fails, you have to be very careful rerunning the job.
The keys of the meta table are set for RapidRelease. If you want to handover for the Main website, you will need to manually do some tweaks
It will create an empty database from the core database and copy the genes from the following databases:
- cdna
- refseq
- lrinitial
The database is then prepared for handover.
None
When this pipeline is started after a successful core handover and the hopefully unlikely modification have been done on the database from the pipeline, the meta table should not need any updates
It will copy the database provided by the short read pipeline and sort the dna_align_feature table which contains the introns. This is necessary to speed up the MySQL queries as the table contains millions of rows. It will the assign the external db id for protein evidences and make sure the data_file table which contains the filename of the BAM/BigWig files. It will upload the web_data and analyses descriptions to the Production database for the configuration matrix to work as expected on the website. It will generate BigWig files from the tissue BAM files and generate a README and a md5sum file for the FTP. And finally prepare the database for handover.
None
When this pipeline is started after a successful core handover and the hopefully unlikely modification have been done on the database from the pipeline, the meta table should not need any updates