-
Notifications
You must be signed in to change notification settings - Fork 3
Updating gene and gene_alias tables
This manual is intended for users that have knowledge about the structure of the cBioPortal seed database.
When loading studies into cBioPortal it is possible for warnings to occur that are caused by an outdated seed database. Gene symbols can be deprecated or be assigned to a different Entrez Gene in a new release. Also Entrez Gene IDs can be added. This markdown explains how to update the seed database, in order to use the most recent Entrez Gene IDs.
The cBioPortal scripts package provides a method to update the gene
and gene_alias
tables. This requires the latest version of the NCBI Gene Info.
Execute these steps in case you want to reset your database to the most recent genes list from NCBI.
1- Remove all studies from your installation. You can use the study removal tool. Also empty tables mutation_event
and cna_event
TRUNCATE TABLE mutation_event;
TRUNCATE TABLE cna_event;
Another way of obtaining an empty database is starting a new MySQL database with the previous seed database.
2- If DB engine supports foreign key (FK) constraints, e.g. InnoDB, drop constraints:
ALTER TABLE cosmic_mutation
DROP FOREIGN KEY cosmic_mutation_ibfk_1;
ALTER TABLE sanger_cancer_census
DROP FOREIGN KEY sanger_cancer_census_ibfk_1;
ALTER TABLE uniprot_id_mapping
DROP FOREIGN KEY uniprot_id_mapping_ibfk_1;
3- Empty tables gene
and gene_alias
TRUNCATE TABLE gene_alias;
TRUNCATE TABLE gene;
4- Restart cBioPortal (restart webserver) to clean-up any cached gene lists.
5- You probably also want to update the gene lengths. This is dependent on the reference genome you want to use. Download .annotation.gtf.gz from the latest GENCODE release from http://www.gencodegenes.org/releases/current.html for hg38 / GRCh38. For the hg19 / GRCh37, click on 'Go to GRCh37 version of this release'.
After downloading, go to your downloads directory, decompress the file and run the following command (replace <gencode.v25.annotation.gtf> with the decompressed file):
grep -v ^# <gencode.v25.annotation.gtf> | perl -ne 'chomp; @c=split(/\t/); $c[0]=~s/^chr//; $c[3]--; $c[8]=~s/.*gene_name\s\"([^"]+)\".*/$1/; print join("\t",@c[0,3,4,8,5,6])."\n" if($c[2] eq "CDS" or $c[2] eq "exon")' > all_exon_loci.bed
6- To import gene data type the following commands when in the folder <your_cbioportal_dir>/core/src/main/scripts
:
export PORTAL_HOME=<your_cbioportal_dir>
./importGenes.pl --genes <Homo_sapiens_gene_info.txt> --exon-loci <all_exon_loci.bed>
7- gene
and gene_alias
tables to verify that they are filled correctly.
8- Additionally, there are several other tables you may want to update now.
-
Updating the pfam graphics if you want to see 3D structures, explained in this document
-
Updating the COSMIC coding mutations, can be downloaded from here and require the script
importCosmicData.pl
9- Clean-up old data:
SET SQL_SAFE_UPDATES = 0;
DELETE FROM cosmic_mutation where ENTREZ_GENE_ID not in (SELECT ENTREZ_GENE_ID from gene);
DELETE FROM sanger_cancer_census where ENTREZ_GENE_ID not in (SELECT ENTREZ_GENE_ID from gene);
DELETE FROM uniprot_id_mapping where ENTREZ_GENE_ID not in (SELECT ENTREZ_GENE_ID from gene);
DELETE FROM interaction where GENE_A not in (SELECT ENTREZ_GENE_ID from gene) or GENE_B not in (SELECT ENTREZ_GENE_ID from gene);
DELETE FROM drug_interaction where target not in (SELECT ENTREZ_GENE_ID from gene);
SET SQL_SAFE_UPDATES = 1;
commit;
10- If DB engine supports FK constraints, e.g. InnoDB, restore constraints:
ALTER TABLE cosmic_mutation
ADD FOREIGN KEY (`ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`);
ALTER TABLE sanger_cancer_census
ADD FOREIGN KEY (`ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`);
ALTER TABLE uniprot_id_mapping
ADD FOREIGN KEY (`ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`);