Skip to content

Updating gene and gene_alias tables

pieterlukasse edited this page May 24, 2016 · 20 revisions

The cBioPortal scripts package provides a simple script to update your local gene and gene_alias tables based on a new version of the NCBI genes file.

Cleaning up DB (in case of new installation)

Execute these steps in case you want to reset your DB to the most recent genes list from NCBI.

Steps:

1- Remove all studies from your installation. You can use the study removal tool

2- (if DB engine support FK constraint, e.g. InnoDB) Drop constraints:

ALTER TABLE cosmic_mutation
  DROP FOREIGN KEY cosmic_mutation_ibfk_1;
  
ALTER TABLE sanger_cancer_census
  DROP FOREIGN KEY sanger_cancer_census_ibfk_1;
    
ALTER TABLE uniprot_id_mapping
  DROP FOREIGN KEY uniprot_id_mapping_ibfk_1;

3- Empty tables gene and gene_alias

TRUNCATE TABLE gene_alias;
TRUNCATE TABLE gene;

4- Restart cBioPortal (restart webserver) to clean-up any cached gene lists.

5- Import gene data again (see section below)

6- ⚠️ Check the gene and gene_alias tables to verify that they are filled correctly.

7- Clean-up old data:

DELETE FROM cosmic_mutation where ENTREZ_GENE_ID not in (SELECT ENTREZ_GENE_ID from gene);
DELETE FROM sanger_cancer_census where ENTREZ_GENE_ID not in (SELECT ENTREZ_GENE_ID from gene);
DELETE FROM uniprot_id_mapping where ENTREZ_GENE_ID not in (SELECT ENTREZ_GENE_ID from gene);
DELETE FROM interaction where GENE_A not in (SELECT ENTREZ_GENE_ID from gene) or GENE_B not in (SELECT ENTREZ_GENE_ID from gene);
DELETE FROM drug_interaction where target not in (SELECT ENTREZ_GENE_ID from gene);
DELETE FROM mutation_event where ENTREZ_GENE_ID not in (SELECT ENTREZ_GENE_ID from gene);
DELETE FROM cna_event where ENTREZ_GENE_ID not in (SELECT ENTREZ_GENE_ID from gene);

8- (if DB engine support FK constraint, e.g. InnoDB) Restore constraints:

ALTER TABLE cosmic_mutation
  ADD FOREIGN KEY (`ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`);
  
ALTER TABLE sanger_cancer_census
  ADD FOREIGN KEY (`ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`);

ALTER TABLE uniprot_id_mapping
  ADD FOREIGN KEY (`ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`);

Updating gene table without removing the existing studies

TODO - harder process (will probably not be needed once re-importing existing studies is made easy - which should be the case soon)

Running the script

To run the script type the following commands when in the folder <your_cbioportal_dir>/core/src/main/scripts:

 export PORTAL_HOME=<your_cbioportal_dir>

and then

./importGenes.pl <ncbi_genes.txt>

If you also wish to add the gene lengths to your gene table, also download this file for ChGr38. After downloading, go to your downloads directory and run the following command:

grep -v ^# gencode.v24.annotation.gtf | perl -ne 'chomp; @c=split(/\t/); $c[0]=~s/^chr//; $c[3]--; $c[8]=~s/.*gene_name\s\"([^"]+)\".*/$1/; print join("\t",@c[0,3,4,8,5,6])."\n" if($c[2] eq "CDS" or $c[2] eq "exon")' > all_exon_loci.bed

TODO: add documentation about running importGenes.pl with ncbi_genes.txt and all_exon_loci.bed

Example:

./importGenes.pl  Homo_sapiens_gene_info.txt