This repository contains the course materials for hands-on exercise of the "Variants Annotation and Phenotype Analysis" session in the Quantitative Genomics workshop at Columbia University on June 24-25, 2024.
Both the lectures and hands-on exercises will be taught via Zoom video conference online. To ensure the cross-platform compatibility, we will only use Rstudio Cloud to implement tools that are developed in Perl and Python.
Students need to go to Posit Cloud to create your own account before the lectures:
-
After you input your information, click "Sign up".
If you logout after signing up, you can go to RStudio Cloud and login with your username and password. After you are added to "2024 Quantitative Genomics" project, you can click "Annotation2024" to start the exercise. The direct link is provided in Slack channel.
When the student opens an assignment project, RStudio Cloud will automatically make a copy for them. In other words, each student will work on your own copy of the virtual machine without influencing each other's files.
By default, you will be in "Console" as shown below where you can run R code.
You can also click "Terminal" so that you can run linux commands.
During the exercise, you can feel free to swith to Terminal when you need to run linux commands and to Console when you need to run R code.
After login to Rstudio Cloud, you already have terminal access by clicking 'terminal' button and can run all basic Linux commands. If you are not familiar with basic Linux commands, you can follow one simple tutorial to learn several commands: https://maker.pro/linux/tutorial/basic-linux-commands-for-beginners. Some of the commands that we will use in the exercise include ls
, pwd
, cd
, mv
and wget
.
Typically you will go to the ANNOVAR website, fill in a registration form, and download the package there. For this exercise, we already prepared a folder in the cloud server that contains a "compact" version of ANNOVAR and necessary library files, to make it easier for users. The folder is called genomics_exercise
and you can simply use cd genomics_exercise
to go to this folder). To confirm which directory you are in, you can type pwd
. You will see the example below.
/cloud/project$ cd genomics_exercise/
/cloud/project/genomics_exercise$ pwd
/cloud/project/genomics_exercise
Please note that this is a sub-sampled version of ANNOVAR for the purpose of th exercise today (to reduce file size significantly to <200Mb), and it should not be used in any other real data analysis. You can view all the files in exercise1
directory:
Type cd exercise1
to enter the exercise1
directory. The sub-folder humandb
folder already contains several annotation databases for human genome that we will use in our exercise. (Note that users can find more annotation databases here.
perl table_annovar.pl example/ex2.vcf humandb/ -buildver hg19 -out myanno -remove -protocol refGeneWithVer,cytoBand,gnomad211_exome,dbnsfp47a -operation g,r,f,f -nastring . -vcfinput -polish
After that, you will find the result files myanno.hg19_multianno.txt
and myanno.hg19_multianno.vcf
.
The -protocol
and the -operation
arguments are the most important one to understand here. Protocol lists a set of tasks that you want to use for annotation, composed of a list of comma-separated keywords. Operation specifies the type of annotations that each step in the protocol uses, composed of a list of comma-separated letters such as g
, r
and f
, corresponding to gene-based, region-based and filter-based annotations, respectively.
In the command above, we specified a protocol with three annotation tasks, including RefGene annotation (a gene-based annotation), cytogenetic band annotation (a region-based annotation), allele frequency in gnoMAD version 2.1.1 (a filter-based annotation), and functional prediction from dbNSFP database version 4.7 as filter-based annotation. These three types of annotation are indicated as g
, r
and f
in the -operation argument, respectively. We also specify that dots ('.') be used to indicate lack of information, so variants that are not observed in the database will have '.' as the annotation. The -vcfinput
argument specifies that the input file is in VCF format. We also specify that the output file names should have "myanno" as the prefix, to make it easier to check output files.
If you have Excel installed, you can open the hg19_multianno.txt
file by Excel and examine the various tab-delimited fields. Since we are using Rstudio now, we could also easily import the tab-delimited file into Rstudio to examine the various rows and columns. We will do this later to filter and prioritize disease-relevant variants.
Next, we want to download a VCF file and then run ANNOVAR on this file.
In the exercise1
folder: (you don't actually need to run this step during the course)
wget http://molecularcasestudies.cshlp.org/content/suppl/2016/10/11/mcs.a001131.DC1/Supp_File_2_KBG_family_Utah_VCF_files.zip -O Supp_File_2_KBG_family_Utah_VCF_files.zip
To give some background information, this is a zip file as supplementary material of a published paper on exome sequencing of a family with undiagnosed genetic diseases. Through analysis of the exome data, the proband was confirmed to have KBG syndrome, a disease caused by loss of function mutations in ANKRD11 gene. There are several VCF files contained in the zip file, including those for parents, silings and the proband. We will only analyze proband in this exercise, but if you are interested, you may want to check whether this is a de novo mutation by analyzing parental genomes.
Next unzip the ZIP file (unzip
):
unzip Supp_File_2_KBG_family_Utah_VCF_files.zip
mv './File 2_KBG family Utah_VCF files/' VCF_files
Note that in the commands above, we rename the directory to "VCF_files" just to make it easier to read and operate.
Run ANNOVAR on the VCF file:
perl table_annovar.pl VCF_files/proband.vcf humandb/ -buildver hg19 -out proband.annovar -remove -protocol refGeneWithVer,gnomad211_exome,dbnsfp47a -operation g,f,f -nastring . -vcfinput
The proband.annovar.hg19_multianno.txt
file contains annotations for this exome.
Now compare this command from the previous table_annovar.pl command, we can see that this type we only request a few annotation tasks (refGeneWithVer and gnomad211_exome and dbnsfp47a), which are gene-based, filter-based and filter-based annotations, respectively.
We used "terminal" for the commands above to generate results. To do further analysis in R, we can now switch to the "console" in Rstudio.
In the exercise1
folder, run pwd
to obtain the working directory and copy this path.
Paste this path to R console or R studio interface to setup the working directory using function setwd()
. For example:
setwd("/cloud/project/genomics_exercise/exercise1")
Check variant distribution across chromesomes:
#load data
res <- read.table("proband.annovar.hg19_multianno.txt", fill=T, header=T, sep="\t", na.strings = ".", quote="")
#visualize variant frequency
par(mar=c(5.1, 4.1, 4.1, 2.1),mfrow=c(1,1))
table <- table(res$Chr)
chrlist <- paste0("chr",1:22)
chrlist <- c(chrlist, "chrX")
barplot(table[chrlist], ylab="Number of Variant", las=2)
At this point, you should see a barplot similar to the one below:
Check allele frequency distribution between non-synonymous, synonymous and intronic variants:
#visualize allele frequency
par(mar=c(2, 4, 1, 1),mfrow=c(3,1))
af_syn<-res[res$ExonicFunc.refGeneWithVer=="synonymous SNV",]$AF
hist(as.numeric(af_syn), ylab="Frequency", xlab="Allele frequency", main="Synonymous SNV", breaks=seq(0,1,.001))
af_nonsyn<-res[res$ExonicFunc.refGeneWithVer=="nonsynonymous SNV",]$AF
hist(as.numeric(af_nonsyn), ylab="Frequency", xlab="Allele frequency", main="nonSynonymous SNV",breaks=seq(0,1,.001))
af_intron<-res[res$Func.refGeneWithVer =="intronic",]$AF
hist(as.numeric(af_intron), ylab="Frequency", xlab="Allele frequency", main="Intronic", breaks=seq(0,1,.001))
You should see a figure similar to the one below:
Check allele frequency distribution across race:
#visualize allele frequency across race
par(mar = c(12, 5, 2, 2),mfrow=c(1,1))
res_sub <- res[res$Gene.refGeneWithVer=="NRXN2",]
variantname <- res_sub$Otherinfo6[1]
res_sub <- as.matrix(res_sub[1,16:23])
colnames(res_sub) <- c("African/African-American", "South Asian",
"Latino/Admixed American", "East Asian", "Non-Finnish European", "Finnish",
"Ashkenazi Jewish", "Other")
barplot(res_sub, ylab="Allele frequency", main=variantname, las=2)
You should see a figure similar to the one below:
Next, we examine the allele frequency distributions stratified by SIFT scores, which predict the impact of amino acid substitutions on protein function based on sequence homology and the physical properties of amino acids. We expect variants associated with more deleterious effects (SIFT score < 0.05) to be rarer.
#visualize allele frequency stratified by SIFT scores
res_tolerate = res[res$SIFT_score > 0.05 & !is.na(res$SIFT_score),];dim(res_tolerate)
res_deleterious = res[res$SIFT_score<0.05 & !is.na(res$SIFT_score),];dim(res_deleterious)
par(mar=c(2.5, 4.5, 1.5, 1.5)+2,mfrow=c(1,1))
hist_deleterious <- hist(as.numeric(res_deleterious$AF), breaks=seq(0, 1, 0.001), plot = FALSE)
hist_tolerate <- hist(as.numeric(res_tolerate$AF), breaks=seq(0, 1, 0.001), plot = FALSE)
plot(hist_tolerate, col=rgb(0, 0, 1, 0.5), ylim=c(0, max(c(hist_deleterious$counts, hist_tolerate$counts))),
xlab="Allele Frequency", ylab="Frequency", main="Allele Frequency Distribution", xlim=c(0, 1),
border=NA) # Set border to NA to remove outline
plot(hist_deleterious, col=rgb(1, 0, 0, 0.5), add=T, border=NA)
legend(x=0.6, y=115, legend=c("SIFT < 0.05", "SIFT > 0.05"), col=c("red", "blue"),
fill=c(rgb(1, 0, 0, 0.5), rgb(0, 0, 1, 0.5)),border=NA)
You should see a figure similar to the one below:
We next use ggplot2
package to compare allele frequency distributions stratified by two other predictive scores: MetaRNN and AlphaMissense. We can similarly find lower allele frequencies for predicted pathogenic variants (MetaRNN: T(olerated) and D(amaging), AlphaMissense: likely (B)enign, (A)mbiguous, or likely (P)athogenic ).
library(ggplot2)
library(dplyr)
library(ggpubr)
a = ggplot(res%>%filter(is.na(MetaRNN_pred)==F),aes(x = MetaRNN_pred, y = AF))+
geom_boxplot(width = 0.3)+theme_classic()+ggtitle('MetaRNN_pred')
b = ggplot(res%>%filter(is.na(AlphaMissense_pred)==F),aes(x = AlphaMissense_pred, y = AF))+
geom_boxplot(width = 0.3)+theme_classic()+ggtitle('AlphaMissense_pred')
ggarrange(a,b,ncol=2)
Feel free to use the table browser to examine the various rows and columns in this table and perform additional summary statistics. Later on, after the exercise 2 (phenotype analysis) below, we will go back to this table to show how combined analysis of genotype data and phenotype data can facilitate genetic diagnosis of rare diseases.
Phen2Gene is a phenotype-based gene prioritization tool from HPO IDs or clinical notes on patients. In the next exercise, we will first use Phen2Gene to prioritize genes based on clinical phenotypes of patients with Mendelian diseases.
We will do the exercise in a directory called exercise2
. In the terminal, assuming that you are currently in the exercise1
directory (you can use pwd
to check this), you can use command cd ../exercise2
to go into the exercise2 directory.
There are a few ways to run Phen2Gene: run it locally (need a few GB of space), use the Phen2Gene API and use Phen2Gene website.
The benefit of running Phen2Gene is if you do not have any idea of a candidate gene for your disease, you can use it in one of three scenarios:
- Ideally, you have a list of physician-curated HPO terms describing a patient phenotype and a list of potential candidate disease variants that overlap gene space and you want to narrow down the list of variants by prioritizing candidate disease genes, often in tandem with variant prioritization software, which cannot as of yet score STR expansions or SVs unlike Phen2Gene which is variant agnostic.
- You do not have variants, but you have HPO terms and would like to get some candidate genes for your disease that you may want to target sequence, as it is much cheaper than whole-genome or whole-exome sequencing.
- If you have clinical notes, you can use tools like EHR-Phenolyzer or Doc2HPO for processing clinical notes into HPO terms using natural language processing (NLP) techniques, then apply scenario 1 or 2 as relevant.
Other tools listed below (ClinPhen, AMELIE, etc) require a gene list, and Phen2Gene does not require any variant data or prior gene lists to provide high quality candidate genes. One would most likely use Phen2Gene for obtaining the genetic etiology of an undiagnosed rare disease with no obvious genetic cause.
- Go to Terminal, make sure you are in the
exercise2
directory first, and runcurl -i -H "Accept: application/json" -H "Content-Type: application/json" "https://phen2gene.wglab.org/api?HPO_list=HP:0000455;HP:0000574;HP:0030084;HP:0012471;HP:0000239;HP:0001572;HP:0000960;HP:0001250;HP:0000322;HP:0001831" | tail -n 1 > output.txt
where you generate JSON output inoutput.txt
However, since theoutput.txt
file is in JSON format, it is not very intuitive to view the content of the file. Instead, we will use the table browser in Rstudio to view a formatted version of the JSON file. - Go To Console, remember that we are probably in the
exercise1
directory, so we should first setexercise2
as the working directory.
setwd("../exercise2")
- Then we can run
# install JSON in R
install.packages("rjson")
# Load JSON in R
library("rjson")
# Read JSON results
result <- fromJSON(file = "output.txt")
# Convert them to array and name columns.
marray <- t(array( unlist(result$results), dim=c(5, length(result$results)) ) );
colnames(marray) <- names(result$results[[1]]);
# View the results in 2-D array. The second column is the rank of genes.
View (marray);
# Get top 100 genes based on Phen2Gene score
top100<-matrix(array(unlist(result$results)), 5)[1,][1:100]
write.table(top100, file="phen2gene_top100_genes", quote=F, col.names = F, row.names = F)
You can see that the top ranked genes are VPS13B, ARID1B, etc.
Under the exercise2
directory, we can run python /cloud/project/usr/Phen2Gene/phen2gene.py -f hpo_list.txt
where we generate Phen2Gene output file in out/output_file.associated_gene_list
.
The hpo_list.txt
file contains a list of HPO terms for a patient (one per line). If you want to see what is in the file, you can use cat hpo_list.txt
to see the list of terms. The output file is written to out/output_file.associated_gene_list
by default.
Use command more out/output_file.associated_gene_list
to check the predicted genes.
You can see that the top ranked genes are VPS13B, ARID1B, etc. Type q
to quit the mode.
Go to https://phen2gene.wglab.org. Click on the tab Patient notes
:
and paste this paragraph of clinical notes in the text box:
The proband had an abnormally large fontanelle, which resolved without treatment. The proband does not appear to have a sacral dimple. Other than the presence of flat arches, there are no obvious signs of foot abnormalities. The proband does not look like his other siblings, although there was a resemblance between him and his sister when they were the same age. Features of proband’s face can be seen, including bushy eyebrows, broad nasal tip, short philtrum, full lips and cupid bow of upper lip. Video of proband’s behavior. Proband is non-verbal, and hyperactive. He repetitively spins his toy. While playing, he gets up from his chair, walks a few steps, stomps his feet, and sits back down. Additional interview in August 2015, after the parents received the proband’s diagnosis of KBG syndrome. The proband develops new mannerisms every four to six months, the most recent being short, hard breaths through the nose and head turning. The proband has had a substantial decrease in the number of seizures after starting an Epidiolex (cannabidiol) treatment (70-80% decrease as described by the parents). The frequency of seizures increased after the proband fell and fractured his jaw. The mother describes the proband’s macrodontia. Although the mother and several siblings have large teeth, the macrodontia in the proband does not appear in any other member of the family. The proband’s features are compared to other characteristics usually found in other KBG patients. Unlike most KBG patients, the proband has full lips. Like most KBG patients, the proband has curved pinkies (diagnosed as clinodactyly), which are often found in KBG patients. Although the proband has relatively short toes, this trait may have been inherited from the father. The proband also has curved toenails, which commonly appear in autistic children.
Then click Submit.
You should see that ANKRD11 is in the top 3. The reason it is not 2 as in the previous example is that we manually curated HPO terms from the patient notes for each patient in our Phen2Gene paper and this is raw notes with no curation done and all HPO terms are being automatically extracted by Doc2HPO. Despite this, the discrepancy in our results is minimal.
Alternatively, you can also submit the HPO terms:
HP:0000707
HP:0007598
HP:0001156
HP:0012446
HP:0004209
HP:0000405
HP:0001627
HP:0002750
HP:0003235
HP:0000239
HP:0001572
HP:0002123
HP:0011220
HP:0004482
HP:0002069
HP:0012471
HP:0001566
HP:0001250
manually using the tab HPO IDs
(they are already the default terms in the website so you can just click Submit
on that tab).
And you should see that ANKRD11 is still number 2.
ClinPhen is another tool to analyze clinical notes. To run this tool,
- Go to Terminal and run the commands below to install it.
export PATH="$HOME/.local/bin":$PATH
pip install clinphen
- Download package nltk in Python. In terminal, type
python
, then type
import nltk
nltk.download('omw-1.4')
After completing installation, type exit()
to quit Python.
- Run ClinPhen
clinphen m_notes.txt
The expected results are below:
/cloud/project/genomics_exercise/exercise2$ clinphen m_notes.txt
HPO ID Phenotype name No. occurrences Earliness (lower = earlier) Example sentence
HP:0012471 Thick vermilion border 2 12 unlike most kbg patients the proband has full lips
HP:0000574 Thick eyebrow 1 9 features of proband s face can be seen including bushy eyebrows broad nasal tip short philtrum full lips and cupid bow of upper lip
HP:0000455 Broad nasal tip 1 10 features of proband s face can be seen including bushy eyebrows broad nasal tip short philtrum full lips and cupid bow of upper lip
HP:0000322 Short philtrum 1 11 features of proband s face can be seen including bushy eyebrows broad nasal tip short philtrum full lips and cupid bow of upper lip
HP:0002263 Exaggerated cupid's bow 1 13 features of proband s face can be seen including bushy eyebrows broad nasal tip short philtrum full lips and cupid bow of upper lip
HP:0001250 Seizures 1 32 the frequency of seizures increased after the proband fell and fractured his jaw
HP:0030084 Clinodactyly 1 42 like most kbg patients the proband has curved pinkies diagnosed as clinodactyly which are often found in kbg patients
You will get the results from ClinPhen. You can also type clinphen --help
if you are interested in more options.
To find candidate genes for this set of HPO terms, you can input HP:0012471;HP:0000574;HP:0000455;HP:0000322;HP:0002263;HP:0001250;HP:0030084
into Phen2Gene web server.
AMELIE is yet another tool for analyzing clinical notes, however it requires a candidate gene list. We put the real causal gene in a algorithm-allowed maximum 1000 gene list of random exome capture genes.
First copy this list of HPO terms to your clipboard. Then go to the AMELIE website and click "Submit Case."
Now, paste the HPO term list into "Case phenotypes."
Then, copy this list of genes to your clipboard.
Next, paste the gene list into "Case genotype" under the tab "Gene List" then click Submit.
You'll get a screen that looks like this:
And a wheel that tells you to wait a while.
After the wheel is done spinning your results should look like the following:
The number 1 gene, TAF1 is the correct causal gene. But Phen2Gene gets the same result.
Go to https://phencards.org/. Click on the tab Patient notes
:
and paste this paragraph of clinical notes in the text box:
The proband had an abnormally large fontanelle, which resolved without treatment. The proband does not appear to have a sacral dimple. Other than the presence of flat arches, there are no obvious signs of foot abnormalities. The proband does not look like his other siblings, although there was a resemblance between him and his sister when they were the same age. Features of proband’s face can be seen, including bushy eyebrows, broad nasal tip, short philtrum, full lips and cupid bow of upper lip. Video of proband’s behavior. Proband is non-verbal, and hyperactive. He repetitively spins his toy. While playing, he gets up from his chair, walks a few steps, stomps his feet, and sits back down. Additional interview in August 2015, after the parents received the proband’s diagnosis of KBG syndrome. The proband develops new mannerisms every four to six months, the most recent being short, hard breaths through the nose and head turning. The proband has had a substantial decrease in the number of seizures after starting an Epidiolex (cannabidiol) treatment (70-80% decrease as described by the parents). The frequency of seizures increased after the proband fell and fractured his jaw. The mother describes the proband’s macrodontia. Although the mother and several siblings have large teeth, the macrodontia in the proband does not appear in any other member of the family. The proband’s features are compared to other characteristics usually found in other KBG patients. Unlike most KBG patients, the proband has full lips. Like most KBG patients, the proband has curved pinkies (diagnosed as clinodactyly), which are often found in KBG patients. Although the proband has relatively short toes, this trait may have been inherited from the father. The proband also has curved toenails, which commonly appear in autistic children.
Then click Search.
This is raw notes with no curation done and all HPO terms are being automatically extracted by Doc2HPO. You should see that KBG syndrome is among the top disease list.
Alternatively, you can also submit the following list of HPO terms manually using the PhenCards web server to find the information of HPO terms. (Do not enter these terms into PhenCards which only find cross-reference for each one of these terms.)
HP:0000455; HP:0000574; HP:0030084; HP:0012471; HP:0000239; HP:0001572; HP:0000960; HP:0001250; HP:0000322; HP:0001831
The expected results are shown below in Phen2Gene server:
Now that we performed variant annotation in exercise1, and performed phenotype-based gene prioritization in exercise2, we want to see how to infer and prioritize candidate genes for the disease.
Go to Console, we can run
# Load top 100 genes generated by Phen2Gene
top100 <- read.table("out/output_file.associated_gene_list", header=T)
top100 <- top100$Gene[1:100]
# Load results of ANNOVAR
res <- read.table("../exercise1/proband.annovar.hg19_multianno.txt", fill=T, header=T, sep="\t", na.strings = ".", quote="")
# Filtering ANNOVAR annotations based on Phen2Gene output
res$Gene.refGeneWithVer
res <- res[res$Gene.refGeneWithVer %in% top100,]
res <- res[res$Func.refGeneWithVer=="exonic",]
res <- res[res$ExonicFunc.refGeneWithVer!="synonymous SNV",]
res <- res[res$AF_popmax<.005 | is.na(res$AF_popmax),]
View(res)
The analytical logic in the above command is that: first we take only exonic variants in the top 100 genes predicted by phenotype-driven gene prioritization software. Next, we remove synonymous SNVs from the list. Next, we keep only variants that are not observed in the gnomAD database, or observed but with allele frequency lower than 0.005.
- You can see that gene ANKRD11 is in the following list
Phenomizer is another a web-based application for clinical diagnostics in human genetics using semantic similarity searches in ontologies.
HP:0000455
HP:0000574
HP:0030084
HP:0012471
HP:0000239
HP:0001572
HP:0000960
HP:0001250
HP:0000322
HP:0001831
First copy one HPO term above to the search bar Phenomizer and click "Search".
Then right click the searched HPO term and select "Add to Patient's Features". You will see this HPO term added to the right panel of Phenomizer. Repeat this step for all HPO terms listed above.
Select "Yes" to make sure 'symmetric' mode algorithm checked on.
After all HPO terms added to the right panel, click "Get diagnosis" at the bottom right corner of Phenomizer.
You should see that KBG syndrome is among the top disease list
OARD (An acronym form of "oh alright") is an Open real-world based Annotation for Rare Diseases and its associated phenotypes.
This is a React web app to serve the web app of OARD. The backend is provided by OARD API. Currently it is hosted on the NCATS AWS server (https://rare.cohd.io/).
open annotation for rare diseases (OARD), a real-world-data-derived resource with annotation for rare-disease-related phenotypes. This resource is derived from the EHRs of two academic health institutions containing more than 10 million individuals spanning wide age ranges and different disease subgroups. By leveraging ontology mapping and advanced natural-language-processing (NLP) methods, OARD automatically and efficiently extracts concepts for both rare diseases and their phenotypic traits from billing codes and lab tests as well as over 100 million clinical narratives.
Go to https://rare.cohd.io/.
Select Method as mostFrequency
and Domain as Disease
to view most frequently occurred disease concept in a dataset:
Then click Submit:
Type HP:0012461
into Query Concept List 1 and select Method as singleConceptFreq
and Dataset as 1-all
to view single concept frequency in a dataset:
Then click Submit:
Type HP:0012461
into Query Concept List 1 and HP:0500111
into Query Concept List 2 and select Method as pairConceptFreq
and Dataset as 1-all
to view pair concept frequency in a dataset:
- Go to Terminal, make sure you are in the
exercise2
directory first, and run
curl "https://rare.cohd.io/api/frequencies/mostFrequency?dataset_id=2&domain_id=diseases" > output_oard1.txt
curl "https://rare.cohd.io/api/frequencies/singleConceptFreq?dataset_id=1&concept_id=90012461" > output_oard2.txt
curl "https://rare.cohd.io/api/frequencies/pairedConceptFreq?dataset_id=1&concept_id_1=90012461&concept_id_2=90500111" > output_oard3.txt
where you generate JSON outputs.
However, since the output.txt
file is in JSON format, it is not very intuitive to view the content of the file. Instead, we will use the table browser in Rstudio to view a formatted version of the JSON file.
- Go To Console, remember that we are probably in the
exercise1
directory, so we should first setexercise2
as the working directory.
setwd("../exercise2")
- Then we can run
library("rjson")
# Read JSON results
result1 <- fromJSON(file = "output_oard1.txt")
# Convert them to array and name columns.
marray1 <- t(array( unlist(result1$results), dim=c(7, length(result1$results)) ) );
colnames(marray1) <- names(result1$results[[1]]);
# View the results in 2-D array.
View (marray1);
result2 <- fromJSON(file = "output_oard2.txt")
# Convert them to array and name columns.
marray2 <- t(array( unlist(result2$results), dim=c(7, length(result2$results)) ) );
colnames(marray2) <- names(result2$results[[1]]);
# View the results in 2-D array.
View (marray2);
result3 <- fromJSON(file = "output_oard3.txt")
# Convert them to array and name columns.
marray3 <- t(array( unlist(result3$results), dim=c(14, length(result3$results)) ) );
colnames(marray3) <- names(result3$results[[1]]);
# View the results in 2-D array.
View (marray3);
PhenoSV is an interpretable phenotype-aware model to prioritize genes affected by structural variants (SVs). PhenoSV can predict pathogenicity of all types of SVs that disrupt either coding or noncoding genome regions, including deletions, duplications, insertions, inversions, and translocations. When phenotype information is available, PhenoSV further utilizes gene-phenotype associations (Phen2Gene) to prioritize disease-related SVs.
For small SVs, we can use the web server (PhenoSV.) to score and interpret SVs. PhenoSV also offers a command-line tool. In the next exercise, we will use PhenoSV to score a single SV with patient's clinical phenotypes.
We need to install database files for PhenoSV to operate. Note that the databases are large and may require more than 15 minutes to complete the installation.
#download PhenoSV database
cd /var/tmp
wget https://www.openbioinformatics.org/PhenoSV/PhenosvlightFile.tar
tar -xvf "PhenosvlightFile.tar"
rm "PhenosvlightFile.tar"
wget https://github.com/WGLab/Phen2Gene/releases/download/1.1.0/H2GKBs.zip
unzip -q "H2GKBs.zip"
rm "H2GKBs.zip"
cd "/cloud/project/genomics_exercise/exercise3/PhenoSV"
bash update_config.sh /var/tmp
After installing the database, We can use PhenoSV. In terminal, use command:
python3 phenosv/model/phenosv.py --c chr6 --s 156994830 --e 157006982 --svtype 'deletion' --noncoding 'tad' --HPO 'HP:0000707,HP:0007598' --model 'PhenoSV-light'
Elements Pathogenicity Type Phen2Gene PhenoSV
0 SV 0.607864 Non-coding SV 0.999126 0.607332
1 ARID1B 0.550785 Intronic 0.999126 0.550304
2 NOX3 0.155716 Regulatory 0.837460 0.130406
3 TFB1M 0.249238 Regulatory 0.544762 0.135775
In summary, a number of computational tools such as Phen2Gene, AMELIE and GADO can perform phenotype-driven gene prioritization. Phen2Gene provides webserver or API, or you can install and run locally (which is important to deploy it in batch processing mode), and it does not require a list of random genes to run either.
- ANNOVAR: Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data. Nucleic Acids Research, 38:e164, 2010
- Phen2Gene: Zhao, M. et al. Phen2Gene: rapid phenotype-driven gene prioritization for rare diseases. NAR Genom Bioinform, 2:lqaa032 (2020).
- AMELIE: Birgmeier, J. et al. AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature. Sci. Transl. Med. 12:eaau9113, (2020).
- GADO: Deelen, P. et al. Improving the diagnostic yield of exome- sequencing by predicting gene-phenotype associations using large-scale gene expression analysis. Nat. Commun. 10:2837 (2019).
- ClinPhen: Deisseroth, C. A. et al. ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis. Genet. Med. 21:1585–1593 (2019).
- PhenCards: Havrilla, J.M. et al. PhenCards: a data resource linking human phenotype information to biomedical knowledge. Genom. Med. 13:91 (2021)
- OARD: Liu, C. et al. OARD: Open annotations for rare diseases and their phenotypes based on real-world data. Am. J. Hum. Genet. 109(9): 1591–1604, (2022)
- PhenoSV: Xu Z, Li Q, Marchionni L and Wang K. PhenoSV: interpretable phenotype-aware model for the prioritization of genes affected by structural variants. Nat. Commun. 14:7805 (2023)