Hidden markov model (HMM) for separating genic and intergenic regions in bacterial DNA
grade = 100%
a) Using the genome sequence and gene annotation for Vibrio cholerae, create the configuration sheet
Code: q1_a.mlx
i) Average Length of Intergenic Regions: 1164.4bp
ii) Average Length of Genic Regions: 991.4515bp
iii) Nucleotide Frequency Table for intergenic regions
Frequency | |
---|---|
A | 0.2651 |
C | 0.2441 |
G | 0.2262 |
T | 0.2646 |
iv) see configurations.xlsx for the codon frequency table for genic regions (emission table)
The code for the Viterbi algorithm is found in Viterbi.m and it uses the matlab function findGenes.m
To run it, just call the Viterbi function in matlab with the names of the three files as the arguments.
Viterbi(fastaFilename, configFilename, outputFile)
This will automatically save the output in gff3 format.
c) Run your program on the genome of Vibrio vulnificus, using the parameters obtained for Vibrio cholerae
The gene predictions are found in vulnificusOutput.gff3
Code for this section and the next is found in q1_d.mlx
The fraction of annotated genes that:
- Perfectly match a guessed gene: 59.8%
- Match stop but not start: 27.8%
- Match start but not stop: 0
- Do not match at all: 12.4 %
The fraction of guessed genes that:
- Perfectly match an annotated gene: 54.3%
- Match stop but not start: 25.3%
- Match start but not stop: 0
- Do not match at all: 20.4 %
e) What properties of annotated genes are associated to an elevated risk of being partially or completely missed by your predictor? What are the properties of genes predicted by your predictor that do not match an annotated gene?
There are two properties that cause genes to be missed.
-
The genes that were completely missed tended to be small in length (100-500bp). This is shown in the histogram below:
-
The genes that were matched for their stop but not their start always had a starting point that was close to the real one. Usually less than 100 basepairs away, and almost always less than 200 basepairs away, as seen in this histogram:
Another thing to notice is that the real starting point was usually upstream from the predicted point.
Basically this would be the same thing as the algorithm for question 1, there would be a intergenic state, start and stop state, and 999 middle states.
The probability of transition
- From start to middle1 is P(L > 1)
- From start to stop is P(L=1)
- From middle1 to middle2 is P(L > 3).
- From middle2 to stop is P(L=3).
etc.