Comp 561 Computational Biology Methods and Research - Homework #3

Hidden markov model (HMM) for separating genic and intergenic regions in bacterial DNA

grade = 100%

Question 1

a) Using the genome sequence and gene annotation for Vibrio cholerae, create the configuration sheet

Code: q1_a.mlx

i) Average Length of Intergenic Regions: 1164.4bp

ii) Average Length of Genic Regions: 991.4515bp

iii) Nucleotide Frequency Table for intergenic regions

	Frequency
A	0.2651
C	0.2441
G	0.2262
T	0.2646

iv) see configurations.xlsx for the codon frequency table for genic regions (emission table)

b) Using the programming language of your choice, implement the Viterbi algorithm

The code for the Viterbi algorithm is found in Viterbi.m and it uses the matlab function findGenes.m

To run it, just call the Viterbi function in matlab with the names of the three files as the arguments.

Viterbi(fastaFilename, configFilename, outputFile)

This will automatically save the output in gff3 format.

c) Run your program on the genome of Vibrio vulnificus, using the parameters obtained for Vibrio cholerae

The gene predictions are found in vulnificusOutput.gff3

d) Evaluate the accuracy of your predictions against the gene annotation available

Code for this section and the next is found in q1_d.mlx

The fraction of annotated genes that:

Perfectly match a guessed gene: 59.8%
Match stop but not start: 27.8%
Match start but not stop: 0
Do not match at all: 12.4 %

The fraction of guessed genes that:

Perfectly match an annotated gene: 54.3%
Match stop but not start: 25.3%
Match start but not stop: 0
Do not match at all: 20.4 %

e) What properties of annotated genes are associated to an elevated risk of being partially or completely missed by your predictor? What are the properties of genes predicted by your predictor that do not match an annotated gene?

There are two properties that cause genes to be missed.

The genes that were completely missed tended to be small in length (100-500bp). This is shown in the histogram below:
The genes that were matched for their stop but not their start always had a starting point that was close to the real one. Usually less than 100 basepairs away, and almost always less than 200 basepairs away, as seen in this histogram:

Another thing to notice is that the real starting point was usually upstream from the predicted point.

Question 2

Basically this would be the same thing as the algorithm for question 1, there would be a intergenic state, start and stop state, and 999 middle states.

The probability of transition

From start to middle1 is P(L > 1)
From start to stop is P(L=1)
From middle1 to middle2 is P(L > 3).
From middle2 to stop is P(L=3).

etc.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
HW3_2020.pdf		HW3_2020.pdf
README.md		README.md
Vibrio_cholerae.GFC_11.37.gff3		Vibrio_cholerae.GFC_11.37.gff3
Vibrio_cholerae.GFC_11.dna.toplevel.fa		Vibrio_cholerae.GFC_11.dna.toplevel.fa
Vibrio_vulnificus.ASM74310v1.37.gff3		Vibrio_vulnificus.ASM74310v1.37.gff3
Vibrio_vulnificus.ASM74310v1.dna.toplevel.fa		Vibrio_vulnificus.ASM74310v1.dna.toplevel.fa
Viterbi.m		Viterbi.m
configurations.xlsx		configurations.xlsx
findGenes.m		findGenes.m
instructions.pdf		instructions.pdf
missedlengthhisto.jpg		missedlengthhisto.jpg
q1_a.mlx		q1_a.mlx
q1_d.mlx		q1_d.mlx
stopmatchhisto.jpg		stopmatchhisto.jpg
vulnificusOutput.gff3		vulnificusOutput.gff3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comp 561 Computational Biology Methods and Research - Homework #3

Hidden markov model (HMM) for separating genic and intergenic regions in bacterial DNA

Question 1

a) Using the genome sequence and gene annotation for Vibrio cholerae, create the configuration sheet

b) Using the programming language of your choice, implement the Viterbi algorithm

c) Run your program on the genome of Vibrio vulnificus, using the parameters obtained for Vibrio cholerae

d) Evaluate the accuracy of your predictions against the gene annotation available

e) What properties of annotated genes are associated to an elevated risk of being partially or completely missed by your predictor? What are the properties of genes predicted by your predictor that do not match an annotated gene?

Question 2

About

ScienceMoo/Gene-Finding-HMM

Folders and files

Latest commit

History

Repository files navigation

Comp 561 Computational Biology Methods and Research - Homework #3

Hidden markov model (HMM) for separating genic and intergenic regions in bacterial DNA

Question 1

a) Using the genome sequence and gene annotation for Vibrio cholerae, create the configuration sheet

b) Using the programming language of your choice, implement the Viterbi algorithm

c) Run your program on the genome of Vibrio vulnificus, using the parameters obtained for Vibrio cholerae

d) Evaluate the accuracy of your predictions against the gene annotation available

e) What properties of annotated genes are associated to an elevated risk of being partially or completely missed by your predictor? What are the properties of genes predicted by your predictor that do not match an annotated gene?

Question 2

About

Resources

Stars

Watchers

Forks