Next: Splice junctions
Up: Gene Finding in Eukaryotes
Previous: Typical figures: vertebrates
There are some models for distinguishing coding regions from
non-coding regions that use Markov chains. These models are based
on statistical differences between coding and non-coding
regions.
A popular model is based on examining windows of 6
consecutive bases in the DNA sequence. This is a 5th order
Markov model. We'll prepare in advance two probability tables one
for coding regions and one for non-coding regions. Each table will
be of size 46. For each 6-tuple of bases the table will
register the probability of observing the 6th base, given the
5 preceding bases appeared in our window. Given a sequence we'll
estimate the likelihood of it being coding using those 2 tables.
This model does not take into account any reading frame
information. It is therefore called ahomogeneous model. An
non-homogeneous model is a model that has different tables
for the 3 possible reading frames. The problem with such models
when dealing with eukaryote genome is that sometimes the exons are
too short and that it is hard to detect splice junctions
(donor and acceptor sites).
Peer Itsik
2000-12-25