Markov Sequence Models

Next: Splice junctions Up: Gene Finding in Eukaryotes Previous: Typical figures: vertebrates

Markov Sequence Models

There are some models for distinguishing coding regions from non-coding regions that use Markov chains. These models are based on statistical differences between coding and non-coding regions.
A popular model is based on examining windows of 6 consecutive bases in the DNA sequence. This is a 5^th order Markov model. We'll prepare in advance two probability tables one for coding regions and one for non-coding regions. Each table will be of size 4⁶. For each 6-tuple of bases the table will register the probability of observing the 6^th base, given the 5 preceding bases appeared in our window. Given a sequence we'll estimate the likelihood of it being coding using those 2 tables.
This model does not take into account any reading frame information. It is therefore called ahomogeneous model. An non-homogeneous model is a model that has different tables for the 3 possible reading frames. The problem with such models when dealing with eukaryote genome is that sometimes the exons are too short and that it is hard to detect splice junctions (donor and acceptor sites).

Peer Itsik
2000-12-25