A Probabilistic Model of Gene Structure

Next: GENSCAN Up: Gene Finding Previous: Gene Structure in Eukaryotes

A Probabilistic Model of Gene Structure

As we have seen, a Hidden Markov Model (HMM) is a Markov chain in which the states are not directly observable. Instead, the output of the current state is observable. The output symbol for each state is randomly chosen from a finite output alphabet according to some probability distribution. A Generalized Hidden Markov Model (GHMM) generalizes the HMM as follows: in a GHMM, the output of a state may not be a single symbol. Instead, the output may be a string of finite length. For a particular current state, the length of the output string as well as the output string it self might be randomly chosen according to some probability distribution. The probability distribution need not be the same for all states. For example, one state might use a weight matrix model for generating the output string, while another might use a HMM. Formally a GHMM is described by a set of four parameters:

A finite set Q of states.
Initial state probability distribution $\pi$ .
Transition probabilities T_i,j for $i,j\in Q$ .
Length distributions f of the states (f_q is the length distribution for state q).
Probabilistic models for each of the states, according to which output strings are generated upon visiting a state.

**Figure 7.9:** GHMM model describing the eukaryotic gene. E states correspond to exons, while I states correspond to introns.

The probabilistic model for gene structure as suggested by Berge and Karlin [1], is based on GHMM (see figure 7.9). The states of the GHMM correspond to the different functional units on a gene, like promoter regions, exon, intron etc. The transition between the states ensure that the order in which the model visits various states is biologically consistent. The states for an intron and an internal exon are subdivided according to phase offset to the codon frames. For $0\leq i\leq 2$ , the state I_i (respectively, E_i) corresponds to introns (exons) starting i positions after a codon starts. Note that the only transition from I_i to any internal exon state is to E_i. Also note that the model is divided into two symmetric halves. The upper half of the figure (states with a ``+" superscript) models a gene on the forward strand and the lower half models a gene on the backward strand of the genomic sequence. If the parameters (like $\pi$ , a_i,j, etc.) are suitably determined, then the model can be used for gene structure prediction in the following manner.

Definition 7.1 A parse $\Phi$ of sequence S is an ordered sequence of states ( $q_1,\dots ,q_t$ ) with an associated duration d_i to each state. The length of $\Phi$ is $L=\sum_{i=1}^{t} d_i$

Suppose we are given a DNA sequence S and a parse $\Phi$ , both of length L. The conditional probability of the parse $\Phi$ given that the sequence generated is S, can be computed as:

$\begin{displaymath}P(\Phi_i\vert S)=\frac{P(\Phi_i,S)}{P(S)}=\frac{P(\Phi_i,S)}{... ...its _{\Phi _j \mbox { is a parse of length L}}^{}P(\Phi_j,S)} \end{displaymath}$

Let S_i be the segment of S produced by q_i, and let P(S_i|q_i,d_i) be the probability of generating S_i by the sequence generation model of state q_i with length d_i.

$\begin{displaymath}P(\Phi_i\vert S)=\prod_{}^{}f_{q_i}(d_i)P(S_i\vert q_i,d_i)=\prod_{k=2}^{t}T_{q_{k-1}q_k}f_{q_k}(d_k)P(S_k\vert q_k,d_k) \end{displaymath}$

The most probable parse, $\Phi_{opt}$ , can be computed by Viterbi like algorithm. P(S) can be computed by a forward-like algorithm.

Next: GENSCAN Up: Gene Finding Previous: Gene Structure in Eukaryotes

Itshack Pe`er
1999-02-03