The probability that the sequence X was generated by the model
given the path
is therefore:
(6) |
Where for our convenience we denote
and
.
The model contains eight states corresponding to the four symbols
of the alphabet {A,C,G,T}:
State: | A+ | C+ | G+ | T+ | A- | C- | G- | T- |
Emitted Symbol: | A | C | G | T | A | C | G | T |
If the probability for staying in a CpG island is p and the
probability of staying outside it is q, then the transition
probabilities will be as described in table
6.2 (derived from the
transition probabilities given in table
6.1 under the assumption that we lose
memory when moving from/into a CpG island, and that we ignore
background probabilities).
In this special case the emission probability of each state
X+ or X- is exactly 1 for the symbol X and 0 for any
other symbol.
Let us consider another example, where the emission probabilities
will not be degenerate.
The corresponding HMM is:
aFF = | aBB | = 0.9 | (7) |
aFB = | aBF | = 0.1 | (8) |
eF(h) = 0.5 | eF(t) = 0.5 | (9) | |
eB(h) = 0.75 | eB(t) = 0.25 | (10) |
Figure 6.1 gives a full description of the
model.
Returning to the general case, we have defined the probability
for a given sequence X and a given path .
However, we do not know the actual sequence of states
that emitted
.
We
therefore say that the generating path of X is hidden.
In the CpG islands case (problem 6.2),
the optimal path can help us find the location of the islands. Had
we known ,
we could have traversed it determining that
all the parts that pass through the "+" states are CpG islands.
Similarly, in the coin-tossing case (example
), the parts of
that pass
through the B (biased) state are suspected tosses of the biased
coin.
A solution for the optimal path problem is described in the next section.