HMMs can be used for aligning a string versus a given profile, thus
helping us to solve the multiple alignment problem.
We define a profile
of length L, as a set of
probabilities, consisting of, for each
and
,
the probability ei(b) of observing the symbol b at the
position. In such a case the probability of a string
given the profile
will be:
(42) |
We can calculate a likelihood score for the ungapped alignment of X
against the profile
:
(43) |
where p(b) is the background frequency of occurrences of the symbol
b.
This leads to a definition of the following HMM: all the states are
match states
which correspond to matches of
the string's symbols with the profile positions. All these states are
sequentially linked (i.e., each match state Mj is linked to its
successor Mj+1) as shown in figure
6.2. The emission probability of the symbol
b from the state Mj is of course ej(b).
To allow insertions, we will add also insertion states
to the model. We shall assume that:
Each insertion state Ij has an link entering from the
corresponding match state Mj, a leaving link towards the next
match state Mj+1 and also has a self-loop (see figure
6.3). Assigning the appropriate
probabilities for those transitions corresponds to the application of
affine gap penalties, since the overall contribution of a gap of
length h to the logarithmic likelihood score is:
To allow deletions as well, we add the deletion states
.
These states cannot emit any symbol and are
therefore called silent (Note that the begin/end states are
silent as well). The deletion states are sequentially linked, in a
similar manner to the match states and they are also interleaved with
the match states (see figure 6.4).
To model both insertions and deletions, we have to add a link from
Dj to Ij and a link from Ij to Dj+1.
The full HMM for modeling the profile
of length L is
comprised of L layers, each layer has three states Mj, Ijand Dj. To complete the model, we add begin and end states,
connected to the layers as shown in figure
6.5. This model is due to Haussler et
al [5].