In examples 6.1.2a and
6.1.2b we constructed hidden Markov models
knowing the transition and emission probabilities for the problems
we had to solve. In real life, this may not be the case. We may be
given n strings
of
length
,
respectively, which were all
generated from the HMM
.
The
values of the probabilities in ,
however, are unknown
a-priori.
In order to construct the HMM that will best characterize
,
we will have to assign values to
that will maximize the probabilities of our strings
according to the model. Since all strings are assumed to be
generated independently, we can write:
(32) |
Using the logarithmic score, our goal is to find
such
that
(33) |
(34) |
The strings
are usually called the training sequences.
Case 1: Assuming we know the state sequences
corresponding to
,
respectively. We can scan these sequences and compute:
The maximum likelihood estimators will be:
(35) |
(36) |
To avoid zero probabilities, when working with a small amount of
samples, it is recommended to work with A'kl and E'k(b),
where:
A'kl | = | Akl + rkl | (37) |
E'k(b) | = | Ek(b) + rk(b) | (38) |
Usually the Laplace correction, where all rkl and
rk(b) values equal 1, is applied, having an intuitive
interpretation of a-priori assumed uniform distribution. However,
it may be beneficial in some cases to use other values for the
correction (e.g. when having some prior information about the
transition or emission probabilities).
Case 2: Usually, the state sequences are not known. In this case, the problem of finding the optimal set of parameters is known to be NP-complete. The Baum-Welch algorithm [2], which is a special case of the EM technique (Expectation and Maximization), can be used for heuristically finding a solution to the problem.
(39) |
Hence, we can compute the expectation:
Let
denote the forward and backward probabilities of the string X(j) .
Then
(40) |
(41) |
The EM-algorithm guarantees that the target function values
are monotonically
increasing, and as logarithms of probabilities are certainly bounded
by 0, the algorithm is guaranteed to converge. It is important to
notice that the convergence is of the target function and not in the
space: the values of
may change drastically even for
almost equal values of the target function, which may imply that the
obtained solution is not stable.
The main problem with the Baum-Welch algorithm is that there may exist several local maxima of the target function and it is not guaranteed that we reach the global maximum: the convergence may lead to a local maximum. A useful way to circumvent this pitfall is to run the algorithm several times, each time with different initial values for . If we reach the same maximum most of the times, it is highly probable that this is indeed the global maximum. Another way is to start with values that are meaningful, like in the case of the CpG islands we might start from values obtained from a real case statistics.