Parameter Estimation for HMMs

Next: Profile Alignment Up: Hidden Markov Models Previous: Forward and Backward Probabilities

Parameter Estimation for HMMs

In examples 6.1.2a and 6.1.2b we constructed hidden Markov models knowing the transition and emission probabilities for the problems we had to solve. In real life, this may not be the case. We may be given n strings $X^{(1)},\ldots,X^{(n)} \in \Sigma^{\ast}$ of length $L^{(1)},\ldots,L^{(n)}$ , respectively, which were all generated from the HMM $\mathcal{M}$ $=(\Sigma,Q,\Theta)$ . The values of the probabilities in $\Theta$ , however, are unknown a-priori.

In order to construct the HMM that will best characterize $X^{(1)},\ldots,X^{(n)}$ , we will have to assign values to $\Theta$ that will maximize the probabilities of our strings according to the model. Since all strings are assumed to be generated independently, we can write:

$\begin{displaymath}P(X^{(1)},\ldots, X^{(n)} \vert \Theta) = \prod_{j=1}^{n}{P(X^{(i)} \vert \Theta)} \end{displaymath}$

(32)

Using the logarithmic score, our goal is to find $\Theta^{*}$ such that

$\begin{displaymath}\Theta^{*} = \arg\max_{\Theta}\{Score(X^{(1)},\ldots, X^{(n)}\vert\Theta)\} \end{displaymath}$

(33)

where:

$\begin{displaymath}Score(X^{(1)},\ldots, X^{(n)} \vert \Theta) = \log{P(X^{(1)}... ...vert \Theta)} = \sum_{j=1}^{n}{\log(P(X^{(i)} \vert \Theta))} \end{displaymath}$

(34)

The strings $X^{(1)},\ldots,X^{(n)}$ are usually called the training sequences.

Case 1: Assuming we know the state sequences $\Pi^{(1)},\ldots, \Pi^{(n)}$ corresponding to
$X^{(1)},\ldots,X^{(n)}$ , respectively. We can scan these sequences and compute:

A_kl - the number of transitions from the state k to l.
E_k(b) - the number of times that an emission of the symbol b occurred in state k.

The maximum likelihood estimators will be:

$\begin{displaymath}a_{kl} = \frac{A_{kl}}{\sum_{q \in Q}{A_{kq}}} \end{displaymath}$

(35)

$\begin{displaymath}e_{k}(b) = \frac{E_{k}(b)}{\sum_{\sigma \in \Sigma}{E_{k}(\sigma)}} \end{displaymath}$

(36)

To avoid zero probabilities, when working with a small amount of samples, it is recommended to work with A'_kl and E'_k(b), where:

A'_kl	=	A_kl + r_kl	(37)
E'_k(b)	=	E_k(b) + r_k(b)	(38)

Usually the Laplace correction, where all r_kl and r_k(b) values equal 1, is applied, having an intuitive interpretation of a-priori assumed uniform distribution. However, it may be beneficial in some cases to use other values for the correction (e.g. when having some prior information about the transition or emission probabilities).

Case 2: Usually, the state sequences $\Pi^{(1)},\ldots, \Pi^{(n)}$ are not known. In this case, the problem of finding the optimal set of parameters $\Theta^{\ast}$ is known to be NP-complete. The Baum-Welch algorithm [2], which is a special case of the EM technique (Expectation and Maximization), can be used for heuristically finding a solution to the problem.

1.

Initialization: Assign values to $\Theta$ .

2.

Expectation:

(a)

Compute the expected number of state transitions from state k to state l. Using the same arguments we used for computing $P(X, \pi_{i}=k)$ (see 6.30), we get:

$\begin{displaymath}P(\pi_{i}=k,\pi_{i+1}=l \vert X,\Theta) = \frac{f_{k}(i) \cdot a_{kl} \cdot e_{l}(x_{i+1}) \cdot b_{l}(i+1)}{P(X)} \end{displaymath}$

(39)

Hence, we can compute the expectation:

Let $\{f^{(j)}_{k}(i), b^{(j)}_{k}(i)\} _{ \quad k \in Q , \quad i\leq L(j)}$ denote the forward and backward probabilities of the string X^(j) . Then

$\begin{displaymath}A_{kl} = \sum_{j=1}^{n} {\frac{1}{P(X^{(j)})} \cdot \sum_{i=... ...ot a_{kl} \cdot e_{l}(x^{(j)}_{i+1}) \cdot b^{(j)}_{l}(i+1)}} \end{displaymath}$

(40)

(b)

Compute the expected number of emissions of the symbol b that occurred at the state k (using the value of $P(\pi_{i}=k \vert X)$ as calculated in 6.31):

$\begin{displaymath}E_{k}(b) = \sum_{j=1}^{n} {\frac{1}{P(X^{(j)})} \cdot \sum_{\... ...rt x^{(j)}_{i} =b\} } {f^{(j)}_{k}(i) \cdot b^{(j)}_{k}(i)}} \end{displaymath}$

(41)

3.

Maximization: Re-compute the new values for $\Theta$ from A_kl and E_k(b), as explained above (in case 1).

4.

Repeat steps 2 and 3 until the improvement of $Score(X^{(1)},\ldots, X^{(n)} \vert \Theta)$ is less then a given parameter $\epsilon$ .

The EM-algorithm guarantees that the target function values $Score(X^{(1)},\ldots, X^{(n)} \vert \Theta)$ are monotonically increasing, and as logarithms of probabilities are certainly bounded by 0, the algorithm is guaranteed to converge. It is important to notice that the convergence is of the target function and not in the $\Theta$ space: the values of $\Theta$ may change drastically even for almost equal values of the target function, which may imply that the obtained solution is not stable.

The main problem with the Baum-Welch algorithm is that there may exist several local maxima of the target function and it is not guaranteed that we reach the global maximum: the convergence may lead to a local maximum. A useful way to circumvent this pitfall is to run the algorithm several times, each time with different initial values for $\Theta$ . If we reach the same maximum most of the times, it is highly probable that this is indeed the global maximum. Another way is to start with $\Theta$ values that are meaningful, like in the case of the CpG islands we might start from $\Theta$ values obtained from a real case statistics.

Next: Profile Alignment Up: Hidden Markov Models Previous: Forward and Backward Probabilities

Peer Itsik
2000-12-19