HMM-4-学习模型参数(Baum-Welch)

4 从观测序列中学习模型参数

4.1 最大期望对数似然

设有\(K\)条观测序列\( ^{[5]} \):

\[ X=\{O^k\}_{k=1}^K \]

模型参数\( ^{[5]} \):

\[ \lambda = (A,B,\Pi) \]

调整模型参数使得观测序列概率最大化,即可完成参数学习过程\( ^{[1]} \).

\[ \begin{align}
P(X|\lambda) &= \prod_{k=1}^K P(O^k | \lambda) \\
&= \prod_{k=1}^K \sum_Q P(Q, O^k | \lambda) \\
\log P(X|\lambda) &= \sum_{k=1}^K \log P(O^k | \lambda) \\
&= \sum_{k=1}^K \log \sum_Q P(Q, O^k | \lambda)
\end{align} \]

状态序列\( Q \)未知, 如遍历所有状态序列求和, 计算代价太高且太复杂. 改为计算对数期望. 并在后续简化计算.

Baum-Welch 算法:

\[ \begin{align}
Q(\lambda, \bar{\lambda}) &= \sum_{k=1}^K \sum_Q P(Q|O^k, \lambda) \log P(Q, O^k |\bar{\lambda}) \\
\bar{\lambda} &= \arg \max_{\bar{\lambda}} Q(\lambda, \bar{\lambda})
\end{align} \]

\( Q \)状态概率未知(此时还不存在最佳模型参数), 使用交替迭代的方式计算,第一步随机生成模型参数\( \lambda \),使用此参数在观测序列上计算得到期望值,再调整参数\( \bar{\lambda} \)使得\( Q(\lambda, \bar{\lambda}) \)最大化, 反复迭代. 最终收敛到极值.

4.1.1 对数概率期望

\[ \begin{align}
\bar{\lambda} &= \arg \max_{\bar{\lambda}} Q(\lambda, \bar{\lambda}) \\
& \Rightarrow \sum_{k=1}^K \sum_Q P(Q|O^k, \lambda) \log P(Q, O^k |\bar{\lambda}) \\
& \Rightarrow \sum_{k=1}^K \sum_Q P(Q|O^k, \lambda) \log \left[ \bar{\pi}_{q_1} \bar{b}_{q_1}(O_1^k) \bar{a}_{q_1 q_2} \bar{b}_{q_2}(O_2^k) \cdots \bar{a}_{q_{t-1} q_T} \bar{b}_{q_T}(O_T^k) \right] \\
& \Rightarrow \left[ \sum_{k=1}^K \sum_Q P(Q|O^k, \lambda) \log \bar{\pi}_{q_1} \right] + \left[ \sum_{k=1}^K \sum_Q \sum_{t=1}^{T_{O^k}-1} P(Q|O^k, \lambda) \log \bar{a}_{q_t q_{t+1}} \right] + \left[ \sum_{k=1}^K \sum_Q \sum_{t=1}^{T_{O^k}} P(Q|O^k, \lambda) \log \bar{b}_{q_t}(O_t^k) \right]
\end{align} \]

于是,在这里\( \bar{A}, \bar{B}, \bar{\Pi} \)参数被分为三组互相无关的求和,可分别计算最大化\( ^{[3][4]} \).

4.1.2 最大化

\[ \begin{align}
\bar{\Pi} &= \arg \max_{\bar{\Pi}} \sum_{k=1}^K \sum_Q P(Q|O^k, \lambda) \log \bar{\pi}_{q_1} \\
& \Rightarrow \sum_{k=1}^K \sum_{q_1} \sum_{q_2 q_3 \cdots q_{T_{O^k}}} P(q_1 q_2 \cdots q_{T_{O^k}} |O^k, \lambda) \log \bar{\pi}_{q_1} \\
& \Rightarrow \sum_{k=1}^K \sum_{i=1}^N P(q_1 = S_i |O^k, \lambda) \log \bar{\pi}_i \\
& s.t. \sum_{i=1}^N \bar{\pi}_i = 1 \text{ Lagrange Multiplier } \\
& \Rightarrow \max \sum_{k=1}^K \sum_{i=1}^N P(q_1 = S_i |O^k, \lambda) \log \bar{\pi}_i + \mu \left[ \sum_{i=1}^N \bar{\pi}_i – 1 \right] \\
& \Rightarrow \frac{\partial}{\partial \bar{\pi}_i} \sum_{k=1}^K \sum_{i=1}^N P(q_1 = S_i |O^k, \lambda) \log \bar{\pi}_i + \mu \left[ \sum_{i=1}^N \bar{\pi}_i – 1 \right] = 0 \\
& \Rightarrow \frac{\sum_{k=1}^K P(q_1 = S_i |O^k, \lambda)}{\bar{\pi}_i} + \mu = 0 \\
& \Rightarrow \sum_{k=1}^K P(q_1 = S_i |O^k, \lambda) + \mu \bar{\pi}_i = 0 \\
& \Rightarrow \sum_{i=1}^N \left[ \sum_{k=1}^K P(q_1 = S_i |O^k, \lambda) + \mu \bar{\pi}_i \right] = 0 \\
\text{note:} \\
& \frac{\partial}{\partial \mu} \sum_{k=1}^K \sum_{i=1}^N P(q_1 = S_i |O^k, \lambda) \log \bar{\pi}_i + \mu \left[ \sum_{i=1}^N \bar{\pi}_i – 1 \right] = \sum_{i=1}^N \bar{\pi}_i – 1 = 0 \\
\text{so:} \\
& \Rightarrow \sum_{k=1}^K \sum_{i=1}^N P(q_1 = S_i | O^k, \lambda) + \mu \sum_{i=1}^N \bar{\pi}_i = 0 \\
& \Rightarrow K + \mu = 0 \\
& \Rightarrow \frac{\sum_{k=1}^K P(q_1 = S_i |O^k, \lambda)}{\bar{\pi}_i} + \mu = K + \mu \\
{\bar{\pi}_i} &= \frac{\sum_{k=1}^K P(q_1 = S_i |O^k, \lambda)}{K} \\
\bar{A} &= \arg \max_{\bar{A}} \sum_{k=1}^K \sum_Q \sum_{t=1}^{T_{O^k}-1} P(Q|O^k, \lambda) \log \bar{a}_{q_t q_{t+1}} \\
& \Rightarrow \sum_{k=1}^K \sum_{q_t q_{t+1}} \sum_{q_1 \cdots q_{t-1}, q_{t+2} \cdots q_{T_{O^k}}} \sum_{t=1}^{T_{O^k}-1} P(q_1 \cdots q_{t-1},q_t,q_{t+1},q_{t+2} \cdots q_{T_{O^k}}|O^k, \lambda) \log \bar{a}_{q_t q_{t+1}} \\
& \Rightarrow \sum_{k=1}^K \sum_{q_t q_{t+1}} \sum_{t=1}^{T_{O^k}-1} P(q_t,q_{t+1}|O^k, \lambda) \log \bar{a}_{q_t q_{t+1}} \\
& \Rightarrow \sum_{k=1}^K \sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T_{O^k}-1} P(q_t=S_i,q_{t+1}=S_j|O^k, \lambda) \log \bar{a}_{ij} \\
& s.t. \sum_{j=1}^N \bar{a}_{ij} = 1 \text{ Lagrange Multiplier } \\
& \Rightarrow \max \sum_{k=1}^K \sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T_{O^k}-1} P(q_t=S_i,q_{t+1}=S_j|O^k, \lambda) \log \bar{a}_{ij} + \sum_{i=1}^N \mu_i \left[ \sum_{j=1}^N \bar{a}_{ij} – 1 \right] \\
& \Rightarrow \frac{\partial}{\partial \bar{a}_{ij}} \sum_{k=1}^K \sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T_{O^k}-1} P(q_t=S_i,q_{t+1}=S_j|O^k, \lambda) \log \bar{a}_{ij} + \sum_{i=1}^N \mu_i \left[ \sum_{j=1}^N \bar{a}_{ij} – 1 \right] = 0 \\
& \Rightarrow \frac{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}-1} P(q_t=S_i,q_{t+1}=S_j|O^k, \lambda)}{\bar{a}_{ij}} + \mu_i = 0 \\
& \Rightarrow \sum_{k=1}^K \sum_{t=1}^{T_{O^k}-1} P(q_t=S_i,q_{t+1}=S_j|O^k, \lambda) + \mu_i \bar{a}_{ij} = 0 \\
& \Rightarrow \sum_{j=1}^N \left[ \sum_{k=1}^K \sum_{t=1}^{T_{O^k}-1} P(q_t=S_i,q_{t+1}=S_j|O^k, \lambda) + \mu_i \bar{a}_{ij} \right] = 0 \\
\text{note:} \\
& \frac{\partial}{\partial \mu_i } \sum_{k=1}^K \sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T_{O^k}-1} P(q_t=S_i,q_{t+1}=S_j|O^k, \lambda) \log \bar{a}_{ij} + \sum_{i=1}^N \mu_i \left[ \sum_{j=1}^N \bar{a}_{ij} – 1 \right] = \sum_{j=1}^N \bar{a}_{ij} – 1 = 0 \\
\text{so:} \\
& \Rightarrow \sum_{k=1}^K \sum_{t=1}^{T_{O^k}-1} \sum_{j=1}^N P(q_t=S_i,q_{t+1}=S_j|O^k, \lambda) + \mu_i \sum_{j=1}^N \bar{a}_{ij} = 0 \\
& \Rightarrow \sum_{k=1}^K \sum_{t=1}^{T_{O^k}-1} P(q_t=S_i|O^k, \lambda) + \mu_i = 0 \\
& \Rightarrow \frac{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}-1} P(q_t=S_i,q_{t+1}=S_j|O^k, \lambda)}{\bar{a}_{ij}} + \mu_i = \sum_{k=1}^K \sum_{t=1}^{T_{O^k}-1} P(q_t=S_i|O^k, \lambda) + \mu_i \\
\bar{a}_{ij} &= \frac{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}-1} P(q_t=S_i,q_{t+1}=S_j|O^k, \lambda)}{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}-1} P(q_t=S_i|O^k, \lambda)} \\
\bar{B} &= \arg \max_{\bar{B}} \sum_{k=1}^K \sum_Q \sum_{t=1}^{T_{O^k}} P(Q|O^k,\lambda) \log \bar{b}_{q_t}(O_t^k) \\
& \Rightarrow \sum_{k=1}^K \sum_{q_t} \sum_{q_1 \cdots q_{t-1}, q_{t+1} \cdots q_{T_{O^k}}} \sum_{t=1}^{T_{O^k}} P(q_1 \cdots q_{t-1}, q_t, q_{t+1} \cdots q_{T_{O^k}}|O^k,\lambda) \log \bar{b}_{q_t}(O_t^k) \\
& \Rightarrow \sum_{k=1}^K \sum_{i=1}^N \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda) \log \bar{b}_i(O_t^k) \\
& \Rightarrow \sum_{k=1}^K \sum_{i=1}^N \sum_{m=1}^M \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda) \log \bar{b}_i(m)1(O_t^k=v_m) \\
& s.t. \sum_{m=1}^M \bar{b}_i(m) = 1 \text{ Lagrange Multiplier } \\
& \Rightarrow \max \sum_{k=1}^K \sum_{i=1}^N \sum_{m=1}^M \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda) \log \bar{b}_i(m)1(O_t^k=v_m) + \sum_{i=1}^N \mu_i \left[ \sum_{m=1}^M \bar{b}_i(m) – 1 \right] \\
& \Rightarrow \frac{\partial}{\partial \bar{b}_i(m)} \sum_{k=1}^K \sum_{i=1}^N \sum_{m=1}^M \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda) \log \bar{b}_i(m)1(O_t^k=v_m) + \sum_{i=1}^N \mu_i \left[ \sum_{m=1}^M \bar{b}_i(m) – 1 \right] = 0 \\
& \Rightarrow \frac{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda) 1(O_t^k=v_m)}{\bar{b}_i(m)} + \mu_i = 0 \\
& \Rightarrow \sum_{k=1}^K \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda) 1(O_t^k=v_m) + \mu_i \bar{b}_i(m) = 0 \\
& \Rightarrow \sum_{m=1}^M \left[ \sum_{k=1}^K \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda) 1(O_t^k=v_m) + \mu_i \bar{b}_i(m) \right] = 0 \\
\text{note:} \\
& \frac{\partial}{ \partial \mu_i } \sum_{k=1}^K \sum_{i=1}^N \sum_{m=1}^M \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda) \log \bar{b}_i(m)1(O_t^k=v_m) + \sum_{i=1}^N \mu_i \left[ \sum_{m=1}^M \bar{b}_i(m) – 1 \right] = \sum_{m=1}^M \bar{b}_i(m) – 1 = 0 \\
\text{so:} \\
& \Rightarrow \sum_{k=1}^K \sum_{t=1}^{T_{O^k}} \sum_{m=1}^M P(q_t=S_i|O^k,\lambda) 1(O_t^k=v_m) + \mu_i \sum_{m=1}^M \bar{b}_i(m) = 0 \\
& \Rightarrow \sum_{k=1}^K \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda) + \mu_i = 0 \\
& \Rightarrow \frac{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda) 1(O_t^k=v_m)}{\bar{b}_i(m)} + \mu_i = \sum_{k=1}^K \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda) + \mu_i \\
\bar{b}_i(m) &= \frac{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda) 1(O_t^k=v_m)}{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda)}
\end{align} \]

于是最终得到了参数更新:

\[\begin{align}
{\bar{\pi}_i} &= \frac{\sum_{k=1}^K P(q_1 = S_i |O^k, \lambda)}{K} \\
\bar{a}_{ij} &= \frac{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}-1} P(q_t=S_i,q_{t+1}=S_j|O^k, \lambda)}{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}-1} P(q_t=S_i|O^k, \lambda)} \\
\bar{b}_i(m) &= \frac{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda) 1(O_t^k=v_m)}{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}} P(q_t=S_i|O^k,\lambda)}
\end{align}\]

定义:

\[ \begin{align}
\xi_t(i, j) & \equiv P(q_t=S_i, q_{t+1}=S_j | O, \lambda) \\
&= \frac{P(q_t=S_i, q_{t+1}=S_j, O | \lambda)}{P(O | \lambda)} \\
&= \frac{P(O | q_t=S_i, q_{t+1}=S_j, \lambda)P(q_t=S_i, q_{t+1}=S_j | \lambda)}{P(O | \lambda)} \\
&= \frac{P(O | q_t=S_i, q_{t+1}=S_j, \lambda)P(q_{t+1}=S_j | q_t=S_i, \lambda)P(q_t=S_i | \lambda)}{P(O | \lambda)} \\
&= \frac{P(O_1 \cdots O_t | q_t=S_i, \lambda)P(O_{t+1} | q_{t+1}=S_j, \lambda)P(O_{t+2} \cdots O_T | q_{t+1}=S_j, \lambda)P(q_{t+1}=S_j | q_t=S_i, \lambda)P(q_t=S_i | \lambda)}{P(O | \lambda)} \\
&= \frac{P(O_1 \cdots O_t, q_t=S_i | \lambda)P(O_{t+1} | q_{t+1}=S_j, \lambda)P(O_{t+2} \cdots O_T | q_{t+1}=S_j, \lambda)P(q_{t+1}=S_j | q_t=S_i, \lambda)}{P(O | \lambda)} \\
&= \frac{ \alpha_t(i) a_{ij} b_j(O_{t+1}) \beta_{t+1}(j) }{ \sum_k^N \sum_l^N \alpha_t(k) a_{kl} b_l(O_{t+1}) \beta_{t+1}(l) } \\
\gamma_t(i) & \equiv P(q_t=S_i | O, \lambda) = \sum_{j=1}^N \xi_t(i,j)
\end{align} \]

\( \gamma_T(i) \)计算时会超出序列长度,可以更改为:

\[ \gamma_t(i) = \frac{ \alpha_t(i) \beta_t(i) }{ \sum_l^N \alpha_t(l) \beta_t(l) } \]

于是更新右侧可以改写为\( ^{[1]} \):

\[\begin{align}
{\bar{\pi}_i} &= \frac{\sum_{k=1}^K \gamma_1^k(i)}{K} \\
\bar{a}_{ij} &= \frac{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}-1} \xi_t^k(i, j)}{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}-1} \gamma_t^k(i)} \\
\bar{b}_i(m) &= \frac{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}} \gamma_t^k(i) 1(O_t^k=v_m)}{\sum_{k=1}^K \sum_{t=1}^{T_{O^k}} \gamma_t^k(i)}
\end{align}\]

4.1.3 停止迭代

当更新前后参数差小于特定值时停止迭代,模型参数学习完成.

\[ || \lambda – \bar{\lambda} ||_1 < \epsilon \text{ or } || \lambda – \bar{\lambda} ||_2 < \epsilon \]

4.1.4 局部最优

Baum-Welch算法得到的参数无法保证是全局最优,其效果严重依赖随机初始化参数值,为尽量提高参数优化效果,工程中,需要使用多组随机参数作为Baum-Welch算法初值参数,计算完毕后再择优选择最终得到的最好的参数值.

4.2 数值稳定性

使用\( \alpha_t(i), \beta_t(i) \)时,当观测序列过长,计算过程会超出机器精度,这里需要使用 \( \hat{\alpha}_t(i), \hat{\beta}_t(i) \)替代\( ^{[2]} \).

\[\begin{align}
\hat{\xi}_t(i, j) &= \frac{1}{C_{t+1}} \hat{\alpha}_t(i) a_{ij} b_j(O_{t+1}) \hat{\beta}_{t+1}(j) \\
\hat{\gamma}_t(i) &= \hat{\alpha}_t(i) \hat{\beta}_t(i)
\end{align}\]

4.3 代码试验

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
import numpy as np
from matplotlib import pyplot as plt
 
np.random.seed(42)
 
# normal forward algorithm
class Alpha:
  def __init__(self, O, Lambda):
    self.O = O
    self.A, self.B, self.Pi = Lambda
    self.T = len(O)
    self.N = self.A.shape[0]
    self.M = self.B.shape[1]
    assert( self.A.shape[0] == self.A.shape[1] == self.B.shape[0] == self.Pi.shape[0] )
    assert( min(self.O) >= 0 and max(self.O) < self.B.shape[1] )
    '''
    \alpha_t(i) \equiv P(O_1 \cdots O_t, q_t = S_i | \lambda)
    notice: param t: 0=>t1, 1=>t2, ..., T-1=>T
    '''
    self.alpha = np.zeros(shape=(self.T, self.N))
    for t in range(self.T):
      if t == 0:
        self.alpha[t] = [self.Pi[i] * self.B[i, self.O[t]] for i in range(self.N)]
      else:
        self.alpha[t] = [sum([self.alpha[t-1, h] * self.A[h, i] for h in range(self.N)]) * self.B[i, self.O[t]] for i in range(self.N)]
 
  def __call__(self, t, i):
    return self.alpha[t, i]
 
# scaling forward algorithm
class Alpha_Hat:
  def __init__(self, O, Lambda):
    self.O = O
    self.A, self.B, self.Pi = Lambda
    self.T = len(O)
    self.N = self.A.shape[0]
    self.M = self.B.shape[1]
    assert( self.A.shape[0] == self.A.shape[1] == self.B.shape[0] == self.Pi.shape[0] )
    assert( min(self.O) >= 0 and max(self.O) < self.B.shape[1] )
    '''
    \hat{\alpha}_t(i) & \equiv P(q_t = S_i | O_1 \cdots O_t, \lambda)
    notice: param t: 0=>t1, 1=>t2, ..., T-1=>T
    '''
    self.C = np.zeros(shape=(self.T,))
    self.alpha_hat = np.zeros(shape=(self.T, self.N))
    for t in range(self.T):
      if t==0:
        self.alpha_hat[t] = [self.Pi[i] * self.B[i, self.O[t]] for i in range(self.N)]
      else:
        self.alpha_hat[t] = [sum([self.alpha_hat[t-1, h] * self.A[h, i] for h in range(self.N)]) * self.B[i, self.O[t]] for i in range(self.N)]
      self.C[t] = self.alpha_hat[t].sum()
      self.alpha_hat[t] /= self.