HMM-2-计算观测序列概率(forward-backward)

2 向前向后算法(forward-backward)

给定模型参数,计算给定任意观测序列\( O \)的概率\( ^{[2]} \).

\[ P(O|\lambda) = \sum_Q P(O, Q | \lambda) \]

\( Q \)有\( N^T \)种可能,很显然,计算复杂度\( o(N^T) \)太高. 因此使用(forward-backward)向前向后算法实现可接受的计算复杂度.

2.1 向前算法(forward)

向前向后算法时间复杂度均为 \( o(N^2 T) \).

  • 定义:

\[ \alpha_t(i) \equiv P(O_1 \cdots O_t, q_t = S_i | \lambda) \]

  • 初始化:

\[ \begin{align}
\alpha_1(i) &\equiv P(O_1,q_1=S_i|\lambda) \\
&= P(O_1 | q_1=S_i,\lambda)P(q_1=S_i|\lambda) \\
&= \pi_i b_i(O_1)
\end{align} \]

  • 递归:

\[ \begin{align}
\alpha_{t+1}(j) &\equiv P(O_1 \cdots O_{t+1}, q_{t+1}=S_j|\lambda) \\
&= P(O_1 \cdots O_{t+1} | q_{t+1}=S_j, \lambda) P(q_{t+1}=S_j | \lambda) \\
&= P(O_1 \cdots O_t|q_{t+1}=S_j,\lambda)P(O_{t+1}|q_{t+1}=S_j,\lambda)P(q_{t+1}=S_j|\lambda) \\
&= P(O_1 \cdots O_t, q_{t+1}=S_j|\lambda)P(O_{t+1}|q_{t+1}=S_j,\lambda) \\
&= P(O_{t+1}|q_{t+1}=S_j,\lambda) \sum_{i=1}^N P(O_1 \cdots O_t, q_t=S_i, q_{t+1}=S_j|\lambda) \\
&= P(O_{t+1}|q_{t+1}=S_j,\lambda) \sum_{i=1}^N P(O_1 \cdots O_t, q_{t+1}=S_j | q_t=S_i, \lambda)P(q_t=S_i|\lambda) \\
&= P(O_{t+1}|q_{t+1}=S_j,\lambda) \sum_{i=1}^N P(O_1 \cdots O_t | q_t=S_i, \lambda) P(q_{t+1} = S_j | q_t=S_i, \lambda) P(q_t=S_i|\lambda) \\
&= P(O_{t+1}|q_{t+1}=S_j,\lambda) \sum_{i=1}^N P(O_1 \cdots O_t, q_t=S_i, \lambda) P(q_{t+1} = S_j | q_t=S_i, \lambda) \\
&= \left[ \sum_{i=1}^N \alpha_t(i) a_{ij} \right] b_j(O_{t+1})
\end{align} \]

  • 最后:

\[ \begin{align}
P(O|\lambda) &= \sum_{i=1}^N P(O, q_T = S_i | \lambda) \\
&= \sum_{i=1}^N \alpha_T(i)
\end{align} \]

2.1.1 数值稳定性

如同之前的模型概率,在这里同样存在大量小于1.0的数值相乘,且因为在过程中有求和,因此这里不适合直接使用对数概率。为了提高数值稳定性让其在机器精度范围内, 可将计算过程中的\( \alpha_{t}(i) \)缩放\( ^{[1][5]} \).

Chain rule (probability)\( ^{[3]} \) Conditional probability\( ^{[4]} \)

  • 定义

\[ \begin{align}
C_t & \equiv P(O_t | O_1 \cdots O_{t-1}, \lambda) \\
\hat{\alpha}_t(i) & \equiv P(q_t = S_i | O_1 \cdots O_t, \lambda) \\
&= \frac{ P(q_t = S_i, O_1 \cdots O_t, \lambda) }{ P(O_1 \cdots O_t | \lambda) } \\
&= \frac{\alpha_t(i)}{\prod_{u=1}^t C_u}
\end{align} \]

  • 递归

\[ \begin{align}
\hat{\alpha_{t+1}} (j) & \equiv P(q_{t+1} = S_j | O_1 \cdots O_{t+1}, \lambda) \\
&= \frac{ P(q_{t+1} = S_j, O_1 \cdots O_{t+1} | \lambda)}{ P( O_1 \cdots O_{t+1} | \lambda)} \\
&= \frac{\alpha_{t+1}(j)}{P( O_1 \cdots O_{t+1} | \lambda)} \\
&= \frac{ \left[ \sum_{i=1}^N \alpha_t(i) a_{ij} \right] b_j(O_{t+1}) }{P( O_1 \cdots O_{t+1} | \lambda)} \\
&=\frac{\left[ \sum_{i=1}^N P(O_1 \cdots O_t, q_t = S_i | \lambda) a_{ij} \right] b_j(O_{t+1})}{P(O_1 \cdots O_{t+1} | \lambda)} \\
&=\frac{\left[ \sum_{i=1}^N P(q_t = S_i | O_1 \cdots O_t, \lambda) a_{ij} \right] b_j(O_{t+1})}{P(O_{t+1} |O_1 \cdots O_t, \lambda)} \\
&=\frac{\left[ \sum_{i=1}^N \hat{\alpha_t}(i) a_{ij} \right] b_j(O_{t+1})}{P(O_{t+1} |O_1 \cdots O_t, \lambda)} \\
&=\frac{\left[ \sum_{i=1}^N \hat{\alpha_t}(i) a_{ij} \right] b_j(O_{t+1})}{C_{t+1}} \\
\text{where:} \\
\left[ \sum_{i=1}^N \hat{\alpha_t}(i) a_{ij} \right] b_j(O_{t+1}) &= \left[ \sum_{i=1}^N P(q_t = S_i | O_1 \cdots O_t, \lambda) P(q_{t+1}=S_j | q_t = S_i, \lambda) \right] P(O_{t+1} | q_{t+1}=S_j, \lambda) \\
&= \sum_{i=1}^N P(q_t = S_i | O_1 \cdots O_t, \lambda) P(q_{t+1}=S_j | q_t = S_i, \lambda) P(O_{t+1} | q_{t+1}=S_j, \lambda) \\
& = \sum_{i=1}^N P(q_t = S_i, q_{t+1}=S_j, O_{t+1} | O_1 \cdots O_t, \lambda) \\
& = P(q_{t+1}=S_j, O_{t+1} | O_1 \cdots O_t, \lambda) \\
\text{thus:} \\
C_{t+1} &= \sum_{j=1}^N P(q_{t+1}=S_j, O_{t+1} | O_1 \cdots O_t, \lambda) \\
&= P(O_{t+1} | O_1 \cdots O_t, \lambda) \\
\text{so:} \\
C_{t+1} &= \sum_{j=1}^N \left[ \sum_{i=1}^N \hat{\alpha_t}(i) a_{ij} \right] b_j(O_{t+1}) \\
\end{align} \]

  • 最后

\[ \begin{align}
P(O | \lambda) &= \prod_{t=1}^T C_t \\
\log P(O | \lambda) &= \sum_{t=1}^T \log C_t
\end{align} \]

2.2 向后算法

  • 定义

\[ \beta_t(i) \equiv P(O_{t+1} \cdots O_T | q_t = S_i, \lambda) \]

  • 初始化:

\[ \beta_T(i) = 1 \]

  • 递归:

\[ \begin{align}
\beta_t(i) &\equiv P(O_{t+1} \cdots O_T|q_t=S_i, \lambda) \\
&= \sum_{j=1}^N P(O_{t+1} \cdots O_T, q_{t+1}=S_j|q_t=S_i,\lambda) \\
&= \sum_{j=1}^N P(O_{t+1} \cdots O_T | q_{t+1}=S_j, q_t=S_i, \lambda) P(q_{t+1} = S_j | q_t=S_i,\lambda) \\
&= \sum_{j=1}^N P(O_{t+1} | q_{t+1}=S_j, q_t=S_i, \lambda) P(O_{t+2} \cdots O_T | q_{t+1}=S_j, q_t=S_i, \lambda) P(q_{t+1} = S_j | q_t=S_i,\lambda) \\
&= \sum_{j=1}^N P(O_{t+1} | q_{t+1} = S_j,\lambda)P(O_{t+2} \cdots O_T | q_{t+1}=S_j, \lambda) P(q_{t+1} = S_j|q_t=S_i,\lambda) \\
&= \sum_{j=1}^N a_{ij} b_j(O_{t+1}) \beta_{t+1}(j)
\end{align} \]

倒数第二步去掉条件概率中的 \( q_t=S_i \) 依据是HMM假设当前状态仅与上一步状态相关,与上一步之前的状态无关.

  • 最后:

\[ \begin{align}
\beta_0(i) &= \sum_{j=1}^N \pi_{j} b_j(O_1) \beta_1(j) \\
P(O|\lambda) &= \beta_0(i) \quad where \quad i \quad \text{is any value}
\end{align} \]

2.1.1 数值稳定性

  • 定义

\[ \begin{align}
\hat{\beta}_t(i) & \equiv \frac{ P(O_{t+1} \cdots O_T|q_t=S_i, \lambda) }{ P(O_{t+1} \cdots O_T | O_1 \cdots O_t, \lambda) } \\
&= \frac{ \beta_t(i) }{ \prod_{u=t+1}^T C_u }
\end{align} \]

缩放操作和之前的\( \hat{\alpha}_t(i) \)相同,只是前后顺序刚好相反. 注意:

\[ \begin{align}
\prod_{u=1}^t C_u \prod_{u=t+1}^T C_u &= P(O_1 \cdots O_t | \lambda) P(O_{t+1} \cdots O_T | O_1 \cdots O_t, \lambda) \\
&= P(O_1 \cdots O_T | \lambda) \\
\alpha_t(i) \beta_t(i) &= P(O_1 \cdots O_t, q_t = S_i | \lambda) P(O_{t+1} \cdots O_T | q_t = S_i, \lambda) \\
&= P(O_1 \cdots O_t, q_t = S_i, O_{t+1} \cdots O_T | \lambda) \\
\hat{\alpha}_t(i) \hat{\beta}_t(i) &= \frac{\alpha_t(i)}{\prod_{u=1}^t C_u} \frac{ \beta_t(i) }{ \prod_{u=t+1}^T C_u } \\
&= P(q_t = S_i | O_1 \cdots O_T, \lambda)
\end{align} \]

  • 递归

\[ \begin{align}
\hat{\beta}_t(i) & \equiv \frac{ P(O_{t+1} \cdots O_T|q_t=S_i, \lambda) }{ P(O_{t+1} \cdots O_T | O_1 \cdots O_t, \lambda) } \\
&= \frac{ \beta_t(i) }{ \prod_{u=t+1}^T C_u } \\
&= \frac{ \sum_{j=1}^N a_{ij} b_j(O_{t+1}) \beta_{t+1}(j) }{ \prod_{u=t+1}^T C_u } \\
&= \frac{ \sum_{j=1}^N a_{ij} b_j(O_{t+1}) \beta_{t+1}(j) }{ C_{t+1} \prod_{u=t+2}^T C_u } \\
&= \frac{ \sum_{j=1}^N a_{ij} b_j(O_{t+1}) \frac{\beta_{t+1}(j)}{\prod_{u=t+2}^T C_u} }{ C_{t+1} } \\
&= \frac{ \sum_{j=1}^N a_{ij} b_j(O_{t+1}) \hat{\beta}_{t+1}(j)}{ C_{t+1} }
\end{align} \]

2.3 代码实验

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
 
np.random.seed(42)
 
T,N,M = 100,10,1000
 
O = np.random.randint(low=0, high=M, size=T).flatten() # O = { O_1,O_2,\cdots,O_T } 
Pi = np.random.dirichlet(np.ones(N), size=1).flatten() # \Pi = [\pi_i]
A = np.random.dirichlet(np.ones(N), size=N) # A = [a_{ij}]
B = np.random.dirichlet(np.ones(M), size=N) # B = [b_j(m)]
Lambda = (A, B, Pi) # \lambda = (A,B,\Pi)
 
alpha_cache = np.zeros(shape=(T,N))
def alpha(t):
  '''
  \alpha_t(i) \equiv P(O_1 \cdots O_t, q_t = S_i | \lambda)
  notice: param t: 0=>t1, 1=>t2, ..., T-1=>T
  '''
  if t == 0:
    alpha_cache[t] = [Pi[i] * B[i, O[t]] for i in range(N)]
  else:
    alpha_cache[t] = [sum([alpha_cache[t-1, h] * A[h, i] for h in range(N)]) * B[i, O[t]] for i in range(N)]
  return alpha_cache[t]
 
# notice: param t: 0=>t1, 1=>t2, ..., T-1=>T
for t in range(T):
  alpha(t)
 
# P(O|\lambda) = \sum_{i=1}^N \alpha_T(i)
# notice: param t: 0=>t1, 1=>t2, ..., T-1=>T
alpha_P = sum([ alpha_cache[T-1, i] for i in range(N) ])
 
beta_cache = np.ones(shape=(T+1, N))
def beta(t):
  '''
  \beta_t(i) \equiv P(O_{t+1} \cdots O_T | q_t = S_i, \lambda)
  notice: param t: T=>T,T-1=>T-1,...,1=>t1,0=>t0
  '''
  if t == 0:
    beta_cache[t] = np.array([ Pi[j] * B[j, O[t]] * beta_cache[t+1, j] for j in range(N)]).sum()
  else:
    beta_cache[t] = np.array([ A[:,j] * B[j, O[t]] * beta_cache[t+1, j] for j in range(N)]).sum(axis=0)
  return beta_cache[t]
 
# notice: param t: T=>T,T-1=>T-1,...,1=>t1,0=>t0
for t in range(T-1, -1, -1):
  beta(t)
 
# P(O|\lambda) = \beta_0(i) where i is any value
# notice: param t: T=>T,T-1=>T-1,...,1=>t1,0=>t0
beta_P = beta_cache[0,0]
 
C_cache = np.zeros(shape=(T,))
alpha_hat_cache = np.zeros(shape=(T,N))
def alpha_hat(t):
  '''
  \hat{\alpha}_t(i) & \equiv P(q_t = S_i | O_1 \cdots O_t, \lambda)
  notice: param t: 0=>t1, 1=>t2, ..., T-1=>T
  '''
  if t==0:
    alpha_hat_cache[t] = [Pi[i] * B[i, O[t]] for i in range(N)]    
  else:
    alpha_hat_cache[t] = [sum([alpha_hat_cache[t-1, h] * A[h, i] for h in range(N)]) * B[i, O[t]] for i in range(N)]
  C_cache[t] = alpha_hat_cache[t].sum()
  alpha_hat_cache[t] /= C_cache[t]
  return alpha_hat_cache[t]
 
# notice: param t: 0=>t1, 1=>t2, ..., T-1=>T
for t in range(T):
  alpha_hat(t)
 
# P(O | \lambda) &= \prod_{t=1}^T C_t
# \log P(O | \lambda) &= \sum_{t=1}^T \log C_t
# notice: param t: 0=>t1, 1=>t2, ..., T-1=>T
CPO = np.prod(C_cache)
logCPO = np.sum( np.log(C_cache) )
 
beta_hat_cache = np.ones(shape=(T+1, N))
def beta_hat(t):
  '''
  \hat{\beta}_t(i) = \frac{ \beta_t(i) }{ \prod_{u=t+1}^T C_u } 
  notice: param t: T=>T,T-1=>T-1,...,1=>t1,0=>t0
  '''
  if t == 0:
    beta_hat_cache[t] = np.array([ Pi[j] * B[j, O[t]] * beta_hat_cache[t+1, j] for j in range(N)]).sum()
  else:
    beta_hat_cache[t] = np.array([ A[:,j] * B[j, O[t]] * beta_hat_cache[t+1, j] for j in range(N)]).sum(axis=0)
  beta_hat_cache[t] /= C_cache[t]
  return beta_hat_cache[t]
 
# notice: param t: T=>T,T-1=>T-1,...,1=>t1,0=>t0
for t in range(T-1, -1, -1):
  beta_hat(t)
 
t = T//2 # 0=>t1, 1=>t2, ..., T-1=>T
print(alpha_P, 'P(O | \lambda) = \sum_{i=1}^N \\alpha_T(i)')
print(beta_P, 'P(O | \lambda) = \sum_{j=1}^N \pi_{j} b_j(O_1) \\beta_1(j)')
print(CPO, 'P(O | \lambda) = \prod_{t=1}^T C_t')
print(np.exp(logCPO), 'P(O | \lambda) = \exp \left[ \sum_{t=1}^T \log C_t \\right]')
print(np.sum([alpha_cache[t,i] * beta_cache[t+1,i] for i in range(N)]), 'P(O | \lambda) = \sum_{i=1}^N \\alpha_t(i) \\beta_t(i)')
print(np.sum([alpha_hat_cache[t,i] * beta_hat_cache[t+1,i] * CPO for i in range(N)]), 'P(O | \lambda) = \sum_{i=1}^N \left[ \hat{\\alpha}_t(i) \hat{\\beta}_t(i) \prod_{u=1}^T C_u \\right]')

output:
6.8789476175101185e-304 \(P(O | \lambda) = \sum_{i=1}^N \alpha_T(i)\)
6.878947617510119e-304 \(P(O | \lambda) = \sum_{j=1}^N \pi_{j} b_j(O_1) \beta_1(j)\)
6.878947617510125e-304 \(P(O | \lambda) = \prod_{t=1}^T C_t\)
6.878947617510068e-304 \(P(O | \lambda) = \exp \left[ \sum_{t=1}^T \log C_t \right]\)
6.878947617510122e-304 \(P(O | \lambda) = \sum_{i=1}^N \alpha_t(i) \beta_t(i)\)
6.8789476175101225e-304 \(P(O | \lambda) = \sum_{i=1}^N \left[ \hat{\alpha}_t(i) \hat{\beta}_t(i) \prod_{u=1}^T C_u \right]\)

References:
[1] Rabiner, L. R. 1989. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.” Proceedings of the IEEE 77:257–286.
[2] Daniel Jurafsky, James H. Martin “Speech and Language Processing” 15.3 Hidden Markov Models :P421
[3] Chain rule (probability) From Wikipedia
[4] Conditional probability From From Wikipedia
[5] Christopher M. Bishop “Pattern Recognition and
Machine Learning” 2006 Springer 13.2.4 Scaling factors p627-629.

Leave a Reply

Your email address will not be published.

Time limit is exhausted. Please reload the CAPTCHA.

Proudly powered by WordPress   Premium Style Theme by www.gopiplus.com