6 工程中算法需做变换,避免超出计算机浮点数范围

在实践中,因为规范因子$ Z_w(x) $在序列较长时值会非常大,因此所有和规范因子相关运算全部使用$ \log \sum \exp $替代。对应的概率操作也需要变换。

6.1 c/c++/python配合实现,解决全局归一速度慢的问题

CRF模型是全局参数优化(即参数在整个被标记序列长度上考虑损失和概率归一,而不是MEMM那样只考虑局部转换),这带来速度非常缓慢的问题,因为每次 x,y就是一个非常长的序列,所有特征函数均需要在其上做完整扫描。用python实现性能会非常缓慢。因此这里用c/c++和python配合实现。线性代数部分用 boost:ublas实现。
代码比较长,请参看github/shizhuolin/zh-word-segment

inline vector<ublas::matrix<double>> xmatrices(const YSET_T& yset,const FEATURES_T& features,const ublas::vector<double>& weights,const wstring& x)
{
    size_t n = x.length();
    vector<ublas::matrix<double>> ms(n+1);
    for(size_t i=0; i<=n; ++i)
        ms.at(i) = xmatrixi(yset, features, weights, x, i);
    return ms;
}

inline ublas::matrix<double> xalphas(const vector<ublas::matrix<double>>& ms)
{
    size_t mslen = ms.size();
    ublas::matrix<double> alpha(mslen, ms[0].size1());
    ublas::row<ublas::matrix<double>>(alpha, 0) = ublas::row<ublas::matrix<double>>(ms[0], 0);
    for(size_t i=1; i<mslen-1; ++i)
    {
        for(size_t j=0; j<ms[i].size2(); ++j)
        {
            ublas::vector<double> values = ublas::row<ublas::matrix<double>>(alpha, i-1)
                                           + ublas::column<ublas::matrix<double>>(ms[i], j);
            alpha(i, j) = logsumexp( values );
        }
    }
    ublas::row<ublas::matrix<double>>(alpha, mslen-1) = ublas::row<ublas::matrix<double>>(alpha, mslen-2)
                                   + ublas::column<ublas::matrix<double>>(ms[mslen-1], 0);
    return alpha;
}
def viterbi(observation):
    T = len(observation)
    delta = [None] * (T + 1)
    psi = [None] * (T + 1)
    delta[0] = dict([(label, model_weight('^', label, (observation, 0))) for label in labels])
    psi[0] = dict( [ (label, None) for label in labels ] )
    for t in range(1, len(observation)):
        Ot = observation,t
        delta[t] = dict([(label, max([delta[t-1][prev_label] + model_weight(prev_label, label, Ot)               for prev_label in labels]))    for label in labels])
        psi[t] = dict([(label, max([(delta[t-1][prev_label] + model_weight(prev_label, label, Ot), prev_label) for prev_label in labels])[1]) for label in labels ])
    Ot = observation,T
    delta[T] = max( [ delta[T-1][label]+model_weight(label, '$', Ot) for label in labels ] )
    psi[T] = max( [(delta[T-1][label]+model_weight(label, '$', Ot), label) for label in labels  ] )[1]
    q = [None] * (T+1)
    q[T] = psi[T]
    for t in range(T-1, -1, -1):
        q[t] = psi[t][q[t+1]]
    return q[1:]

6.2 PKU数据集,实测效果

计算速度较慢,全程花费近10小时,还有优化余地。效果相比MEMM得到极大提升(特征模版完全相同),因扫描一次参数实在太慢, 不在做后续优化。和之前一样使用PKU集训练和其gold检验。
CRF参数为,L2-sigma=10, 最小特征次数限制1次

algorithmPRFOOVOOV RecallIV Recall
maximum matching0.8360.9040.8690.0580.0590.956
maximum probability0.8590.9190.8880.0580.0850.970
HMM0.8040.7890.7960.0580.3640.815
Full Second Order HMM0.8240.7990.8110.0580.3600.825
MEMM0.9090.8910.9000.0580.3830.923
CRF0.9320.9120.9220.0580.5560.934

参考:
[1] LogSumExp: https://en.wikipedia.org/wiki/LogSumExp

Tag:none

Add a new comment.