中文分词词性和序列标注之CRF-6-试验效果

6 工程中算法需做变换,避免超出计算机浮点数范围

在实践中,因为规范因子\( Z_w(x) \)在序列较长时值会非常大,因此所有和规范因子相关运算全部使用\( \log \sum \exp \)替代。对应的概率操作也需要变换。

6.1 c/c++/python配合实现,解决全局归一速度慢的问题

CRF模型是全局参数优化(即参数在整个被标记序列长度上考虑损失和概率归一,而不是MEMM那样只考虑局部转换),这带来速度非常缓慢的问题,因为每次 x,y就是一个非常长的序列,所有特征函数均需要在其上做完整扫描。用python实现性能会非常缓慢。因此这里用c/c++和python配合实现。线性代数部分用 boost:ublas实现。
代码比较长,请参看github/shizhuolin/zh-word-segment

inline vector<ublas::matrix<double>> xmatrices(const YSET_T& yset,const FEATURES_T& features,const ublas::vector<double>& weights,const wstring& x)
{
    size_t n = x.length();
    vector<ublas::matrix<double>> ms(n+1);
    for(size_t i=0; i<=n; ++i)
        ms.at(i) = xmatrixi(yset, features, weights, x, i);
    return ms;
}
 
inline ublas::matrix<double> xalphas(const vector<ublas::matrix<double>>& ms)
{
    size_t mslen = ms.size();
    ublas::matrix<double> alpha(mslen, ms[0].size1());
    ublas::row<ublas::matrix<double>>(alpha, 0) = ublas::row<ublas::matrix<double>>(ms[0], 0);
    for(size_t i=1; i<mslen-1; ++i)
    {
        for(size_t j=0; j<ms[i].size2(); ++j)
        {
            ublas::vector<double> values = ublas::row<ublas::matrix<double>>(alpha, i-1)
                                           + ublas::column<ublas::matrix<double>>(ms[i], j);
            alpha(i, j) = logsumexp( values );
        }
    }
    ublas::row<ublas::matrix<double>>(alpha, mslen-1) = ublas::row<ublas::matrix<double>>(alpha, mslen-2)
                                   + ublas::column<ublas::matrix<double>>(ms[mslen-1], 0);
    return alpha;
}
def viterbi(observation):
    T = len(observation)
    delta = [None] * (T + 1)
    psi = [None] * (T + 1)
    delta[0] = dict([(label, model_weight('^', label, (observation, 0))) for label in labels])
    psi[0] = dict( [ (label, None) for label in labels ] )
    for t in range(1, len(observation)):
        Ot = observation,t
        delta[t] = dict([(label, max([delta[t-1][prev_label] + model_weight(prev_label, label, Ot)               for prev_label in labels]))    for label in labels])
        psi[t] = dict([(label, max([(delta[t-1][prev_label] + model_weight(prev_label, label, Ot), prev_label) for prev_label in labels])[1]) for label in labels ])
    Ot = observation,T
    delta[T] = max( [ delta[T-1][label]+model_weight(label, '$', Ot) for label in labels ] )
    psi[T] = max( [(delta[T-1][label]+model_weight(label, '$', Ot), label) for label in labels  ] )[1]
    q = [None] * (T+1)
    q[T] = psi[T]
    for t in range(T-1, -1, -1):
        q[t] = psi[t][q[t+1]]
    return q[1:]

6.2 PKU数据集,实测效果,

计算速度较慢,全程花费近10小时,还有优化余地。效果相比MEMM得到极大提升(特征模版完全相同),因扫描一次参数实在太慢, 不在做后续优化。和之前一样使用PKU集训练和其gold检验。
CRF参数为,L2-sigma=10, 最小特征次数限制1次

algorithm P R F OOV OOV Recall IV Recall
maximum matching 0.836 0.904 0.869 0.058 0.059 0.956
maximum probability 0.859 0.919 0.888 0.058 0.085 0.970
HMM 0.804 0.789 0.796 0.058 0.364 0.815
Full Second Order HMM 0.824 0.799 0.811 0.058 0.360 0.825
MEMM 0.909 0.891 0.900 0.058 0.383 0.923
CRF 0.932 0.912 0.922 0.058 0.556 0.934

参考:
[1] LogSumExp: https://en.wikipedia.org/wiki/LogSumExp

Leave a Reply

Your email address will not be published.

Time limit is exhausted. Please reload the CAPTCHA.

Proudly powered by WordPress   Premium Style Theme by www.gopiplus.com