北京航空航天大学学报 ›› 2008, Vol. 34 ›› Issue (11): 1276-1279.

• 论文 • 上一篇    下一篇

一种基于N元语法分布的语言模型自适应方法

尹继豪1, 姜志国1, 樊孝忠2   

  1. 1. 北京航空航天大学 宇航学院, 北京 100191;
    2. 北京理工大学 计算机科学技术学院, 北京 100081
  • 收稿日期:2007-11-29 出版日期:2008-11-30 发布日期:2010-09-16
  • 作者简介:尹继豪(1980-),男,河南叶县人,讲师,yjh@buaa.edu.cn.
  • 基金资助:

    教育部博士点基金资助项目(20050007023)

Statistical language model adaptation based on N-gram distribution

Yin Jihao1, Jiang Zhiguo1, Fan Xiaozhong2   

  1. 1. School of Astronautics, Beijing University of Aeronautics and Astronautics, Beijing 100191, China;
    2. School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China
  • Received:2007-11-29 Online:2008-11-30 Published:2010-09-16

摘要: N元语法分布能恰当地描述语料库的特性,为了有效利用普通领域训练数据,提出一种基于N元语法分布的语言模型自适应方法.该方法定义一个小的领域内的高质量种子集和一个大的普通领域的质量不稳定的训练集,将训练集的N元语法分布自适应到和种子集的N元语法分布相似,以更好地进行特定领域单词识别.实验结果表明,基于N元语法分布的语言模型自适应方法可以使单词困惑度和词错误率分别比传统的简单插值法降低11.1%和6.9%.

Abstract: N-gram distribution can represent the characters of corpus correctly. So an approach was proposed for statistical language modeling adaptation, which is based on N-gram distribution. Given a large set of out-of-task training data, called training set, and a small set of task-special training data, called seed set, one statistical language modeling was adapted towards the special domain by adjusting the N-gram distribution in the training set to that in the seed set. The experiment results show prominent improvements over conventional methods. Compared with the simple interpolation method, the perplexity and word error rate decreases 11.1% and 6.9% respectively.

中图分类号: 


版权所有 © 《北京航空航天大学学报》编辑部
通讯地址:北京市海淀区学院路37号 北京航空航天大学学报编辑部 邮编:100191 E-mail:jbuaa@buaa.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发