一种基于<i>N</i>元语法分布的语言模型自适应方法

尹继豪; 姜志国; 樊孝忠

一种基于N元语法分布的语言模型自适应方法

1.
北京航空航天大学宇航学院, 北京 100191
2. 北京理工大学计算机科学技术学院, 北京 100081

基金项目: 教育部博士点基金资助项目(20050007023)

详细信息

作者简介:
尹继豪(1980-),男,河南叶县人,讲师,yjh@buaa.edu.cn.

中图分类号: TP 391
计量
- 文章访问数: 2787
- HTML全文浏览量: 202
- PDF下载量: 1937
- 被引次数: 0
出版历程
- 收稿日期: 2007-11-29
- 网络出版日期: 2008-11-30

Statistical language model adaptation based on N-gram distribution

1.
School of Astronautics, Beijing University of Aeronautics and Astronautics, Beijing 100191, China
2. School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China

摘要

摘要: N元语法分布能恰当地描述语料库的特性,为了有效利用普通领域训练数据,提出一种基于N元语法分布的语言模型自适应方法.该方法定义一个小的领域内的高质量种子集和一个大的普通领域的质量不稳定的训练集,将训练集的N元语法分布自适应到和种子集的N元语法分布相似,以更好地进行特定领域单词识别.实验结果表明,基于N元语法分布的语言模型自适应方法可以使单词困惑度和词错误率分别比传统的简单插值法降低11.1%和6.9%.
- N元语法分布 /
- 种子集 /
- 训练集 /
- 自适应
Abstract: N-gram distribution can represent the characters of corpus correctly. So an approach was proposed for statistical language modeling adaptation, which is based on N-gram distribution. Given a large set of out-of-task training data, called training set, and a small set of task-special training data, called seed set, one statistical language modeling was adapted towards the special domain by adjusting the N-gram distribution in the training set to that in the seed set. The experiment results show prominent improvements over conventional methods. Compared with the simple interpolation method, the perplexity and word error rate decreases 11.1% and 6.9% respectively.
- N-gram distribution /
- seed set /
- training set /
- adaptation

HTML全文

参考文献(1)

[1] Jelinek F. Self-organized language modeling for speech recognition // Readings in Speech Recognition. Morgan-Kaufmann, CA: San Mateo, 1990: 450-506 [2] Miller D, Leek T, Schwartz RM. A hidden Markov model information retrieval system //Proceedings of 22nd International Conference on Research and Development in Information Retrieval. USA: Berkeley, 1999: 214-221 [3] Kim W. Language model adaptation for automatic speech recognition and statistical machine translation . USA: Department of Computer Science, Johns Hopkins University, 2004 [4] Iyer R, Ostendorf M, Gish H. Using out-of-domain data to improve in-domain language models[J]. IEEE Signal Processing Letters, 1997, 4(8): 221-223 [5] Rosenfeld R. A maximum entropy approach to adaptive statistical language model[J]. Computer Speech & Language, 1996, 10: 187-228 [6] Jurafsky D, Martin J H. Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition[M]. USA: Prentice Hall, 2000 [7] Katz S M. Estimation of probabilities from sparse data for the language model component of a speech recognizer[J]. IEEE Transaction Acoustics, Speech and Signal Processing, 1987, 35(3): 400-401 [8] Wu Genqing, Zheng Fang, Wu Wenhu. Improved Katz smoothing for language modeling in speech recognition //International Conference on Spoken Language Processing 2002. USA: Colorado, 2002: 925-928 [9] Manning C D, Sch tze H. Foundation of statistical natural language processing[M]. USA: The MIT Press, 1999 [10] Katz S M. Distribution of content words and phrases in text and language modeling[J]. Natural Language Engineering, 1996, 2: 15-59

施引文献

资源附件(0)

访问统计