Statistical language model adaptation based on N-gram distribution
-
摘要: N元语法分布能恰当地描述语料库的特性,为了有效利用普通领域训练数据,提出一种基于N元语法分布的语言模型自适应方法.该方法定义一个小的领域内的高质量种子集和一个大的普通领域的质量不稳定的训练集,将训练集的N元语法分布自适应到和种子集的N元语法分布相似,以更好地进行特定领域单词识别.实验结果表明,基于N元语法分布的语言模型自适应方法可以使单词困惑度和词错误率分别比传统的简单插值法降低11.1%和6.9%.Abstract: N-gram distribution can represent the characters of corpus correctly. So an approach was proposed for statistical language modeling adaptation, which is based on N-gram distribution. Given a large set of out-of-task training data, called training set, and a small set of task-special training data, called seed set, one statistical language modeling was adapted towards the special domain by adjusting the N-gram distribution in the training set to that in the seed set. The experiment results show prominent improvements over conventional methods. Compared with the simple interpolation method, the perplexity and word error rate decreases 11.1% and 6.9% respectively.
-
Key words:
- N-gram distribution /
- seed set /
- training set /
- adaptation
-
[1] Jelinek F. Self-organized language modeling for speech recognition // Readings in Speech Recognition. Morgan-Kaufmann, CA: San Mateo, 1990: 450-506 [2] Miller D, Leek T, Schwartz RM. A hidden Markov model information retrieval system //Proceedings of 22nd International Conference on Research and Development in Information Retrieval. USA: Berkeley, 1999: 214-221 [3] Kim W. Language model adaptation for automatic speech recognition and statistical machine translation . USA: Department of Computer Science, Johns Hopkins University, 2004 [4] Iyer R, Ostendorf M, Gish H. Using out-of-domain data to improve in-domain language models[J]. IEEE Signal Processing Letters, 1997, 4(8): 221-223 [5] Rosenfeld R. A maximum entropy approach to adaptive statistical language model[J]. Computer Speech & Language, 1996, 10: 187-228 [6] Jurafsky D, Martin J H. Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition[M]. USA: Prentice Hall, 2000 [7] Katz S M. Estimation of probabilities from sparse data for the language model component of a speech recognizer[J]. IEEE Transaction Acoustics, Speech and Signal Processing, 1987, 35(3): 400-401 [8] Wu Genqing, Zheng Fang, Wu Wenhu. Improved Katz smoothing for language modeling in speech recognition //International Conference on Spoken Language Processing 2002. USA: Colorado, 2002: 925-928 [9] Manning C D, Sch tze H. Foundation of statistical natural language processing[M]. USA: The MIT Press, 1999 [10] Katz S M. Distribution of content words and phrases in text and language modeling[J]. Natural Language Engineering, 1996, 2: 15-59
点击查看大图
计量
- 文章访问数: 2878
- HTML全文浏览量: 224
- PDF下载量: 1939
- 被引次数: 0