Citation: | WU Zhen, RAN Xiaoyan, MIAO Quan, et al. Industry classification technology based on fastText algorithm[J]. Journal of Beijing University of Aeronautics and Astronautics, 2022, 48(2): 193-198. doi: 10.13700/j.bh.1001-5965.2020.0402(in Chinese) |
With the rapid development of China's economy and the continuous improvement of technological innovation ability, efficient organization and classification information is the basis of providing personalized industry management and tracking analysis. According to the characteristics of industry information and the law of development, a Chinese industry classification model based on fastText is proposed in this paper. First, the keyword database of industry classification is constructed, then word segmentation and weight calculation are carried out by feature lexicon, and finally the classifier model is constructed to realize the automatic classification of industry. In the experiment, 80 000 test documents including business scope, enterprise information and public opinion information were selected. The results show that the classification accuracy of the proposed model is higher than that of Bayes, decision tree, KNN and other classification algorithms. Thus, the proposed model works well in the application.
[1] |
中华人民共和国国家质量监督检验检疫总局, 中国国家标准化管理委员会. 国民经济行业分类: GB/T 4754-2017[S]. 北京: 中国标准出版社, 2017.
General Administration of Quality Supervision, Inspection and Quarantine of the People's Republic of China, Standardization Administration. Industrial classification for national economic activities: GB/T 4754-2017[S]. Beijing: Standards Press of China, 2017(in Chinese).
|
[2] |
石文娟, 龙舜, 云飞. 基于背景学习的迭代式文本分类框架[J]. 计算机工程与应用, 2015, 51(9): 991-994. https://www.cnki.com.cn/Article/CJFDTOTAL-JSGG201509025.htm
SHI W J, LONG S, YUN F. Iterative text classification framework based on background learning[J]. Computer Engineering and Applications, 2015, 51(9): 991-994(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JSGG201509025.htm
|
[3] |
HEPBURN J. Universal language model fine-tuning for patent classification[C]//Proceedings of Australasian Language Technology Association Workshop, 2018: 93-96.
|
[4] |
BOJANOWSKI P, GRAVE E, JOULIN A, et al. Enriching word vectors with subword information[EB/OL]. (2017-06-19)[2020-08-01]. https://arxiv.org/abs/1607.04606v2.
|
[5] |
MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word represent in vector space[EB/OL]. (2013-09-07)[2020-08-01]. https://arxiv.org/abs/1301.3781.
|
[6] |
徐军, 丁宇新, 王晓龙. 使用机器学习方法进行新闻的情感自动分类[J]. 中文信息学报, 2007, 21(6): 95-100. https://www.cnki.com.cn/Article/CJFDTOTAL-MESS200706016.htm
XU J, DING Y X, WANG X L. Automatic sentiment classification of news based on machine learning[J]. Journal of Chinese Information Processing, 2007, 21(6): 95-100(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-MESS200706016.htm
|
[7] |
KIM Y. Convolutional neural networks for sentence classification[EB/OL]. (2014-09-03)[2020-08-01]. https://arxiv.org/abs/1408.5882.
|
[8] |
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. (2017-12-06)[2020-08-01]. https://arxiv.org/abs/1706.03762.
|
[9] |
DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2019-05-24)[2020-08-01]. https://arxiv.org/abs/1810.04805.
|
[10] |
JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification[EB/OL]. (2016-08-09)[2020-08-01]. https://arxiv.org/abs/1607.01759v3.
|
[11] |
闫琰. 基于深度学习的文本表示与分类方法研究[D]. 北京: 北京科技大学, 2016.
YAN Y. Research on text representation and classification method based on deep learning[D]. Beijing: University of Science and Technology Beijing, 2016(in Chinese).
|
[12] |
YESSENALINA A, CARDIE C. Compositional matrix-space models for sentiment analysis[C]//Conference on Empirical Methods in Natural Language Processing, 2011: 172-182.
|
[13] |
代令令, 蒋侃. 基于fastText的中文文本分类[J]. 计算机与现代化, 2018, 5: 37-38. https://www.cnki.com.cn/Article/CJFDTOTAL-JYXH201805009.htm
DAI L L, JIANG K. Chinese text classification based on fastText[J]. Computer and Modernization, 2018, 5: 37-38(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JYXH201805009.htm
|
[14] |
孙茂松, 李景阳, 郭志芃, 等. THUCTC: 一个高效的中文文本分类工具包[EB/OL]. (2016-01-25)[2020-08-01]. http://thuctc.Thunlp.org.
SUN M S, LI J Y, GUO Z P, et al. THUCTC: An efficient Chinese text classifier[EB/OL]. (2016-01-25)[2020-08-01]. http://thuctc.Thunlp.org (in Chinese).
|
[15] |
陆彦婷, 陆建峰, 杨静宇. 层次分类方法综述[J]. 模式识别与人工智能, 2013, 26(12): 1130-1139. doi: 10.3969/j.issn.1003-6059.2013.12.006
LU Y T, LU J F, YANG J Y. A survey of hierarchical classification methods[J]. Pattern Recognition and Artificial Intelligence, 2013, 26(12): 1130-1139(in Chinese). doi: 10.3969/j.issn.1003-6059.2013.12.006
|
[16] |
LE Q, MIKOLOV T. Distributed representations of sentences and documents[EB/OL]. (2014-05-22)[2020-08-01]. https://arxiv.org/abs/1405.4053.
|
[17] |
MANSUR M, PEI W, CHANG B. Feature-based neural language model and Chinese word segmentation[C]//IJCNLP, 2013: 1271-1277.
|