基于fastText算法的行业分类技术

吴震; 冉晓燕; 苗权; 刘纯艳; 张栋; 魏娜

doi:10.13700/j.bh.1001-5965.2020.0402

基于fastText算法的行业分类技术

doi: 10.13700/j.bh.1001-5965.2020.0402

吴震¹,
冉晓燕¹,
苗权^{1, 2, ,},
刘纯艳³,
张栋¹,
魏娜³

1.
国家计算机网络应急技术处理协调中心, 北京 100029
2.
国家计算机网络应急技术处理协调中心北京分中心, 北京 100055
3.
长城计算机软件与系统有限公司, 北京 100190

详细信息

通讯作者:
苗权, E-mail: miaoq@cert.org.cn

中图分类号: TP391.1
计量
- 文章访问数: 546
- HTML全文浏览量: 140
- PDF下载量: 114
- 被引次数: 0
出版历程
- 收稿日期: 2020-08-09
- 录用日期: 2020-09-25
- 网络出版日期: 2022-02-20

Industry classification technology based on fastText algorithm

WU Zhen¹,
RAN Xiaoyan¹,
MIAO Quan^{1, 2
, ,},
LIU Chunyan³,
ZHANG Dong¹,
WEI Na³

1.
National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029, China
2.
Beijing Branch of National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100055, China
3.
Great Wall Computer Software & System Inc., Beijing 100190, China

More Information

Corresponding author: MIAO Quan, E-mail: miaoq@cert.org.cn

摘要

摘要:
随着中国经济的高速发展和技术创新能力的不断提升，高效的组织、分类信息是提供个性化行业管理和跟踪分析的基础。根据行业信息特点和发展规律，提出了一种基于fastText算法的行业分类模型。首先，构建行业分类关键词库，通过特征词库进行分词和权重计算。然后，构建分类器模型，实现中文行业的自动分类。最后，实验选取了80 000个包含企业经营范围、企业信息、舆论信息的测试文档，结果表明，所提模型结果高于Bayes、决策树、KNN等分类算法，取得了较好的应用效果。
- 自然语言处理 /
- 行业分类 /
- fastText算法 /
- 关键词 /
- 语法模型
Abstract:
With the rapid development of China's economy and the continuous improvement of technological innovation ability, efficient organization and classification information is the basis of providing personalized industry management and tracking analysis. According to the characteristics of industry information and the law of development, a Chinese industry classification model based on fastText is proposed in this paper. First, the keyword database of industry classification is constructed, then word segmentation and weight calculation are carried out by feature lexicon, and finally the classifier model is constructed to realize the automatic classification of industry. In the experiment, 80 000 test documents including business scope, enterprise information and public opinion information were selected. The results show that the classification accuracy of the proposed model is higher than that of Bayes, decision tree, KNN and other classification algorithms. Thus, the proposed model works well in the application.
- natural language processing /
- industry classification /
- fastText algorithm /
- keywords /
- grammar model

HTML全文

图 1 分类识别流程

Figure 1. Flowchart of classification and identification

下载: 全尺寸图片幻灯片

图 2 层次Softmax

Figure 2. Hierarchical Softmax

下载: 全尺寸图片幻灯片

图 3 fastText模型架构

Figure 3. Architecture of fastText model

下载: 全尺寸图片幻灯片

表 1 行业分类结果

Table 1. Industry classification results

一级分类	二级分类	三级分类
信息设备制造	计算机制造	计算机整机制造
		计算机零部件制造
		计算机外围设备制造
		工业控制计算机及系统制造
		信息安全设备制造
		其他计算机制造
	通信设备制造	通信系统设备制造
		通信终端设备制造
		通信配套产品及其他通信设备制造
	广播电视设备制造	广播电视节目制作及发射设备制造
		广播电视接收设备制造
		广播电视专用配件制造
		专业音响设备制造
		应用电视设备及其他广播电视设备制造
	雷达及配套设备制造	雷达及配套设备制造
	非专业视听设备制造	电视机制造
		音响设备制造
		影视录放设备制造
		家用电子电器主要配套件
	智能消费设备制造	可穿戴智能设备制造
		智能车载设备制造
		智能无人飞行器制造
		服务消费机器人制造
		其他智能消费设备制造
	电子器件制造	电子真空器件制造
		半导体分立器件制造
		集成电路制造
		显示器件制造
		半导体照明器件制造
		光电子器件制造
		其他电子器件制造
	电子元件及电子专用材料制造	电阻电容电感元件制造
		电子电路制造
		敏感元件及传感器制造
		电声器件及零件制造
		电子专用材料制造
		其他电子元件制造
	电子信息机电产品制造	电子微电机制造
		电子电线电缆制造
		光纤、光缆制造
		电池制造
	专用仪器仪表制造	电子测量仪器制造
	其他电子设备制造	其他电子设备制造
信息传输服务	电信	固定电信服务
		移动电信服务
		其他电信服务
	广播电视传输服务	有线广播电视传输服务
	广播电视传输服务	无线广播电视传输服务
	卫星传输服务	广播电视卫星传输服务
	卫星传输服务	其他卫星传输服务
软件和信息技术服务业	软件开发	基础软件开发
		支撑软件开发
		应用软件开发
		其他软件开发
	集成电路设计	集成电路设计
	信息系统集成和物联网技术服务	信息系统集成服务
	信息系统集成和物联网技术服务	物联网技术服务
	运行维护服务	运行维护服务
	信息处理和存储支持服务	信息处理和存储支持服务
	信息技术咨询服务	信息技术咨询服务
	数字内容服务	地理遥感信息服务
		动漫、游戏数字内容服务
		其他数字内容服务
	其他信息技术服务业	呼叫中心
	其他信息技术服务业	其他未列明信息技术服务业
互联网和相关服务	互联网接入及相关服务	互联网接入及相关服务
	互联网信息服务	互联网搜索服务
		互联网游戏服务
		互联网其他信息服务
	互联网平台	互联网生产服务平台
		互联网生活服务平台
		互联网科技创新平台
		互联网公共服务平台
		其他互联网平台
	互联网安全服务	互联网安全服务
	互联网数据服务	互联网数据服务
	其他互联网服务	其他互联网服务

下载: 导出CSV

表 2 文本数量数据集样本

Table 2. Text quantity dataset sample

类别	DS1	DS2	DS3	DS4	DS5	Test1
1	3 000	2 000	5 000	2 000	10 000	2 000
2	3 000	2 000	5 000	2 000	10 000	2 000
3	3 000	2 000	5 000	2 000	10 000	2 000
4	3 000	2 000	5 000	2 000	10 000	2 000
5	3 000	2 000	5 000	2 000	10 000	2 000
6	3 000	2 000	5 000	2 000	10 000	2 000
7	3 000	2 000	5 000	2 000	10 000	2 000
8	3 000	2 000	5 000	2 000	10 000	2 000
9	3 000	2 000	5 000	2 000	10 000	2 000
10	3 000	2 000	5 000	2 000	10 000	2 000
11	3 000	2 000	5 000	2 000	10 000	2 000
12	3 000	2 000	5 000	2 000	10 000	2 000
13	3 000	2 000	5 000	2 000	10 000	2 000
14	3 000	2 000	5 000	2 000	10 000	2 000
15	3 000	2 000	5 000	2 000	10 000	2 000
16	3 000	2 000	5 000	2 000	10 000	2 000
17	3 000	2 000	5 000	2 000	10 000	2 000
18	3 000	2 000	5 000	2 000	10 000	2 000
19	3 000	2 000	5 000	2 000	10 000	2 000
20	3 000	2 000	5 000	2 000	10 000	2 000
21	3 000	2 000	5 000	2 000	10 000	2 000
22	3 000	2 000	5 000	2 000	10 000	2 000
23	3 000	2 000	5 000	2 000	10 000	2 000
24	3 000	2 000	5 000	2 000	10 000	2 000
25	3 000	2 000	5 000	2 000	10 000	2 000
26	3 000	2 000	5 000	2 000	10 000	2 000
27	3 000	2 000	5 000	2 000	10 000	2 000
28	3 000	2 000	5 000	2 000	10 000	2 000
29	3 000	2 000	5 000	2 000	10 000	2 000
30	3 000	2 000	5 000	2 000	10 000	2 000
31	3 000	2 000	5 000	2 000	10 000	2 000

下载: 导出CSV

表 3 数据集的分类结果

Table 3. Classification result of dataset %

模型	数据集	精确率	召回率	F₁
Bayes	DS1	73.6	71.8	72.7
	DS2	74.1	70.2	72.1
	DS3	75.2	73.8	74.5
	DS4	69.9	68.5	69.2
	DS5	70.3	67.6	68.9
决策树	DS1	78.0	72.4	75.1
	DS2	76.8	75.2	76.0
	DS3	77.7	74.9	76.3
	DS4	75.8	72.1	73.9
	DS5	72.1	70.2	71.1
KNN	DS1	76.2	72.1	74.1
	DS2	75.4	70.3	72.8
	DS3	70.1	68.9	69.5
	DS4	72.5	70.8	71.6
	DS5	72.4	69.8	71.1
TextCNN	DS1	80.1	75.2	77.6
	DS2	79.2	75.0	77.0
	DS3	81.1	76.2	78.6
	DS4	77.9	74.3	76.1
	DS5	80.5	78.2	79.3
fastText	DS1	82.2	80.1	81.14
	DS2	81.3	79.8	80.5
	DS3	81.6	79.9	80.7
	DS4	82.5	80.0	81.2
	DS5	84.0	81.1	82.5
fastText_TF_IDF	DS1	83.73	81.91	82.81(+1.67)
	DS2	82.6	79.1	80.81(+0.31)
	DS3	83.1	80.5	81.78(+1.08)
	DS4	82.1	79.1	80.57
	DS5	84.2	82.0	83.08(+0.58)
均值		83.1	80.5	81.8

下载: 导出CSV

参考文献(17)

[1]	中华人民共和国国家质量监督检验检疫总局, 中国国家标准化管理委员会. 国民经济行业分类: GB/T 4754-2017[S]. 北京: 中国标准出版社, 2017. General Administration of Quality Supervision, Inspection and Quarantine of the People's Republic of China, Standardization Administration. Industrial classification for national economic activities: GB/T 4754-2017[S]. Beijing: Standards Press of China, 2017(in Chinese).
[2]	石文娟, 龙舜, 云飞. 基于背景学习的迭代式文本分类框架[J]. 计算机工程与应用, 2015, 51(9): 991-994. https://www.cnki.com.cn/Article/CJFDTOTAL-JSGG201509025.htm SHI W J, LONG S, YUN F. Iterative text classification framework based on background learning[J]. Computer Engineering and Applications, 2015, 51(9): 991-994(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JSGG201509025.htm
[3]	HEPBURN J. Universal language model fine-tuning for patent classification[C]//Proceedings of Australasian Language Technology Association Workshop, 2018: 93-96.
[4]	BOJANOWSKI P, GRAVE E, JOULIN A, et al. Enriching word vectors with subword information[EB/OL]. (2017-06-19)[2020-08-01]. https://arxiv.org/abs/1607.04606v2.
[5]	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word represent in vector space[EB/OL]. (2013-09-07)[2020-08-01]. https://arxiv.org/abs/1301.3781.
[6]	徐军, 丁宇新, 王晓龙. 使用机器学习方法进行新闻的情感自动分类[J]. 中文信息学报, 2007, 21(6): 95-100. https://www.cnki.com.cn/Article/CJFDTOTAL-MESS200706016.htm XU J, DING Y X, WANG X L. Automatic sentiment classification of news based on machine learning[J]. Journal of Chinese Information Processing, 2007, 21(6): 95-100(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-MESS200706016.htm
[7]	KIM Y. Convolutional neural networks for sentence classification[EB/OL]. (2014-09-03)[2020-08-01]. https://arxiv.org/abs/1408.5882.
[8]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. (2017-12-06)[2020-08-01]. https://arxiv.org/abs/1706.03762.
[9]	DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2019-05-24)[2020-08-01]. https://arxiv.org/abs/1810.04805.
[10]	JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification[EB/OL]. (2016-08-09)[2020-08-01]. https://arxiv.org/abs/1607.01759v3.
[11]	闫琰. 基于深度学习的文本表示与分类方法研究[D]. 北京: 北京科技大学, 2016. YAN Y. Research on text representation and classification method based on deep learning[D]. Beijing: University of Science and Technology Beijing, 2016(in Chinese).
[12]	YESSENALINA A, CARDIE C. Compositional matrix-space models for sentiment analysis[C]//Conference on Empirical Methods in Natural Language Processing, 2011: 172-182.
[13]	代令令, 蒋侃. 基于fastText的中文文本分类[J]. 计算机与现代化, 2018, 5: 37-38. https://www.cnki.com.cn/Article/CJFDTOTAL-JYXH201805009.htm DAI L L, JIANG K. Chinese text classification based on fastText[J]. Computer and Modernization, 2018, 5: 37-38(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JYXH201805009.htm
[14]	孙茂松, 李景阳, 郭志芃, 等. THUCTC: 一个高效的中文文本分类工具包[EB/OL]. (2016-01-25)[2020-08-01]. http://thuctc.Thunlp.org. SUN M S, LI J Y, GUO Z P, et al. THUCTC: An efficient Chinese text classifier[EB/OL]. (2016-01-25)[2020-08-01]. http://thuctc.Thunlp.org (in Chinese).
[15]	陆彦婷, 陆建峰, 杨静宇. 层次分类方法综述[J]. 模式识别与人工智能, 2013, 26(12): 1130-1139. doi: 10.3969/j.issn.1003-6059.2013.12.006 LU Y T, LU J F, YANG J Y. A survey of hierarchical classification methods[J]. Pattern Recognition and Artificial Intelligence, 2013, 26(12): 1130-1139(in Chinese). doi: 10.3969/j.issn.1003-6059.2013.12.006
[16]	LE Q, MIKOLOV T. Distributed representations of sentences and documents[EB/OL]. (2014-05-22)[2020-08-01]. https://arxiv.org/abs/1405.4053.
[17]	MANSUR M, PEI W, CHANG B. Feature-based neural language model and Chinese word segmentation[C]//IJCNLP, 2013: 1271-1277.