大规模物联网恶意样本分析与分类方法

何清林; 王丽宏; 罗冰; 杨黎斌

doi:10.13700/j.bh.1001-5965.2020.0401

大规模物联网恶意样本分析与分类方法

doi: 10.13700/j.bh.1001-5965.2020.0401

何清林^{1, 2},
王丽宏^2, ,,
罗冰¹,
杨黎斌³

1.
国家计算机网络应急技术处理协调中心, 北京 100029
2.
北京航空航天大学计算机学院, 北京 100083
3.
西北工业大学网络空间安全学院, 西安 710072

基金项目:

国家重点研发计划 2017YFC1201204

详细信息

通讯作者:
王丽宏, E-mail: wlh@isc.org.cn

中图分类号: TP393.4;TP312
计量
- 文章访问数: 392
- HTML全文浏览量: 45
- PDF下载量: 70
- 被引次数: 0
出版历程
- 收稿日期: 2020-08-09
- 录用日期: 2020-09-05
- 网络出版日期: 2022-02-20

Large-scale IoT malware analysis and classification method

HE Qinglin^{1, 2},
WANG Lihong^{2
, ,},
LUO Bing¹,
YANG Libin³

1.
National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029, China
2.
School of Computer Science and Engineering, Beihang University, Beijing 100083
3.
School of Cybersecurity, Northwestern Polytechnical University, Xi'an 710072, China

Funds:

National Key R & D Program of China 2017YFC1201204

More Information

Corresponding author: WANG Lihong, E-mail: wlh@isc.org.cn

摘要

摘要:
物联网（IoT）恶意样本发展迅猛，在网络中大量攻击各类物联网设备，但由于开源问题导致其家族特征并不明显，需要一种更细粒度的样本分类方法，以解决高级威胁样本发现和攻击组织追踪等问题。针对该问题，对2019年5月至2020年5月捕获到的157 911个物联网恶意样本进行了大规模分析，并标注了一套包含9个家族分支共计12 278个样本的数据集。提出了物联网恶意样本的分类方法，通过静态逆向分析提取FCG图和文本等复杂结构特征，利用图表示学习和文本表示学习的特征，在标注的数据集上取得了平均召回率88.1%的分类效果。所提方法在实际工作应用中效果优异。
- 物联网(IoT) /
- 恶意样本 /
- 分类 /
- 图学习 /
- 文本学习
Abstract:
Recently, Internet of things (IoT) malware emerges in large numbers and attacks IoT devices in cyberspace. However, the family characteristics of IoT malwares are not obvious due to the open-source problem, a more fine-grained malware classification method is needed to solve the problems of advanced threat malware discovery and attack organization tracking. To address this question, we took a large-scale analysis of 157 911 IoT malwares which have been found from May 2019 to May 2020, and labeled a dataset which includes 9 categories and 12 278 malwares. Then we proposed an IoT malware classification method whose main idea is extracting complex structure features including FCG graph and text by static reverse analysis. The learning features using graph representation learning and text representation learning were used, and the experiments on the labeled dataset show that the average recall rate is 88.1%. Our method has been taken into practice and works well.
- Internet of things (IoT) /
- malware /
- classification /
- graph learning /
- text learning

HTML全文

图 1 样本分析处理示意图

Figure 1. Schematic diagram of sample analysis and processing

下载: 全尺寸图片幻灯片

图 2 两个Mozi样本的FCG示意图

Figure 2. Schematic diagram of FCG of two Mozi samples

下载: 全尺寸图片幻灯片

图 3 样本特征提取与分类方法示意图

Figure 3. Schematic diagram of sample feature extraction and classification method

下载: 全尺寸图片幻灯片

图 4 基于文本表示学习的向量特征学习

Figure 4. Vector characteristic learning based on text representation learning

下载: 全尺寸图片幻灯片

表 1 157 911个恶意样本CPU架构分布

Table 1. Distribution of CPU framework of 157 911 malwares

CPU架构	样本个数	占比/%
ARM	45 663	28.9
MIPS	28 569	18.1
X86	16 956	10.7
MC68000	11 576	7.3
PowerPC	10 202	6.5
SuperH SH	9 793	6.2
Sparc	7 703	4.9
X86-64	5 210	3.3
其他	22 239	14.1

下载: 导出CSV

表 2 物联网恶意样本加壳方式

Table 2. Packers of IoT malware

壳类型	样本个数	占比/%
UPX变种	41 026	26.0
标准UPX3.94	15 373	9.7
标准UPX3.95	2 148	1.4
其他UPX标准版本	623	0.40

下载: 导出CSV

表 3 样本10大漏洞利用统计

Table 3. Statistics of top 10 vulnerability exploited in malware

漏洞	影响设备	发现样本个数
CVE-2017-17215	华为HG532家用路由器	18 025
CVE-2014-8361	使用Realtek SDK摄像头	7 108
CVE-2017-6884	Zyxel家用路由器	6 945
Redis命令执行	安装Redis服务的设备	5 433
CVE-2018-10561	GPON路由器	2 369
JAWS命令执行	VPower DVR等	1 804
Vacron命令执行	Vacron NVR设备	1 534
Linksys命令执行	Linksys系列路由器	1 281
Dlink命令执行	Dlink系列路由器	1 250
Netgear命令执行	Netgear系列路由器	837

下载: 导出CSV

表 4 物联网恶意程序标注数据集

Table 4. Labeled IoT malware dataset

类别	样本个数	覆盖CPU类型	类别说明
1	241	ARM, MIPS, X86	Echobot系列样本
2	123	ARM, MIPS	Mozi系列样本
3	2 676	ARM, MIPS, X86	UnHAnaA系列样本
4	598	ARM, MIPS, X86	JoSho系列样本
5	4 501	ARM, MIPS, X86	Loligang系列样本
6	2 185	ARM, MIPS, X86	Yakuza系列样本
7	807	ARM, MIPS, X86	Sora系列样本
8	234	ARM, MIPS, X86	Fbot系列样本
9	913	ARM, MIPS, X86	Owari系列样本

下载: 导出CSV

表 5 恶意样本中常用的3种CPU架构指令集数量

Table 5. Number of instruction set commonly used in malware with three CPU frameworks

CPU类型	样本中使用的指令集数量
ARM	179
MIPS	198
X86	195

下载: 导出CSV

表 6 不同特征和类别的分类召回率及F₁值

Table 6. Classification recall rate and F₁ value with different features and categories

类别	V₁		V₂		V₂+V₃
类别	R/%	F₁	R/%	F₁	R/%	F₁
1	100	1.0	100	1.0	100	1.0
2	87	0.91	90	0.92	92	0.93
3	85	0.92	89	0.93	90	0.93
4	47	0.52	67	0.72	70	0.74
5	89	0.91	92	0.94	94	0.95
6	93	0.93	94	0.94	96	0.96
7	93	0.97	93	0.97	94	0.97
8	82	0.84	86	0.89	88	0.91
9	59	0.65	68	0.73	69	0.73
均值	82	0.85	87	0.89	88.1	0.92

下载: 导出CSV

参考文献(25)

[1]	World Economic Forum. The global risks report 2020[EB/OL]. (2020-01-15)[2020-07-03]. https://www.weforum.org/reports/the-global-risks-report-2020.
[2]	Gartner Inc. Gartner identifies top 10 strategic IoT technologies and trends[EB/OL]. (2018-11-07)[2020-07-03]. https://www.gartner.com/en/newsroom/press-releases/2018-11-07-gartner-identifies-top-10-strategic-iot-technologies-and-trends.
[3]	ANTONAKAKIS M, APRIL T, BAILEY M, et al. Understanding the Mirai botnet[C]//USENIX Security Symposium, 2017: 1093-1110.
[4]	DE DONNO M, DRAGONI N, GIARETTA A, et al. DDoS-capable IoT malwares: Comparative analysis and Mirai investigation[J]. Security and Communication Networks, 2018, 2018: 7178164.
[5]	COZZI E, GRAZIANO M, FRATANTONIO Y, et al. Understanding Linux malware[C]//IEEE Symposium on Security and Privacy. Piscataway: IEEE Press, 2018: 161-175.
[6]	HERWIG S, HARVEY K, HUGHEY G, et al. Measurement and analysis of Hajime a peer-to-peer IoT botnet[C]//Network and Distributed Systems Security Symposium, 2019: 1-15.
[7]	国家互联网应急中心. Mozi样本分析报告[EB/OL]. (2020-02-28)[2020-07-03]. https://www.ics-cert.org.cn/portal/page/112/f6aa66554f9a4669904d6b138cfea1ac.html. CNCERT. Dive into Mozi malware[EB/OL]. (2020-02-28)[2020-07-03]. https://www.ics-cert.org.cn/portal/page/112/f6aa66554f9a4669904d6b138cfea1ac.html (in Chinese).
[8]	Google LLC. VirusTotal[EB/OL]. [2020-07-03]. http://virustotal.com.
[9]	SU J W, VARGAS D V, PRASAD S, et al. Lightweight classification of IoT malware based on image recognition[C]//IEEE 42nd Annual Computer Software and Application Conference. Piscataway: IEEE Press, 2018: 664-669.
[10]	GIBERT D, MATEU C, PLANES J, et al. Classification of malware by using structural entropy on convolutional neural networks[C]//30th AAAI Conference on Innovative Applications of Artificial Intelligence, 2018: 1-6.
[11]	SRI SHAILA G, DARKI A, FALOUTSOS M, et al. IDAPro for IoT malware analysis [C]//Proceedings of the 12th USENIX Conference on Cyber Security Experimentation and Test, 2019: 15.
[12]	WANG F, SHOSHITAISHVILI Y. Angr-The next generation of binary analysis[C]//2017 IEEE Cybersecurity Development. Piscataway: IEEE Press, 2017: 8-9.
[13]	Radare2[EB/OL]. [2020-07-03]. https://github.com/radareorg/radare2.
[14]	HU X, CHIUEH T, SHIN K G. Large-scale malware indexing using function-call graphs[C]//ACM Conference on Computer and Communications Security. New York: ACM, 2009: 611-620.
[15]	KONG D, YAN G H. Discriminant malware distance learning on structural information for automated malware classification[C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2013: 1357-1365.
[16]	CIMPANU C. New Echobot malware is a smorgasbord of vulnerabilities[EB/OL]. (2019-06-17)[2020-07-03]. https://www.zdnet.com/article/new-echobot-malware-is-a-smorgasbord-of-vulnerabilities.
[17]	Microsoft malware classification challenge (BIG 2015)[EB/OL]. [2020-07-03]. https://www.kaggle.com/c/malware-classification.
[18]	HARUYAMA T. fn_fuzzy: Fast multiple binary diffing triage with IDA[EB/OL]. (2019-05-09)[2020-07-03]. https://conference.hitb.org/hitbsecconf2019ams/sessions/fn_fuzzy-fast-multiple-binary-diffing-triage-with-ida/.
[19]	XU X J, LIU C, FENG Q, et al. Neural network-based graph embedding for cross-platform binary code similarity detection[C]//ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2017: 363-376.
[20]	DEVLIN J, CHANG M, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT, 2019: 4171-4186.
[21]	HEITMAN C, ARCE I. BARF: A multiplatform open source binary analysis and reverse engineering framework[C]//XX Congreso Argentino de Ciencias de la Computación, 2014.
[22]	ALAM S, HORSPOOL R N, TRAORÉ I. MAIL: Malware analysis intermediate language: A step towards automating and optimizing malware detection[C]//Proceedings of the 6th International Conference on Security of Information and Networks, 2013: 233-240.
[23]	SHERVASHIDZE N, SCHWEITZER P, VAN LEEUWEN E J, et al. Weisfeiler-Lehman graph kernels[J]. Journal of Machine Learning Research, 2011, 12: 2539-2561. http://e-citations.ethbib.ethz.ch/view/pub:138403
[24]	MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of NIPS2013, 2013: 3111-3119.
[25]	LE Q, MIKOLOV T. Distributed representations of sentences and documents[C]//Proceedings of ICML, 2014: 1188-1196.