基于抽象汇编指令的恶意软件家族分类方法

李玉; 罗森林; 郝靖伟; 潘丽敏

doi:10.13700/j.bh.1001-5965.2020.0568

基于抽象汇编指令的恶意软件家族分类方法

doi: 10.13700/j.bh.1001-5965.2020.0568

北京理工大学信息与电子学院, 北京 100081

基金项目:

工信部2020年信息安全软件项目 CEIEC-2020-ZM02-0134

详细信息

通讯作者:
潘丽敏, E-mail: panlimin2016@gmail.com

中图分类号: TP309.5
计量
- 文章访问数: 386
- HTML全文浏览量: 58
- PDF下载量: 39
- 被引次数: 0
出版历程
- 收稿日期: 2020-09-30
- 录用日期: 2020-11-13
- 网络出版日期: 2022-02-20

Malware family classification method based on abstract assembly instructions

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

Funds:

2020 Information Security Software Project of the Ministry of Industry and Information Technology CEIEC-2020-ZM02-0134

More Information

Corresponding author: PAN Limin, E-mail: panlimin2016@gmail.com

摘要

摘要:
恶意软件变体的大量出现对网络安全造成巨大威胁。针对基于汇编指令的恶意软件家族分类方法中，操作数语义与运行环境密切相关而难以提取，导致指令语义缺失，难以正确分类恶意软件变体的问题。提出了一种基于抽象汇编指令的恶意软件家族分类方法。通过抽象出操作数类型重构指令，使操作数语义脱离运行环境的约束；利用词注意力机制与双向门循环单元（Bi-GRU）构建指令嵌入网络以捕获指令行为语义，并结合双向循环神经网络（Bi-RNN）学习恶意软件家族共性指令序列，以减小变体技术对指令序列的干扰；融合原始指令和家族共性指令序列构建特征图像，并通过卷积神经网络实现恶意软件家族分类。公开数据集上的实验结果表明：所提方法能够有效提取操作数信息，抵抗恶意软件变体中无关指令的干扰，实现恶意软件变体的家族分类。
- 恶意软件家族分类 /
- 可视化 /
- 抽象汇编指令 /
- 卷积神经网络 /
- 双向循环神经网络(Bi-RNN) /
- 词注意力机制
Abstract:
The emergence of malware variants poses a great threat to network security. In malware family classification methods based on assembly instructions, the semantics of operands are closely related to the operating environment and difficult to extract, which leads to the lack of instruction semantics and the difficulty in correctly classifying malware variants. A malware family classification method based on abstract assembly instructions is proposed. The instruction is reconstructed by abstracting the operand type, so that the semantics of the operands can be separated from the constraints of the operating environment. The word attention mechanism and bidirectional gate recurrent unit (Bi-GRU) are used to construct an instruction embedding network and to capture the instruction behavior semantics. Combined with bidirectional recursive neural networks (Bi-RNN), the common instruction sequence of malware family is learned to reduce the interference of variation technology on the instruction sequence. The original instruction and family common instruction sequence are integrated to construct feature images, and the malware family classification is realized through convolutional neural network. The experimental results on the public dataset show that the proposed method can effectively extract operand information, resist the interference of irrelevant instructions in malware variants, and realize the family classification of malware variants.
- malware family classification /
- visualization /
- abstract assembly instructions /
- convolutional neural network /
- bidirectional recursive neural network (Bi-RNN) /
- word attention mechanism

HTML全文

图 1 MCAI原理框图

Figure 1. Functional block diagram of MCAI

下载: 全尺寸图片幻灯片

图 2 操作数类型抽象

Figure 2. Operand type abstracting

下载: 全尺寸图片幻灯片

图 3 指令嵌入网络

Figure 3. Instruction embedding network

下载: 全尺寸图片幻灯片

图 4 家族共性指令序列学习

Figure 4. Family common instruction sequence learning

下载: 全尺寸图片幻灯片

图 5 家族共性指令序列预测

Figure 5. Family common instruction sequence prediction

下载: 全尺寸图片幻灯片

图 6 对比实验流程

Figure 6. Flowchart of comparative experiments

下载: 全尺寸图片幻灯片

图 7 对比实验精确率

Figure 7. Precision of comparison experiment

下载: 全尺寸图片幻灯片

图 8 对比实验F₁值

Figure 8. F₁-score of comparison experiment

下载: 全尺寸图片幻灯片

图 9 对比实验召回率

Figure 9. Recall of comparison experiment

下载: 全尺寸图片幻灯片

表 1 数据集家族标签及样本数量

Table 1. Family label and number of the sample in the dataset

序号	家族名称	样本数量
1	Ramnit	1 541
2	Lollipop	2 478
3	Kelihos_ver3	2 942
4	Vundo	475
5	Simda	42
6	Tracur	751
7	Kelihos_ver1	398
8	Obfuscator.ACY	1 228
9	Gatak	1 013

下载: 导出CSV

表 2 消融实验结果

Table 2. Results of ablation experiments %

方法	准确率	召回率	精确率	F₁
1	91.49	88.87	87.81	88.32
2	80.99	71.53	69.53	70.28
3	92.49	88.04	87.27	87.57
4	96.04	91.13	96.28	93.01
5	98.51	98.36	97.07	97.66

下载: 导出CSV

表 3 对比实验结果

Table 3. Comparative experimental results %

方法	准确率	召回率	精确率	F₁
MCSC	91.4	88.87	87.81	88.32
RMVC	94.20	95.92	89.07	91.00
MCAI	98.51	98.36	97.07	97.66

下载: 导出CSV

参考文献(16)

[1]	YE Y F, LI T, ADJEROH D, et al. A survey on malware detection using data mining techniques[J]. ACM Computing Surveys, 2017, 50(3): 1-40.
[2]	YE Y F, LI T, ZHU S H, et al. Combining file content and file relations for cloud based malware detection[C]//Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2011: 222-230.
[3]	TAMERSOY A, ROUNDY K, CHAU D H. Guilt by association: Large scale malware detection by mining file-relation graphs[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2014: 1524-1533.
[4]	DING Y X, XIA X L, CHEN S, et al. A malware detection method based on family behavior graph[J]. Computers & Security, 2018, 73: 73-86.
[5]	HARDY W, CHEN L, HOU S, et al. D14md: A deep learning framework for intelligent malware detection[C]//Proceedings of the International Conference on Data Mining(DMIN). New York: CSREA, 2016: 61.
[6]	NATARAJ L, KARTHIKEYAN S, JACOB G, et al. Malware images: Visualization and automatic classification[C]//Proceedings of the 8th International Symposium on Visualization for Cyber Security. New York: ACM, 2011: 1-7.
[7]	CUI Z H, XUE F, CAI X J, et al. Detection of malicious code variants based on deep learning[J]. IEEE Transactions on Industrial Informatics, 2018, 14(7): 3187-3196. doi: 10.1109/TII.2018.2822680
[8]	TRINIUS P, HOLZ T, GÖBEL J, et al. Visual analysis of malware behavior using treemaps and thread graphs[C]//20096th International Workshop on Visualization for Cyber Security. Piscataway: IEEE Press, 2009: 33-38.
[9]	ZHANG J X, QIN Z, YIN H, et al. IRMD: Malware variant detection using opcode image recognition[C]//2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS). Piscataway: IEEE Press, 2016: 1175-1180.
[10]	NI S, QIAN Q, ZHANG R. Malware identification using visualization images and deep learning[J]. Computers & Security, 2018, 77: 871-885.
[11]	SUN G S, QIAN Q. Deep learning and visualization for identifying malware families[J]. IEEE Transactions on Dependable and Secure Computing, 2021, 18(1): 283-295. doi: 10.1109/TDSC.2018.2884928
[12]	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[C]//Proceedings of ICLR Workshops Track. Cambridge: MIT Press, 2013: 21-29.
[13]	CHO K, VAN MERRIENBOER B, BAHDANAU D, et al. On the properties of neural machine translation: Encoder-decoder approaches[C]//Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014: 103-111.
[14]	YANG Z C, YANG D Y, DYER C, et al. Hierarchical attention networks for document classification[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016: 33-38.
[15]	SCHUSTER M, PALIWAL K K. Bidirectional recurrent neural networks[J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681. doi: 10.1109/78.650093
[16]	Microsoft. Microsoft malware classification challenge (Big 2015)[EB/OL]. (2019-06-11)[2020-09-01]. https://www.kaggle.com/c/malware-classification.