Malicious code clone detection technology based on deep learning

SHEN Yuan; YAN Hanbing; XIA Chunhe; HAN Zhihui

doi:10.13700/j.bh.1001-5965.2020.0400

Volume 48 Issue 2

Feb. 2022

Turn off MathJax

Article Contents

Abstract

References

Journal of Beijing University of Aeronautics and Astronautics > 2022 > 48(2): 282-290.

Zheng Yi, Yuan Weixing, Wang Hai, et al. Experiments on dynamic dehumidification of internally cooling compact solid dehumidifier[J]. Journal of Beijing University of Aeronautics and Astronautics, 2006, 32(09): 1100-1103. (in Chinese)

Citation:

SHEN Yuan, YAN Hanbing, XIA Chunhe, et al. Malicious code clone detection technology based on deep learning[J]. Journal of Beijing University of Aeronautics and Astronautics, 2022, 48(2): 282-290. doi: 10.13700/j.bh.1001-5965.2020.0400(in Chinese)

Citation:

PDF( 3868 KB)

Malicious code clone detection technology based on deep learning

doi: 10.13700/j.bh.1001-5965.2020.0400

1.
School of Computer Science and Engineering, Beihang University, Beijing 100083, China
2.
National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029, China

Funds:

National Natural Science Foundation of China U1736218

National Science and Technology Major Project 2018YFB0804704

Beihang Youth Top Talent Support Program YWF-20-BJ-J-103

More Information

Corresponding author: YAN Hanbing, E-mail: yhb@cert.org.cn
Received Date: 09 Aug 2020
Accepted Date: 21 Jan 2021
Publish Date: 20 Feb 2022

Abstract

Abstract

Malicious code clone detection has become an effective way to analyze malicious code homology and advanced persistent threat (APT) attacks. In this paper, we collect samples of different APT organizations from public threat intelligence, and propose a deep learning based malicious code clone detection framework to detect the similarity between the functions in newly discovered malicious code and the malicious code in known APT organizational resources in order to efficiently analyze malware and quickly identify the source of APT attacks. We perform static analysis of malicious code through disassembly technology, use its key function call graph and disassembly code as the features of the malicious code, and then classify the malicious code in the APT organization library according to the neural network model. Through extensive evaluation and comparison with our previous models (MCrab), the improved model is better than the previous model, which can effectively detect and classify malicious code clones and obtain higher detection rate.
- deep learning,
- advanced persistent threat (APT) groups,
- clone detection,
- control flow graph (CFG),
- system function call graph

FullText(HTML)

References(36)

References

[1]	TAMADA H, OKAMOTO K, NAKAMURA M, et al. Dynamic software birthmarks to detect the theft of Windows applications[C]//International Symposium on Future Software Technology, 2004.
[2]	PFEFFER A, CALL C, CHAMBERLAIN J, et al. Malware analysis and attribution using genetic information[C]//7th International Conference on Malicious and Unwanted Software. Piscataway: IEEE Press, 2012: 39-45.
[3]	RUTTENBERG B, MILES C, KELLOGG L, et al. Identifying shared software components to support malware forensics[C]//International Conference on Detection of Intrusions, Malware and Vulnerability Assessment. Berlin: Springer, 2014: 21-40.
[4]	BENCSÁTH B, PÉK G, BUTTYÁN L, et al. The cousins of stuxnet: Duqu, Flame and Gauss[J]. Future Internet, 2012, 4(4): 971-1003. doi: 10.3390/fi4040971
[5]	BENCSÁTH B, PÉK G, BUTTYÁN L, et al. Duqu: A Stuxnet-like malware found in the wild[R]. Hungary: CrySyS Lab Technical Report, 2011, 14: 1-60.
[6]	GOSTEV A, KUZNETSOV I. Stuxnet/Duqu: The evolution of drivers[EB/OL]. (2011-12-28)[2020-08-01].
[7]	CHIEN E, OMURCHU L, FALLIERE N. Duqu: The precursor to the next Stuxnet[C]//LEET 12: Proceedings of the 5th USENIX Conference on Large-Scale Exploits and Emergent Threats, 2012.
[8]	AWAD Y, NASSAR M, SAFA H. Modeling malware as a language[C]//IEEE International Conference on Communications (ICC). Piscataway: IEEE Press, 2018: 1-6.
[9]	LIU J R, SHEN Y, YAN H B. Functions-based CFG embedding for malware homology analysis[C]//26th International Conference on Telecommunications (ICT). Piscataway: IEEE Press, 2019: 220-226.
[10]	BAKER B S. On finding duplication and near-duplication in large software systems[C]//Proceedings of 2nd Working Conference on Reverse Engineering. Piscataway: IEEE Press, 1995: 86-95.
[11]	BILENKO M, MOONEY R J. Adaptive duplicate detection using learnable string similarity measures[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-KDD'03. New York: ACM, 2003: 39-48.
[12]	BROMLEY J, BENTZ J W, BOTTOU L, et al. Signature verification using a "Siamese" time delay neural network[J]. International Journal of Pattern Recognition and Artificial Intelligence, 1993, 7(4): 669-688. doi: 10.1142/S0218001493000339
[13]	DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407. doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
[14]	KAMIYA T, KUSUMOTO S, INOUE K. CCFinder: A multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering, 2002, 28(7): 654-670. doi: 10.1109/TSE.2002.1019480
[15]	SCHLEIMER S, WILKERSON D S, AIKEN A. Winnowing: Local algorithms for document fingerprinting[C]//Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data-SIGMOD'03. New York: ACM, 2003: 76-85.
[16]	PRECHELT L, MALPOHL G, PHILIPPSEN M. Finding plagiarisms among a set of programs with JPlag[J]. Journal of Universal Computer Science, 2002, 8(11): 1016-1038.
[17]	JIANG L X, MISHERGHI G, SU Z D, et al. Deckard: Scalable and accurate tree-based detection of code clones[C]//29th International Conference on Software Engineering. Piscataway: IEEE Press, 2007: 96-105.
[18]	KOSCHKE R, FALKE R, FRENZEL P. Clone detection using abstract syntax suffix trees[C]//Proceedings of 13th Working Conference on Reverse Engineering. Piscataway: IEEE Press, 2006: 253-262.
[19]	PEWNY J, SCHUSTER F, BERNHARD L, et al. Leveraging semantic signatures for bug search in binary programs[C]//Proceedings of the 30th Annual Computer Security Applications Conference. New York: ACM, 2014.
[20]	GABEL M, JIANG L, SU Z. Scalable detection of semantic clones[C]//30th International Conference on Software Engineering. Piscataway: IEEE Press, 2008: 321-330.
[21]	LIU C, CHEN C, HAN J W, et al. GPLAG: Detection of software plagiarism by program dependence graph analysis[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-KDD'06. New York: ACM, 2006: 872-881.
[22]	CRUSSELL J, GIBLER C, CHEN H. Attack of the clones: Detecting cloned applications on Android markets[C]//Computer Security-ESORICS. Berlin: Springer, 2012: 37-54.
[23]	SCHULER D, DALLMEIER V, LINDIG C. A dynamic birthmark for Java[C]//22nd ACM International Conference on Automated Software Engineering. New York: ACM, 2007: 274-283.
[24]	WHITE M, TUFANO M, MARTINEZ M, et al. Sorting and transforming program repair ingredients via deep learning code similarities[C]//26th International Conference on Software Analysis, Evolution and Reengineering. Piscataway: IEEE Press, 2019: 479-490.
[25]	MOU L, LI G, ZHANG L, et al. Convolutional neural networks over tree structures for programming language processing[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016: 1287-1293.
[26]	LIU B C, HUO W, ZHANG C, et al. αDiff: Cross-version binary code similarity detection with DNN[C]//33th ACM/IEEE International Conference on Automated Software Engineering. New York: ACM, 2018: 667-678.
[27]	FENG Q, ZHOU R D, XU C C, et al. Scalable graph-based bug search for firmware images[C]//Proceedings of ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2016: 480-491.
[28]	XU X J, LIU C, FENG Q, et al. Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2017: 363-376.
[29]	CHANDRAMOHAN M, XUE Y X, XU Z Z, et al. BinGo: Cross-architecture cross-OS binary search[C]//24th ACM SIGSOFT International Symposiumn. New York: ACM, 2016: 678-689.
[30]	ZHOU C T, SUN C L, LIU Z Y, et al. A C-LSTM neural network for text classification[J]. Computer Science, 2015, 1(4): 39-44.
[31]	HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. doi: 10.1162/neco.1997.9.8.1735
[32]	GHOSH S, VINYALS O, STROPE B, et al. Contextual LSTM (CLSTM) models for large scale NLP tasks[EB/OL]. (2016-05-31)[2020-08-01].
[33]	KOLEN J F, KREMER S C. Gradient flow in recurrent nets: The difficulty of learning longterm dependencies[J]. A Field Guide to Dynamical Recurrent Networks, 2001, 28(2): 237243.
[34]	DAI H J, DAI B, SONG L. Discriminative embeddings of latent variable models for structured data[C]//Proceedings of the 33rd International Conference on Machine Learning. New York: ACM, 2016: 2702-2711.
[35]	KALCHBRENNER N, GREFENSTETTE E, BLUNSOM P. A convolutional neural network for modelling sentences[C]//Proceedings of the 52nd Annual Meeting of the Asscociation for Computational Linguistics, 2014: 655-665.
[36]	TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, 2015: 1556-1566.

Relative Articles

Supplements(0)

Cited By

Cited by

Periodical cited type(8)

1.	季一木，张嘉铭，杨倩，杜宏煜，邵思思，张俊杰，刘尚东. 高级持续性威胁检测与分析方法研究进展. 南京邮电大学学报(自然科学版). 2025(01): 1-11 .
2.	宋亚飞，张丹丹，王坚，王亚男，郭新鹏. 基于深度学习的恶意代码检测综述. 空军工程大学学报. 2024(04): 94-106 .
3.	王琼赟，王萌，张亚昊，史睿，郭琪，吴京航. 基于投票机制的联邦学习恶意代码检测：以电网为例. 武汉理工大学学报(信息与管理工程版). 2024(04): 644-650+657 .
4.	官斌. 基于静态逆向的工控软件函数调用获取技术. 计算机与数字工程. 2024(09): 2745-2751+2777 .
5.	李涛，张冬雯，张杨，郑琨. 基于编辑序列的跨语言重构检测方法. 河北科技大学学报. 2024(06): 627-635 .
6.	叶贵鑫，张宇翔，张成，赵佳棋，王焕廷 . 基于图神经网络的OpenCL程序自动优化启发式方法. 计算机研究与发展. 2023(05): 1121-1135 .
7.	毛莉君，王心妍. 基于能耗信任值的无线传感网络克隆攻击检测方法. 传感技术学报. 2023(07): 1122-1127 .
8.	梁鹤，李鑫，尹南南，李超. 结合动态行为和静态特征的APT攻击检测方法. 计算机工程与应用. 2023(18): 249-259 .

Other cited types(4)

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(6) / Tables(2)

Get Citation

PDF

XML

Article Metrics

Article views(500) PDF downloads(338)