-
摘要:
恶意代码克隆检测已经成为恶意代码同源分析及高级持续性威胁(APT)攻击溯源的有效方式。从公共威胁情报中收集了不同APT组织的样本,并提出了一种基于深度学习的恶意代码克隆检测框架,目的是检测新发现的恶意代码中的函数与已知APT组织资源库中的恶意代码的相似性,以此高效地对恶意软件进行分析,进而快速判别APT攻击来源。通过反汇编技术对恶意代码进行静态分析,并利用其关键系统函数调用图及反汇编代码作为该恶意代码的特征。根据神经网络模型对APT组织资源库中的恶意代码进行分类。通过广泛评估和与MCrab模型的对比可知,改进模型优于MCrab模型,可以有效地进行恶意代码克隆检测与分类,且获得了较高的检测率。
-
关键词:
- 深度学习 /
- 高级持续性威胁(APT)组织 /
- 克隆检测 /
- 控制流图(CFG) /
- 系统函数调用图
Abstract:Malicious code clone detection has become an effective way to analyze malicious code homology and advanced persistent threat (APT) attacks. In this paper, we collect samples of different APT organizations from public threat intelligence, and propose a deep learning based malicious code clone detection framework to detect the similarity between the functions in newly discovered malicious code and the malicious code in known APT organizational resources in order to efficiently analyze malware and quickly identify the source of APT attacks. We perform static analysis of malicious code through disassembly technology, use its key function call graph and disassembly code as the features of the malicious code, and then classify the malicious code in the APT organization library according to the neural network model. Through extensive evaluation and comparison with our previous models (MCrab), the improved model is better than the previous model, which can effectively detect and classify malicious code clones and obtain higher detection rate.
-
表 1 APT组织样本数据集
Table 1. APT group samples dataset
APT组织 样本个数 函数个数 训练集 验证集 测试集 标签 patchwork 964 83 230 62 430 8 323 20 800 0 hangover 768 75 636 56 727 7 564 18 909 1 stuxnet 829 79 624 59 718 7 962 19 906 2 darkhotel 658 53 290 39 968 5 329 13 322 3 lazarus 721 65 997 49497 6 560 16 500 4 apt28 538 62 548 46 911 6 255 15 637 5 deeppanda 682 75 629 56 722 7 563 18 907 6 apt10 721 85 476 64 107 8 548 21 369 7 gaza 530 48 926 36 695 4 893 12 231 8 turla 561 50 362 37 772 5 036 12 590 9 总计 6 972 680 718 510 547 68 033 170 171 表 2 不同模型比较
Table 2. Comparison with different models
模型 准确率 CNN-static 0.836 CNN-non-static 0.857 CNN-multichannel 0.879 DCNN 0.886 LSTM 0.902 Bi-LSTM 0.881 C-LSTM 0.910 CNN-SLSTM 0.953 -
[1] TAMADA H, OKAMOTO K, NAKAMURA M, et al. Dynamic software birthmarks to detect the theft of Windows applications[C]//International Symposium on Future Software Technology, 2004. [2] PFEFFER A, CALL C, CHAMBERLAIN J, et al. Malware analysis and attribution using genetic information[C]//7th International Conference on Malicious and Unwanted Software. Piscataway: IEEE Press, 2012: 39-45. [3] RUTTENBERG B, MILES C, KELLOGG L, et al. Identifying shared software components to support malware forensics[C]//International Conference on Detection of Intrusions, Malware and Vulnerability Assessment. Berlin: Springer, 2014: 21-40. [4] BENCSÁTH B, PÉK G, BUTTYÁN L, et al. The cousins of stuxnet: Duqu, Flame and Gauss[J]. Future Internet, 2012, 4(4): 971-1003. doi: 10.3390/fi4040971 [5] BENCSÁTH B, PÉK G, BUTTYÁN L, et al. Duqu: A Stuxnet-like malware found in the wild[R]. Hungary: CrySyS Lab Technical Report, 2011, 14: 1-60. [6] GOSTEV A, KUZNETSOV I. Stuxnet/Duqu: The evolution of drivers[EB/OL]. (2011-12-28)[2020-08-01]. http://www.securelist.com/en/analysis/204792208/Stuxnet Duqu The Evolution of Drivers. [7] CHIEN E, OMURCHU L, FALLIERE N. Duqu: The precursor to the next Stuxnet[C]//LEET 12: Proceedings of the 5th USENIX Conference on Large-Scale Exploits and Emergent Threats, 2012. [8] AWAD Y, NASSAR M, SAFA H. Modeling malware as a language[C]//IEEE International Conference on Communications (ICC). Piscataway: IEEE Press, 2018: 1-6. [9] LIU J R, SHEN Y, YAN H B. Functions-based CFG embedding for malware homology analysis[C]//26th International Conference on Telecommunications (ICT). Piscataway: IEEE Press, 2019: 220-226. [10] BAKER B S. On finding duplication and near-duplication in large software systems[C]//Proceedings of 2nd Working Conference on Reverse Engineering. Piscataway: IEEE Press, 1995: 86-95. [11] BILENKO M, MOONEY R J. Adaptive duplicate detection using learnable string similarity measures[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-KDD'03. New York: ACM, 2003: 39-48. [12] BROMLEY J, BENTZ J W, BOTTOU L, et al. Signature verification using a "Siamese" time delay neural network[J]. International Journal of Pattern Recognition and Artificial Intelligence, 1993, 7(4): 669-688. doi: 10.1142/S0218001493000339 [13] DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407. doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 [14] KAMIYA T, KUSUMOTO S, INOUE K. CCFinder: A multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering, 2002, 28(7): 654-670. doi: 10.1109/TSE.2002.1019480 [15] SCHLEIMER S, WILKERSON D S, AIKEN A. Winnowing: Local algorithms for document fingerprinting[C]//Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data-SIGMOD'03. New York: ACM, 2003: 76-85. [16] PRECHELT L, MALPOHL G, PHILIPPSEN M. Finding plagiarisms among a set of programs with JPlag[J]. Journal of Universal Computer Science, 2002, 8(11): 1016-1038. [17] JIANG L X, MISHERGHI G, SU Z D, et al. Deckard: Scalable and accurate tree-based detection of code clones[C]//29th International Conference on Software Engineering. Piscataway: IEEE Press, 2007: 96-105. [18] KOSCHKE R, FALKE R, FRENZEL P. Clone detection using abstract syntax suffix trees[C]//Proceedings of 13th Working Conference on Reverse Engineering. Piscataway: IEEE Press, 2006: 253-262. [19] PEWNY J, SCHUSTER F, BERNHARD L, et al. Leveraging semantic signatures for bug search in binary programs[C]//Proceedings of the 30th Annual Computer Security Applications Conference. New York: ACM, 2014. [20] GABEL M, JIANG L, SU Z. Scalable detection of semantic clones[C]//30th International Conference on Software Engineering. Piscataway: IEEE Press, 2008: 321-330. [21] LIU C, CHEN C, HAN J W, et al. GPLAG: Detection of software plagiarism by program dependence graph analysis[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-KDD'06. New York: ACM, 2006: 872-881. [22] CRUSSELL J, GIBLER C, CHEN H. Attack of the clones: Detecting cloned applications on Android markets[C]//Computer Security-ESORICS. Berlin: Springer, 2012: 37-54. [23] SCHULER D, DALLMEIER V, LINDIG C. A dynamic birthmark for Java[C]//22nd ACM International Conference on Automated Software Engineering. New York: ACM, 2007: 274-283. [24] WHITE M, TUFANO M, MARTINEZ M, et al. Sorting and transforming program repair ingredients via deep learning code similarities[C]//26th International Conference on Software Analysis, Evolution and Reengineering. Piscataway: IEEE Press, 2019: 479-490. [25] MOU L, LI G, ZHANG L, et al. Convolutional neural networks over tree structures for programming language processing[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016: 1287-1293. [26] LIU B C, HUO W, ZHANG C, et al. αDiff: Cross-version binary code similarity detection with DNN[C]//33th ACM/IEEE International Conference on Automated Software Engineering. New York: ACM, 2018: 667-678. [27] FENG Q, ZHOU R D, XU C C, et al. Scalable graph-based bug search for firmware images[C]//Proceedings of ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2016: 480-491. [28] XU X J, LIU C, FENG Q, et al. Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2017: 363-376. [29] CHANDRAMOHAN M, XUE Y X, XU Z Z, et al. BinGo: Cross-architecture cross-OS binary search[C]//24th ACM SIGSOFT International Symposiumn. New York: ACM, 2016: 678-689. [30] ZHOU C T, SUN C L, LIU Z Y, et al. A C-LSTM neural network for text classification[J]. Computer Science, 2015, 1(4): 39-44. http://pdfs.semanticscholar.org/10f6/2af29c3fc5e2572baddca559ffbfd6be8787.pdf [31] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. doi: 10.1162/neco.1997.9.8.1735 [32] GHOSH S, VINYALS O, STROPE B, et al. Contextual LSTM (CLSTM) models for large scale NLP tasks[EB/OL]. (2016-05-31)[2020-08-01]. https://arxiv.org/abs/1602.06291. [33] KOLEN J F, KREMER S C. Gradient flow in recurrent nets: The difficulty of learning longterm dependencies[J]. A Field Guide to Dynamical Recurrent Networks, 2001, 28(2): 237243. [34] DAI H J, DAI B, SONG L. Discriminative embeddings of latent variable models for structured data[C]//Proceedings of the 33rd International Conference on Machine Learning. New York: ACM, 2016: 2702-2711. [35] KALCHBRENNER N, GREFENSTETTE E, BLUNSOM P. A convolutional neural network for modelling sentences[C]//Proceedings of the 52nd Annual Meeting of the Asscociation for Computational Linguistics, 2014: 655-665. [36] TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, 2015: 1556-1566.