Citation: | SHEN Yuan, YAN Hanbing, XIA Chunhe, et al. Malicious code clone detection technology based on deep learning[J]. Journal of Beijing University of Aeronautics and Astronautics, 2022, 48(2): 282-290. doi: 10.13700/j.bh.1001-5965.2020.0400(in Chinese) |
Malicious code clone detection has become an effective way to analyze malicious code homology and advanced persistent threat (APT) attacks. In this paper, we collect samples of different APT organizations from public threat intelligence, and propose a deep learning based malicious code clone detection framework to detect the similarity between the functions in newly discovered malicious code and the malicious code in known APT organizational resources in order to efficiently analyze malware and quickly identify the source of APT attacks. We perform static analysis of malicious code through disassembly technology, use its key function call graph and disassembly code as the features of the malicious code, and then classify the malicious code in the APT organization library according to the neural network model. Through extensive evaluation and comparison with our previous models (MCrab), the improved model is better than the previous model, which can effectively detect and classify malicious code clones and obtain higher detection rate.
[1] |
TAMADA H, OKAMOTO K, NAKAMURA M, et al. Dynamic software birthmarks to detect the theft of Windows applications[C]//International Symposium on Future Software Technology, 2004.
|
[2] |
PFEFFER A, CALL C, CHAMBERLAIN J, et al. Malware analysis and attribution using genetic information[C]//7th International Conference on Malicious and Unwanted Software. Piscataway: IEEE Press, 2012: 39-45.
|
[3] |
RUTTENBERG B, MILES C, KELLOGG L, et al. Identifying shared software components to support malware forensics[C]//International Conference on Detection of Intrusions, Malware and Vulnerability Assessment. Berlin: Springer, 2014: 21-40.
|
[4] |
BENCSÁTH B, PÉK G, BUTTYÁN L, et al. The cousins of stuxnet: Duqu, Flame and Gauss[J]. Future Internet, 2012, 4(4): 971-1003. doi: 10.3390/fi4040971
|
[5] |
BENCSÁTH B, PÉK G, BUTTYÁN L, et al. Duqu: A Stuxnet-like malware found in the wild[R]. Hungary: CrySyS Lab Technical Report, 2011, 14: 1-60.
|
[6] |
GOSTEV A, KUZNETSOV I. Stuxnet/Duqu: The evolution of drivers[EB/OL]. (2011-12-28)[2020-08-01]. http://www.securelist.com/en/analysis/204792208/Stuxnet Duqu The Evolution of Drivers.
|
[7] |
CHIEN E, OMURCHU L, FALLIERE N. Duqu: The precursor to the next Stuxnet[C]//LEET 12: Proceedings of the 5th USENIX Conference on Large-Scale Exploits and Emergent Threats, 2012.
|
[8] |
AWAD Y, NASSAR M, SAFA H. Modeling malware as a language[C]//IEEE International Conference on Communications (ICC). Piscataway: IEEE Press, 2018: 1-6.
|
[9] |
LIU J R, SHEN Y, YAN H B. Functions-based CFG embedding for malware homology analysis[C]//26th International Conference on Telecommunications (ICT). Piscataway: IEEE Press, 2019: 220-226.
|
[10] |
BAKER B S. On finding duplication and near-duplication in large software systems[C]//Proceedings of 2nd Working Conference on Reverse Engineering. Piscataway: IEEE Press, 1995: 86-95.
|
[11] |
BILENKO M, MOONEY R J. Adaptive duplicate detection using learnable string similarity measures[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-KDD'03. New York: ACM, 2003: 39-48.
|
[12] |
BROMLEY J, BENTZ J W, BOTTOU L, et al. Signature verification using a "Siamese" time delay neural network[J]. International Journal of Pattern Recognition and Artificial Intelligence, 1993, 7(4): 669-688. doi: 10.1142/S0218001493000339
|
[13] |
DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407. doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
|
[14] |
KAMIYA T, KUSUMOTO S, INOUE K. CCFinder: A multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering, 2002, 28(7): 654-670. doi: 10.1109/TSE.2002.1019480
|
[15] |
SCHLEIMER S, WILKERSON D S, AIKEN A. Winnowing: Local algorithms for document fingerprinting[C]//Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data-SIGMOD'03. New York: ACM, 2003: 76-85.
|
[16] |
PRECHELT L, MALPOHL G, PHILIPPSEN M. Finding plagiarisms among a set of programs with JPlag[J]. Journal of Universal Computer Science, 2002, 8(11): 1016-1038.
|
[17] |
JIANG L X, MISHERGHI G, SU Z D, et al. Deckard: Scalable and accurate tree-based detection of code clones[C]//29th International Conference on Software Engineering. Piscataway: IEEE Press, 2007: 96-105.
|
[18] |
KOSCHKE R, FALKE R, FRENZEL P. Clone detection using abstract syntax suffix trees[C]//Proceedings of 13th Working Conference on Reverse Engineering. Piscataway: IEEE Press, 2006: 253-262.
|
[19] |
PEWNY J, SCHUSTER F, BERNHARD L, et al. Leveraging semantic signatures for bug search in binary programs[C]//Proceedings of the 30th Annual Computer Security Applications Conference. New York: ACM, 2014.
|
[20] |
GABEL M, JIANG L, SU Z. Scalable detection of semantic clones[C]//30th International Conference on Software Engineering. Piscataway: IEEE Press, 2008: 321-330.
|
[21] |
LIU C, CHEN C, HAN J W, et al. GPLAG: Detection of software plagiarism by program dependence graph analysis[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-KDD'06. New York: ACM, 2006: 872-881.
|
[22] |
CRUSSELL J, GIBLER C, CHEN H. Attack of the clones: Detecting cloned applications on Android markets[C]//Computer Security-ESORICS. Berlin: Springer, 2012: 37-54.
|
[23] |
SCHULER D, DALLMEIER V, LINDIG C. A dynamic birthmark for Java[C]//22nd ACM International Conference on Automated Software Engineering. New York: ACM, 2007: 274-283.
|
[24] |
WHITE M, TUFANO M, MARTINEZ M, et al. Sorting and transforming program repair ingredients via deep learning code similarities[C]//26th International Conference on Software Analysis, Evolution and Reengineering. Piscataway: IEEE Press, 2019: 479-490.
|
[25] |
MOU L, LI G, ZHANG L, et al. Convolutional neural networks over tree structures for programming language processing[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016: 1287-1293.
|
[26] |
LIU B C, HUO W, ZHANG C, et al. αDiff: Cross-version binary code similarity detection with DNN[C]//33th ACM/IEEE International Conference on Automated Software Engineering. New York: ACM, 2018: 667-678.
|
[27] |
FENG Q, ZHOU R D, XU C C, et al. Scalable graph-based bug search for firmware images[C]//Proceedings of ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2016: 480-491.
|
[28] |
XU X J, LIU C, FENG Q, et al. Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2017: 363-376.
|
[29] |
CHANDRAMOHAN M, XUE Y X, XU Z Z, et al. BinGo: Cross-architecture cross-OS binary search[C]//24th ACM SIGSOFT International Symposiumn. New York: ACM, 2016: 678-689.
|
[30] |
ZHOU C T, SUN C L, LIU Z Y, et al. A C-LSTM neural network for text classification[J]. Computer Science, 2015, 1(4): 39-44. http://pdfs.semanticscholar.org/10f6/2af29c3fc5e2572baddca559ffbfd6be8787.pdf
|
[31] |
HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. doi: 10.1162/neco.1997.9.8.1735
|
[32] |
GHOSH S, VINYALS O, STROPE B, et al. Contextual LSTM (CLSTM) models for large scale NLP tasks[EB/OL]. (2016-05-31)[2020-08-01]. https://arxiv.org/abs/1602.06291.
|
[33] |
KOLEN J F, KREMER S C. Gradient flow in recurrent nets: The difficulty of learning longterm dependencies[J]. A Field Guide to Dynamical Recurrent Networks, 2001, 28(2): 237243.
|
[34] |
DAI H J, DAI B, SONG L. Discriminative embeddings of latent variable models for structured data[C]//Proceedings of the 33rd International Conference on Machine Learning. New York: ACM, 2016: 2702-2711.
|
[35] |
KALCHBRENNER N, GREFENSTETTE E, BLUNSOM P. A convolutional neural network for modelling sentences[C]//Proceedings of the 52nd Annual Meeting of the Asscociation for Computational Linguistics, 2014: 655-665.
|
[36] |
TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, 2015: 1556-1566.
|