留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

一种基于深度学习的恶意代码克隆检测技术

沈元 严寒冰 夏春和 韩志辉

沈元, 严寒冰, 夏春和, 等 . 一种基于深度学习的恶意代码克隆检测技术[J]. 北京航空航天大学学报, 2022, 48(2): 282-290. doi: 10.13700/j.bh.1001-5965.2020.0400
引用本文: 沈元, 严寒冰, 夏春和, 等 . 一种基于深度学习的恶意代码克隆检测技术[J]. 北京航空航天大学学报, 2022, 48(2): 282-290. doi: 10.13700/j.bh.1001-5965.2020.0400
SHEN Yuan, YAN Hanbing, XIA Chunhe, et al. Malicious code clone detection technology based on deep learning[J]. Journal of Beijing University of Aeronautics and Astronautics, 2022, 48(2): 282-290. doi: 10.13700/j.bh.1001-5965.2020.0400(in Chinese)
Citation: SHEN Yuan, YAN Hanbing, XIA Chunhe, et al. Malicious code clone detection technology based on deep learning[J]. Journal of Beijing University of Aeronautics and Astronautics, 2022, 48(2): 282-290. doi: 10.13700/j.bh.1001-5965.2020.0400(in Chinese)

一种基于深度学习的恶意代码克隆检测技术

doi: 10.13700/j.bh.1001-5965.2020.0400
基金项目: 

国家自然科学基金 U1736218

国家科技重大专项 2018YFB0804704

北航青年拔尖人才支持计划 YWF-20-BJ-J-103

详细信息
    通讯作者:

    严寒冰, E-mail: yhb@cert.org.cn

  • 中图分类号: TP391

Malicious code clone detection technology based on deep learning

Funds: 

National Natural Science Foundation of China U1736218

National Science and Technology Major Project 2018YFB0804704

Beihang Youth Top Talent Support Program YWF-20-BJ-J-103

More Information
  • 摘要:

    恶意代码克隆检测已经成为恶意代码同源分析及高级持续性威胁(APT)攻击溯源的有效方式。从公共威胁情报中收集了不同APT组织的样本,并提出了一种基于深度学习的恶意代码克隆检测框架,目的是检测新发现的恶意代码中的函数与已知APT组织资源库中的恶意代码的相似性,以此高效地对恶意软件进行分析,进而快速判别APT攻击来源。通过反汇编技术对恶意代码进行静态分析,并利用其关键系统函数调用图及反汇编代码作为该恶意代码的特征。根据神经网络模型对APT组织资源库中的恶意代码进行分类。通过广泛评估和与MCrab模型的对比可知,改进模型优于MCrab模型,可以有效地进行恶意代码克隆检测与分类,且获得了较高的检测率。

     

  • 图 1  原型系统GMCrab工作流程

    Figure 1.  Prototype system GMCrab workflow

    图 2  GMCrab部署图

    Figure 2.  Deployment of GMCrab

    图 3  CFG样例

    Figure 3.  CFG example

    图 4  系统函数调用示例

    Figure 4.  Example of system function call

    图 5  语义模型的CNN-SLSTM的体系结构

    Figure 5.  Architecture of CNN-SLSTM for semantic model

    图 6  不同模型的准确率和损失率对比

    Figure 6.  Comparison of accuracy and loss for different models

    表  1  APT组织样本数据集

    Table  1.   APT group samples dataset

    APT组织 样本个数 函数个数 训练集 验证集 测试集 标签
    patchwork 964 83 230 62 430 8 323 20 800 0
    hangover 768 75 636 56 727 7 564 18 909 1
    stuxnet 829 79 624 59 718 7 962 19 906 2
    darkhotel 658 53 290 39 968 5 329 13 322 3
    lazarus 721 65 997 49497 6 560 16 500 4
    apt28 538 62 548 46 911 6 255 15 637 5
    deeppanda 682 75 629 56 722 7 563 18 907 6
    apt10 721 85 476 64 107 8 548 21 369 7
    gaza 530 48 926 36 695 4 893 12 231 8
    turla 561 50 362 37 772 5 036 12 590 9
    总计 6 972 680 718 510 547 68 033 170 171
    下载: 导出CSV

    表  2  不同模型比较

    Table  2.   Comparison with different models

    模型 准确率
    CNN-static 0.836
    CNN-non-static 0.857
    CNN-multichannel 0.879
    DCNN 0.886
    LSTM 0.902
    Bi-LSTM 0.881
    C-LSTM 0.910
    CNN-SLSTM 0.953
    下载: 导出CSV
  • [1] TAMADA H, OKAMOTO K, NAKAMURA M, et al. Dynamic software birthmarks to detect the theft of Windows applications[C]//International Symposium on Future Software Technology, 2004.
    [2] PFEFFER A, CALL C, CHAMBERLAIN J, et al. Malware analysis and attribution using genetic information[C]//7th International Conference on Malicious and Unwanted Software. Piscataway: IEEE Press, 2012: 39-45.
    [3] RUTTENBERG B, MILES C, KELLOGG L, et al. Identifying shared software components to support malware forensics[C]//International Conference on Detection of Intrusions, Malware and Vulnerability Assessment. Berlin: Springer, 2014: 21-40.
    [4] BENCSÁTH B, PÉK G, BUTTYÁN L, et al. The cousins of stuxnet: Duqu, Flame and Gauss[J]. Future Internet, 2012, 4(4): 971-1003. doi: 10.3390/fi4040971
    [5] BENCSÁTH B, PÉK G, BUTTYÁN L, et al. Duqu: A Stuxnet-like malware found in the wild[R]. Hungary: CrySyS Lab Technical Report, 2011, 14: 1-60.
    [6] GOSTEV A, KUZNETSOV I. Stuxnet/Duqu: The evolution of drivers[EB/OL]. (2011-12-28)[2020-08-01]. http://www.securelist.com/en/analysis/204792208/Stuxnet Duqu The Evolution of Drivers.
    [7] CHIEN E, OMURCHU L, FALLIERE N. Duqu: The precursor to the next Stuxnet[C]//LEET 12: Proceedings of the 5th USENIX Conference on Large-Scale Exploits and Emergent Threats, 2012.
    [8] AWAD Y, NASSAR M, SAFA H. Modeling malware as a language[C]//IEEE International Conference on Communications (ICC). Piscataway: IEEE Press, 2018: 1-6.
    [9] LIU J R, SHEN Y, YAN H B. Functions-based CFG embedding for malware homology analysis[C]//26th International Conference on Telecommunications (ICT). Piscataway: IEEE Press, 2019: 220-226.
    [10] BAKER B S. On finding duplication and near-duplication in large software systems[C]//Proceedings of 2nd Working Conference on Reverse Engineering. Piscataway: IEEE Press, 1995: 86-95.
    [11] BILENKO M, MOONEY R J. Adaptive duplicate detection using learnable string similarity measures[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-KDD'03. New York: ACM, 2003: 39-48.
    [12] BROMLEY J, BENTZ J W, BOTTOU L, et al. Signature verification using a "Siamese" time delay neural network[J]. International Journal of Pattern Recognition and Artificial Intelligence, 1993, 7(4): 669-688. doi: 10.1142/S0218001493000339
    [13] DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407. doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
    [14] KAMIYA T, KUSUMOTO S, INOUE K. CCFinder: A multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering, 2002, 28(7): 654-670. doi: 10.1109/TSE.2002.1019480
    [15] SCHLEIMER S, WILKERSON D S, AIKEN A. Winnowing: Local algorithms for document fingerprinting[C]//Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data-SIGMOD'03. New York: ACM, 2003: 76-85.
    [16] PRECHELT L, MALPOHL G, PHILIPPSEN M. Finding plagiarisms among a set of programs with JPlag[J]. Journal of Universal Computer Science, 2002, 8(11): 1016-1038.
    [17] JIANG L X, MISHERGHI G, SU Z D, et al. Deckard: Scalable and accurate tree-based detection of code clones[C]//29th International Conference on Software Engineering. Piscataway: IEEE Press, 2007: 96-105.
    [18] KOSCHKE R, FALKE R, FRENZEL P. Clone detection using abstract syntax suffix trees[C]//Proceedings of 13th Working Conference on Reverse Engineering. Piscataway: IEEE Press, 2006: 253-262.
    [19] PEWNY J, SCHUSTER F, BERNHARD L, et al. Leveraging semantic signatures for bug search in binary programs[C]//Proceedings of the 30th Annual Computer Security Applications Conference. New York: ACM, 2014.
    [20] GABEL M, JIANG L, SU Z. Scalable detection of semantic clones[C]//30th International Conference on Software Engineering. Piscataway: IEEE Press, 2008: 321-330.
    [21] LIU C, CHEN C, HAN J W, et al. GPLAG: Detection of software plagiarism by program dependence graph analysis[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-KDD'06. New York: ACM, 2006: 872-881.
    [22] CRUSSELL J, GIBLER C, CHEN H. Attack of the clones: Detecting cloned applications on Android markets[C]//Computer Security-ESORICS. Berlin: Springer, 2012: 37-54.
    [23] SCHULER D, DALLMEIER V, LINDIG C. A dynamic birthmark for Java[C]//22nd ACM International Conference on Automated Software Engineering. New York: ACM, 2007: 274-283.
    [24] WHITE M, TUFANO M, MARTINEZ M, et al. Sorting and transforming program repair ingredients via deep learning code similarities[C]//26th International Conference on Software Analysis, Evolution and Reengineering. Piscataway: IEEE Press, 2019: 479-490.
    [25] MOU L, LI G, ZHANG L, et al. Convolutional neural networks over tree structures for programming language processing[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016: 1287-1293.
    [26] LIU B C, HUO W, ZHANG C, et al. αDiff: Cross-version binary code similarity detection with DNN[C]//33th ACM/IEEE International Conference on Automated Software Engineering. New York: ACM, 2018: 667-678.
    [27] FENG Q, ZHOU R D, XU C C, et al. Scalable graph-based bug search for firmware images[C]//Proceedings of ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2016: 480-491.
    [28] XU X J, LIU C, FENG Q, et al. Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2017: 363-376.
    [29] CHANDRAMOHAN M, XUE Y X, XU Z Z, et al. BinGo: Cross-architecture cross-OS binary search[C]//24th ACM SIGSOFT International Symposiumn. New York: ACM, 2016: 678-689.
    [30] ZHOU C T, SUN C L, LIU Z Y, et al. A C-LSTM neural network for text classification[J]. Computer Science, 2015, 1(4): 39-44. http://pdfs.semanticscholar.org/10f6/2af29c3fc5e2572baddca559ffbfd6be8787.pdf
    [31] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. doi: 10.1162/neco.1997.9.8.1735
    [32] GHOSH S, VINYALS O, STROPE B, et al. Contextual LSTM (CLSTM) models for large scale NLP tasks[EB/OL]. (2016-05-31)[2020-08-01]. https://arxiv.org/abs/1602.06291.
    [33] KOLEN J F, KREMER S C. Gradient flow in recurrent nets: The difficulty of learning longterm dependencies[J]. A Field Guide to Dynamical Recurrent Networks, 2001, 28(2): 237243.
    [34] DAI H J, DAI B, SONG L. Discriminative embeddings of latent variable models for structured data[C]//Proceedings of the 33rd International Conference on Machine Learning. New York: ACM, 2016: 2702-2711.
    [35] KALCHBRENNER N, GREFENSTETTE E, BLUNSOM P. A convolutional neural network for modelling sentences[C]//Proceedings of the 52nd Annual Meeting of the Asscociation for Computational Linguistics, 2014: 655-665.
    [36] TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, 2015: 1556-1566.
  • 加载中
图(6) / 表(2)
计量
  • 文章访问数:  294
  • HTML全文浏览量:  58
  • PDF下载量:  67
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-08-09
  • 录用日期:  2021-01-21
  • 网络出版日期:  2022-02-20

目录

    /

    返回文章
    返回
    常见问答