留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于多模态掩码Transformer网络的社会事件分类

陈宏 钱胜胜 李章明 方全 徐常胜

陈宏,钱胜胜,李章明,等. 基于多模态掩码Transformer网络的社会事件分类[J]. 北京航空航天大学学报,2024,50(2):579-587 doi: 10.13700/j.bh.1001-5965.2022.0388
引用本文: 陈宏,钱胜胜,李章明,等. 基于多模态掩码Transformer网络的社会事件分类[J]. 北京航空航天大学学报,2024,50(2):579-587 doi: 10.13700/j.bh.1001-5965.2022.0388
CHEN H,QIAN S S,LI Z M,et al. Multi-modal mask Transformer network for social event classification[J]. Journal of Beijing University of Aeronautics and Astronautics,2024,50(2):579-587 (in Chinese) doi: 10.13700/j.bh.1001-5965.2022.0388
Citation: CHEN H,QIAN S S,LI Z M,et al. Multi-modal mask Transformer network for social event classification[J]. Journal of Beijing University of Aeronautics and Astronautics,2024,50(2):579-587 (in Chinese) doi: 10.13700/j.bh.1001-5965.2022.0388

基于多模态掩码Transformer网络的社会事件分类

doi: 10.13700/j.bh.1001-5965.2022.0388
基金项目: 国家自然科学基金(61832002)
详细信息
    通讯作者:

    E-mail:csxu@nlpr.ia

  • 中图分类号: TP391

Multi-modal mask Transformer network for social event classification

Funds: National Natural Science Foundation of China (61832002)
More Information
  • 摘要:

    多模态社会事件分类的关键是充分且准确地利用图像和文字2种模态的特征。然而,现有的大多数方法存在以下局限性:简单地将事件的图像特征和文本特征连接起来,不同模态之间存在不相关的上下文信息导致相互干扰。因此,仅仅考虑多模态数据模态间的关系是不够的,还要考虑模态之间不相关的上下文信息(即区域或单词)。为克服这些局限性,提出一种新颖的基于多模态掩码Transformer网络(MMTN)模型的社会事件分类方法。通过图-文编码网络来学习文本和图像的更好的表示。将获得的图像和文本表示输入多模态掩码Transformer网络来融合多模态信息,并通过计算多模态信息之间的相似性,对多模态信息的模态间的关系进行建模,掩盖模态之间的不相关上下文。在2个基准数据集上的大量实验表明:所提模型达到了最先进的性能。

     

  • 图 1  本文方法的总体框架

    Figure 1.  Overall framework for the proposed method

    图 2  掩码网络示意图

    Figure 2.  Schematic diagram of mask network

    图 3  掩码模块不同阈值实验结果比较

    Figure 3.  Comparison of experimental results with different thresholds of mask module

    表  1  CrisisMMD[11]数据集

    Table  1.   CrisisMMD[11] dataset

    事件 文本数量/条 图片/张
    Hurricane Irma 4 041 4 525
    Hurricane Harvey 4 000 4 443
    Hurricane Maria 4 000 4 562
    Mexico earthquake 1 239 1 382
    California wildfires 1 486 1 589
    Sri Lanka floods 832 1 025
    Iraq-Iran earthquake 499 600
    所有事件 16 097 18 126
    下载: 导出CSV

    表  2  PHEME[12]数据集

    Table  2.   PHEME[12] dataset

    事件 文本数量/条 图片/张
    Charlie Hebdo 2 079 911
    Sydney siege 1 221 413
    Ferguson 1 143 355
    Ottawa shooting 890 231
    Germanwings-crash 469 179
    所有事件 5 802 2 089
    下载: 导出CSV

    表  3  不同模型在CrisisMMD数据集上的实验结果

    Table  3.   Experimental results of different models on CrisisMMD dataset %

    模型 准确率 F1分数 加权F1分数
    MMBT[20] 87.82 84.78 87.62
    CBP[21] 91.54 89.36 91.23
    CBGP[22] 92.12 91.46 92.45
    MDL-DR[23] 86.77 85.73 86.91
    Multi-RC[25] 83.95 82.23 83.95
    SCBD[7] 93.66 95.10 93.66
    MCAN[16] 94.82 94.87 94.85
    MMTN 96.72 97.09 96.72
    下载: 导出CSV

    表  4  不同模型在PHEME数据集上的实验结果

    Table  4.   Experimental results of different models on PHEME dataset %

    模型 准确率 F1分数 加权F1分数
    MMBT[20] 85.92 84.27 88.66
    CBP[21] 89.42 88.52 90.33
    CBGP[22] 86.14 87.79 90.12
    MDL-DR[23] 85.3 85.1 85.4
    Multi-RC[25] 83.67 82.61 83.17
    SCBD[7] 90.90 91.26 90.91
    MCAN[16] 91.19 91.55 91.21
    MMTN 94.65 94.89 94.65
    下载: 导出CSV

    表  5  MMTN模型的不同变式在CrisisMMD[11]数据集上的检测性能比较

    Table  5.   Comparison of detection performance of different variants of MMTN model on CrisisMMD[11] dataset %

    模型 准确率 F1分数 加权F1分数
    $ {\mathrm{MMTN}}\neg {{\boldsymbol{F}}}_{{\boldsymbol{VW}}} $ 94.82 95.63 94.84
    $ {\mathrm{MMTN}}\neg {{\boldsymbol{F}}}_{{\boldsymbol{WV}}} $ 93.86 93.63 93.85
    $ {\mathrm{MMTN}}\neg {\mathrm{M}} $ 95.72 96.35 95.72
    $ {\mathrm{MMTN}}\neg {\mathrm{FI}} $ 94.76 94.75 94.77
    $ {\rm{MMTN}} $ 96.72 97.09 96.72
    下载: 导出CSV

    表  6  MMTN模型的不同变式在PHEME[12]数据集上的检测性能比较

    Table  6.   Comparison of detection performance of different variants of MMTN model on PHEME[12] dataset %

    模型 准确率 F1分数 加权F1分数
    $ {\mathrm{MMTN}}\neg {{\boldsymbol{F}}}_{{\boldsymbol{VW}}} $ 91.31 91.61 91.29
    $ {\mathrm{MMTN}}\neg {{\boldsymbol{F}}}_{{\boldsymbol{WV}}} $ 91.71 92.42 91.71
    $ {\mathrm{MMTN}}\neg {\mathrm{M}} $ 93.44 93.65 93.44
    $ {\mathrm{MMTN}}\neg {\mathrm{FI}} $ 92.11 92.78 92.17
    ${ \rm{MMTN}} $ 94.65 94.89 94.65
    下载: 导出CSV
  • [1] KUMAR S, BARBIER G, ABBASI M, et al. TweetTracker: An analysis tool for humanitarian and disaster relief[C]//Proceedings of the International AAAI Conference on Web and Social Media. Washington, D. C. : AAAI, 2021, 5(1): 661-662.
    [2] SHEKHAR H, SETTY S. Disaster analysis through tweets[C]//Proceedings of the 2015 International Conference on Advances in Computing, Communications and Informatics. Piscataway: IEEE Press, 2015: 1719-1723.
    [3] STOWE K, PAUL M J, PALMER M, et al. Identifying and categorizing disaster-related tweets[C]//Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media. Stroudsburg: Association for Computational Linguistics, 2016.
    [4] TO H, AGRAWAL S, KIM S H, et al. On identifying disaster-related tweets: Matching-based or learning-based? [C]//Proceedings of the 2017 IEEE Third International Conference on Multimedia Big Data. Piscataway: IEEE Press, 2017: 330-337.
    [5] MOUZANNAR H, RIZK Y, AWAD M. Damage identification in social media posts using multimodal deep learning[C]//Proceedings of the 15th International Conference on Information Systems for Crisis Response and Management. Rochester: ISCRAM, 2018.
    [6] KELLY S, ZHANG X B, AHMAD K. Mining multimodal information on social media for increased situational awareness[C]//Proceedings of the 14th International Conference on Information Systems for Crisis Response and Management. Albi: ISCRAM, 2017.
    [7] ABAVISANI M, WU L W, HU S L, et al. Multimodal categorization of crisis events in social media[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 14667-14677.
    [8] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 770-778.
    [9] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional Transformers for language understanding[EB/OL]. (2019-05-24) [2022-02-16]. https://arxiv.org/abs/1810.04805.
    [10] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
    [11] ALAM F, OFLI F, IMRAN M. CrisisMMD: Multimodal twitter datasets from natural disasters[C]//Proceedings of the International AAAI Conference on Web and Social Media. Washington, D. C. : AAAI, 2018, 12(1): 465-473.
    [12] KOCHKINA E, LIAKATA M, ZUBIAGA A. All-in-one: Multi-task learning for rumour verification[C]//Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe: Association for Computational Linguistics, 2018: 3402-3413.
    [13] LI X K, CARAGEA D, ZHANG H Y, et al. Localizing and quantifying damage in social media images[C]//Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. Piscataway: IEEE Press, 2018: 194-201.
    [14] NALLURU G, PANDEY R, PUROHIT H. Relevancy classification of multimodal social media streams for emergency services[C]//Proceedings of the 2019 IEEE International Conference on Smart Computing. Piscataway: IEEE Press, 2019: 121-125.
    [15] NIE X S, WANG B W, LI J J, et al. Deep multiscale fusion hashing for cross-modal retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(1): 401-410. doi: 10.1109/TCSVT.2020.2974877
    [16] WU Y, ZHAN P W, ZHANG Y J, et al. Multimodal fusion with co-attention networks for fake news detection[C]//Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Stroudsburg: Association for Computational Linguistics, 2021: 2560-2569.
    [17] MAO Y D, JIANG Q P, CONG R M, et al. Cross-modality fusion and progressive integration network for saliency prediction on stereoscopic 3D images[J]. IEEE Transactions on Multimedia, 2021, 24: 2435-2448.
    [18] QI P, CAO J, LI X R, et al. Improving fake news detection by using an entity-enhanced framework to fuse diverse multimodal clues[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 1212-1220.
    [19] DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2009: 248-255.
    [20] KIELA D, BHOOSHAN S, FIROOZ H, et al. Supervised multimodal bitransformers for classifying images and text[EB/OL]. (2019-09-06) [2022-02-18]. https://arxiv.org/abs/1909.02950.
    [21] FUKUI A, PARK D H, YANG D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2016: 457-468.
    [22] KIELA D, GRAVE E, JOULIN A, et al. Efficient large-scale multi-modal classification[EB/OL]. (2008-02-06) [2022-02-18]. https://arxiv.org/abs/1802.02892.
    [23] OFLI F, ALAM F, IMRAN M. Analysis of social media data using multimodal deep learning for disaster response[EB/OL]. (2020-04-14)[2022-02-18] .https://arxiv.org/abs/2004.11838.
    [24] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale visual recognition[EB/OL]. (2015-04-10) [2022-02-20]. https://arxiv.org/abs/1409.1556.
    [25] LI X, CARAGEA D. Improving disaster-related tweet classification with a multimodal approach[C]//Proceedings of the 17th ISCRAM Conference. Blacksburg: ISCRAM, 2020: 893-902.
    [26] CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN Encoder–Decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1724-1734.
    [27] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90. doi: 10.1145/3065386
    [28] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 2261-2269.
  • 加载中
图(3) / 表(6)
计量
  • 文章访问数:  609
  • HTML全文浏览量:  77
  • PDF下载量:  5
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-05-19
  • 录用日期:  2022-11-04
  • 网络出版日期:  2023-01-12
  • 整期出版日期:  2024-02-27

目录

    /

    返回文章
    返回
    常见问答