留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于差异化和空间约束的自动图像描述模型

姜文晖 陈志亮 程一波 方玉明 左一帆

姜文晖,陈志亮,程一波,等. 基于差异化和空间约束的自动图像描述模型[J]. 北京航空航天大学学报,2024,50(2):456-465 doi: 10.13700/j.bh.1001-5965.2022.0400
引用本文: 姜文晖,陈志亮,程一波,等. 基于差异化和空间约束的自动图像描述模型[J]. 北京航空航天大学学报,2024,50(2):456-465 doi: 10.13700/j.bh.1001-5965.2022.0400
JIANG W H,CHEN Z L,CHENG Y B,et al. Image captioning model based on divergence-based and spatial consistency constraints[J]. Journal of Beijing University of Aeronautics and Astronautics,2024,50(2):456-465 (in Chinese) doi: 10.13700/j.bh.1001-5965.2022.0400
Citation: JIANG W H,CHEN Z L,CHENG Y B,et al. Image captioning model based on divergence-based and spatial consistency constraints[J]. Journal of Beijing University of Aeronautics and Astronautics,2024,50(2):456-465 (in Chinese) doi: 10.13700/j.bh.1001-5965.2022.0400

基于差异化和空间约束的自动图像描述模型

doi: 10.13700/j.bh.1001-5965.2022.0400
基金项目: 国家自然科学基金(62161013,62162029); 江西省重点研发计划项目(20203BBE53033); 江西省自然科学基金(20224BAB212010,20212BAB202011, 20224BAB212012,20232BAB202001)
详细信息
    通讯作者:

    E-mail:fa0001ng@e.ntu.edu.sg

  • 中图分类号: TP391

Image captioning model based on divergence-based and spatial consistency constraints

Funds: National Natural Science Foundation of China (62161013,62162029); Key R & D Program of Jiangxi Province (20203BBE53033); Natural Science Foundation of Jiangxi Province (20224BAB212010,20212BAB202011,20224BAB212012,20232BAB202001)
More Information
  • 摘要:

    多头注意力机制是图像描述模型的常用方法,该机制通过多分支结构构建关于输入特征的独特属性,以提高特征模型的区分性。然而,不同分支的独立性导致建模存在冗余性。同时,注意力机制会关注于不重要的图像区域,导致描述的文本不够准确。针对上述问题,提出一种损失函数作为训练目标的正则化项,以提高多头注意力机制的多样性和准确性。在多样性方面,提出一种多头注意力的差异化正则,鼓励多头注意力机制的不同分支关注于所描述目标的不同部件,使不同分支的建模目标变得简单。同时,不同分支相互融合,最后形成完整且更有区分性的视觉描述。在准确性方面,设计一种空间一致性正则。通过建模多头注意力机制的空间关联,鼓励注意力机制关注的图像区域尽可能集中,从而抑制背景区域的影响,提高注意力机制的准确性。提出差异化正则和空间一致性正则共同作用的方法,最终提升自动图像描述模型的准确性。所提方法在MS COCO数据集上对模型进行验证,并与多种代表性工作进行对比。实验结果表明:所提方法显著地提高了图像描述的准确性。

     

  • 图 1  多头注意力机制不同分支的可视化效果

    Figure 1.  Visualization of the different branches of attention results from the multi-head attention mechanism.

    图 2  本文方法的整体结构

    Figure 2.  Overall framework of the proposed method

    图 3  多头注意力机制的可视化效果

    Figure 3.  Visualization of multi-head attention mechanism

    图 4  使用不同的正则化方法生成的图像描述效果对比

    Figure 4.  Examples of captions generated by different regularization method

    表  1  不同的正则化方法对图像描述性能的影响

    Table  1.   The effect of different regularization methods on image description performance

    方法 BLEU-1 BLEU-4 CIDEr METEOR
    Transformer 76.4 36.4 116.9 28.3
    +Cosine 76.0 36.4 115.4 28.0
    +SRIP 76.4 36.7 116.5 28.3
    +DBR 76.5 36.8 117.9 28.3
    +SCR 76.4 36.9 117.5 28.4
    本文方法 76.5 36.9 118.4 28.4
     注:加粗字体表示各列最优结果。
    下载: 导出CSV

    表  2  不同分支数量对图像描述性能的影响

    Table  2.   The effect of different number of branches on image description performance

    h 性能
    Transformer +DBR
    4 116.4 116.8
    8 116.9 117.9
    16 117 118.3
    32 116.5 117.4
    下载: 导出CSV

    表  3  与主流方法在MS COCO测试集性能比较

    Table  3.   Comparisons of performance of this paper with mainstream methods on MS COCO test set

    方法 BLEU-1 BLEU-4 METEOR CIDEr
    Up-Down 79.8 36.3 27.7 120.1
    STMA 80.2 37.7 28.2 125.9
    SCAN 80.2 38.0 28.5 126.1
    AAT 38.7 28.6 128.6
    RD 38.6 28.7 128.3
    Transformer 80.1 38.8 28.7 127.2
    ORT 80.5 38.6 28.7 128.3
    M2 80.8 39.1 29.2 131.2
    CaptionNet 80.4 38.5 28.8 127.6
    TAA 78.6 37.1 27.5 119.6
    X-LAN 80.8 39.7 29.5 132.0
    DIC 38.7 28.4 128.2
    GET 81.5 39.5 29.3 131.6
    CAVP 38.6 28.3 126.3
    SG2Caps 33.0 26.2 112.3
    本文方法 81.7 39.3 29.2 132.0
    ORT+DBR+SCR 81.0 39.3 29.1 131.3
    M2 +DBR+SCR 81.1 39.5 29.9 131.8
    下载: 导出CSV
  • [1] WAN B Y, JIANG W H, FANG Y M, et al. Revisiting image captioning via maximum discrepancy competition[J]. Pattern Recognition, 2022, 122: 108358. doi: 10.1016/j.patcog.2021.108358
    [2] 谭云兰, 汤鹏杰, 张丽, 等. 从图像到语言: 图像标题生成与描述[J]. 中国图象图形学报, 2021, 26(4): 727-750. doi: 10.11834/jig.200177

    TAN Y L, TANG P J, ZHANG L, et al. From image to language: Image captioning and description[J]. Journal of Image and Graphics, 2021, 26(4): 727-750 (in Chinese). doi: 10.11834/jig.200177
    [3] 石义乐, 杨文忠, 杜慧祥, 等. 基于深度学习的图像描述综述[J]. 电子学报, 2021, 49(10): 2048-2060. doi: 10.12263/DZXB.20200669

    SHI Y L, YANG W Z, DU H X, et al. Overview of image captions based on deep learning[J]. Acta Electronica Sinica, 2021, 49(10): 2048-2060 (in Chinese). doi: 10.12263/DZXB.20200669
    [4] XU K, BA J L, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]// Proceedings of the 32nd International Conference on International Conference on Machine Learning—Volume 37. New York: ACM, 2015: 2048-2057.
    [5] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 6077-6086.
    [6] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
    [7] VOITA E, TALBOT D, MOISEEV F, et al. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 5797-5808.
    [8] YANG B S, LI J A, WONG D F, et al. Context-aware self-attention networks[J]. Computer Science, 2019, 33(1): 387-394. doi: 10.1609/aaai.v33i01.3301387
    [9] LI J A, TU Z P, YANG B S, et al. Multi-head attention with disagreement regularization[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 2897-2903.
    [10] LIU C X, MAO J H, SHA F, et al. Attention correctness in neural image captioning[J]. Computer Science, 2017, 31(1): 4176-4182.
    [11] ROHRBACH A, HENDRICKS L A, BURNS K, et al. Object hallucination in image captioning[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 4035-4045.
    [12] ZHOU Y E, WANG M, LIU D Q, et al. More grounded image captioning by distilling image-text matching model[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 4777-4786.
    [13] MA C Y, KALANTIDIS Y, ALREGIB G, et al. Learning to generate grounded visual captions without localization supervision[C]// Proceedings of the 16th European Conference, Glasgow. New York: ACM, 2020: 353-370.
    [14] PAPANDREOU G, ZHU T, CHEN L C, et al. PersonLab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model[C]//Computer Vision – ECCV 2018. Berlin: Springer, 2018: 282-299.
    [15] HE S, TAVAKOLI H R, BORJI A, et al. Human attention in image captioning: dataset and analysis[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2020: 8528-8537.
    [16] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[C]//European Conference on Computer Vision. Berlin: Springer, 2014: 740-755.
    [17] 毕健旗, 刘茂福, 胡慧君, 等. 基于依存句法的图像描述文本生成[J]. 北京航空航天大学学报, 2021, 47(3): 431-440. doi: 10.13700/j.bh.1001-5965.2020.0443

    BI J Q, LIU M F, HU H J, et al. Image captioning based on dependency syntax[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 431-440 (in Chinese). doi: 10.13700/j.bh.1001-5965.2020.0443
    [18] JI J Z, XU C, ZHANG X D, et al. Spatio-temporal memory attention for image captioning[J]. IEEE Transactions on Image Processing, 2020, 29: 7615-7628. doi: 10.1109/TIP.2020.3004729
    [19] ZHA Z J, LIU D Q, ZHANG H W, et al. Context-aware visual policy network for fine-grained image captioning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(2): 710-722. doi: 10.1109/TPAMI.2019.2909864
    [20] ZHANG W Q, SHI H C, TANG S L, et al. Consensus graph representation learning for better grounded image captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(4): 3394-3402. doi: 10.1609/aaai.v35i4.16452
    [21] YANG X, ZHANG H W, CAI J F. Deconfounded image captioning: A causal retrospect[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12996-13010.
    [22] PAN Y W, YAO T, LI Y H, et al. X-linear attention networks for image captioning[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 10968-10977.
    [23] JI J Y, LUO Y P, SUN X S, et al. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(2): 1655-1663. doi: 10.1609/aaai.v35i2.16258
    [24] GUO L T, LIU J, LU S C, et al. Show, tell, and polish: Ruminant decoding for image captioning[J]. IEEE Transactions on Multimedia, 2020, 22(8): 2149-2162. doi: 10.1109/TMM.2019.2951226
    [25] NGUYEN K, TRIPATHI S, DU B, et al. In defense of scene graphs for image captioning[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2022: 1387-1396.
    [26] SHEN T, ZHOU T Y, LONG G D, et al. DiSAN: Directional self-attention network for RNN/CNN-free language understanding[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2018: 5446-5455.
    [27] STRUBELL E, VERGA P, ANDOR D, et al. Linguistically-informed self-attention for semantic role labeling[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 5027–5038.
    [28] RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 1179-1195.
    [29] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 3128-3137.
    [30] JIANG H Z, MISRA I, ROHRBACH M, et al. In defense of grid features for visual question answering[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 10264-10273.
    [31] CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 10575-10584.
    [32] BANSAL N, CHEN X H, WANG Z Y. Can we gain more from orthogonality regularizations in training deep CNNs? [C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. New York: ACM, 2018: 4266-4276.
    [33] HUANG L, WANG W M, XIA Y X, et al. Adaptively aligned image captioning via adaptive attention time[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. New York: ACM,2019: 8942-8951.
    [34] HERDADE S, KAPPELER A, BOAKYE K, et al. Image captioning: Transforming objects into words[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. New York: ACM, 2019: 11137-11147.
    [35] YANG L Y, WANG H L, TANG P J, et al. CaptionNet: A tailor-made recurrent neural network for generating image descriptions[J]. IEEE Transactions on Multimedia, 2021, 23: 835-845. doi: 10.1109/TMM.2020.2990074
    [36] YAN C G, HAO Y M, LI L, et al. Task-adaptive attention for image captioning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(1): 43-51. doi: 10.1109/TCSVT.2021.3067449
  • 加载中
图(4) / 表(3)
计量
  • 文章访问数:  657
  • HTML全文浏览量:  74
  • PDF下载量:  11
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-05-21
  • 录用日期:  2022-07-29
  • 网络出版日期:  2022-10-12
  • 整期出版日期:  2024-02-27

目录

    /

    返回文章
    返回
    常见问答