
尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!



毕健旗 刘茂福 胡慧君 代建华

毕健旗, 刘茂福, 胡慧君, 等 . 基于依存句法的图像描述文本生成[J]. 北京航空航天大学学报, 2021, 47(3): 431-440. doi: 10.13700/j.bh.1001-5965.2020.0443
引用本文: 毕健旗, 刘茂福, 胡慧君, 等 . 基于依存句法的图像描述文本生成[J]. 北京航空航天大学学报, 2021, 47(3): 431-440. doi: 10.13700/j.bh.1001-5965.2020.0443
BI Jianqi, LIU Maofu, HU Huijun, et al. Image captioning based on dependency syntax[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 431-440. doi: 10.13700/j.bh.1001-5965.2020.0443(in Chinese)
Citation: BI Jianqi, LIU Maofu, HU Huijun, et al. Image captioning based on dependency syntax[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 431-440. doi: 10.13700/j.bh.1001-5965.2020.0443(in Chinese)


doi: 10.13700/j.bh.1001-5965.2020.0443

国家社会科学基金重大研究计划 11&ZD189

全军共用信息系统装备预先研究项目 31502030502


    毕健旗   男,硕士研究生。主要研究方向:自然语言处理

    刘茂福   男,博士,教授,博士生导师。主要研究方向:自然语言处理、图像分析与理解

    胡慧君   女,博士,副教授,硕士生导师。主要研究方向:智能信息处理、图像分析与理解

    代建华   男,博士,教授,博士生导师。主要研究方向:人工智能、智能信息处理


    刘茂福, E-mail: liumaofu@wust.edu.cn

  • 中图分类号: TP37

Image captioning based on dependency syntax


Major Projects of National Social Science Foundation of China 11&ZD189

Pre-research Foundation of Whole Army Shared Information System Equipment 31502030502

More Information
    Corresponding author: LIU Maofu, E-mail: liumaofu@wust.edu.cn
  • 摘要:



  • 图 1  本文模型框架

    Figure 1.  Framework of proposed model

    图 2  图像描述文本依存句法示例

    Figure 2.  A dependency syntax example of an image caption

    图 3  生成文本效果对比

    Figure 3.  Comparison of generated captions

    图 4  不同K值选取对实验结果的影响

    Figure 4.  Experimental results affected by different K values

    图 5  K取10时分类效果

    Figure 5.  Results of classification when K is 10

    图 6  模型生成的文本多样性示例

    Figure 6.  Example of diversity of captions generated by model

    图 7  图像注意力对比

    Figure 7.  Comparison of image attention

    图 8  相似图像和依存句法模板

    Figure 8.  Similar image and dependency syntactic template

    图 9  基于依存句法的图像描述文本生成示例

    Figure 9.  Examples of image captioning based on dependency syntax

    表  1  超参数设置

    Table  1.   Hyperparameter setting

    参数 数值
    图像特征向量/维 14×14×2 048
    词向量/维 512
    依存句法向量/维 512
    LSTM隐向量/维 512
    自注意力机制头数 8
    批处理大小 32
    下载: 导出CSV

    表  2  Flickr30K数据集的实验结果

    Table  2.   Experimental results on Flickr30K dataset

    NIC+ATT(Baseline) 62.84 39.00 25.07 17.52 17.98 44.57 30.18 11.06
    AdaptAtt 60.69 41.80 25.92 18.63 19.71 45.61 33.36
    NIC+WC+WA+RL 24.50 21.50 51.60 58.40
    MLO/MLPF-LSTM+(BS) 66.20 47.20 33.10 23.00 19.60
    CACNN-GAN(ResNet-152) 69.30 49.90 35.80 25.90 22.30
    NIC+DS(Top-5) 57.09 39.35 28.66 20.73 20.81 48.24 49.78 17.58
    NIC+DSSA(Top-5) 58.62 40.46 29.81 22.62 20.96 49.98 51.74 17.56
    NIC+DS(Top-10) 59.76 44.53 31.48 24.75 21.31 51.36 50.91 18.43
    NIC+DSSA(Top-10) 61.81 47.33 33.97 26.06 23.57 52.81 52.48 18.62
    下载: 导出CSV

    表  3  Flickr8K数据集的实验结果

    Table  3.   Experimental results on Flickr8K dataset

    NIC+ATT(Baseline) 60.32 37.88 24.66 16.33 18.48 46.16 34.99 11.17
    NIC+DS(Top-10) 57.76 41.16 30.70 27.78 19.54 48.81 36.69 14.47
    NIC+DSSA(Top-10) 59.45 45.86 36.05 29.36 21.92 50.06 40.24 15.72
    下载: 导出CSV

    表  4  Flickr8K-CN数据集的实验结果

    Table  4.   Experimental results on Flickr8K-CN dataset

    NIC+ATT(Baseline) 59.16 36.30 22.73 16.02 16.87 43.59 31.09 10.82
    NIC+DS(Top-10) 56.28 40.03 29.42 25.61 17.48 46.21 34.16 13.45
    NIC+DSSA(Top-10) 58.72 46.86 33.05 28.16 20.57 49.10 38.48 14.36
    下载: 导出CSV

    表  5  描述文本中连接词数量统计

    Table  5.   Statistics of conjunction numbers in captions

    模型 连接词数量
    NIC 1
    NIC+DSSA 66
    参考文本 58
    下载: 导出CSV
  • [1] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning, 2015: 2048-2057.
    [2] LU C, KRISHNA R, BERNSTEIN M S, et al. Visual relationship detection with language priors[C]//Proceedings of the 14th European Conference on Computer Vision, 2016: 852-869.
    [3] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems 28, 2015: 91-99.
    [4] GUO Y, CHENG Z, NIE L, et al. Quantifying and alleviating the language prior problem in visual question answering[C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2019: 75-84.
    [5] ADITYA D, JYOTI A, LIWEI W, et al. Fast, diverse and accurate image captioning guided by part-of-speech[C]//Proceedings of the 32nd IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 10695-10704.
    [6] WANG Y, LIN Z, SHEN X, et al. Skeleton key: Image captioning by skeleton-attribute decomposition[C]//Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 7272-7281.
    [7] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 5998-6008.
    [8] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014: 3104-3112.
    [9] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(4): 652-663.
    [10] ZHU Z, XUE Z, YUAN Z. Topic-guided attention for image captioning[C]//Proceedings of the 25th IEEE International Conference on Image Processing. Piscataway: IEEE Press, 2018: 2615-2619.
    [11] WANG T, HU H, HE C. Image caption with endogenous-exogenous attention[J]. Neural Processing Letters, 2019, 50(1): 431-443. doi: 10.1007/s11063-019-09979-7
    [12] LIU F, LIU Y, REN X, et al. Aligning visual regions and textual concepts: Learning fine-grained image representations for image captioning[EB/OL]. (2019-05-15)[2020-08-01].
    [13] FALENSKA A, KUHN J. The (non-) utility of structural features in BiLSTM-based dependency parsers[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 117-128.
    [14] LI Z, PENG X, ZHANG M, et al. Semi-supervised domain adaptation for dependency parsing[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2386-2395.
    [15] WANG X, TU Z, WANG L, et al. Self-attention with structural position representations[EB/OL]. (2019-09-01)[2020-08-01].
    [16] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 770-778.
    [17] CHRISTOPER D M, MIHAI S, JOHN B, et al. The stanfordCoreNLP natural language processing toolkit[C]//Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics, 2014: 55-60.
    [18] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002: 311-318.
    [19] CHOUEIRI T K, ESCUDIER B, POWLES T, et al. Cabozantinib versus everolimus in advanced renal cell carcinoma (METEOR): Final results from a randomised, open-label, phase 3 trial[J]. The Lancet Oncology, 2016, 17(7): 917-927. doi: 10.1016/S1470-2045(16)30107-3
    [20] LIN C Y. Rouge: A package for automatic evaluation of summaries[C]//Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 2004: 74-81.
    [21] VEDANTAM R, ZITNICK C L, PARIKH D. CIDER: Consensus-based image description evaluation[C]//Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 4566-4575.
    [22] LU J, XIONG C, PARIKH D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 3242-3250.
    [23] FAN Z, WEI Z, HUANG X, et al. Bridging by word: Image grounded vocabulary construction for visual captioning[C]//Proceedings of the International Conference on the Association for the Advance of Artificial Intelligence, 2019: 6514-6524.
    [24] 汤鹏杰, 王瀚漓, 许恺晟. LSTM逐层多目标优化及多层概率融合的图像描述[J]. 自动化学报, 2018, 44(7): 1237-1249.

    TANG P J, WANG H L, XU K S. Multi-objective layer-wise optimization and multi-level probability fusion for image description generation using LSTM[J]. Acta Automatica Sinica, 2018, 44(7): 1237-1249(in Chinese).
    [25] 薛子育, 郭沛宇, 祝晓斌, 等. 一种基于生成式对抗网络的图像描述方法[J]. 软件学报, 2018, 29(2): 30-43.

    XUE Z Y, GUO P Y, ZHU X B, et al. Image description method based on generative adversarial networks[J]. Journal of Chinese Information Processing, 2018, 29(2): 30-43(in Chinese).
    [26] SALVARIS M, DEAN D, TOK W H, et al. Generative adversarial networks[EB/OL]. (2014-06-10)[2020-08-01].
    [27] LIU M, HU H, LI L, et al. Chinese image caption generation via visual attention and topic modeling[J/OL]. IEEE Transactions on Cybernetics, 2020(2020-06-22)[2020-08-01].
  • 加载中
图(9) / 表(5)
  • 文章访问数:  640
  • HTML全文浏览量:  166
  • PDF下载量:  177
  • 被引次数: 0
  • 收稿日期:  2020-08-21
  • 录用日期:  2020-09-05
  • 网络出版日期:  2021-03-20


