留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于依存句法的图像描述文本生成

毕健旗 刘茂福 胡慧君 代建华

毕健旗, 刘茂福, 胡慧君, 等 . 基于依存句法的图像描述文本生成[J]. 北京航空航天大学学报, 2021, 47(3): 431-440. doi: 10.13700/j.bh.1001-5965.2020.0443
引用本文: 毕健旗, 刘茂福, 胡慧君, 等 . 基于依存句法的图像描述文本生成[J]. 北京航空航天大学学报, 2021, 47(3): 431-440. doi: 10.13700/j.bh.1001-5965.2020.0443
BI Jianqi, LIU Maofu, HU Huijun, et al. Image captioning based on dependency syntax[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 431-440. doi: 10.13700/j.bh.1001-5965.2020.0443(in Chinese)
Citation: BI Jianqi, LIU Maofu, HU Huijun, et al. Image captioning based on dependency syntax[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 431-440. doi: 10.13700/j.bh.1001-5965.2020.0443(in Chinese)

基于依存句法的图像描述文本生成

doi: 10.13700/j.bh.1001-5965.2020.0443
基金项目: 

国家社会科学基金重大研究计划 11&ZD189

全军共用信息系统装备预先研究项目 31502030502

详细信息
    作者简介:

    毕健旗   男,硕士研究生。主要研究方向:自然语言处理

    刘茂福   男,博士,教授,博士生导师。主要研究方向:自然语言处理、图像分析与理解

    胡慧君   女,博士,副教授,硕士生导师。主要研究方向:智能信息处理、图像分析与理解

    代建华   男,博士,教授,博士生导师。主要研究方向:人工智能、智能信息处理

    通讯作者:

    刘茂福, E-mail: liumaofu@wust.edu.cn

  • 中图分类号: TP37

Image captioning based on dependency syntax

Funds: 

Major Projects of National Social Science Foundation of China 11&ZD189

Pre-research Foundation of Whole Army Shared Information System Equipment 31502030502

More Information
    Corresponding author: LIU Maofu, E-mail: liumaofu@wust.edu.cn
  • 摘要:

    现有图像描述文本生成模型能够应用词性序列和句法树使生成的文本更符合语法规则,但文本多为简单句,在语言模型促进深度学习模型的可解释性方面研究甚少。将依存句法信息融合到深度学习模型以监督图像描述文本生成的同时,可使深度学习模型更具可解释性。图像结构注意力机制基于依存句法和图像视觉信息,用于计算图像区域间关系并得到图像区域关系特征;融合图像区域关系特征和图像区域特征,与文本词向量通过长短期记忆网络(LSTM),用于生成图像描述文本。在测试阶段,通过测试图像与训练图像集的内容关键词,计算2幅图像的内容重合度,间接提取与测试图像对应的依存句法模板;模型基于依存句法模板,生成多样的图像描述文本。实验结果验证了模型在改善图像描述文本多样性和句法复杂度方面的能力,表明模型中的依存句法信息增强了深度学习模型的可解释性。

     

  • 图 1  本文模型框架

    Figure 1.  Framework of proposed model

    图 2  图像描述文本依存句法示例

    Figure 2.  A dependency syntax example of an image caption

    图 3  生成文本效果对比

    Figure 3.  Comparison of generated captions

    图 4  不同K值选取对实验结果的影响

    Figure 4.  Experimental results affected by different K values

    图 5  K取10时分类效果

    Figure 5.  Results of classification when K is 10

    图 6  模型生成的文本多样性示例

    Figure 6.  Example of diversity of captions generated by model

    图 7  图像注意力对比

    Figure 7.  Comparison of image attention

    图 8  相似图像和依存句法模板

    Figure 8.  Similar image and dependency syntactic template

    图 9  基于依存句法的图像描述文本生成示例

    Figure 9.  Examples of image captioning based on dependency syntax

    表  1  超参数设置

    Table  1.   Hyperparameter setting

    参数 数值
    图像特征向量/维 14×14×2 048
    词向量/维 512
    依存句法向量/维 512
    LSTM隐向量/维 512
    自注意力机制头数 8
    批处理大小 32
    下载: 导出CSV

    表  2  Flickr30K数据集的实验结果

    Table  2.   Experimental results on Flickr30K dataset

    模型 BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr Len
    NIC+ATT(Baseline) 62.84 39.00 25.07 17.52 17.98 44.57 30.18 11.06
    AdaptAtt 60.69 41.80 25.92 18.63 19.71 45.61 33.36
    NIC+WC+WA+RL 24.50 21.50 51.60 58.40
    MLO/MLPF-LSTM+(BS) 66.20 47.20 33.10 23.00 19.60
    CACNN-GAN(ResNet-152) 69.30 49.90 35.80 25.90 22.30
    NIC+DS(Top-5) 57.09 39.35 28.66 20.73 20.81 48.24 49.78 17.58
    NIC+DSSA(Top-5) 58.62 40.46 29.81 22.62 20.96 49.98 51.74 17.56
    NIC+DS(Top-10) 59.76 44.53 31.48 24.75 21.31 51.36 50.91 18.43
    NIC+DSSA(Top-10) 61.81 47.33 33.97 26.06 23.57 52.81 52.48 18.62
    下载: 导出CSV

    表  3  Flickr8K数据集的实验结果

    Table  3.   Experimental results on Flickr8K dataset

    模型 BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr Len
    NIC+ATT(Baseline) 60.32 37.88 24.66 16.33 18.48 46.16 34.99 11.17
    NIC+DS(Top-10) 57.76 41.16 30.70 27.78 19.54 48.81 36.69 14.47
    NIC+DSSA(Top-10) 59.45 45.86 36.05 29.36 21.92 50.06 40.24 15.72
    下载: 导出CSV

    表  4  Flickr8K-CN数据集的实验结果

    Table  4.   Experimental results on Flickr8K-CN dataset

    模型 BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr Len
    NIC+ATT(Baseline) 59.16 36.30 22.73 16.02 16.87 43.59 31.09 10.82
    NIC+DS(Top-10) 56.28 40.03 29.42 25.61 17.48 46.21 34.16 13.45
    NIC+DSSA(Top-10) 58.72 46.86 33.05 28.16 20.57 49.10 38.48 14.36
    下载: 导出CSV

    表  5  描述文本中连接词数量统计

    Table  5.   Statistics of conjunction numbers in captions

    模型 连接词数量
    NIC 1
    NIC+DSSA 66
    参考文本 58
    下载: 导出CSV
  • [1] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning, 2015: 2048-2057.
    [2] LU C, KRISHNA R, BERNSTEIN M S, et al. Visual relationship detection with language priors[C]//Proceedings of the 14th European Conference on Computer Vision, 2016: 852-869.
    [3] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems 28, 2015: 91-99.
    [4] GUO Y, CHENG Z, NIE L, et al. Quantifying and alleviating the language prior problem in visual question answering[C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2019: 75-84.
    [5] ADITYA D, JYOTI A, LIWEI W, et al. Fast, diverse and accurate image captioning guided by part-of-speech[C]//Proceedings of the 32nd IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 10695-10704.
    [6] WANG Y, LIN Z, SHEN X, et al. Skeleton key: Image captioning by skeleton-attribute decomposition[C]//Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 7272-7281.
    [7] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 5998-6008.
    [8] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014: 3104-3112.
    [9] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(4): 652-663. http://europepmc.org/abstract/MED/28055847
    [10] ZHU Z, XUE Z, YUAN Z. Topic-guided attention for image captioning[C]//Proceedings of the 25th IEEE International Conference on Image Processing. Piscataway: IEEE Press, 2018: 2615-2619.
    [11] WANG T, HU H, HE C. Image caption with endogenous-exogenous attention[J]. Neural Processing Letters, 2019, 50(1): 431-443. doi: 10.1007/s11063-019-09979-7
    [12] LIU F, LIU Y, REN X, et al. Aligning visual regions and textual concepts: Learning fine-grained image representations for image captioning[EB/OL]. (2019-05-15)[2020-08-01]. https://arxiv.org/abs/1905.06139v1.
    [13] FALENSKA A, KUHN J. The (non-) utility of structural features in BiLSTM-based dependency parsers[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 117-128.
    [14] LI Z, PENG X, ZHANG M, et al. Semi-supervised domain adaptation for dependency parsing[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2386-2395.
    [15] WANG X, TU Z, WANG L, et al. Self-attention with structural position representations[EB/OL]. (2019-09-01)[2020-08-01]. https://arxiv.org/abs/1909.00383.
    [16] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 770-778.
    [17] CHRISTOPER D M, MIHAI S, JOHN B, et al. The stanfordCoreNLP natural language processing toolkit[C]//Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics, 2014: 55-60.
    [18] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002: 311-318.
    [19] CHOUEIRI T K, ESCUDIER B, POWLES T, et al. Cabozantinib versus everolimus in advanced renal cell carcinoma (METEOR): Final results from a randomised, open-label, phase 3 trial[J]. The Lancet Oncology, 2016, 17(7): 917-927. doi: 10.1016/S1470-2045(16)30107-3
    [20] LIN C Y. Rouge: A package for automatic evaluation of summaries[C]//Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 2004: 74-81.
    [21] VEDANTAM R, ZITNICK C L, PARIKH D. CIDER: Consensus-based image description evaluation[C]//Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 4566-4575.
    [22] LU J, XIONG C, PARIKH D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 3242-3250.
    [23] FAN Z, WEI Z, HUANG X, et al. Bridging by word: Image grounded vocabulary construction for visual captioning[C]//Proceedings of the International Conference on the Association for the Advance of Artificial Intelligence, 2019: 6514-6524.
    [24] 汤鹏杰, 王瀚漓, 许恺晟. LSTM逐层多目标优化及多层概率融合的图像描述[J]. 自动化学报, 2018, 44(7): 1237-1249. https://www.cnki.com.cn/Article/CJFDTOTAL-MOTO201807007.htm

    TANG P J, WANG H L, XU K S. Multi-objective layer-wise optimization and multi-level probability fusion for image description generation using LSTM[J]. Acta Automatica Sinica, 2018, 44(7): 1237-1249(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-MOTO201807007.htm
    [25] 薛子育, 郭沛宇, 祝晓斌, 等. 一种基于生成式对抗网络的图像描述方法[J]. 软件学报, 2018, 29(2): 30-43. https://www.cnki.com.cn/Article/CJFDTOTAL-WJFZ202003010.htm

    XUE Z Y, GUO P Y, ZHU X B, et al. Image description method based on generative adversarial networks[J]. Journal of Chinese Information Processing, 2018, 29(2): 30-43(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-WJFZ202003010.htm
    [26] SALVARIS M, DEAN D, TOK W H, et al. Generative adversarial networks[EB/OL]. (2014-06-10)[2020-08-01]. https://arxiv.org/abs/1406.2661.
    [27] LIU M, HU H, LI L, et al. Chinese image caption generation via visual attention and topic modeling[J/OL]. IEEE Transactions on Cybernetics, 2020(2020-06-22)[2020-08-01]. https://ieeexplore.ieee.org/document/9122435.
  • 加载中
图(9) / 表(5)
计量
  • 文章访问数:  602
  • HTML全文浏览量:  160
  • PDF下载量:  177
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-08-21
  • 录用日期:  2020-09-05
  • 网络出版日期:  2021-03-20

目录

    /

    返回文章
    返回
    常见问答