Volume 47 Issue 3
Mar.  2021
Turn off MathJax
Article Contents
BI Jianqi, LIU Maofu, HU Huijun, et al. Image captioning based on dependency syntax[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 431-440. doi: 10.13700/j.bh.1001-5965.2020.0443(in Chinese)
Citation: BI Jianqi, LIU Maofu, HU Huijun, et al. Image captioning based on dependency syntax[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 431-440. doi: 10.13700/j.bh.1001-5965.2020.0443(in Chinese)

Image captioning based on dependency syntax

doi: 10.13700/j.bh.1001-5965.2020.0443
Funds:

Major Projects of National Social Science Foundation of China 11&ZD189

Pre-research Foundation of Whole Army Shared Information System Equipment 31502030502

More Information
  • Corresponding author: LIU Maofu, E-mail: liumaofu@wust.edu.cn
  • Received Date: 21 Aug 2020
  • Accepted Date: 05 Sep 2020
  • Publish Date: 20 Mar 2021
  • Current image captioning model can automatically apply the part-of-speech sequences and syntactic trees to make the generated text in line with grammar. However, the above-mentioned models generally generate the simple sentences. There is no groundbreaking work in language models promoting the interpretability of deep learning models. The dependency syntax is integrated into the deep learning model to supervise the image captioning, which can make deep learning models more interpretable. An image structure attention mechanism, which recognizes the relationship between image regions based on the dependency syntax, is applied to compute the visual relations and obtain the features. The fusion of image region relation features and image region features, and the word embedding are employed into Long Short-Term Memory (LSTM) to generate the image captions. In testing, the content keywords of the testing and training image datasets are produced due to the content overlap of two images, and the dependency syntax template corresponding to the test image can be indirectly extracted. According to the dependency syntax template, the diverse descriptions can be generated. Experiment resultsverify the capacity of the proposed model to improve the diversity of generated captions and syntax complexity and indicate that the dependency syntax can enhance the interpretability of deep learning model.

     

  • loading
  • [1]
    XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning, 2015: 2048-2057.
    [2]
    LU C, KRISHNA R, BERNSTEIN M S, et al. Visual relationship detection with language priors[C]//Proceedings of the 14th European Conference on Computer Vision, 2016: 852-869.
    [3]
    REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems 28, 2015: 91-99.
    [4]
    GUO Y, CHENG Z, NIE L, et al. Quantifying and alleviating the language prior problem in visual question answering[C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2019: 75-84.
    [5]
    ADITYA D, JYOTI A, LIWEI W, et al. Fast, diverse and accurate image captioning guided by part-of-speech[C]//Proceedings of the 32nd IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 10695-10704.
    [6]
    WANG Y, LIN Z, SHEN X, et al. Skeleton key: Image captioning by skeleton-attribute decomposition[C]//Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 7272-7281.
    [7]
    VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 5998-6008.
    [8]
    SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014: 3104-3112.
    [9]
    VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(4): 652-663. http://europepmc.org/abstract/MED/28055847
    [10]
    ZHU Z, XUE Z, YUAN Z. Topic-guided attention for image captioning[C]//Proceedings of the 25th IEEE International Conference on Image Processing. Piscataway: IEEE Press, 2018: 2615-2619.
    [11]
    WANG T, HU H, HE C. Image caption with endogenous-exogenous attention[J]. Neural Processing Letters, 2019, 50(1): 431-443. doi: 10.1007/s11063-019-09979-7
    [12]
    LIU F, LIU Y, REN X, et al. Aligning visual regions and textual concepts: Learning fine-grained image representations for image captioning[EB/OL]. (2019-05-15)[2020-08-01]. https://arxiv.org/abs/1905.06139v1.
    [13]
    FALENSKA A, KUHN J. The (non-) utility of structural features in BiLSTM-based dependency parsers[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 117-128.
    [14]
    LI Z, PENG X, ZHANG M, et al. Semi-supervised domain adaptation for dependency parsing[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2386-2395.
    [15]
    WANG X, TU Z, WANG L, et al. Self-attention with structural position representations[EB/OL]. (2019-09-01)[2020-08-01]. https://arxiv.org/abs/1909.00383.
    [16]
    HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 770-778.
    [17]
    CHRISTOPER D M, MIHAI S, JOHN B, et al. The stanfordCoreNLP natural language processing toolkit[C]//Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics, 2014: 55-60.
    [18]
    PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002: 311-318.
    [19]
    CHOUEIRI T K, ESCUDIER B, POWLES T, et al. Cabozantinib versus everolimus in advanced renal cell carcinoma (METEOR): Final results from a randomised, open-label, phase 3 trial[J]. The Lancet Oncology, 2016, 17(7): 917-927. doi: 10.1016/S1470-2045(16)30107-3
    [20]
    LIN C Y. Rouge: A package for automatic evaluation of summaries[C]//Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 2004: 74-81.
    [21]
    VEDANTAM R, ZITNICK C L, PARIKH D. CIDER: Consensus-based image description evaluation[C]//Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 4566-4575.
    [22]
    LU J, XIONG C, PARIKH D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 3242-3250.
    [23]
    FAN Z, WEI Z, HUANG X, et al. Bridging by word: Image grounded vocabulary construction for visual captioning[C]//Proceedings of the International Conference on the Association for the Advance of Artificial Intelligence, 2019: 6514-6524.
    [24]
    汤鹏杰, 王瀚漓, 许恺晟. LSTM逐层多目标优化及多层概率融合的图像描述[J]. 自动化学报, 2018, 44(7): 1237-1249. https://www.cnki.com.cn/Article/CJFDTOTAL-MOTO201807007.htm

    TANG P J, WANG H L, XU K S. Multi-objective layer-wise optimization and multi-level probability fusion for image description generation using LSTM[J]. Acta Automatica Sinica, 2018, 44(7): 1237-1249(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-MOTO201807007.htm
    [25]
    薛子育, 郭沛宇, 祝晓斌, 等. 一种基于生成式对抗网络的图像描述方法[J]. 软件学报, 2018, 29(2): 30-43. https://www.cnki.com.cn/Article/CJFDTOTAL-WJFZ202003010.htm

    XUE Z Y, GUO P Y, ZHU X B, et al. Image description method based on generative adversarial networks[J]. Journal of Chinese Information Processing, 2018, 29(2): 30-43(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-WJFZ202003010.htm
    [26]
    SALVARIS M, DEAN D, TOK W H, et al. Generative adversarial networks[EB/OL]. (2014-06-10)[2020-08-01]. https://arxiv.org/abs/1406.2661.
    [27]
    LIU M, HU H, LI L, et al. Chinese image caption generation via visual attention and topic modeling[J/OL]. IEEE Transactions on Cybernetics, 2020(2020-06-22)[2020-08-01]. https://ieeexplore.ieee.org/document/9122435.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(9)  / Tables(5)

    Article Metrics

    Article views(491) PDF downloads(174) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return