基于差异化和空间约束的自动图像描述模型

姜文晖; 陈志亮; 程一波; 方玉明; 左一帆

doi:10.13700/j.bh.1001-5965.2022.0400

基于差异化和空间约束的自动图像描述模型

doi: 10.13700/j.bh.1001-5965.2022.0400

江西财经大学信息管理学院，南昌 330032

基金项目: 国家自然科学基金(62161013,62162029)；江西省重点研发计划项目(20203BBE53033)；江西省自然科学基金(20224BAB212010,20212BAB202011, 20224BAB212012,20232BAB202001)

详细信息

通讯作者:
E-mail：fa0001ng@e.ntu.edu.sg

中图分类号: TP391
计量
- 文章访问数: 657
- HTML全文浏览量: 74
- PDF下载量: 11
- 被引次数: 0
出版历程
- 收稿日期: 2022-05-21
- 录用日期: 2022-07-29
- 网络出版日期: 2022-10-12
- 整期出版日期: 2024-02-27

Image captioning model based on divergence-based and spatial consistency constraints

School of Information Management, Jiangxi University of Finance and Economics, Nanchang 330032, China

Funds: National Natural Science Foundation of China (62161013,62162029); Key R & D Program of Jiangxi Province (20203BBE53033); Natural Science Foundation of Jiangxi Province (20224BAB212010,20212BAB202011,20224BAB212012,20232BAB202001)

More Information

Corresponding author: E-mail：fa0001ng@e.ntu.edu.sg

摘要

摘要:
多头注意力机制是图像描述模型的常用方法，该机制通过多分支结构构建关于输入特征的独特属性，以提高特征模型的区分性。然而，不同分支的独立性导致建模存在冗余性。同时，注意力机制会关注于不重要的图像区域，导致描述的文本不够准确。针对上述问题，提出一种损失函数作为训练目标的正则化项，以提高多头注意力机制的多样性和准确性。在多样性方面，提出一种多头注意力的差异化正则，鼓励多头注意力机制的不同分支关注于所描述目标的不同部件，使不同分支的建模目标变得简单。同时，不同分支相互融合，最后形成完整且更有区分性的视觉描述。在准确性方面，设计一种空间一致性正则。通过建模多头注意力机制的空间关联，鼓励注意力机制关注的图像区域尽可能集中，从而抑制背景区域的影响，提高注意力机制的准确性。提出差异化正则和空间一致性正则共同作用的方法，最终提升自动图像描述模型的准确性。所提方法在MS COCO数据集上对模型进行验证，并与多种代表性工作进行对比。实验结果表明：所提方法显著地提高了图像描述的准确性。
- 多头注意力机制 /
- 图像描述 /
- 差异性 /
- 空间约束 /
- 模态融合
Abstract:
The multi-head attention mechanism has been widely adopted in image captioning. It is appealing for the ability to jointly attend to information from different representation subspaces. However, as each head captures distinct properties of the input individually, the diversity between heads’ representations is not guaranteed. In the meanwhile, most existing attention models encounter the problem of “attention defocus”, i.e., they fail to concentrate on correct image regions when generating the target words. Consequently, the generated sentences are not accurate enough. To address these problems, we propose a novel training objective that serves as an auxiliary regularization function to improve the diversity and accuracy of the multi-head attention mechanism. In the beginning, we present a divergence-based regularization that encourages each brain to concentrate on various areas of the goal. Partial representations are aggregated to produce distinct representations of the target. Secondly, we introduce a spatial consistency regularization that builds the spatial relationship among the attended regions. By encouraging the attended regions to be focussed, it enhances image captioning. We proposed a method for the joint action of divergence-based regularization and spatial consistency regularization. We compare the performance of the proposed method with state-of-the-art methods on challenging MS COCO datasets. The experimental results demonstrate the superior performance of the proposed method.
- multi-head attention mechanism /
- image captioning /
- diversity /
- spatial consistency /
- model fusion

HTML全文

图 1 多头注意力机制不同分支的可视化效果

Figure 1. Visualization of the different branches of attention results from the multi-head attention mechanism.

下载: 全尺寸图片幻灯片

图 2 本文方法的整体结构

Figure 2. Overall framework of the proposed method

下载: 全尺寸图片幻灯片

图 3 多头注意力机制的可视化效果

Figure 3. Visualization of multi-head attention mechanism

下载: 全尺寸图片幻灯片

图 4 使用不同的正则化方法生成的图像描述效果对比

Figure 4. Examples of captions generated by different regularization method

下载: 全尺寸图片幻灯片

表 1 不同的正则化方法对图像描述性能的影响

Table 1. The effect of different regularization methods on image description performance

方法	BLEU-1	BLEU-4	CIDEr	METEOR
Transformer	76.4	36.4	116.9	28.3
+Cosine	76.0	36.4	115.4	28.0
+SRIP	76.4	36.7	116.5	28.3
+DBR	76.5	36.8	117.9	28.3
+SCR	76.4	36.9	117.5	28.4
本文方法	76.5	36.9	118.4	28.4
注：加粗字体表示各列最优结果。

下载: 导出CSV

表 2 不同分支数量对图像描述性能的影响

Table 2. The effect of different number of branches on image description performance

h	性能
h	Transformer	+DBR
4	116.4	116.8
8	116.9	117.9
16	117	118.3
32	116.5	117.4

下载: 导出CSV

表 3 与主流方法在MS COCO测试集性能比较

Table 3. Comparisons of performance of this paper with mainstream methods on MS COCO test set

方法	BLEU-1	BLEU-4	METEOR	CIDEr
Up-Down	79.8	36.3	27.7	120.1
STMA	80.2	37.7	28.2	125.9
SCAN	80.2	38.0	28.5	126.1
AAT		38.7	28.6	128.6
RD		38.6	28.7	128.3
Transformer	80.1	38.8	28.7	127.2
ORT	80.5	38.6	28.7	128.3
M2	80.8	39.1	29.2	131.2
CaptionNet	80.4	38.5	28.8	127.6
TAA	78.6	37.1	27.5	119.6
X-LAN	80.8	39.7	29.5	132.0
DIC		38.7	28.4	128.2
GET	81.5	39.5	29.3	131.6
CAVP		38.6	28.3	126.3
SG2Caps		33.0	26.2	112.3
本文方法	81.7	39.3	29.2	132.0
ORT+DBR+SCR	81.0	39.3	29.1	131.3
M2 +DBR+SCR	81.1	39.5	29.9	131.8

下载: 导出CSV

参考文献(36)

[1]	WAN B Y, JIANG W H, FANG Y M, et al. Revisiting image captioning via maximum discrepancy competition[J]. Pattern Recognition, 2022, 122: 108358. doi: 10.1016/j.patcog.2021.108358
[2]	谭云兰, 汤鹏杰, 张丽, 等. 从图像到语言: 图像标题生成与描述[J]. 中国图象图形学报, 2021, 26(4): 727-750. doi: 10.11834/jig.200177 TAN Y L, TANG P J, ZHANG L, et al. From image to language: Image captioning and description[J]. Journal of Image and Graphics, 2021, 26(4): 727-750 (in Chinese). doi: 10.11834/jig.200177
[3]	石义乐, 杨文忠, 杜慧祥, 等. 基于深度学习的图像描述综述[J]. 电子学报, 2021, 49(10): 2048-2060. doi: 10.12263/DZXB.20200669 SHI Y L, YANG W Z, DU H X, et al. Overview of image captions based on deep learning[J]. Acta Electronica Sinica, 2021, 49(10): 2048-2060 (in Chinese). doi: 10.12263/DZXB.20200669
[4]	XU K, BA J L, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]// Proceedings of the 32nd International Conference on International Conference on Machine Learning—Volume 37. New York: ACM, 2015: 2048-2057.
[5]	ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 6077-6086.
[6]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[7]	VOITA E, TALBOT D, MOISEEV F, et al. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 5797-5808.
[8]	YANG B S, LI J A, WONG D F, et al. Context-aware self-attention networks[J]. Computer Science, 2019, 33(1): 387-394. doi: 10.1609/aaai.v33i01.3301387
[9]	LI J A, TU Z P, YANG B S, et al. Multi-head attention with disagreement regularization[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 2897-2903.
[10]	LIU C X, MAO J H, SHA F, et al. Attention correctness in neural image captioning[J]. Computer Science, 2017, 31(1): 4176-4182.
[11]	ROHRBACH A, HENDRICKS L A, BURNS K, et al. Object hallucination in image captioning[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 4035-4045.
[12]	ZHOU Y E, WANG M, LIU D Q, et al. More grounded image captioning by distilling image-text matching model[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 4777-4786.
[13]	MA C Y, KALANTIDIS Y, ALREGIB G, et al. Learning to generate grounded visual captions without localization supervision[C]// Proceedings of the 16th European Conference, Glasgow. New York: ACM, 2020: 353-370.
[14]	PAPANDREOU G, ZHU T, CHEN L C, et al. PersonLab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model[C]//Computer Vision – ECCV 2018. Berlin: Springer, 2018: 282-299.
[15]	HE S, TAVAKOLI H R, BORJI A, et al. Human attention in image captioning: dataset and analysis[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2020: 8528-8537.
[16]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[C]//European Conference on Computer Vision. Berlin: Springer, 2014: 740-755.
[17]	毕健旗, 刘茂福, 胡慧君, 等. 基于依存句法的图像描述文本生成[J]. 北京航空航天大学学报, 2021, 47(3): 431-440. doi: 10.13700/j.bh.1001-5965.2020.0443 BI J Q, LIU M F, HU H J, et al. Image captioning based on dependency syntax[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 431-440 (in Chinese). doi: 10.13700/j.bh.1001-5965.2020.0443
[18]	JI J Z, XU C, ZHANG X D, et al. Spatio-temporal memory attention for image captioning[J]. IEEE Transactions on Image Processing, 2020, 29: 7615-7628. doi: 10.1109/TIP.2020.3004729
[19]	ZHA Z J, LIU D Q, ZHANG H W, et al. Context-aware visual policy network for fine-grained image captioning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(2): 710-722. doi: 10.1109/TPAMI.2019.2909864
[20]	ZHANG W Q, SHI H C, TANG S L, et al. Consensus graph representation learning for better grounded image captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(4): 3394-3402. doi: 10.1609/aaai.v35i4.16452
[21]	YANG X, ZHANG H W, CAI J F. Deconfounded image captioning: A causal retrospect[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12996-13010.
[22]	PAN Y W, YAO T, LI Y H, et al. X-linear attention networks for image captioning[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 10968-10977.
[23]	JI J Y, LUO Y P, SUN X S, et al. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(2): 1655-1663. doi: 10.1609/aaai.v35i2.16258
[24]	GUO L T, LIU J, LU S C, et al. Show, tell, and polish: Ruminant decoding for image captioning[J]. IEEE Transactions on Multimedia, 2020, 22(8): 2149-2162. doi: 10.1109/TMM.2019.2951226
[25]	NGUYEN K, TRIPATHI S, DU B, et al. In defense of scene graphs for image captioning[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2022: 1387-1396.
[26]	SHEN T, ZHOU T Y, LONG G D, et al. DiSAN: Directional self-attention network for RNN/CNN-free language understanding[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2018: 5446-5455.
[27]	STRUBELL E, VERGA P, ANDOR D, et al. Linguistically-informed self-attention for semantic role labeling[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 5027–5038.
[28]	RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 1179-1195.
[29]	KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 3128-3137.
[30]	JIANG H Z, MISRA I, ROHRBACH M, et al. In defense of grid features for visual question answering[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 10264-10273.
[31]	CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 10575-10584.
[32]	BANSAL N, CHEN X H, WANG Z Y. Can we gain more from orthogonality regularizations in training deep CNNs? [C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. New York: ACM, 2018: 4266-4276.
[33]	HUANG L, WANG W M, XIA Y X, et al. Adaptively aligned image captioning via adaptive attention time[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. New York: ACM,2019: 8942-8951.
[34]	HERDADE S, KAPPELER A, BOAKYE K, et al. Image captioning: Transforming objects into words[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. New York: ACM, 2019: 11137-11147.
[35]	YANG L Y, WANG H L, TANG P J, et al. CaptionNet: A tailor-made recurrent neural network for generating image descriptions[J]. IEEE Transactions on Multimedia, 2021, 23: 835-845. doi: 10.1109/TMM.2020.2990074
[36]	YAN C G, HAO Y M, LI L, et al. Task-adaptive attention for image captioning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(1): 43-51. doi: 10.1109/TCSVT.2021.3067449