留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

语言引导的多粒度特征融合目标分割方法

谭荃戈 王蓉 吴澳

谭荃戈,王蓉,吴澳. 语言引导的多粒度特征融合目标分割方法[J]. 北京航空航天大学学报,2024,50(2):542-550 doi: 10.13700/j.bh.1001-5965.2022.0384
引用本文: 谭荃戈,王蓉,吴澳. 语言引导的多粒度特征融合目标分割方法[J]. 北京航空航天大学学报,2024,50(2):542-550 doi: 10.13700/j.bh.1001-5965.2022.0384
TAN Q G,WANG R,WU A. Language-guided target segmentation method based on multi-granularity feature fusion[J]. Journal of Beijing University of Aeronautics and Astronautics,2024,50(2):542-550 (in Chinese) doi: 10.13700/j.bh.1001-5965.2022.0384
Citation: TAN Q G,WANG R,WU A. Language-guided target segmentation method based on multi-granularity feature fusion[J]. Journal of Beijing University of Aeronautics and Astronautics,2024,50(2):542-550 (in Chinese) doi: 10.13700/j.bh.1001-5965.2022.0384

语言引导的多粒度特征融合目标分割方法

doi: 10.13700/j.bh.1001-5965.2022.0384
基金项目: 国家自然科学基金(62076246)
详细信息
    通讯作者:

    E-mail:dbdxwangrong@163.com

  • 中图分类号: V221+.3;TB553

Language-guided target segmentation method based on multi-granularity feature fusion

Funds: National Natural Science Foundation of China (62076246)
More Information
  • 摘要:

    语言引导的目标分割旨在将文本描述的目标与其所指代的实体进行匹配,从而实现对文本、实体之间关系的理解与指代目标的定位。该任务在信息抽取、文本分类、机器翻译等应用场景中具有重要的应用价值。基于Refvos模型提出一种语言引导的多粒度特征融合目标分割方法,能够对特定目标精准定位。利用Swin Transformer和Bert网络,分别提取多粒度的视觉特征和文本特征,提高对整体与细节的表征能力;将文本特征分别与不同粒度视觉特征进行融合,通过语言引导增强特定目标表达;通过卷积长短期记忆网络对多粒度融合特征进行优化,在不同粒度特征间进行信息交流,得到更精细化的分割结果。在UNC、UNC+、G-Ref、ReferIt数据集上进行训练并测试所提方法。实验结果表明:相比Refvos,所提方法在UNC数据集的val、testB子集中IoU结果分别提升0.92%、4.1%,在UNC+数据集的val、testA、testB子集中IoU结果分别提升1.83%、0.63%、1.75%。所提方法在G-Ref、ReferIt数据集的IoU结果分别为40.16%和64.37%,达到前沿水平,证明所提方法的有效性与先进性。

     

  • 图 1  Refvos模型

    Figure 1.  Refvos model

    图 2  视觉Transformer模型

    Figure 2.  Visual Transformer model

    图 3  卷积长短期记忆网络

    Figure 3.  Convolutional long short-term memory network

    图 4  本文方法总体框架

    Figure 4.  General framework for proposed method

    图 5  Swin Transformer结构

    Figure 5.  Swin Transformer structure

    图 6  窗口变换自注意力模块

    Figure 6.  Window transform self-attention module

    图 7  视觉特征提取

    Figure 7.  Visual feature extraction

    图 8  文本特征提取

    Figure 8.  Text feature extraction

    图 9  多粒度特征优化计算流程

    Figure 9.  Multi-granularity feature optimization calculation process

    图 10  “站在中间戴着围巾的女人”分割结果

    Figure 10.  "The woman in the center with the scarf" split results

    图 11  “最左边的男人”分割结果

    Figure 11.  Segmentation results for "leftmost man" split results

    图 12  “蛋糕第2层”分割结果

    Figure 12.  "Second layer of the cake" split results

    表  1  总体IoU实验结果对比

    Table  1.   Overall IoU comparison of experimental results %

    语言引导分割方法 UNC[18] UNC+[18] G-Ref[19] ReferIt[20]
    val testA testB val testA testB val test
    RMI[2] 45.18 45.69 45.57 29.86 30.48 29.50 34.52 58.73
    DMN[3] 49.78 54.83 45.13 38.88 44.22 32.29 36.76 52.81
    RRN[5] 55.33 57.26 53.95 39.75 42.15 36.11 36.45 63.63
    MattNet[24] 56.51 62.37 51.70 46.67 52.39 40.08
    CMSA[6] 58.32 60.61 55.09 43.76 47.60 37.89 39.98 63.80
    Refvos[10] 59.45 63.19 54.17 44.71 49.73 36.17
    STEP[7] 60.04 63.46 58.97 48.18 52.33 40.41 46.40 64.13
    BRINET[25] 61.35 63.37 59.57 48.57 52.87 42.13 47.57
    CMPC[8] 61.36 64.53 59.64 49.56 53.44 43.23 49.05 65.53
    本文方法 60.37 62.31 58.27 46.54 50.36 37.92 40.16 64.37
    下载: 导出CSV

    表  2  消融实验结果总体IoU对比

    Table  2.   Overall IoU comparison of ablation experiment results %

    方法 val testA testB
    单一粒度 55.31 55.70 53.85
    无特征优化 57.99 59.62 56.62
    本文方法 60.37 62.31 58.27
    下载: 导出CSV
  • [1] HU R H, ROHRBACH M, DARRELL T. Segmentation from natural language expressions[C]//Proceeding of the European Conference on Computer Vision. Berlin: Springer, 2016: 108-124.
    [2] LIU C X, LIN Z, SHEN X H, et al. Recurrent multimodal interaction for referring image segmentation[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2017: 1280-1289.
    [3] MARGFFOY-TUAY E, PÉREZ J C, BOTERO E, et al. Dynamic multimodal instance segmentation guided by natural language queries[C]//Proceedings of the European Conference on Computer Vision. New York: ACM, 2018: 656–672.
    [4] LEI T, ZHANG Y. Training RNNs as fast as CNNs[EB/OL]. (2017-09-12)[2022-01-05]. https://arxiv.org/abs/1709.02755v2.
    [5] LI R Y, LI K C, KUO Y C, et al. Referring image segmentation via recurrent refinement networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 5745-5753.
    [6] YE L W, ROCHAN M, LIU Z, et al. Cross-modal self-attention network for referring image segmentation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 10494-10503.
    [7] CHEN D J, JIA S H, LO Y C, et al. See-through-text grouping for referring image segmentation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 7453-7462.
    [8] HUANG S F, HUI T R, LIU S, et al. Referring image segmentation via cross-modal progressive comprehension[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 10485-10494.
    [9] HUI T R, LIU S, HUANG S F, et al. Linguistic structure guided context modeling for referring image segmentation[C]//Proceeding of the European Conference on Computer Vision. Berlin: Springer, 2020: 59-75.
    [10] BELLVER M, VENTURA C, SILBERER C, et al. A closer look at referring expressions for video object segmentation[J]. Multimedia Tools and Applications, 2023, 82(3): 4419-4438. doi: 10.1007/s11042-022-13413-x
    [11] CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation[EB/OL]. (2017-12-05)[2022-01-12]. https://doi.org/10.48550/arXiv.1706.05587.
    [12] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL] . (2018-11-11)[2022-01-12].https://arxiv.org/abs/1810.04805.
    [13] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2021: 9992-10002.
    [14] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
    [15] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL]. (2021-06-03)[2021-12-11]. https://doi.org/10.48550/arXiv.2010.11929.
    [16] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. doi: 10.1162/neco.1997.9.8.1735
    [17] SHI X J, CHEN Z R, WANG H, et al. Convolutional LSTM Network: A machine learning approach for precipitation nowcasting[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. New York: ACM, 2015: 802–810.
    [18] YU L C, POIRSON P, YANG S, et al. Modeling context in referring expressions[C]//Proceeding of the European Conference on Computer Vision. Berlin: Springer, 2016: 69-85.
    [19] MAO J H, HUANG J, TOSHEV A, et al. Generation and comprehension of unambiguous object descriptions[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 11-20.
    [20] KAZEMZADEH S, ORDONEZ V, MATTEN M, et al. ReferItGame: Referring to objects in photographs of natural scenes[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 787-798.
    [21] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[C]//Proceeding of the European Conference on Computer Vision. Berlin: Springer, 2014: 740-755.
    [22] ESCALANTE H J, HERNÁNDEZ C A, GONZALEZ J A, et al. The segmented and annotated IAPR TC-12 benchmark[J]. Computer Vision and Image Understanding, 2010, 114(4): 419-428. doi: 10.1016/j.cviu.2009.03.008
    [23] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90. doi: 10.1145/3065386
    [24] YU L C, LIN Z, SHEN X H, et al. MAttNet: Modular attention network for referring expression comprehension[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 1307-1315.
    [25] HU Z W, FENG G, SUN J Y, et al. Bi-directional relationship inferring network for referring image segmentation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 4423-4432.
  • 加载中
图(12) / 表(2)
计量
  • 文章访问数:  1068
  • HTML全文浏览量:  84
  • PDF下载量:  12
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-05-18
  • 录用日期:  2022-06-23
  • 网络出版日期:  2022-10-25
  • 整期出版日期:  2024-02-27

目录

    /

    返回文章
    返回
    常见问答