留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于视觉语言双向感知的图像指代分割方法

邓玉鹏 郭放 王蓉 宋振峰

邓玉鹏,郭放,王蓉,等. 基于视觉语言双向感知的图像指代分割方法[J]. 北京航空航天大学学报,2025,51(12):4353-4360 doi: 10.13700/j.bh.1001-5965.2024.0462
引用本文: 邓玉鹏,郭放,王蓉,等. 基于视觉语言双向感知的图像指代分割方法[J]. 北京航空航天大学学报,2025,51(12):4353-4360 doi: 10.13700/j.bh.1001-5965.2024.0462
DENG Y P,GUO F,WANG R,et al. A referring image segmentation method based on bidirectional vision-language interaction module[J]. Journal of Beijing University of Aeronautics and Astronautics,2025,51(12):4353-4360 (in Chinese) doi: 10.13700/j.bh.1001-5965.2024.0462
Citation: DENG Y P,GUO F,WANG R,et al. A referring image segmentation method based on bidirectional vision-language interaction module[J]. Journal of Beijing University of Aeronautics and Astronautics,2025,51(12):4353-4360 (in Chinese) doi: 10.13700/j.bh.1001-5965.2024.0462

基于视觉语言双向感知的图像指代分割方法

doi: 10.13700/j.bh.1001-5965.2024.0462
基金项目: 

国家自然科学基金(62076246);中央高校基本科研业务费专项资金(2023JKF01ZK07)

详细信息
    通讯作者:

    E-mail:songzhenfeng@ppsuc.edu.cn

  • 中图分类号: V221+.3;TB553

A referring image segmentation method based on bidirectional vision-language interaction module

Funds: 

National Natural Science Foundation of China (62076246); Fundamental Research Funds for the Central Universities (2023JKF01ZK07)

More Information
  • 摘要:

    图像指代分割旨在根据自然语言描述从图像中分割出相应的目标区域。针对当前方法中跨模态特征融合不充分,导致目标区域和背景判别不准确的问题,提出一种基于视觉语言双向感知的图像指代分割方法。利用多层视觉特征编码网络提取多尺度视觉特征,增强每个像素对周围信息的感知能力,丰富其语义信息;将多尺度视觉特征输入双向视觉语言注意力解码模块中,通过跨模态注意力机制进行跨模态对齐,增强跨模态特征的上下文感知能力;通过双线性插值上采样得到逐像素分割掩码。在RefCOCO、RefCOCO+和G-Ref数据集上与主流方法进行对比实验,所提方法的总体IoU值分别达到74.85%、66.18%和64.95%,相较于LAVT算法分别提升2.12%、4.04%和3.71%,证明所提方法能够有效提升图像指代分割的性能。

     

  • 图 1  基于视觉语言双向感知的图像指代分割方法整体框架

    Figure 1.  Overall framework of referring image segmentation based on bidirectional vision-language interaction module

    图 2  Vicinity Transformer块结构

    Figure 2.  Structure of Vicinity Transformer Block

    图 3  VVT架构

    Figure 3.  Architecture of VVT

    图 4  文本编码器结构

    Figure 4.  Structure of language encoder

    图 5  视觉语言双向感知注意力

    Figure 5.  Bidirectional vision-language interaction attention

    图 6  分割结果可视化

    Figure 6.  Visualization of segmentation results

    图 7  特征图及分割结果可视化

    Figure 7.  Visualization of feature maps and segmentation results

    表  1  不同方法在3个数据集上的实验对比结果

    Table  1.   Comparation of results with different methods in terms of OIoU on three dataset %

    对比方法 主干网络 RefCOCO RefCOCO+ G-Ref
    验证集 测试集A 测试集B 验证集 测试集A 测试集B 验证集U 测试集U 验证集G
    MCN[2] Darknet-53 62.44 64.20 59.71 50.62 54.99 44.69 49.22 49.40
    BRINet[21] ResNet-101 60.98 62.99 59.21 48.17 52.32 42.11 63.46
    EFN[22] ResNet-101 62.76 65.69 59.67 51.50 55.24 43.01 66.70
    LTS[23] Darknet-53 65.43 67.76 63.08 54.21 58.32 48.02 54.40 54.24
    VLT[12] Darknet-53 65.65 68.29 62.73 55.50 59.20 49.36 52.99 56.65
    ReSTR[7] ViT-B-16 67.22 69.30 64.45 55.78 60.44 48.27 54.48 70.18
    CRIS[24] ResNet-101 70.47 73.18 66.10 62.27 68.08 53.68 59.87 60.36
    DMMI[25] Swin-B 74.13 77.13 70.16 63.98 69.73 57.03 63.46 64.19 61.98
    SADLR[14] Swin-B 74.24 76.25 70.06 64.28 69.09 55.19 63.60 63.56 61.16
    LAVT[20] Swin-B 72.73 75.82 68.79 62.14 68.38 55.10 61.24 62.09
    本文 VVT 74.83 77.86 71.70 66.18 72.41 58.24 64.95 64.25 63.76
    下载: 导出CSV

    表  2  视觉主干网络对比实验结果

    Table  2.   Comparative experimental results of visual backbone network %

    视觉主干网络P@0.5P@0.7P@0.9mIoUOIoU
    ResNet5084.9775.7136.8374.5271.87
    SwinTransformer86.5178.6638.2976.0273.62
    VVT86.9580.5739.8977.5374.83
    下载: 导出CSV

    表  3  跨模态融合方式对比实验结果

    Table  3.   Comparative experimental results of cross-modal fusion methods %

    融合方式P@0.5P@0.7P@0.9mIoUOIoU
    Mul79.3672.1731.7272.3168.46
    PWAM82.8479.2838.7276.5873.31
    Attention-Guided86.9580.5739.8977.5374.83
    下载: 导出CSV
  • [1] HU R H, ROHRBACH M, DARRELL T. Segmentation from natural language expressions[C]//Proceedings of the Computer Vision – ECCV 2016. Cham: Springer International Publishing, 2016: 108-124.
    [2] LUO G, ZHOU Y Y, SUN X S, et al. Multi-task collaborative network for joint referring expression comprehension and segmentation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 10031-10040.
    [3] LIU C X, LIN Z, SHEN X H, et al. Recurrent multimodal interaction for referring image segmentation[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2017: 1280-1289.
    [4] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 770-778.
    [5] BELLVER M, VENTURA C, SILBERER C, et al. A closer look at referring expressions for video object segmentation[J]. Multimedia Tools and Applications, 2023, 82(3): 4419-4438. doi: 10.1007/s11042-022-13413-x
    [6] CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation[EB/OL]. (2017-06-17)[2022-05-20]. https://doi.ory/10.48550/arXiv.1706.05587.
    [7] KIM N, KIM D, KWAK S, et al. ReSTR: convolution-free referring image segmentation using transformers[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2022: 18124-18133.
    [8] MARGFFOY-TUAY E, PÉREZ J C, BOTERO E, et al. Dynamic multimodal instance segmentation guided by natural language queries[C]//Proceedings of the Computer Vision – ECCV 2018. Cham: Springer International Publishing, 2018: 656-672.
    [9] LIU S, HUI T R, HUANG S F, et al. Cross-modal progressive comprehension for referring segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(9): 4761-4775.
    [10] YE L W, ROCHAN M, LIU Z, et al. Cross-modal self-attention network for referring image segmentation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 10494-10503.
    [11] LI W, GAO C, NIU G, et al. Unimo: towards unified-modal understanding and generation via cross-modal contrastive learning[EB/OL]. (2020-12-31)[2022-05-21]. http://doi.org/10.48550/arXiv.2012.15409.
    [12] DING H H, LIU C, WANG S C, et al. Vision-language transformer and query generation for referring segmentation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2021: 16301-16310.
    [13] LI J, LI D, SAVARESE S, et al. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of the International Conference on Machine Learning. Honolulu: PMLR, 2023: 19730-19742.
    [14] YANG Z, WANG J Q, TANG Y S, et al. Semantics-aware dynamic localization and refinement for referring image segmentation[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(3): 3222-3230. doi: 10.1609/aaai.v37i3.25428
    [15] SUN W X, QIN Z, DENG H, et al. Vicinity vision transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(10): 12635-12649. doi: 10.1109/TPAMI.2023.3285569
    [16] DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Minneapolis: Association for Computational Linguistics, 2019: 4171–4186.
    [17] YU L C, POIRSON P, YANG S, et al. Modeling context in referring expressions[C]// Computer Vision – ECCV 2016. Cham: Springer International Publishing, 2016: 69-85.
    [18] MAO J H, HUANG J, TOSHEV A, et al. Generation and comprehension of unambiguous object descriptions[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 11-20.
    [19] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the Computer Vision – ECCV 2014. Cham: Springer International Publishing, 2014: 740-755.
    [20] YANG Z, WANG J Q, TANG Y S, et al. LAVT: Language-aware vision transformer for referring image segmentation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2022: 18134-18144.
    [21] FENG G, HU Z W, ZHANG L H, et al. Bidirectional relationship inferring network for referring image localization and segmentation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(5): 2246-2258. doi: 10.1109/TNNLS.2021.3106153
    [22] FENG G, HU Z W, ZHANG L H, et al. Encoder fusion network with co-attention embedding for referring image segmentation[C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 15501-15510.
    [23] JING Y, KONG T, WANG W, et al. Locate then segment: A strong pipeline for referring image segmentation[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 9853-9862.
    [24] WANG Z Q, LU Y, LI Q, et al. CRIS: CLIP-driven referring image segmentation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2022: 11676-11685.
    [25] HU Y T, WANG Q X, SHAO W Q, et al. Beyond one-to-one: Rethinking the referring image segmentation[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2023: 4044-4054.
  • 加载中
图(7) / 表(3)
计量
  • 文章访问数:  221
  • HTML全文浏览量:  89
  • PDF下载量:  12
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-06-21
  • 录用日期:  2024-09-27
  • 网络出版日期:  2024-10-16
  • 整期出版日期:  2025-12-31

目录

    /

    返回文章
    返回
    常见问答