留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于布局化-语义联合表征遥感图文检索方法

张若愚 聂婕 宋宁 郑程予 魏志强

张若愚,聂婕,宋宁,等. 基于布局化-语义联合表征遥感图文检索方法[J]. 北京航空航天大学学报,2024,50(2):671-683 doi: 10.13700/j.bh.1001-5965.2022.0527
引用本文: 张若愚,聂婕,宋宁,等. 基于布局化-语义联合表征遥感图文检索方法[J]. 北京航空航天大学学报,2024,50(2):671-683 doi: 10.13700/j.bh.1001-5965.2022.0527
ZHANG R Y,NIE J,SONG N,et al. Remote sensing image-text retrieval based on layout semantic joint representation[J]. Journal of Beijing University of Aeronautics and Astronautics,2024,50(2):671-683 (in Chinese) doi: 10.13700/j.bh.1001-5965.2022.0527
Citation: ZHANG R Y,NIE J,SONG N,et al. Remote sensing image-text retrieval based on layout semantic joint representation[J]. Journal of Beijing University of Aeronautics and Astronautics,2024,50(2):671-683 (in Chinese) doi: 10.13700/j.bh.1001-5965.2022.0527

基于布局化-语义联合表征遥感图文检索方法

doi: 10.13700/j.bh.1001-5965.2022.0527
基金项目: 国家重点研发计划(2021YFF070400);国家自然科学基金(62072418,62172376);中央高校基本科研业务费专项资金(202042008)
详细信息
    通讯作者:

    E-mail:niejie@ouc.edu.cn

  • 中图分类号: P 407.8

Remote sensing image-text retrieval based on layout semantic joint representation

Funds: National Key Research and Development Program of China (2021YFF070400); National Natural Science Foundation of China (62072418,62172376); the Fundamental Research Funds for the Central Universities (202042008)
More Information
  • 摘要:

    遥感图文检索可以从类别繁多、内容复杂的遥感数据中检索到有价值的信息,对环境评估、城市规划以及灾害预测具有重要意义。但是,遥感图文跨模态检索存在一个关键问题,即遥感图像的空间布局信息被忽略。其主要体现在2个方面:①遥感目标的远距离建模困难;②遥感相邻次要目标被淹没。基于以上问题,提出了一种基于布局化-语义联合表征的跨模态遥感图像文本检索(SL-SJR),主要包括主导语义监督的布局化视觉特征提取(DSSL)模块、布局化视觉-全局语义交叉指导(LV-GSCG)模块和多视角匹配(MVM)模块。DSSL模块实现主导语义类别特征监督下图像的布局化建模。LV-GSCG模块计算布局化视觉特征与文本中提取的全局语义特征的相似度来实现不同模态特征的交互。MVM模块建立跨模态特征指导的多视角度量匹配机制以消除跨模态数据之间的语义鸿沟。在4个基线遥感图像文本数据集上的实验验证,结果表明所提方法在大多数跨模态遥感图像文本检索任务中可以达到最先进的性能。

     

  • 图 1  模型整体架构图

    Figure 1.  Overall structure diagram of the model

    图 2  主导语义监督下的布局化视觉特征提取模块

    Figure 2.  Dominant semantic supervision under layout visual feature extraction module

    图 3  布局化视觉-全局语义交叉指导模块

    Figure 3.  Layout vision global semantic cross guidance module

    图 4  图像生成文本结果

    Figure 4.  Image generated text result

    图 5  文本生成图像结果

    Figure 5.  Text generated image result

    表  1  RSICD数据集对比实验结果

    Table  1.   Comparative experimental results on RSICD

    方法 R1 R5 R10 Rm
    图像检索文本 文本检索图像 图像检索文本 文本检索图像 图像检索文本 文本检索图像
    SCAN t2i 4.42 4.02 11.20 11.54 17.68 18.60 11.24
    SCAN i2t 5.90 3.86 13.21 16.83 19.96 26.49 14.38
    CAMP-triplet 5.22 4.30 13.05 17.10 21.02 27.54 14.71
    CAMP-bce 4.50 3.03 10.08 15.12 16.48 23.05 12.04
    MTFN 5.01 4.96 12.86 12.24 21.30 29.12 14.24
    AMFMN 5.39 5.05 15.32 18.19 28.82 29.70 17.07
    本文 5.31 5.32 19.12 19.69 31.47 32.50 18.90
    下载: 导出CSV

    表  2  RSITMD数据集对比实验结果

    Table  2.   Comparative experimental results on RSITMD

    方法 R1 R5 R10 Rm
    图像检索文本 文本检索图像 图像检索文本 文本检索图像 图像检索文本 文本检索图像
    SCAN t2i 10.59 10.04 28.72 29.54 38.41 42.91 26.70
    SCAN i2t 10.84 9.62 25.86 29.80 37.26 41.06 25.74
    CAMP-triplet 11.82 8.69 27.40 27.17 38.12 43.60 26.13
    CAMP-bce 9.21 6.81 22.56 25.65 35.73 40.05 23.34
    MTFN 10.80 9.82 27.68 30.28 36.40 48.27 27.21
    AMFMN 11.37 9.04 27.70 32.46 39.29 49.68 28.25
    本文 13.05 9.82 29.42 35.39 39.82 52.65 30.03
    下载: 导出CSV

    表  3  UCM数据集对比实验结果

    Table  3.   Comparative experimental results on UCM

    方法 R1 R5 R10 Rm
    图像检索文本 文本检索图像 图像检索文本 文本检索图像 图像检索文本 文本检索图像
    SCAN t2i 13.68 11.06 40.92 48.28 63.28 69.40 41.10
    SCAN i2t 12.54 11.24 42.15 46.26 65.45 74.93 42.10
    CAMP-triplet 10.37 8.30 38.71 45.13 62.78 70.50 39.30
    CAMP-bce 14.22 10.91 39.02 47.95 62.98 72.56 41.27
    MTFN 10.89 14.06 38.16 49.38 60.05 76.22 41.46
    AMFMN 11.90 13.81 41.42 45.52 61.90 70.67 40.87
    本文 14.76 11.33 39.52 50.86 61.43 82.29 43.37
    下载: 导出CSV

    表  4  Sydney数据集对比实验结果

    Table  4.   Comparative experimental results on Sydney

    方法 R1 R5 R10 Rm
    图像检索文本 文本检索图像 图像检索文本 文本检索图像 图像检索文本 文本检索图像
    SCAN t2i 17.26 16.83 40.60 56.04 58.05 71.24 43.34
    SCAN i2t 19.48 15.62 44.81 58.20 55.28 73.53 44.49
    CAMP-triplet 17.80 14.50 45.28 45.63 61.44 69.55 42.37
    CAMP-bce 14.26 13.41 42.53 50.32 57.35 73.86 41.96
    MTFN 16.23 14.05 42.09 56.95 52.68 77.18 43.20
    AMFMN 15.52 14.14 43.10 56.21 60.34 75.52 44.14
    本文 15.52 16.90 50.00 55.17 67.24 80.69 47.59
    下载: 导出CSV

    表  5  消融实验结果

    Table  5.   Ablation experimental results

    方法R1R5R10Rm
    图像检索文本文本检索图像图像检索文本文本检索图像图像检索文本文本检索图像
    AMFMN10.628.5027.6533.0140.0451.6428.58
    AMFMN+A11.069.1627.8734.7842.2654.3429.91
    AMFMN+A+B13.059.8229.4235.3939.8252.6530.03
     注:加粗数值表示最优结果。
    下载: 导出CSV

    表  6  图像块个数的消融实验结果

    Table  6.   Ablation experimental results of number of image blocks

    图像块个数图像块大小R1R5R10Rm
    图像检索文本文本检索图像图像检索文本文本检索图像图像检索文本文本检索图像
    64328.859.4227.2134.0742.0454.3829.33
    166412.838.1927.8834.1242.2655.5330.13
    412811.738.7228.9833.8940.9354.0329.71
     注:加粗数值表示最优结果。
    下载: 导出CSV

    表  7  相似度度量中参数α1α2消融实验结果

    Table  7.   Parameters in similarity measurement α1 and α2 ablation experimental results

    (α1,α2)R1R5R10
    Rm
    图像检索文本文本检索图像图像检索文本文本检索图像图像检索文本文本检索图像
    (1,1)0.440.402.432.214.653.812.32
    (0.1,0.1)10.628.8928.1031.0241.1547.7027.91
    (0.05,0.05)10.629.9628.1034.3441.3749.8729.04
    (0.01,0.01)12.838.1927.8834.1242.2655.5330.13
    (0.0005,0.0005)10.378.2726.5534.2539.3855.0028.97
    (0.0001,0.0001)9.967.9223.6730.4938.0552.3027.06
     注:加粗数值表示最优结果。
    下载: 导出CSV

    表  8  损失函数中β消融实验结果

    Table  8.   In the loss function β ablation experimental results

    (β,1−β) R1 R5 R10
    Rm
    图像检索文本 文本检索图像 图像检索文本 文本检索图像 图像检索文本 文本检索图像
    (1,0) 13.05 9.82 29.42 35.39 39.82 52.65 30.025
    (0.5,0.5) 12.83 8.19 27.88 34.12 42.26 55.53 30.135
    (0,1) 11.28 11.37 28.32 34.69 43.58 49.87 29.85
    下载: 导出CSV
  • [1] LIU Y J, LI X F, REN Y B. A deep learning model for oceanic mesoscale eddy detection based on multi-source remote sensing imagery[C]// IGARSS 2020 - 2020 IEEE International Geoscience and Remote Sensing Symposium. Piscataway: IEEE Press, 2020: 6762-6765.
    [2] ZHANG Q L, SETO K C. Mapping urbanization dynamics at regional and global scales using multi-temporal DMSP/OLS nighttime light data[J]. Remote Sensing of Environment, 2011, 115(9): 2320-2329. doi: 10.1016/j.rse.2011.04.032
    [3] NOGUEIRA K, FADEL S G, DOURADO I C, et al. Exploiting ConvNet diversity for flooding identification[J]. IEEE Geoscience and Remote Sensing Letters, 2018, 15(9): 1446-1450. doi: 10.1109/LGRS.2018.2845549
    [4] CHENG Q M, ZHOU Y Z, FU P, et al. A deep semantic alignment network for cross-modal image-text retrieval in remote sensing[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14: 4284-4297. doi: 10.1109/JSTARS.2021.3070872
    [5] YUAN Z Q, ZHANG W K, FU K, et al. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 4404119.
    [6] 葛芸, 马琳, 叶发茂, 等. 基于多尺度池化和范数注意力机制的遥感图像检索[J]. 电子与信息学报, 2022, 44(2): 543-551.

    GE Y, MA L, YE F M, et al. Remote sensing image retrieval based on multi-scale pooling and norm attention mechanism[J]. Journal of Electronics & Information Technology, 2022, 44(2): 543-551(in Chinese).
    [7] 李彦甫, 范习健, 杨绪兵, 等. 基于自注意力卷积网络的遥感图像分类[J]. 北京林业大学学报, 2021, 43(10): 81-88. doi: 10.12171/j.1000-1522.20210196

    LI Y F, FAN X J, YANG X B, et al. Remote sensing image classification farmework based on self-attention convolutional neural network[J]. Journal of Beijing Forestry University, 2021, 43(10): 81-88(in Chinese). doi: 10.12171/j.1000-1522.20210196
    [8] YUAN Z Q, ZHANG W K, TIAN C Y, et al. Remote sensing cross-modal text-image retrieval based on global and local information[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 3163706.
    [9] RONG X E, SUN X, DIAO W H, et al. Historical information-guided class-incremental semantic segmentation in remote sensing images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5622618.
    [10] ZHANG Z Y, HAN X, LIU Z Y, et al. ERNIE: Enhanced language representation with informative entities[C]//Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 1441-1451.
    [11] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[C]//International Conference on Learning Representations. Washington DC: ICLR, 2020: 16-28.
    [12] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers [C]// In European Conference on Computer Vision. Berlin: Springer, 2020: 213-229.
    [13] SRINIVAS A, LIN T Y, PARMAR N, et al. Bottleneck transformers for visual recognition[C]// /IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 16514–16524.
    [14] HE K M, ZHANG X Y, REN S Q, et al . Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 770-778.
    [15] LIU S L, ZHANG L, YANG X, et al. Query2Label: A simple transformer way to multi-label classification[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 661-670.
    [16] MESSINA N, AMATO G, FALCHI F, et al. Towards efficient cross-modal visual textual retrieval using transformer-encoder deep features[C]// 2021 International Conference on Content-Based Multimedia Indexing. Piscataway: IEEE Press, 2021: 1-6.
    [17] GABEUR V, SUN C, ALAHARI K, et al. Multi-modal transformer for video retrieval[C]// Conference on Computer Vision. Berlin: Springer, 2020: 214-229.
    [18] MALEKI D, TIZHOOSH H R. LILE: look in-depth before looking elsewhere -- A dual attention network using transformers for cross-modal information retrieval in histopathology archives[J]. Proceedings of Machine Learning Research. Virtual: PMLR, 2022: 3002-3013.
    [19] SHI Z W, ZOU Z X. Can a machine generate humanlike language descriptions for a remote sensing image[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(6): 3623-3634. doi: 10.1109/TGRS.2017.2677464
    [20] LU X Q, WANG B Q, ZHENG X T, et al. Exploring models and data for remote sensing image caption generation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2018, 56(4): 2183-2195. doi: 10.1109/TGRS.2017.2776321
    [21] HOXHA G, MELGANI F, SLAGHENAUFFI J. A new CNN-RNN framework for remote sensing image captioning[C]//2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium. Piscataway: IEEE Press, 2020: 1-4.
    [22] LI X L, ZHANG X T, HUANG W, et al. Truncation cross entropy loss for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021, 59(6): 5246-5257. doi: 10.1109/TGRS.2020.3010106
    [23] LU X Q, WANG B Q, ZHENG X T. Sound active attention framework for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(3): 1985-2000. doi: 10.1109/TGRS.2019.2951636
    [24] FAGHRI F, FLEET D J, KIROS J R , et al. VSE++: improving visual-semantic embeddings with hard negatives[C]// British Machine Vision Conference. London : British Machine Vision Association, 2017: 1707-1717.
    [25] LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C]//European Conference on Computer Vision. Berlin: Springer, 2018: 212-228.
    [26] WANG T, XU X, YANG Y, et al. Matching images and text with multi-modal tensor fusion and re-ranking[C]// the 27th ACM International Conference. New York: ACM, 2019, 1: 12-20.
    [27] DEVLIN J, CHENG H, FANG H, et al. Language models for image captioning: the quirks and what works[J]. Computer Science, 2015, 2(53): 100-105.
    [28] ABDULLAH, BAZI, RAHHAL A, et al. TextRS: Deep bidirectional triplet network for matching text to remote sensing images[J]. Remote Sensing, 2020, 12(3): 405. doi: 10.3390/rs12030405
    [29] QU B, LI X L, TAO D L, et al. Deep semantic understanding of high resolution remote sensing image[C]// International Conference on Computer. Piscataway: IEEE Press, 2016: 124-128.
    [30] WANG Z H, LIU X H, LI H S, et al. CAMP: Cross-modal adaptive message passing for text-image retrieval[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE Press, 2010: 5763-5772.
  • 加载中
图(5) / 表(8)
计量
  • 文章访问数:  484
  • HTML全文浏览量:  110
  • PDF下载量:  30
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-06-22
  • 录用日期:  2022-07-08
  • 网络出版日期:  2023-01-12
  • 整期出版日期:  2024-02-27

目录

    /

    返回文章
    返回
    常见问答