留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于多模态联合语义感知的零样本目标检测

段立娟 袁蓥 王文健 梁芳芳

段立娟,袁蓥,王文健,等. 基于多模态联合语义感知的零样本目标检测[J]. 北京航空航天大学学报,2024,50(2):368-375 doi: 10.13700/j.bh.1001-5965.2022.0392
引用本文: 段立娟,袁蓥,王文健,等. 基于多模态联合语义感知的零样本目标检测[J]. 北京航空航天大学学报,2024,50(2):368-375 doi: 10.13700/j.bh.1001-5965.2022.0392
DUAN L J,YUAN Y,WANG W J,et al. Zero-shot object detection based on multi-modal joint semantic perception[J]. Journal of Beijing University of Aeronautics and Astronautics,2024,50(2):368-375 (in Chinese) doi: 10.13700/j.bh.1001-5965.2022.0392
Citation: DUAN L J,YUAN Y,WANG W J,et al. Zero-shot object detection based on multi-modal joint semantic perception[J]. Journal of Beijing University of Aeronautics and Astronautics,2024,50(2):368-375 (in Chinese) doi: 10.13700/j.bh.1001-5965.2022.0392

基于多模态联合语义感知的零样本目标检测

doi: 10.13700/j.bh.1001-5965.2022.0392
基金项目: 国家自然科学基金(62176009,62106065); 北京市教育委员会科学研究计划(KZ201910005008)
详细信息
    通讯作者:

    E-mail:ljduan@bjut.edu.cn

  • 中图分类号: TP391

Zero-shot object detection based on multi-modal joint semantic perception

Funds: National Natural Science Foundation of China (62176009,62106065); Program of Beijing Municipal Education Commission (KZ201910005008)
More Information
  • 摘要:

    零样本目标检测借助语义嵌入作为引导信息,将未见过的物体的视觉特征与类别语义嵌入映射到同一空间,根据其在映射空间的距离进行分类,但由于语义信息获取的单一性,视觉信息缺乏可靠表示,易混淆背景信息和未见过对象信息,使得视觉和语义之间很难无差别对齐。基于此,借助视觉上下文模块捕捉视觉特征的上下文信息,并通过语义优化模块对文本上下文和视觉上下文信息进行交互融合,增加视觉表达的多样化,使模型感知到前景的辨别性语义,从而有效地实现零样本目标检测。在MS-COCO的2个划分数据集上进行实验,在零样本目标检测和广义零样本目标检测的准确率和召回率上取得了提升,结果证明了所提方法的有效性。

     

  • 图 1  基于多模态联合语义感知的零样本目标检测整体框架

    Figure 1.  Overall framework of zero-shot object detection based on multi-modal joint semantic perception

    图 2  文本上下文提取流程

    Figure 2.  Text context extraction process

    图 3  结果可视化

    Figure 3.  Results visualisation

    表  1  48/17和65/15数据集上零样本目标检测的实验结果

    Table  1.   Experimental results of ZSD on 48/17 and 65/16 datasets

    方法可见类/不可见类
    数量
    召回率/%平均准确率
    (阈值0.5)/%
    阈值0.4阈值0.5阈值0.6
    SB[2]48/1734.4622.1411.310.32
    DSES[2]48/1740.2327.1913.630.54
    TD[5]48/1745.5034.3018.10
    CG[17]48/177.2
    65/1510.9
    PL[12]48/1743.5910.10
    65/1537.7212.40
    BLC[13]48/1749.6346.3941.869.90
    65/1554.1851.6547.8613.10
    BLC(ms)[13]48/1751.3348.8745.0310.60
    65/1557.2354.6851.2214.70
    本文方法48/1751.4048.6945.2912.10
    65/1557.2655.0351.5014.90
     注:ms表示多尺度训练和测试。
    下载: 导出CSV

    表  2  48/17和65/15数据集上广义零样本目标检测的实验结果

    Table  2.   Experimental results of GZSD on 48/17 and 65/15 datasets

    方法 可见类/不可见类
    数量
    平均准确率/% 召回率/%
    可见类 不可见类 HM 可见类 不可见类 HM
    DSES[2] 48/17 15.02 15.32 15.17
    PL[12] 48/17 35.92 4.12 7.39 38.24 26.32 31.18
    65/15 34.07 12.40 18.18 36.38 37.16 36.76
    BLC[13] 48/17 42.10 4.50 8.20 57.56 46.39 51.37
    65/15 36.00 13.10 19.20 56.39 51.65 53.92
    本文方法 48/17 44.10 5.10 9.10 61.23 49.18 54.55
    65/15 36.30 13.50 21.10 57.53 56.54 57.03
     注:HM表示可见类和不可见类的调和平均值。
    下载: 导出CSV

    表  3  48/17数据集上不可见类别的召回率

    Table  3.   Recall of invisible categories on 48/17 dataset %

    方法 总体
    召回率
    召回率
    bus dog cow elephant umbrella tie skateboard cup knife cake couch keyboard sink scissors airplane cat snowboard
    基准(BLC[13]) 46.4 77.4 88.4 71.9 77.2 0 0 41.7 38.0 45.6 34.3 65.2 23.8 14.1 20.8 48.3 79.9 61.8
    本文方法 48.7 77.7 93.8 78.6 78.3 0.3 0 64.9 44.3 39.9 39.9 70.2 19.2 17.0 12.5 58.0 85.3 74.0
    下载: 导出CSV

    表  4  65/15数据集上不可见类别的召回率

    Table  4.   Recall of invisible categories on 65/15 dataset %

    方法 总体
    召回率
    召回率
    airplane train parking meter cat bear suitcase frisbee snowboard fork sandwich hot dog toilet mouse toaster hair drier
    基准(BLC[13]) 51.3 58.7 72.0 10.2 96.1 91.6 46.9 44.1 65.4 37.9 82.5 73.6 43.8 7.9 35.9 2.7
    本文方法 55.0 67.0 77.4 9.6 96.9 91.8 53.7 49.2 69.0 41.1 86.1 77.6 50.1 10.9 42.3 2.7
    下载: 导出CSV

    表  5  各模块性能

    Table  5.   Performance of each module

    基准(BLC[13] 视觉上下文模块 语义优化模块 平均准确率/%
    10.6
    10.0
    11.2
    12.1
    下载: 导出CSV

    表  6  语义信息的选取

    Table  6.   Selection of semantic information

    语义信息 平均准确率/%
    11.2
    词向量 11.3
    图像描述 12.1
    下载: 导出CSV
  • [1] TAN C, XU X, SHEN F. A survey of zero shot detection: Methods and applications[J]. Cognitive Robotics, 2021, 1: 159-167. doi: 10.1016/j.cogr.2021.08.001
    [2] BANSAL A, SIKKA K, SHARMA G, et al. Zero-shot object detection[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2018: 384-400.
    [3] ZHANG L, WANG X, YAO L, et al. Zero-shot object detection via learning an embedding from semantic space to visual space[C]// Proceedings of the 29th International Joint Conference on Artificial Intelligence and 17th Pacific Rim International Conference on Artificial Intelligence. [S. l.]: IJCAI, 2020: 906-912.
    [4] GUPTA D, ANANTHARAMAN A, MAMGAIN N, et al. A multi-space approach to zero-shot object detection[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway: IEEE Press, 2020: 1209-1217.
    [5] LI Z, YAO L, ZHANG X, et al. Zero-shot object detection with textual descriptions[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto : AAAI Press, 2019: 8690-8697.
    [6] YANG X, TANG K, ZHANG H, et al. Auto-encoding scene graphs for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 10685-10694.
    [7] FENG Y, MA L, LIU W, et al. Unsupervised image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 4125-4134.
    [8] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2014: 740-755.
    [9] WANG W, ZHENG V W, YU H, et al. A survey of zero-shot learning: Settings, methods, and applications[J]. ACM Transactions on Intelligent Systems and Technology, 2019, 10(2): 1-37.
    [10] ZHU P, WANG H, SALIGRAMA V. Don’t even look once: Synthesizing features for zero-shot detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 11693-11702.
    [11] RAHMAN S, KHAN S, BARNES N. Polarity loss for zero-shot object detection[EB/OL]. (2020-04-02)[2022-05-01]. https://arxiv.org/abs/1811.08982v2.
    [12] RAHMAN S, KHAN S, BARNES N. Improved visual-semantic alignment for zero-shot object detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 11932-11939.
    [13] ZHENG Y, HUANG R, HAN C, et al. Background learnable cascade for zero-shot object detection[C]//Proceedings of the Asian Conference on Computer Vision. Berlin: Springer, 2020: 107-123.
    [14] CAI Z, VASCONCELOS N. Cascade R-CNN: Delving into high quality object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 6154-6162.
    [15] GU X, LIN T Y, KUO W, et al. Zero-shot detection via vision and language knowledge distillation[EB/OL]. (2022-05-12)[2022-05-15]. https://arxiv.org/abs/2104.13921v1.
    [16] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Learning. [S. l.]: PMLR, 2021: 8748-8763.
    [17] LI Y, SHAO Y, WANG D. Context-guided super-class inference for zero-shot detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 944-945.
    [18] GU Z, ZHOU S, NIU L, et al. Context-aware feature generation for zero-shot semantic segmentation[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1921-1929.
    [19] YANG Z, WANG Y, CHEN X, et al. Context-Transformer: Tackling object confusion for few-shot detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 12653-12660.
    [20] YU F, KOLTUN V. Multi-scale context aggregation by dilated convolutions[EB/OL]. (2016-04-30)[2022-05-01]. https://arxiv.org/abs/1511.07122.
    [21] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//Proceedings of the International Conference on Machine Learning. [S. l. ]: PMLR, 2015: 2048-2057.
    [22] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015-04-10)[2022-05-01]. https://arxiv.org/abs/1409.1556.
    [23] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[EB/OL]. (2016-05-19)[2022-05-01]. https://arxiv.org/abs/1409.0473v5.
    [24] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. doi: 10.1162/neco.1997.9.8.1735
    [25] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2019-05-24)[2022-05-01]. https://arxiv.org/abs/1810.04805v2.
    [26] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
    [27] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 770-778.
    [28] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 2117-2125.
    [29] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. New York: ACM, 2013: 3111-3119.
  • 加载中
图(3) / 表(6)
计量
  • 文章访问数:  996
  • HTML全文浏览量:  95
  • PDF下载量:  37
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-05-19
  • 录用日期:  2022-06-23
  • 网络出版日期:  2022-10-18
  • 整期出版日期:  2024-02-27

目录

    /

    返回文章
    返回
    常见问答