留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于多语义线索的跨模态视频检索算法

丁洛 李逸凡 于成龙 刘洋 王轩 漆舒汉

丁洛, 李逸凡, 于成龙, 等 . 基于多语义线索的跨模态视频检索算法[J]. 北京航空航天大学学报, 2021, 47(3): 596-604. doi: 10.13700/j.bh.1001-5965.2020.0470
引用本文: 丁洛, 李逸凡, 于成龙, 等 . 基于多语义线索的跨模态视频检索算法[J]. 北京航空航天大学学报, 2021, 47(3): 596-604. doi: 10.13700/j.bh.1001-5965.2020.0470
DING Luo, LI Yifan, YU Chenglong, et al. Cross-modal video retrieval algorithm based on multi-semantic clues[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 596-604. doi: 10.13700/j.bh.1001-5965.2020.0470(in Chinese)
Citation: DING Luo, LI Yifan, YU Chenglong, et al. Cross-modal video retrieval algorithm based on multi-semantic clues[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 596-604. doi: 10.13700/j.bh.1001-5965.2020.0470(in Chinese)

基于多语义线索的跨模态视频检索算法

doi: 10.13700/j.bh.1001-5965.2020.0470
基金项目: 

国家自然科学基金 61902093

广东省自然科学基金 2020A1515010652

详细信息
    作者简介:

    丁洛  男,硕士研究生。主要研究方向:多模态检索、目标检测

    李逸凡  男,博士研究生。主要研究方向:视觉问答、目标识别技术

    漆舒汉  男,博士,教授,硕士生导师。主要研究方向:计算机视觉、多媒体信息检索和机器博弈

    通讯作者:

    漆舒汉, E-mail: shuhanqi@cs.hitsz.edu.cn

  • 中图分类号: TP391.4

Cross-modal video retrieval algorithm based on multi-semantic clues

Funds: 

National Natural Science Foundation of China 61902093

Natural Science Foundation of Guangdong Province 2020A1515010652

More Information
  • 摘要:

    针对现有的大多数跨模态视频检索算法忽略了数据中丰富的语义线索,使得生成特征的表现能力较差的问题,设计了一种基于多语义线索的跨模态视频检索模型,该模型通过多头目自注意力机制捕捉视频模态内部对语义起到重要作用的数据帧,有选择性地关注视频数据的重要信息,获取数据的全局特征;采用双向门控循环单元(GRU)捕捉多模态数据内部上下文之间的交互特征;通过对局部数据之间的细微差别进行联合编码挖掘出视频和文本数据中的局部信息。通过数据的全局特征、上下文交互特征和局部特征构成多模态数据的多语义线索,更好地挖掘数据中的语义信息,进而提高检索效果。在此基础上,提出了一种改进的三元组距离度量损失函数,采用了基于相似性排序的困难负样本挖掘方法,提升了跨模态特征的学习效果。在MSR-VTT数据集上的实验表明:与当前最先进的方法比较,所提算法在文本检索视频任务上提高了11.1%;在MSVD数据集上的实验表明:与当前先进的方法比较,所提算法在文本检索视频任务上总召回率提高了5.0%。

     

  • 图 1  MCCR模型示意图

    Figure 1.  Schematic diagram of proposed MCCR model

    图 2  MSR-VTT数据集的6个视频检索文本的测试样例,以及MCCR的各个编码部分(Level-1、Level-2、Level-3)及其组合的检索结果

    Figure 2.  Test samples of 6 video retrieval words in MSR-VTT dataset, as well as the retrieval results of each coding part (Level-1, Level-2, level-3) of MCCR and its combination.

    图 3  文本检索视频

    Figure 3.  Text to video retrieval

    图 4  视频检索文本

    Figure 4.  Video to text retrieval

    表  1  在MSR-VTT数据集上的结果

    Table  1.   Results on MSR-VTT dataset

    算法 文本检索视频 视频检索文件 Recall Sum
    R@1 R@5 R@10 MedR R@1 R@5 R@10 MedR
    VSE++ 5.0 16.4 24.6 47 7.7 20.3 31.2 28 105.2
    W2VV 5.5 17.6 25.9 51 9.1 24.6 36.0 23 118.7
    Fusion 7.0 20.9 29.7 38 12.5 31.3 42.4 14 143.8
    本文(VSE++) 7.6 21.7 31.2 31 12.2 29.4 42.4 18 144.5
    本文 7.8 23.0 33.1 29 13.1 30.7 43.1 15 150.8
    下载: 导出CSV

    表  2  在MSVD数据集上的结果

    Table  2.   Results on MSVD dataset

    算法 文本检索视频 视频检索文本 Recall Sum
    R@1 R@5 R@10 MedR R@1 R@5 R@10 MedR
    VSE++ 15.4 39.6 53.3 9 21.2 43.4 52.2 9 225.1
    W2VV 15.4 39.2 51.4 10 16.3 33.4 44.8 14 200.5
    Fusion 18.9 46.1 60.9 6 30.6 49.1 61.5 6 267.1
    本文(VSE++) 19.7 48.2 61.0 6 31.7 50.7 61.8 6 272.5
    本文 20.9 49.0 62.6 5 32.2 51.1 62.2 5 278.0
    下载: 导出CSV

    表  3  在MSR-VTT数据集上的消融分析结果

    Table  3.   Ablation analysis results on MSR-VTT dataset

    方法 文本检索视频 视频检索文本 Recall Sum
    R@1 R@5 R@10 MedR R@1 R@5 R@10 MedR
    Level-1 6.4 18.9 27.1 46 11.9 28.3 39.2 22 131.8
    Level-2 6.3 19.7 28.8 38 10.0 26.2 38.3 20 128.8
    Level-3 7.3 21.5 31.2 32 10.6 27.3 38.5 20 136.4
    Level-(1+2) 7.2 21.3 29.6 37 12.1 30.5 40.9 17 141.6
    Level-(1+3) 7.4 21.2 32.3 30 12.4 29.9 42.5 16 147.1
    Level-(2+3) 7.6 22.4 32.2 31 11.9 30.6 42.4 16 147.2
    Level-(1+2+3) 7.8 23.0 33.1 29 13.1 30.7 43.1 15 150.8
    下载: 导出CSV
  • [1] 张鸿, 吴飞, 庄越挺. 跨媒体相关性推理与检索研究[J]. 计算机研究与发展, 2008, 45(5): 869. https://www.cnki.com.cn/Article/CJFDTOTAL-JFYZ200805023.htm

    ZHANG H, WU F, ZHUANG Y T. Research on cross-media correlation inference and retrieval[J]. Computer Research and Development, 2008, 45(5): 869(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JFYZ200805023.htm
    [2] DONG J, LI X, SNOEK C G M. Predicting visual features from text for image and video caption retrieval[J]. IEEE Transactions on Multimedia, 2018, 20(12): 3377-3388. doi: 10.1109/TMM.2018.2832602
    [3] MITHUN N C, LI J, METZE F, et al. Learning joint embedding with multimodal cues for cross-modal video-text retrieval[C]//Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. New York: ACM Press, 2018: 19-27.
    [4] DONG J, LI X, SNOEK C G M. Word2VisualVec: Image and video to sentence matching by visual feature prediction[EB/OL]. (2016-04-23)[2020-08-01]. https://arxiv.org/abs/1604.06838.
    [5] TORABI A, TANDON N, SIGAL L. Learning language-visual embedding for movie understanding with natural-language[EB/OL]. (2016-08-26)[2020-08-01]. https://arxiv.org/abs/1609.08124.
    [6] RASIWASIA N, COSTA P J, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia. New York: ACM Press, 2010: 251-260.
    [7] FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: Improving visual-semantic embeddings with hard negatives[EB/OL]. (2017-07-18)[2020-08-01]. https://arxiv.org/abs/1707.05612.
    [8] GONG Y, KE Q, ISARD M, et al. A multi-view embedding space for modeling internet images, tags, and their semantics[J]. International Journal of Computer Vision, 2014, 106(2): 210-233. doi: 10.1007/s11263-013-0658-4
    [9] HODOSH M, YOUNG P, HOCKENMAIER J. Framing image description as a ranking task: Data, models and evaluation metrics[J]. Journal of Artificial Intelligence Research, 2013, 47(24): 853-899. doi: 10.5555/2832747.2832838
    [10] 李志欣, 施智平, 陈宏朝, 等. 基于语义学习的图像多模态检索[J]. 计算机工程, 2013, 39(3): 258-263. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJC201303053.htm

    LI Z X, SHI Z P, CHEN H C, et al. Multi-modal image retrieval based on semantic learning[J]. Computer Engineering, 2013, 39(3): 258-263(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JSJC201303053.htm
    [11] WANG L, LI Y, HUANG J, et al. Learning two-branch neural networks for image-text matching tasks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(2): 394-407.
    [12] FROME A, CORRADO G S, SHLENS J, et al. Devise: A deep visual-semantic embedding model[C]//Advances in Neural Information Processing Systems. New York: ACM Press, 2013: 2121-2129.
    [13] KIROS R, SALAKHUTDINOV R, ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural language models[EB/OL]. (2014-11-10)[2020-08-01]. https://arxiv.org/abs/1411.2539.
    [14] XU R, XIONG C, CHEN W, et al. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework[C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015: 6.
    [15] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. (2013-01-16)[2020-08-03]. https: //arxiv.org/abs/1301.3781.
    [16] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 770-778.
    [17] SONG Y, SOLEYMANI M. Polysemous visual-semantic embedding for cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 1979-1988.
    [18] CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[EB/OL]. (2014-01-03)[2020-08-01]. https://arxiv.org/abs/1406.1078.
    [19] KIM Y. Convolutional neural networks for sentence classification[EB/OL]. (2014-08-25)[2020-08-01]. https://arxiv.org/abs/1408.5882.
    [20] YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2(1): 67-78. http://www.researchgate.net/publication/303721259_From_image_descriptions_to_visual_denotations_New_similarity_metrics_for_semantic_inference_over_event_descriptions
    [21] XU J, MEI T, YAO T, et al. MSR-VTT: A large video description dataset for bridging video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 5288-5296.
    [22] CHEN D, DOLAN W B. Collecting highly parallel data for paraphrase evaluation[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011: 190-200.
  • 加载中
图(4) / 表(3)
计量
  • 文章访问数:  257
  • HTML全文浏览量:  5
  • PDF下载量:  45
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-08-26
  • 录用日期:  2020-09-04
  • 刊出日期:  2021-03-20

目录

    /

    返回文章
    返回
    常见问答