北京航空航天大学学报 ›› 2021, Vol. 47 ›› Issue (3): 596-604.doi: 10.13700/j.bh.1001-5965.2020.0470

• 论文 • 上一篇    下一篇

基于多语义线索的跨模态视频检索算法

丁洛1, 李逸凡1, 于成龙2, 刘洋1, 王轩1,3, 漆舒汉1,3   

  1. 1. 哈尔滨工业大学(深圳) 计算机科学与技术学院, 深圳 518055;
    2. 深圳信息职业技术学院 数字媒体学院, 深圳 518172;
    3. 鹏城实验室, 深圳 518055
  • 收稿日期:2020-08-26 发布日期:2021-04-08
  • 通讯作者: 漆舒汉 E-mail:shuhanqi@cs.hitsz.edu.cn
  • 作者简介:丁洛,男,硕士研究生。主要研究方向:多模态检索、目标检测;李逸凡,男,博士研究生。主要研究方向:视觉问答、目标识别技术;漆舒汉,男,博士,教授,硕士生导师。主要研究方向:计算机视觉、多媒体信息检索和机器博弈。
  • 基金资助:
    国家自然科学基金(61902093);广东省自然科学基金(2020A1515010652)

Cross-modal video retrieval algorithm based on multi-semantic clues

DING Luo1, LI Yifan1, YU Chenglong2, LIU Yang1, WANG Xuan1,3, QI Shuhan1,3   

  1. 1. School of Computer Science and Technology, Harbin Institute of Technology(Shenzhen), Shenzhen 518055, China;
    2. School of Digital Media, Shenzhen Institute of Information Technology, Shenzhen 518172, China;
    3. Peng Cheng Laboratory, Shenzhen 518055, China
  • Received:2020-08-26 Published:2021-04-08
  • Supported by:
    National Natural Science Foundation of China (61902093); Natural Science Foundation of Guangdong Province (2020A1515010652)

摘要: 针对现有的大多数跨模态视频检索算法忽略了数据中丰富的语义线索,使得生成特征的表现能力较差的问题,设计了一种基于多语义线索的跨模态视频检索模型,该模型通过多头目自注意力机制捕捉视频模态内部对语义起到重要作用的数据帧,有选择性地关注视频数据的重要信息,获取数据的全局特征;采用双向门控循环单元(GRU)捕捉多模态数据内部上下文之间的交互特征;通过对局部数据之间的细微差别进行联合编码挖掘出视频和文本数据中的局部信息。通过数据的全局特征、上下文交互特征和局部特征构成多模态数据的多语义线索,更好地挖掘数据中的语义信息,进而提高检索效果。在此基础上,提出了一种改进的三元组距离度量损失函数,采用了基于相似性排序的困难负样本挖掘方法,提升了跨模态特征的学习效果。在MSR-VTT数据集上的实验表明:与当前最先进的方法比较,所提算法在文本检索视频任务上提高了11.1%;在MSVD数据集上的实验表明:与当前先进的方法比较,所提算法在文本检索视频任务上总召回率提高了5.0%。

关键词: 跨模态视频检索, 多语义线索, 多头目注意力机制, 距离度量损失函数, 多模态

Abstract: Most of the existing cross-modal video retrieval algorithms map heterogeneous data to a space, so that semantically similar data are close to each other and semantically dissimilar data are far from each other, that is, the global similarity relationship of different modal data is established. However, these methods ignore the rich semantic clues in the data, which makes the performance of feature generation poor. To solve this problem, we propose a cross-modal retrieval model based on multi-semantic clues. This model captures the data frames that play an important role in semantics within video model through multi-head self-attention mechanism, and pays attention to the important information of video data to obtain the global characteristics of the data. Bidirectional Gate Recurrent Unit (GRU) is used to capture the interaction characteristics between contexts within multimodal data. Our method can also mine the local information in video and text data through the joint coding of the slight differences between the local data. Through the global features, context interaction features and local features of the data, the multi-semantic clues of the multi-modal data are formed to better mine the semantic information in the data and improve the retrieval effect. Besides this, an improved triplet distance measurement loss function is proposed, which adopts the difficult negative sample mining method based on similarity sorting and improves the learning effect of cross-modal characteristics. Experiments on MSR-VTT dataset show that the proposed method improves the text retrieval video task by 11.1% compared with the state-of-the-art methods. Experiments on MSVD dataset show that the proposed method improves the text retrieval video task by 5.0% compared with the state-of-the-art methods.

Key words: cross-modal video retrieval, multi-semantic clues, multi-leader attention mechanism, distance measurement loss function, multi-modal

中图分类号: 


版权所有 © 《北京航空航天大学学报》编辑部
通讯地址:北京市海淀区学院路37号 北京航空航天大学学报编辑部 邮编:100191 E-mail:jbuaa@buaa.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发