Cross-modal video retrieval algorithm based on multi-semantic clues

DING Luo; LI Yifan; YU Chenglong; LIU Yang; WANG Xuan; QI Shuhan

doi:10.13700/j.bh.1001-5965.2020.0470

Volume 47 Issue 3

Mar. 2021

Turn off MathJax

Article Contents

Journal of Beijing University of Aeronautics and Astronautics > 2021 > 47(3): 596-604.

DING Luo, LI Yifan, YU Chenglong, et al. Cross-modal video retrieval algorithm based on multi-semantic clues[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 596-604. doi: 10.13700/j.bh.1001-5965.2020.0470(in Chinese)

Citation:

DING Luo, LI Yifan, YU Chenglong, et al. Cross-modal video retrieval algorithm based on multi-semantic clues[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 596-604. doi: 10.13700/j.bh.1001-5965.2020.0470(in Chinese)

Citation:

PDF( 3361 KB)

Cross-modal video retrieval algorithm based on multi-semantic clues

doi: 10.13700/j.bh.1001-5965.2020.0470

DING Luo¹,
LI Yifan¹,
YU Chenglong²,
LIU Yang¹,
WANG Xuan^{1, 3},
QI Shuhan^{1, 3
,
,}

1.
School of Computer Science and Technology, Harbin Institute of Technology(Shenzhen), Shenzhen 518055, China
2.
School of Digital Media, Shenzhen Institute of Information Technology, Shenzhen 518172, China
3.
Peng Cheng Laboratory, Shenzhen 518055, China

Funds:

National Natural Science Foundation of China 61902093

Natural Science Foundation of Guangdong Province 2020A1515010652

More Information

Corresponding author: QI Shuhan, E-mail: shuhanqi@cs.hitsz.edu.cn
Received Date: 26 Aug 2020
Accepted Date: 04 Sep 2020
Publish Date: 20 Mar 2021

Abstract

Abstract

Most of the existing cross-modal video retrieval algorithms map heterogeneous data to a space, so that semantically similar data are close to each other and semantically dissimilar data are far from each other, that is, the global similarity relationship of different modal data is established. However, these methods ignore the rich semantic clues in the data, which makes the performance of feature generation poor. To solve this problem, we propose a cross-modal retrieval model based on multi-semantic clues. This model captures the data frames that play an important role in semantics within video model through multi-head self-attention mechanism, and pays attention to the important information of video data to obtain the global characteristics of the data. Bidirectional Gate Recurrent Unit (GRU) is used to capture the interaction characteristics between contexts within multimodal data. Our method can also mine the local information in video and text data through the joint coding of the slight differences between the local data. Through the global features, context interaction features and local features of the data, the multi-semantic clues of the multi-modal data are formed to better mine the semantic information in the data and improve the retrieval effect. Besides this, an improved triplet distance measurement loss function is proposed, which adopts the difficult negative sample mining method based on similarity sorting and improves the learning effect of cross-modal characteristics. Experiments on MSR-VTT dataset show that the proposed method improves the text retrieval video task by 11.1% compared with the state-of-the-art methods. Experiments on MSVD dataset show that the proposed method improves the text retrieval video task by 5.0% compared with the state-of-the-art methods.
- cross-modal video retrieval,
- multi-semantic clues,
- multi-leader attention mechanism,
- distance measurement loss function,
- multi-modal

FullText(HTML)

References(22)

References

[1]	张鸿, 吴飞, 庄越挺. 跨媒体相关性推理与检索研究[J]. 计算机研究与发展, 2008, 45(5): 869. https://www.cnki.com.cn/Article/CJFDTOTAL-JFYZ200805023.htm ZHANG H, WU F, ZHUANG Y T. Research on cross-media correlation inference and retrieval[J]. Computer Research and Development, 2008, 45(5): 869(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JFYZ200805023.htm
[2]	DONG J, LI X, SNOEK C G M. Predicting visual features from text for image and video caption retrieval[J]. IEEE Transactions on Multimedia, 2018, 20(12): 3377-3388. doi: 10.1109/TMM.2018.2832602
[3]	MITHUN N C, LI J, METZE F, et al. Learning joint embedding with multimodal cues for cross-modal video-text retrieval[C]//Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. New York: ACM Press, 2018: 19-27.
[4]	DONG J, LI X, SNOEK C G M. Word2VisualVec: Image and video to sentence matching by visual feature prediction[EB/OL]. (2016-04-23)[2020-08-01]. https://arxiv.org/abs/1604.06838.
[5]	TORABI A, TANDON N, SIGAL L. Learning language-visual embedding for movie understanding with natural-language[EB/OL]. (2016-08-26)[2020-08-01]. https://arxiv.org/abs/1609.08124.
[6]	RASIWASIA N, COSTA P J, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia. New York: ACM Press, 2010: 251-260.
[7]	FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: Improving visual-semantic embeddings with hard negatives[EB/OL]. (2017-07-18)[2020-08-01]. https://arxiv.org/abs/1707.05612.
[8]	GONG Y, KE Q, ISARD M, et al. A multi-view embedding space for modeling internet images, tags, and their semantics[J]. International Journal of Computer Vision, 2014, 106(2): 210-233. doi: 10.1007/s11263-013-0658-4
[9]	HODOSH M, YOUNG P, HOCKENMAIER J. Framing image description as a ranking task: Data, models and evaluation metrics[J]. Journal of Artificial Intelligence Research, 2013, 47(24): 853-899. doi: 10.5555/2832747.2832838
[10]	李志欣, 施智平, 陈宏朝, 等. 基于语义学习的图像多模态检索[J]. 计算机工程, 2013, 39(3): 258-263. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJC201303053.htm LI Z X, SHI Z P, CHEN H C, et al. Multi-modal image retrieval based on semantic learning[J]. Computer Engineering, 2013, 39(3): 258-263(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JSJC201303053.htm
[11]	WANG L, LI Y, HUANG J, et al. Learning two-branch neural networks for image-text matching tasks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(2): 394-407.
[12]	FROME A, CORRADO G S, SHLENS J, et al. Devise: A deep visual-semantic embedding model[C]//Advances in Neural Information Processing Systems. New York: ACM Press, 2013: 2121-2129.
[13]	KIROS R, SALAKHUTDINOV R, ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural language models[EB/OL]. (2014-11-10)[2020-08-01]. https://arxiv.org/abs/1411.2539.
[14]	XU R, XIONG C, CHEN W, et al. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework[C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015: 6.
[15]	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. (2013-01-16)[2020-08-03]. https: //arxiv.org/abs/1301.3781.
[16]	HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 770-778.
[17]	SONG Y, SOLEYMANI M. Polysemous visual-semantic embedding for cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 1979-1988.
[18]	CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[EB/OL]. (2014-01-03)[2020-08-01]. https://arxiv.org/abs/1406.1078.
[19]	KIM Y. Convolutional neural networks for sentence classification[EB/OL]. (2014-08-25)[2020-08-01]. https://arxiv.org/abs/1408.5882.
[20]	YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2(1): 67-78. http://www.researchgate.net/publication/303721259_From_image_descriptions_to_visual_denotations_New_similarity_metrics_for_semantic_inference_over_event_descriptions
[21]	XU J, MEI T, YAO T, et al. MSR-VTT: A large video description dataset for bridging video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 5288-5296.
[22]	CHEN D, DOLAN W B. Collecting highly parallel data for paraphrase evaluation[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011: 190-200.

Relative Articles

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(4) / Tables(3)

Get Citation

PDF

XML

Article Metrics

Article views(571) PDF downloads(64)

Cross-modal video retrieval algorithm based on multi-semantic clues

doi: 10.13700/j.bh.1001-5965.2020.0470

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Proportional views

Related

Cross-modal video retrieval algorithm based on multi-semantic clues

doi: 10.13700/j.bh.1001-5965.2020.0470

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Proportional views

Related

Export File

Citation

Format

Content