-
摘要:
视频缩略图作为视频内容最直观的表现形式,在视频共享网站中发挥很重要的作用,是吸引用户是否会点击观看该视频的关键要素之一。一句与视频内容相关的描述性语句,再搭配一幅与语句内容相关的视频缩略图,往往对用户更有吸引力,因此提出一种深度视觉语义嵌入模型来构建完整的视频缩略图推荐框架。该模型首先使用卷积神经网络(CNN)来提取视频关键帧的视觉特征,并使用循环神经网络(RNN)来提取描述语句的语义特征,再将视觉特征与语义特征嵌入到维度相同的视觉语义潜在空间;然后通过比较视觉特征与语义特征之间的相关性来推荐与特定的描述语句内容密切相关的视频关键帧作为视频缩略图推荐结果。在不同类型的网络视频数据上的实验表明,所提方法能够有效地从网络视频中推荐出与给定描述性语句内容较相关的视频缩略图序列,提升视频的用户浏览体验。
-
关键词:
- 视频缩略图 /
- 关键帧 /
- 卷积神经网络(CNN) /
- 循环神经网络(RNN) /
- 视觉语义嵌入
Abstract:Video thumbnail, as the most intuitive form of video content, plays an important role in video sharing sites and is one of the key elements to attract users to click and watch the video. However, a descriptive statement related to video content with a video thumbnail associated with the content of the statement is often more attractive to user. Therefore, a complete video thumbnail recommendation framework with a deep visual-semantic embedding model is proposed in this paper. This model uses the convolutional neural network to extract the visual features of video keyframes, and uses recurrent neural network to extract the semantic features of description sentences. After embedding the visual features and the semantic features into the visual-semantic potential space of the same dimension, the key frames related to the content of the descriptive sentences are recommended as video thumbnails by comparing the correlation between the visual features and the semantic features. Experiments on different categories of web videos show that the proposed method can effectively recommend contented-related video thumbnail sequence from videos for given descriptive statements and enhance the user experience.
-
表 1 不同类别网络视频的击中率
Table 1. Hit rates of web videos in different categories
视频
类别视频
个数语句
条数完全击中率
(A)/%击中率
(A+B)/%教育 3 15 29.3 56.0 娱乐 3 15 30.7 63.3 电影 2 10 43.0 72.0 游戏与动画 2 10 20.0 51.0 新闻与政治 3 15 27.3 71.3 生活 4 20 53.0 87.0 体育 3 15 33.3 66.7 合计 20 100 35.0 68.3 -
[1] GAO Y L, ZHANG T, XIAO J.Thematic video thumbnail selection[C]//Proceedings of the 2009 IEEE International Conference on Image Processing (ICIP).Piscataway, NJ: IEEE Press, 2009: 4333-4336. [2] LIAN H C, LI X Q, SONG B.Automatic video thumbnail selection[C]//Proceedings of the 2011 IEEE International Conference on Multimedia Technology (ICMT).Piscataway, NJ: IEEE Press, 2011: 242-245. [3] JIANG J F, ZHANG X P.Video thumbnail extraction using video time density function and independent component analysis mixture model[C]//Proceedings of the 2011 IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP).Piscataway, NJ: IEEE Press, 2011: 1417-142. [4] LIU C X, HUANG Q M, JIANG S Q.Query sensitive dynamic web video thumbnail generation[C]//Proceedings of the 2011 IEEE International Conference on Image Processing (ICIP).Piscataway, NJ: IEEE Press, 2011: 2449-2452. [5] ZHANG W G, LIU C X, WANG Z J, et al.Web video thumbnail recommendation with content-aware analysis and query-sensitive matching[J].Multimedia Tools and Applications, 2014, 73:547-571. doi: 10.1007/s11042-013-1607-5 [6] ZHANG W G, LIU C X, HUANG Q M, et al.A novel framework for web video thumbnail generation[C]//Proceedings of the 8 th International Conference on Intelligent Information Hiding and Multimedia Signal Processing.Piscataway, NJ: IEEE Press, 2012: 343-346. [7] ZHAO B Q, LIN S J, QI X, et al.Automatic generation of visual-textual web video thumbnail[C]//Siggraph Asia Posters.New York: ACM, 2017: 41. [8] SONG Y L, REDI M, VAllMITJANA J, et al.To click or not to click: Automatic selection of beautiful thumbnails from videos[C]//Proceedings of the 25th ACM International on Conference on Information and Knowledge Management.New York: ACM, 2016: 659-668. [9] LIU W, MEI T, ZHANG Y D, et al.Multi-task deep visual-semantic embedding for video thumbnail selection[C]//Proceedings of IEEE Conference on Computer Vision an Pattern Recognition(CVPR).Piscataway, NJ: IEEE Press, 2015: 3707-3715. [10] ANDREA F, GREG S C, JON S, et al.Devise: A deep visual-semantic embedding model[C]//Proceedings of Neural Information Processing Systems Conference.Nevada: NIPS, 2013: 2121-2129. [11] CHUNG J, GULCEHRE C, CHO K H, et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL].(2014-12-11)[2019-07-25].https://arxiv.org/abs/1412.3555. [12] HE K M, ZHANG X Y, REN S Q, et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Piscataway, NJ: IEEE Press, 2016: 770-778. [13] 金红, 周源华, 梅承力.用非监督式聚类进行视频镜头分割[J].红外与激光工程, 2000, 29(5):42-46. doi: 10.3969/j.issn.1007-2276.2000.05.010JIN H, ZHOU Y H, MEI C L.Video shot segmentation using unsupervised clustering[J].Infrared and Laser Engineering, 2000, 29(5):42-46(in Chinese). doi: 10.3969/j.issn.1007-2276.2000.05.010 [14] 李祚林, 李晓辉, 马灵玲, 等.面向无参考图像的清晰度评价方法研究[J].遥感技术与应用, 2011, 26(2):239-246. http://d.old.wanfangdata.com.cn/Periodical/ygjsyyy201102016LI Z L, LI X H, MA L L, et al.Research of definition assessment based on no-reference digital image quality[J].Remote Sensing Technology and Application, 2011, 26(2):239-246(in Chinese). http://d.old.wanfangdata.com.cn/Periodical/ygjsyyy201102016 [15] 徐晓昭, 蔡轶珩, 刘长江, 等.基于图像分析的偏色检测及颜色校正方法[J].测控技术, 2008, 27(5):10-12. doi: 10.3969/j.issn.1000-8829.2008.05.003XU X Z, CAI Y H, LIU C J, et al.Color cast detection and color correction methods based on image analysis[J].Chinese Journal of Measurement and Control Technology, 2008, 27(5):10-12(in Chinese). doi: 10.3969/j.issn.1000-8829.2008.05.003 [16] LIN T Y, MAIRE M, BELONGIE S, et al.Microsoft COCO: Common objects in context[C]//Proceedings of European Conference on Computer Vision.Berlin: Springer, 2014: 740-755. [17] XU J, MEI T, YAO T, et al.MSR-VTT: A large video description dataset for bridging video and language[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Piscataway, NJ: IEEE Press, 2016: 5288-5296.