融合语义信息的视频摘要生成

滑蕊; 吴心筱; 赵文天

doi:10.13700/j.bh.1001-5965.2020.0447

融合语义信息的视频摘要生成

doi: 10.13700/j.bh.1001-5965.2020.0447

北京理工大学计算机学院, 北京 100081

基金项目:

国家自然科学基金 61673062

国家自然科学基金 62072041

详细信息

作者简介:
滑蕊   女，硕士研究生。主要研究方向：视频摘要

吴心筱   女，博士，副教授，博士生导师。主要研究方向：视觉与语言、视频内容理解、机器学习

赵文天   男，博士研究生。主要研究方向：图像和视频描述生成

通讯作者:
吴心筱, E-mail: wuxinxiao@bit.edu.cn

中图分类号: TP391
计量
- 文章访问数: 722
- HTML全文浏览量: 104
- PDF下载量: 76
- 被引次数: 0
出版历程
- 收稿日期: 2020-08-24
- 录用日期: 2020-09-27
- 网络出版日期: 2021-03-20

Video summarization by learning semantic information

School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China

Funds:

National Natural Science Foundation of China 61673062

National Natural Science Foundation of China 62072041

More Information

Corresponding author: WU Xinxiao, E-mail: wuxinxiao@bit.edu.cn

摘要

摘要:
视频摘要任务旨在通过生成简短的视频片段来表示原视频的主要内容，针对现有方法缺乏对语义信息探索的问题，提出了一种融合语义信息的视频摘要生成模型，学习视频特征使其包含丰富的语义信息，进而同时生成描述原始视频内容的视频摘要和文本摘要。该模型分为3个模块：帧级分数加权模块、视觉-语义嵌入模块、视频文本描述生成模块。帧级分数加权模块结合卷积网络与全连接层以获取帧级重要性分数；视觉-语义嵌入模块将视觉特征与文本特征映射到同一空间，以使2种特征相互靠近；视频文本描述生成模块最小化视频摘要的生成描述与文本标注真值之间的距离，以生成带有语义信息的视频摘要。测试时，在获取视频摘要的同时，该模型获得简短的文本摘要作为副产品，可以帮助人们更直观地理解视频内容。在SumMe和TVSum数据集上的实验表明：该模型通过融合语义信息，比现有先进方法取得了更好的性能，在这2个数据集上F-score指标分别提高了0.5%和1.6%。
- 视频摘要 /
- 视觉-语义嵌入空间 /
- 视频文本描述 /
- 视频关键帧 /
- 长短期记忆(LSTM)模型
Abstract:
Video summarization aims to generate short and compact summary to represent original video. However, the existing methods focus more on representativeness and diversity of representation, but less on semantic information. In order to fully exploit semantic information of video content, we propose a novel video summarization model that learns a visual-semantic embedding space, so that the video features contain rich semantic information. It can generate video summaries and text summaries that describe the original video simultaneously. The model is mainly divided into three modules: frame-level score weighting module that combines convolutional layers and fully connected layers; visual-semantic embedding module that embeds the video and text in a common embedding space and make them lose to each other to achieve the purpose of mutual promotion of two features; video caption generation module that generates video summary with semantic information by minimizing the distance between the generated description of the video summary and the manually annotated text of the original video. During the test, while obtaining the video summary, we obtain a short text summary as a by-product, which can help people understand the video content more intuitively. Experiments on SumMe and TVSum datasets show that the proposed model achieves better performance than the existing advanced methods by fusing semantic information, and improves F-score by 0.5% and 1.6%, respectively.
- video summarization /
- visual-semantic embedding space /
- video captioning /
- video key frame /
- Long Short-Term Memory (LSTM) model

HTML全文

图 1 融合语义信息的视频摘要生成流程

Figure 1. Flowchart of video summarization by learning semantic information

下载: 全尺寸图片幻灯片

图 2 帧级分数加权模块框架

Figure 2. Framework of frame-level score weighting module

下载: 全尺寸图片幻灯片

图 3 TVSum数据集中生成视频摘要的示例

Figure 3. Examples of video summarization in TVSum

下载: 全尺寸图片幻灯片

图 4 TVSum数据集中生成文本摘要的示例

Figure 4. Examples of text summarization in TVSum

下载: 全尺寸图片幻灯片

表 1 与6个最新方法之间的F-score比较

Table 1. between our frameworks and six state-of- the-art methods

实验方法	F-score/%
实验方法	SumMe	TVSum
vsLSTM^[4]	37.6	54.2
dppLSTM^[4]	38.6	54.7
SUM-GAN_sup^[5]	41.7	56.3
DR-DSN_sup^[17]	42.1	58.1
SASUM_sup^[11]	45.3	58.2
CSNet_sup^[24]	48.6	58.5
本文方法(无监督)	45.5	57.3
本文方法(有监督)	49.1	60.1

下载: 导出CSV

表 2 不同数据集生成的文本摘要评测

Table 2. Evaluation of text summaries generated by different datasets

数据集	BLEU-1/%	ROUGE-L/%	CIDEr/%
SumMe	28.3	27.6	9.6
TVSum	32.8	29.9	12.7

下载: 导出CSV

表 3 TVSum数据集上的消融实验结果

Table 3. Results of ablation experiment on TVSum

实验编号	嵌入空间	描述生成	F-score/%
1	×	×	50.4
2	√	×	51.5
3	×	√	58.3
4	√	√	60.1

下载: 导出CSV

参考文献(24)

[1]	刘波. 视频摘要研究综述[J]. 南京信息工程大学学报, 2020, 12(3): 274-278. https://www.cnki.com.cn/Article/CJFDTOTAL-NJXZ202003003.htm LIU B. Survey of video summary[J]. Journal of Nanjing University of Information Science & Technology, 2020, 12(3): 274-278(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-NJXZ202003003.htm
[2]	RAV-ACHA A, PRITCH Y, PELEG S. Making a long video short: Dynamic video synopsis[C]//Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2006: 435-441.
[3]	ZHAO B, XING E P. Quasi real-time summarization for consumer videos[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2014: 2513-2520.
[4]	ZHANG K, CHAO W, SHA F, et al. Video summarization with long short-term memory[C]//European Conference on Computer Vision. Berlin: Springer, 2016: 766-782.
[5]	MAHASSENI B, LAM M, TODOROVIC S, et al. Unsupervised video summarization with adversarial LSTM networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 2982-2991.
[6]	ZHAO B, LI X, LU X, et al. HSA-RNN: Hierarchical structure-adaptive RNN for video summarization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 7405-7414.
[7]	冀中, 江俊杰. 基于解码器注意力机制的视频摘要[J]. 天津大学学报, 2018, 51(10): 1023-1030. https://www.cnki.com.cn/Article/CJFDTOTAL-TJDX201810004.htm JI Z, JIANG J J. Video summarization based on decoder attention mechanism[J]. Transactions of Tianjin University, 2018, 51(10): 1023-1030(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-TJDX201810004.htm
[8]	李依依, 王继龙. 自注意力机制的视频摘要模型[J]. 计算机辅助设计与图形学学报, 2020, 32(4): 652-659. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJF202004016.htm LI Y Y, WANG J L. Self-attention based video summarization[J]. Journal of Computer-Aided Design & Computer Graphics, 2020, 32(4): 652-659(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JSJF202004016.htm
[9]	CHEN Y, TAO L, WANG X, et al. Weakly supervised video summarization by hierarchical reinforcement learning[C]//Proceedings of the ACM Multimedia Asia. New York: ACM Press, 2019: 1-6.
[10]	CHOI J, OH T, KWEON I S, et al. Contextually customized video summaries via natural language[C]//Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). Piscataway: IEEE Press, 2018: 1718-1726.
[11]	WEI H, NI B, YAN Y, et al. Video summarization via semantic attended networks[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018: 216-223.
[12]	SHARGHI A, BORJI A, LI C, et al. Improving sequential determinantal point processes for supervised video summarization[C]//European Conference on Computer Vision. Berlin: Springer, 2018: 533-550.
[13]	ROCHAN M, YE L, WANG Y, et al. Video summarization using fully convolutional sequence networks[C]//European Conference on Computer Vision. Berlin: Springer, 2018: 358-374.
[14]	ZHANG Y, KAMPFFMEYER M, LIANG X, et al. Query-conditioned three-player adversarial network for video summarization[C]//British Machine Vision Conference. Berlin: Springer, 2018.
[15]	YUAN L, TAY F E, LI P, et al. Cycle-sum: Cycle-consistent adversarial LSTM networks for unsupervised video summarization[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 9143-9150.
[16]	ZHANG Y, KAMPFFMEYER M, ZHAO X, et al. DTR-GAN: Dilated temporal relational adversarial network for video summarization[C]//Proceedings of the ACM Turing Celebration Conference-China. New York: ACM Press, 2019: 1-6.
[17]	ZHOU K, QIAO Y, XIANG T. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018: 7582-7589.
[18]	WANG L, ZHU Y, PAN H. Unsupervised reinforcement learning for video summarization reward function[C]//Proceedings of the 2019 International Conference on Image, Video and Signal Processing. New York: ACM Press, 2019: 40-44.
[19]	XU J, MEI T, YAO T, et al. Msr-VTT: A large video description dataset for bridging video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 5288-5296.
[20]	POTAPOV D, DOUZE M, HARCHAOUI Z, et al. Category-specific video summarization[C]//European Conference on Computer Vision. Berlin: Springer, 2014: 540-555.
[21]	GONG B, CHAO W L, GRAUMAN K, et al. Diverse sequential subset selection for supervised video summarization[C]//Advances in Neural Information Processing Systems. New York: Curran Associates, 2014: 2069-2077.
[22]	SONG Y, VALLMITJANA J, STENT A, et al. TVSum: Summarizing web videos using titles[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 5179-5187.
[23]	GYGLI M, GRABNER H, RIEMENSCHNEIDER H, et al. Creating summaries from user videos[C]//European Conference on Computer Vision. Berlin: Springer, 2014: 505-520.
[24]	JUNG Y, CHO D, KIM D, et al. Discriminative feature learning for unsupervised video summarization[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 8537-8544.