北京航空航天大学学报 ›› 2021, Vol. 47 ›› Issue (3): 650-657.doi: 10.13700/j.bh.1001-5965.2020.0447

• 论文 • 上一篇    下一篇

融合语义信息的视频摘要生成

滑蕊, 吴心筱, 赵文天   

  1. 北京理工大学 计算机学院, 北京 100081
  • 收稿日期:2020-08-24 发布日期:2021-04-08
  • 通讯作者: 吴心筱 E-mail:wuxinxiao@bit.edu.cn
  • 作者简介:滑蕊,女,硕士研究生。主要研究方向:视频摘要;吴心筱,女,博士,副教授,博士生导师。主要研究方向:视觉与语言、视频内容理解、机器学习;赵文天,男,博士研究生。主要研究方向:图像和视频描述生成。
  • 基金资助:
    国家自然科学基金(61673062,62072041)

Video summarization by learning semantic information

HUA Rui, WU Xinxiao, ZHAO Wentian   

  1. School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China
  • Received:2020-08-24 Published:2021-04-08
  • Supported by:
    National Natural Science Foundation of China (61673062,62072041)

摘要: 视频摘要任务旨在通过生成简短的视频片段来表示原视频的主要内容,针对现有方法缺乏对语义信息探索的问题,提出了一种融合语义信息的视频摘要生成模型,学习视频特征使其包含丰富的语义信息,进而同时生成描述原始视频内容的视频摘要和文本摘要。该模型分为3个模块:帧级分数加权模块、视觉-语义嵌入模块、视频文本描述生成模块。帧级分数加权模块结合卷积网络与全连接层以获取帧级重要性分数;视觉-语义嵌入模块将视觉特征与文本特征映射到同一空间,以使2种特征相互靠近;视频文本描述生成模块最小化视频摘要的生成描述与文本标注真值之间的距离,以生成带有语义信息的视频摘要。测试时,在获取视频摘要的同时,该模型获得简短的文本摘要作为副产品,可以帮助人们更直观地理解视频内容。在SumMe和TVSum数据集上的实验表明:该模型通过融合语义信息,比现有先进方法取得了更好的性能,在这2个数据集上F-score指标分别提高了0.5%和1.6%。

关键词: 视频摘要, 视觉-语义嵌入空间, 视频文本描述, 视频关键帧, 长短期记忆(LSTM)模型

Abstract: Video summarization aims to generate short and compact summary to represent original video. However, the existing methods focus more on representativeness and diversity of representation, but less on semantic information. In order to fully exploit semantic information of video content, we propose a novel video summarization model that learns a visual-semantic embedding space, so that the video features contain rich semantic information. It can generate video summaries and text summaries that describe the original video simultaneously. The model is mainly divided into three modules: frame-level score weighting module that combines convolutional layers and fully connected layers; visual-semantic embedding module that embeds the video and text in a common embedding space and make them lose to each other to achieve the purpose of mutual promotion of two features; video caption generation module that generates video summary with semantic information by minimizing the distance between the generated description of the video summary and the manually annotated text of the original video. During the test, while obtaining the video summary, we obtain a short text summary as a by-product, which can help people understand the video content more intuitively. Experiments on SumMe and TVSum datasets show that the proposed model achieves better performance than the existing advanced methods by fusing semantic information, and improves F-score by 0.5% and 1.6%, respectively.

Key words: video summarization, visual-semantic embedding space, video captioning, video key frame, Long Short-Term Memory (LSTM) model

中图分类号: 


版权所有 © 《北京航空航天大学学报》编辑部
通讯地址:北京市海淀区学院路37号 北京航空航天大学学报编辑部 邮编:100191 E-mail:jbuaa@buaa.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发