北京航空航天大学学报 ›› 2021, Vol. 47 ›› Issue (3): 478-485.doi: 10.13700/j.bh.1001-5965.2020.0457

• 论文 • 上一篇    下一篇

基于深度多模态特征融合的短视频分类

张丽娟, 崔天舒, 井佩光, 苏育挺   

  1. 天津大学 电气自动化与信息工程学院, 天津 300072
  • 收稿日期:2020-08-24 发布日期:2021-04-08
  • 通讯作者: 井佩光 E-mail:pgjing@tju.edu.cn
  • 作者简介:张丽娟,女,硕士研究生。主要研究方向:多媒体信息处理;崔天舒,男,硕士研究生。主要研究方向:多媒体信息处理;井佩光,男,博士,副教授,硕士生导师。主要研究方向:多媒体计算、机器学习、高阶数据分析;苏育挺,男,博士,教授,博士生导师。主要研究方向:多媒体计算、机器学习。
  • 基金资助:
    国家自然科学基金(61802277);中国博士后科学基金(2019M651038)

Deep multimodal feature fusion for micro-video classification

ZHANG Lijuan, CUI Tianshu, JING Peiguang, SU Yuting   

  1. School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China
  • Received:2020-08-24 Published:2021-04-08
  • Supported by:
    National Natural Science Foundation of China (61802277); China Postdoctoral Science Foundation (2019M651038)

摘要: 目前,短视频已经成为新媒体时代极具有代表性的产物之一,其天然的具有时短、强编辑等特点,使得传统视频分类模型不再适合于短视频分类任务。针对综合短视频分类问题的特点,提出了一种基于深度多模态特征融合的短视频分类算法。所提算法将视觉模态信息和音频模态信息输入到域分离网络中,将整个特征空间划分为所有模态共享的公有域部分及由音频模态和视觉模态分别独有的私有域部分,借助优化域分离网络,最大程度地保留了不同模态特征间的差异性和相似性。在公开的短视频分类数据集上进行实验,证明了所提算法可以有效减少特征融合时的冗余性,并将分类的平均精度提高到0.813。

关键词: 短视频, 多模态学习, 深度网络, 分类, 特征空间

Abstract: Nowadays, micro-video has become one of the most representative products in the new media era. It has the characteristics of short time and strong editing, which makes the traditional video classification models no longer suitable for micro-video classification task.Based on the characteristics of the micro-video classification problem, the micro-video classification algorithm based on deep multimodal feature fusion is proposed. The proposed algorithm inputs the visual modal information and acoustic modal information into the domain separation network, and divides the entire feature space into a shared domain part shared by all modalities and the private domain part unique to the acoustic and visual modalities respectively. By optimizing the domain separation network, the differences and similarities among different modal features are preserved to the greatest extent. The experiments on the public micro-video classification dataset prove that the proposed algorithm can effectively reduce the redundancy of feature fusion and improve the average classification accuracy to 0.813.

Key words: micro-video, multimodal learning, deep network, classification, feature space

中图分类号: 


版权所有 © 《北京航空航天大学学报》编辑部
通讯地址:北京市海淀区学院路37号 北京航空航天大学学报编辑部 邮编:100191 E-mail:jbuaa@buaa.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发