留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于深度多模态特征融合的短视频分类

张丽娟 崔天舒 井佩光 苏育挺

张丽娟, 崔天舒, 井佩光, 等 . 基于深度多模态特征融合的短视频分类[J]. 北京航空航天大学学报, 2021, 47(3): 478-485. doi: 10.13700/j.bh.1001-5965.2020.0457
引用本文: 张丽娟, 崔天舒, 井佩光, 等 . 基于深度多模态特征融合的短视频分类[J]. 北京航空航天大学学报, 2021, 47(3): 478-485. doi: 10.13700/j.bh.1001-5965.2020.0457
ZHANG Lijuan, CUI Tianshu, JING Peiguang, et al. Deep multimodal feature fusion for micro-video classification[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 478-485. doi: 10.13700/j.bh.1001-5965.2020.0457(in Chinese)
Citation: ZHANG Lijuan, CUI Tianshu, JING Peiguang, et al. Deep multimodal feature fusion for micro-video classification[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 478-485. doi: 10.13700/j.bh.1001-5965.2020.0457(in Chinese)

基于深度多模态特征融合的短视频分类

doi: 10.13700/j.bh.1001-5965.2020.0457
基金项目: 

国家自然科学基金 61802277

中国博士后科学基金 2019M651038

详细信息
    作者简介:

    张丽娟   女,硕士研究生。主要研究方向:多媒体信息处理

    崔天舒   男,硕士研究生。主要研究方向:多媒体信息处理

    井佩光   男,博士,副教授,硕士生导师。主要研究方向:多媒体计算、机器学习、高阶数据分析

    苏育挺   男,博士,教授,博士生导师。主要研究方向:多媒体计算、机器学习

    通讯作者:

    井佩光, E-mail: pgjing@tju.edu.cn

  • 中图分类号: TP181

Deep multimodal feature fusion for micro-video classification

Funds: 

National Natural Science Foundation of China 61802277

China Postdoctoral Science Foundation 2019M651038

More Information
  • 摘要:

    目前,短视频已经成为新媒体时代极具有代表性的产物之一,其天然的具有时短、强编辑等特点,使得传统视频分类模型不再适合于短视频分类任务。针对综合短视频分类问题的特点,提出了一种基于深度多模态特征融合的短视频分类算法。所提算法将视觉模态信息和音频模态信息输入到域分离网络中,将整个特征空间划分为所有模态共享的公有域部分及由音频模态和视觉模态分别独有的私有域部分,借助优化域分离网络,最大程度地保留了不同模态特征间的差异性和相似性。在公开的短视频分类数据集上进行实验,证明了所提算法可以有效减少特征融合时的冗余性,并将分类的平均精度提高到0.813。

     

  • 图 1  基于深度多模态特征融合的短视频分类网络

    Figure 1.  Micro-video classification network based on deep multimodal feature fusion

    图 2  参数αβγ不同取值对短视频分类任务的影响

    Figure 2.  Influence of different values of parameters α, β, γ on micro-video classification task

    表  1  不同网络短视频分类性能对比

    Table  1.   Performance comparison of micro-video classification in different networks

    分类算法 AR AP Micro-F1 Macro-F1
    单模态 C3D 0.555 0.568 0.569 0.560
    R2+1D 0.622 0.643 0.651 0.631
    ResNet3D 0.669 0.692 0.695 0.678
    GoogleNet 0.679 0.709 0.719 0.691
    I3D 0.706 0.727 0.733 0.715
    S3D-G 0.741 0.750 0.755 0.744
    多模态 C3D(multimodal) 0.612 0.623 0.632 0.617
    I3D(multimodal) 0.751 0.763 0.772 0.755
    TSN 0.763 0.772 0.781 0.770
    CTSN 0.771 0.780 0.785 0.773
    SlowFast Network 0.780 0.782 0.794 0.781
    本文算法 0.782 0.795 0.813 0.789
    下载: 导出CSV

    表  2  不同网络短视频特征分类性能对比

    Table  2.   Performance comparison of micro-video feature classification in different networks

    方法 AR AP Micro-F1 Macro-F1
    I3D(视觉模态) 0.720 0.718 0.733 0.720
    I3D(音频模态) 0.328 0.319 0.341 0.328
    前期融合 0.751 0.763 0.772 0.755
    公有域网络 0.771 0.777 0.781 0.774
    私有域网络 0.751 0.744 0.763 0.748
    本文算法 0.782 0.795 0.813 0.789
    下载: 导出CSV
  • [1] ZHANG J, NIE L, WANG X, et al. Shorter-is-better: Venue category estimation from micro-video[C]//Proceedings of ACM International Conference on Multimedia. New Nork: ACM Press, 2016: 1415-1424.
    [2] WEI Y, WANG X, GUAN W, et al. Neural multimodal cooperative learning toward micro-video understanding[J]. IEEE Transactions on Image Processing, 2019, 29: 1-14. http://ieeexplore.ieee.org/document/8752281/
    [3] NIE L, WANG X, ZHANG J, et al. Enhancing micro-video understanding by harnessing external sounds[C]//Proceedings of ACM International Conference on Multimedia. New Nork: ACM Press, 2017: 1192-1200.
    [4] JING P, SU Y, NIE L, et al. Low-rank multi-view embedding learning for micro-video popularity prediction[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 30(8): 1519-1532. http://ieeexplore.ieee.org/document/8233154/
    [5] LIU S, CHEN Z, LIU H, et al. User-video co-attention network for personalized micro-video recommendation[C]//Proceedings of World Wide Web Conference, 2019: 3020-3026.
    [6] SHANG S, SHI M, SHANG W, et al. A micro-video recommendation system based on big data[C]//Proceedings of International Conference on Computer and Information Science. Piscataway: IEEE Press, 2016: 1-5.
    [7] LONG X, GAN C, MELO G D, et al. Attention clusters: Purely attention based local feature integration for video classification[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 7834-7843.
    [8] MA C Y, CHEN M H, KIRA Z, et al. TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition[J]. Signal Processing: Image Communication, 2019, 71: 76-87. doi: 10.1016/j.image.2018.09.003
    [9] TRAN D, BOURDEY L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2015: 4489-4497.
    [10] CARREIRA J, ZISSERMAN A, QUO V. Action recognition A new model and the kinetics dataset[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 6299-6308.
    [11] HARA K, KATAOKA H, SATOH Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 6546-6555.
    [12] FEICHTENHOFER C, FAN H, MALIK J, et al. SlowFast networks for video recognition[C]//Proceedings of IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 6202-6211.
    [13] QIU Z, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks[C]//Proceedings of IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2017: 5533-5541.
    [14] TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 6450-6459.
    [15] XIE S, SUN C, HUANG J, et al. Rethinking spatiotemporal feature learning for video understanding[EB/OL]. (2017-12-13)[2020-08-01]. https://arxiv.org/abs/1712.04851.
    [16] D'MELLO S K, KORY J. A review and meta-analysis of multimodal affect detection systems[J]. ACM Computing Surveys, 2015, 47(3): 193-228. doi: 10.1145/2682899
    [17] ZHAI D, CHANG H, SHAN S, et al. Multiview metric learning with global consistency and local smoothness[J]. ACM Transactions on Intelligent Systems and Technology, 2012, 3(3): 1-22.
    [18] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 1933-1941.
    [19] FRANKLIN J. The elements of statistical learning: Data mining, inference and prediction[J]. The Mathematical Intelligencer, 2005, 27(2): 83-85. doi: 10.1111/j.1467-985X.2010.00646_6.x/full
    [20] WANG L, XIONG Y, WANG Z, et al. Temporal segment networks: Towards good practices for deep action recognition[C]//Proceedings of European Conference on Computer Vision. Berlin: Springer, 2016: 20-36.
    [21] CHOPRA S, HASSELL R, LECUN Y. Learning a similarity metric discriminatively, with application to face verification[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2005: 539-546.
    [22] ZAGORUYKO S, KOMODAKIS N. Learning to compare image patches via convolutional neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 4353-4361.
    [23] BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional Siamese networks for object tracking[C]//Proceedings of European Conference on Computer Vision. Berlin: Springer, 2016: 850-865.
    [24] VALMADRE J, BERTINETTO L, HENRIQUES J, et al. End-to-end representation learning for correlation filter based tracking[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 2805-2813.
    [25] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 1-9.
  • 加载中
图(2) / 表(2)
计量
  • 文章访问数:  903
  • HTML全文浏览量:  28
  • PDF下载量:  237
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-08-24
  • 录用日期:  2020-08-28
  • 刊出日期:  2021-03-20

目录

    /

    返回文章
    返回
    常见问答