Volume 47 Issue 3
Mar.  2021
Turn off MathJax
Article Contents
ZHANG Lijuan, CUI Tianshu, JING Peiguang, et al. Deep multimodal feature fusion for micro-video classification[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 478-485. doi: 10.13700/j.bh.1001-5965.2020.0457(in Chinese)
Citation: ZHANG Lijuan, CUI Tianshu, JING Peiguang, et al. Deep multimodal feature fusion for micro-video classification[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 478-485. doi: 10.13700/j.bh.1001-5965.2020.0457(in Chinese)

Deep multimodal feature fusion for micro-video classification

doi: 10.13700/j.bh.1001-5965.2020.0457
Funds:

National Natural Science Foundation of China 61802277

China Postdoctoral Science Foundation 2019M651038

More Information
  • Corresponding author: JING Peiguang, E-mail: pgjing@tju.edu.cn
  • Received Date: 24 Aug 2020
  • Accepted Date: 28 Aug 2020
  • Publish Date: 20 Mar 2021
  • Nowadays, micro-video has become one of the most representative products in the new media era. It has the characteristics of short time and strong editing, which makes the traditional video classification models no longer suitable for micro-video classification task.Based on the characteristics of the micro-video classification problem, the micro-video classification algorithm based on deep multimodal feature fusion is proposed. The proposed algorithm inputs the visual modal information and acoustic modal information into the domain separation network, and divides the entire feature space into a shared domain part shared by all modalities and the private domain part unique to the acoustic and visual modalities respectively. By optimizing the domain separation network, the differences and similarities among different modal features are preserved to the greatest extent. The experiments on the public micro-video classification dataset prove that the proposed algorithm can effectively reduce the redundancy of feature fusion and improve the average classification accuracy to 0.813.

     

  • loading
  • [1]
    ZHANG J, NIE L, WANG X, et al. Shorter-is-better: Venue category estimation from micro-video[C]//Proceedings of ACM International Conference on Multimedia. New Nork: ACM Press, 2016: 1415-1424.
    [2]
    WEI Y, WANG X, GUAN W, et al. Neural multimodal cooperative learning toward micro-video understanding[J]. IEEE Transactions on Image Processing, 2019, 29: 1-14. http://ieeexplore.ieee.org/document/8752281/
    [3]
    NIE L, WANG X, ZHANG J, et al. Enhancing micro-video understanding by harnessing external sounds[C]//Proceedings of ACM International Conference on Multimedia. New Nork: ACM Press, 2017: 1192-1200.
    [4]
    JING P, SU Y, NIE L, et al. Low-rank multi-view embedding learning for micro-video popularity prediction[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 30(8): 1519-1532. http://ieeexplore.ieee.org/document/8233154/
    [5]
    LIU S, CHEN Z, LIU H, et al. User-video co-attention network for personalized micro-video recommendation[C]//Proceedings of World Wide Web Conference, 2019: 3020-3026.
    [6]
    SHANG S, SHI M, SHANG W, et al. A micro-video recommendation system based on big data[C]//Proceedings of International Conference on Computer and Information Science. Piscataway: IEEE Press, 2016: 1-5.
    [7]
    LONG X, GAN C, MELO G D, et al. Attention clusters: Purely attention based local feature integration for video classification[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 7834-7843.
    [8]
    MA C Y, CHEN M H, KIRA Z, et al. TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition[J]. Signal Processing: Image Communication, 2019, 71: 76-87. doi: 10.1016/j.image.2018.09.003
    [9]
    TRAN D, BOURDEY L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2015: 4489-4497.
    [10]
    CARREIRA J, ZISSERMAN A, QUO V. Action recognition A new model and the kinetics dataset[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 6299-6308.
    [11]
    HARA K, KATAOKA H, SATOH Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 6546-6555.
    [12]
    FEICHTENHOFER C, FAN H, MALIK J, et al. SlowFast networks for video recognition[C]//Proceedings of IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 6202-6211.
    [13]
    QIU Z, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks[C]//Proceedings of IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2017: 5533-5541.
    [14]
    TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 6450-6459.
    [15]
    XIE S, SUN C, HUANG J, et al. Rethinking spatiotemporal feature learning for video understanding[EB/OL]. (2017-12-13)[2020-08-01]. https://arxiv.org/abs/1712.04851.
    [16]
    D'MELLO S K, KORY J. A review and meta-analysis of multimodal affect detection systems[J]. ACM Computing Surveys, 2015, 47(3): 193-228. doi: 10.1145/2682899
    [17]
    ZHAI D, CHANG H, SHAN S, et al. Multiview metric learning with global consistency and local smoothness[J]. ACM Transactions on Intelligent Systems and Technology, 2012, 3(3): 1-22.
    [18]
    FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 1933-1941.
    [19]
    FRANKLIN J. The elements of statistical learning: Data mining, inference and prediction[J]. The Mathematical Intelligencer, 2005, 27(2): 83-85. doi: 10.1111/j.1467-985X.2010.00646_6.x/full
    [20]
    WANG L, XIONG Y, WANG Z, et al. Temporal segment networks: Towards good practices for deep action recognition[C]//Proceedings of European Conference on Computer Vision. Berlin: Springer, 2016: 20-36.
    [21]
    CHOPRA S, HASSELL R, LECUN Y. Learning a similarity metric discriminatively, with application to face verification[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2005: 539-546.
    [22]
    ZAGORUYKO S, KOMODAKIS N. Learning to compare image patches via convolutional neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 4353-4361.
    [23]
    BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional Siamese networks for object tracking[C]//Proceedings of European Conference on Computer Vision. Berlin: Springer, 2016: 850-865.
    [24]
    VALMADRE J, BERTINETTO L, HENRIQUES J, et al. End-to-end representation learning for correlation filter based tracking[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 2805-2813.
    [25]
    SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 1-9.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(2)  / Tables(2)

    Article Metrics

    Article views(1829) PDF downloads(302) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return