Deep multimodal feature fusion for micro-video classification

ZHANG Lijuan; CUI Tianshu; JING Peiguang; SU Yuting

doi:10.13700/j.bh.1001-5965.2020.0457

Volume 47 Issue 3

Mar. 2021

Turn off MathJax

Article Contents

Journal of Beijing University of Aeronautics and Astronautics > 2021 > 47(3): 478-485.

ZHANG Lijuan, CUI Tianshu, JING Peiguang, et al. Deep multimodal feature fusion for micro-video classification[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 478-485. doi: 10.13700/j.bh.1001-5965.2020.0457(in Chinese)

Citation:

ZHANG Lijuan, CUI Tianshu, JING Peiguang, et al. Deep multimodal feature fusion for micro-video classification[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 478-485. doi: 10.13700/j.bh.1001-5965.2020.0457(in Chinese)

Citation:

PDF( 2297 KB)

Deep multimodal feature fusion for micro-video classification

doi: 10.13700/j.bh.1001-5965.2020.0457

School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China

Funds:

National Natural Science Foundation of China 61802277

China Postdoctoral Science Foundation 2019M651038

More Information

Corresponding author: JING Peiguang, E-mail: pgjing@tju.edu.cn
Received Date: 24 Aug 2020
Accepted Date: 28 Aug 2020
Publish Date: 20 Mar 2021

Abstract

Abstract

Nowadays, micro-video has become one of the most representative products in the new media era. It has the characteristics of short time and strong editing, which makes the traditional video classification models no longer suitable for micro-video classification task.Based on the characteristics of the micro-video classification problem, the micro-video classification algorithm based on deep multimodal feature fusion is proposed. The proposed algorithm inputs the visual modal information and acoustic modal information into the domain separation network, and divides the entire feature space into a shared domain part shared by all modalities and the private domain part unique to the acoustic and visual modalities respectively. By optimizing the domain separation network, the differences and similarities among different modal features are preserved to the greatest extent. The experiments on the public micro-video classification dataset prove that the proposed algorithm can effectively reduce the redundancy of feature fusion and improve the average classification accuracy to 0.813.
- micro-video,
- multimodal learning,
- deep network,
- classification,
- feature space

FullText(HTML)

References(25)

References

[1]	ZHANG J, NIE L, WANG X, et al. Shorter-is-better: Venue category estimation from micro-video[C]//Proceedings of ACM International Conference on Multimedia. New Nork: ACM Press, 2016: 1415-1424.
[2]	WEI Y, WANG X, GUAN W, et al. Neural multimodal cooperative learning toward micro-video understanding[J]. IEEE Transactions on Image Processing, 2019, 29: 1-14. http://ieeexplore.ieee.org/document/8752281/
[3]	NIE L, WANG X, ZHANG J, et al. Enhancing micro-video understanding by harnessing external sounds[C]//Proceedings of ACM International Conference on Multimedia. New Nork: ACM Press, 2017: 1192-1200.
[4]	JING P, SU Y, NIE L, et al. Low-rank multi-view embedding learning for micro-video popularity prediction[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 30(8): 1519-1532. http://ieeexplore.ieee.org/document/8233154/
[5]	LIU S, CHEN Z, LIU H, et al. User-video co-attention network for personalized micro-video recommendation[C]//Proceedings of World Wide Web Conference, 2019: 3020-3026.
[6]	SHANG S, SHI M, SHANG W, et al. A micro-video recommendation system based on big data[C]//Proceedings of International Conference on Computer and Information Science. Piscataway: IEEE Press, 2016: 1-5.
[7]	LONG X, GAN C, MELO G D, et al. Attention clusters: Purely attention based local feature integration for video classification[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 7834-7843.
[8]	MA C Y, CHEN M H, KIRA Z, et al. TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition[J]. Signal Processing: Image Communication, 2019, 71: 76-87. doi: 10.1016/j.image.2018.09.003
[9]	TRAN D, BOURDEY L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2015: 4489-4497.
[10]	CARREIRA J, ZISSERMAN A, QUO V. Action recognition A new model and the kinetics dataset[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 6299-6308.
[11]	HARA K, KATAOKA H, SATOH Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 6546-6555.
[12]	FEICHTENHOFER C, FAN H, MALIK J, et al. SlowFast networks for video recognition[C]//Proceedings of IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 6202-6211.
[13]	QIU Z, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks[C]//Proceedings of IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2017: 5533-5541.
[14]	TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 6450-6459.
[15]	XIE S, SUN C, HUANG J, et al. Rethinking spatiotemporal feature learning for video understanding[EB/OL]. (2017-12-13)[2020-08-01]. https://arxiv.org/abs/1712.04851.
[16]	D'MELLO S K, KORY J. A review and meta-analysis of multimodal affect detection systems[J]. ACM Computing Surveys, 2015, 47(3): 193-228. doi: 10.1145/2682899
[17]	ZHAI D, CHANG H, SHAN S, et al. Multiview metric learning with global consistency and local smoothness[J]. ACM Transactions on Intelligent Systems and Technology, 2012, 3(3): 1-22.
[18]	FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 1933-1941.
[19]	FRANKLIN J. The elements of statistical learning: Data mining, inference and prediction[J]. The Mathematical Intelligencer, 2005, 27(2): 83-85. doi: 10.1111/j.1467-985X.2010.00646_6.x/full
[20]	WANG L, XIONG Y, WANG Z, et al. Temporal segment networks: Towards good practices for deep action recognition[C]//Proceedings of European Conference on Computer Vision. Berlin: Springer, 2016: 20-36.
[21]	CHOPRA S, HASSELL R, LECUN Y. Learning a similarity metric discriminatively, with application to face verification[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2005: 539-546.
[22]	ZAGORUYKO S, KOMODAKIS N. Learning to compare image patches via convolutional neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 4353-4361.
[23]	BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional Siamese networks for object tracking[C]//Proceedings of European Conference on Computer Vision. Berlin: Springer, 2016: 850-865.
[24]	VALMADRE J, BERTINETTO L, HENRIQUES J, et al. End-to-end representation learning for correlation filter based tracking[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 2805-2813.
[25]	SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 1-9.

Relative Articles

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(2) / Tables(2)

Get Citation

PDF

XML

Article Metrics

Article views(1829) PDF downloads(302)

Deep multimodal feature fusion for micro-video classification

doi: 10.13700/j.bh.1001-5965.2020.0457

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Proportional views

Related

Deep multimodal feature fusion for micro-video classification

doi: 10.13700/j.bh.1001-5965.2020.0457

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Proportional views

Related

Export File

Citation

Format

Content