基于音视频特征优化与跨模态Transformer的多模态情感分析

林宜山; 左景; 卢树华

doi:10.13700/j.bh.1001-5965.2024.0247

基于音视频特征优化与跨模态Transformer的多模态情感分析

doi: 10.13700/j.bh.1001-5965.2024.0247

林宜山¹,
左景¹,
卢树华^{1, 2, ,}

1.
中国人民公安大学信息网络安全学院，北京 102600
2.
公安部安全防范技术与风险评估重点实验室，北京 102600

基金项目:

中国人民公安大学双一流创新研究专项(2023SYL08)

详细信息

通讯作者:
E-mail：lushuhua@ppsuc.edu.cn

中图分类号: TP391
计量
- 文章访问数: 661
- HTML全文浏览量: 226
- PDF下载量: 22
- 被引次数: 0
出版历程
- 收稿日期: 2024-04-23
- 录用日期: 2024-07-05
- 网络出版日期: 2024-08-15
- 整期出版日期: 2026-06-30

A multimodal sentiment analysis based on audio and video features optimization and cross-modal Transformer

LIN Yishan¹,
ZUO Jing¹,
LU Shuhua^{1, 2
, ,}

1.
College of Information and Cyber Security，People’s Public Security University of China，Beijing 102600，China
2.
Key Laboratory of Security Technology and Risk Assessment Ministry of Public Security，Beijing 102600，China

Funds:

Double First-Class Innovation Research Project for People’s Public Security University of China (2023SYL08)

More Information

Corresponding author: E-mail: lushuhua@ppsuc.edu.cn

摘要

摘要:
针对多模态情感分析中音视频模态特征质量较差、不同模态信息交互不够充分等问题，提出一种基于音视频特征优化与跨模态Transformer(CMT)的多模态情感分析方法。设计了一种音视频特征优化机制(AVFOM)，通过与文本特征的协同作用，增加音视频特征的情感信息密度，提高音频和视频特征质量；设计了CMT结构，以文本为主，实现文本-音频、文本-视频模态的两两充分交互，学习不同模态的一致性信息。此外，引入基于自监督学习策略的标签生成方法，实现单模态情感预测任务，学习每个模态各自的特性。所提方法在CMU-MOSI和CMU-MOSEI这2个公开数据集上进行了大量实验验证与测试，结果表明：所提方法超越当前诸多性能先进的方法，有效提升了多模态情感分析的准确性。
- 多模态 /
- 情感分析 /
- Transformer模型 /
- 自监督学习 /
- 音视频特征优化
Abstract:
To solve problems including low-quality audio and video modal features and inadequate interaction between various modalities, a multimodal sentiment analysis approach based on cross-modal Transformer (CMT) and audio and video feature optimization is suggested. Firstly, we propose a audio and video features optimizing mechanism (AVFOM), which increases the density of sentiment information in audio and video features through synergistic interaction with textual features, thereby improving the quality of audio and video features. Secondly, in order to accomplish full interaction between text-audio and text-video modalities and learn consistent knowledge across various modalities, we construct a cross-modal Transformer structure with text as the dominant modality. Additionally, a label generation method based on the self-supervised learning strategy is introduced to perform single-modality sentiment prediction tasks, learning the characteristics of each modality separately. The proposed method is extensively validated and tested on two public datasets, CMU-MOSI and CMU-MOSEI, which surpass many currently advanced methods in terms of performance and effectively improve the accuracy of multimodal sentiment analysis.
- multimodal /
- sentiment analysis /
- Transformer /
- self-supervised learning /
- audio and video features optimization

HTML全文

图 1 本文模型的整体结构

Figure 1. The overall structure of the proposed model

下载: 全尺寸图片幻灯片

图 2 音视频特征优化机制

Figure 2. Audio and video features optimization mechanism

下载: 全尺寸图片幻灯片

图 3 跨模态Transformer

Figure 3. Cross-modal Transformer

下载: 全尺寸图片幻灯片

图 4 特征可视化

Figure 4. Feature visualization

下载: 全尺寸图片幻灯片

表 1 实验设置

Table 1. Experimental setup

数据集	学习率	批大小	注意力头	Transformer 层
CMU-MOSI^[27]	6.8×10⁻⁵	32	16	1
CMU-MOSEI^[28]	9.0×10⁻⁶	32	16	1

下载: 导出CSV

表 2 在CMU-MOSI数据集上与其他基准模型的对比结果

Table 2. The results compared with other baseline models on the CMU-MOSI dataset

模型	MAE	Corr	Acc-2/%	Acc-7/%	F₁/%
TFN₂^[15]	0.901	0.698	*/80.8	34.90	*/80.7
MulT₁^[33]	0.871	0.698	*/83.0	40.00	*/82.8
Self-MM^[32]	0.713	0.798	84.0/85.98	*	84.42/85.95
MISA^[34]	0.783	0.761	81.8/83.4	42.30	81.7/83.6
MAG-Bert₂^[35]	0.731	0.798	82.5/84.3	*	82.6/84.3
MTSA^[36]	0.696	0.806	*/86.8	46.40	*/86.8
TETFN^[37]	0.717	0.800	84.05/86.10	*	83.83/86.07
MTAMW^[38]	0.712	0.794	84.40/86.59	46.84	84.20/86.46
CRNet^[39]	0.712	0.797	*/86.4	47.40	*/86.4
FRDIN^[40]	0.682	0.813	85.8/87.4	46.59	85.3/87.5
MIBSA^[41]	0.728	0.798	*/87.00	43.10	*/87.20
本文	0.592	0.862	86.90/89.11	50.30	86.96/89.12
注：MulT₁模型表示结果来自(AOBERT)^[20]，TFN₂，MAG-Bert₂模型表示结果来自(VLP2MSA)^[42]，其他模型结果来自原论文；“*”表示原文未提供结果；对于Acc-2和F₁，“/”的左边代表negative/non-negative的方法，右边代表negative/positive的方法；加粗数字表示最优值。

下载: 导出CSV

表 3 在CMU-MOSEI数据集上与其他基线模型对比的结果

Table 3. The results compared with other baseline models on the CMU-MOSEI dataset

模型	MAE	Corr	Acc-2/%	Acc-7/%	F₁/%
TFN₂^[15]	0.593	0.700	*/82.50	50.20	*/82.10
MulT₁^[33]	0.580	0.703	*/82.50	51.80	*/82.30
Self-MM^[32]	0.530	0.765	82.81/85.17	*	82.53/85.30
MISA^[34]	0.555	0.756	83.60/85.50	52.20	83.80/85.30
MAG-Bert₂^[35]	0.543	0.755	82.51/84.82	*	82.77/84.71
MTSA^[36]	0.541	0.774	*/85.50	52.90	*/85.30
TETFN^[37]	0.551	0.748	84.25/85.18	*	84.18/85.27
MTAMW^[38]	0.525	0.782	83.09/86.49	53.73	83.48/86.45
CRNet^[39]	0.541	0.771	*/86.20	53.80	*/86.10
FRDIN^[40]	0.525	0.778	83.30/86.30	54.40	83.70/86.20
MIBSA^[41]	0.568	0.753	*/86.70	52.40	*/85.80
本文	0.519	0.791	83.66/86.77	54.41	83.24/86.76
注：MulT₁模型结果来自(AOBERT)^[20]，TFN₂，MAG-Bert₂模型结果来自(VLP2MSA)^[42]，其他模型结果来自原论文，“*”表示原文未提供结果，对于Acc-2和F₁，“/”的左边代表negative/non-negative的方法，右边代表negative/positive的方法；加粗数字表示最优值。

下载: 导出CSV

表 4 在CMU-MOSI数据集上的消融实验

Table 4. The ablation experiments on the CMU-MOSI dataset

模型	MAE	Corr	Acc-2/%	Acc-7/%	F₁/%
本文	0.592	0.862	86.90/89.11	50.30	86.96/89.12
w/o SentiLARE	0.837	0.743	82.14/84.14	39.43	82.20/84.13
w/o AVFOM	0.617	0.856	85.86/88.16	48.51	85.92/88.17
w/o CMT	0.613	0.854	85.71/87.38	47.47	85.78/87.41
w/o UPT	0.613	0.855	86.01/88.16	48.07	86.12/88.22
w/o audio	0.618	0.854	86.16/88.01	46.28	86.23/88.03
w/o video	0.621	0.847	85.86/88.01	49.40	85.89/87.99
w/o text	1.426	0.012	44.70/45.98	15.60	49.83/48.46

下载: 导出CSV

参考文献(42)

[1]	GANDHI A, ADHVARYU K, PORIA S, et al. Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions[J]. Information Fusion, 2023, 91: 424-444.
[2]	HUANG C Q, ZHANG J L, WU X M, et al. TeFNA: text-centered fusion network with crossmodal attention for multimodal sentiment analysis[J]. Knowledge-Based Systems, 2023, 269: 110502.
[3]	CHEN C, HONG H S, GUO J, et al. Inter-intra modal representation augmentation with trimodal collaborative disentanglement network for multimodal sentiment analysis[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1476-1488.
[4]	CHEN Q P, HUANG G M, WANG Y B. The weighted cross-modal attention mechanism with sentiment prediction auxiliary task for multimodal sentiment analysis[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2689-2695.
[5]	YUAN Z Q, LI W, XU H, et al. Transformer-based feature reconstruction network for robust multimodal sentiment analysis[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 4400-4407.
[6]	SUN H, CHEN Y W, LIN L F. TensorFormer: a tensor-based multimodal Transformer for multimodal sentiment analysis and depression detection[J]. IEEE Transactions on Affective Computing, 2023, 14(4): 2776-2786.
[7]	SUN Z K, SARMA P, SETHARES W, et al. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 8992-8999.
[8]	WU Y, LIN Z J, ZHAO Y Y, et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis[C]//Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Kerrville: Association for Computational Linguistics, 2021: 4730-4738.
[9]	HU G M, LIN T E, ZHAO Y, et al. UniMSE: Towards unified multimodal sentiment analysis and emotion recognition[EB/OL]. (2022-11-21)[2024-01-10]. https://doi.org/10.48550/arXiv.2211.11256.
[10]	MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: Harvesting opinions from the web[C]//Proceedings of the 13th International Conference on Multimodal Interfaces. New York: ACM, 2011: 169-176.
[11]	PORIA S, CAMBRIA E, GELBUKH A. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Kerrville: Association for Computational Linguistics, 2015: 2539-2544.
[12]	ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. New York: ACM, 2018: 5642-5649.
[13]	KAMPMAN O, BAREZI E J, BERTERO D, et al. Investigating audio, visual, and text fusion methods for end-to-end automatic personality prediction[EB/OL]. (2018-05-16)[2024-01-12]. https:/doi.org/10.48550//arXiv.1805.00705.
[14]	NOJAVANASGHARI B, GOPINATH D, KOUSHIK J, et al. Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of the 18th ACM International Conference on Multimodal Interaction. New York: ACM, 2016: 284-288.
[15]	ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[EB/OL]. (2017-07-23)[2024-01-13]. https://doi.org/10.48550/arXiv.1707.07250.
[16]	LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[EB/OL]. (2018-05-31)[2024-01-13]. https://doi.org/10.48550/arXiv.1806.00064.
[17]	MAJUMDER N, HAZARIKA D, GELBUKH A, et al. Multimodal sentiment analysis using hierarchical fusion with context modeling[J]. Knowledge-Based Systems, 2018, 161: 124-133.
[18]	MAI S J, HU H F, XING S L. Divide, conquer and combine: hierarchical feature fusion network with local and global perspectives for multimodal affective computing[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Kerrville: Association for Computational Linguistics, 2019: 481-492.
[19]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the Advances in Neural Information Processing Systems. Red Hook: Curran Associates, 2017: 5998-6008.
[20]	KIM K, PARK S. AOBERT: All-modalities-in-One BERT for multimodal sentiment analysis[J]. Information Fusion, 2023, 92(C): 37-45.
[21]	MA L Y, YAO Y, LIANG T, et al. Multi-scale cooperative multimodal Transformers for multimodal sentiment analysis in videos[EB/OL]. (2022-06-17)[2024-01-15]. https://doi.org/10.48550/arXiv.2206.07981.
[22]	XU M, LIANG F F, SU X Y, et al. CMJRT: cross-modal joint representation Transformer for multimodal sentiment analysis[J]. IEEE Access, 2022, 10: 131671-131679.
[23]	WANG L, PENG J J, ZHENG C Z, et al. A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning[J]. Information Processing & Management, 2024, 61(3): 103675.
[24]	FU Y P, ZHANG Z Y, YANG R D, et al. Hybrid cross-modal interaction learning for multimodal sentiment analysis[J]. Neurocomputing, 2024, 571: 127201.
[25]	KE P, JI H Z, LIU S Y, et al. SentiLARE: sentiment-aware language representation learning with linguistic knowledge[EB/OL]. (2020-09-24)[2024-01-17]. https://doi.org/10.48550/arXiv.1911.02493.
[26]	LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. (2019-07-26)[2024-01-17]. https://doi.org/10.48550/arXiv.1907.11692.
[27]	ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[EB/OL]. (2016-08-12)[2024-01-18]. https://doi.org/10.48550/arXiv.1606.06259.
[28]	BAGHER ZADEH A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistic. Kerrville: Association for Computational Linguistics, 2018: 2236-2246.
[29]	DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP: a collaborative voice analysis repository for speech technologies[C]//Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE Press, 2014: 960-964.
[30]	CHEONG J H, JOLLY E, XIE T K, et al. Py-Feat: python facial expression analysis toolbox[J]. Affective Science, 2023, 4(4): 781-796.
[31]	LIN H, ZHANG P L, LING J D, et al. PS-mixer: a polar-vector and strength-vector mixer model for multimodal sentiment analysis[J]. Information Processing & Management, 2023, 60(2): 103229.
[32]	YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(12): 10790-10797.
[33]	TSAI Y H H, BAI S J, LIANG P P, et al. Multimodal Transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Kerrville: Association for Computational Linguistics, 2019: 6558-6569.
[34]	HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: Modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1122-1131.
[35]	RAHMAN W, HASAN M K, LEE S, et al. Integrating multimodal information in large pretrained Transformers [C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Kerrville: Association for Computational Linguistics, 2020: 2359-2369.
[36]	YANG B, SHAO B, WU L J, et al. Multimodal sentiment analysis with unidirectional modality translation[J]. Neurocomputing, 2022, 467: 130-137.
[37]	WANG D, GUO X T, TIAN Y M, et al. TETFN: a text enhanced Transformer fusion network for multimodal sentiment analysis[J]. Pattern Recognition, 2023, 136: 109259.
[38]	WANG Y F, HE J H, WANG D, et al. Multimodal Transformer with adaptive modality weighting for multimodal sentiment analysis[J]. Neurocomputing, 2024, 572: 127181.
[39]	SHI H, PU Y Y, ZHAO Z P, et al. Co-space representation interaction network for multimodal sentiment analysis[J]. Knowledge-Based Systems, 2024, 283: 111149.
[40]	ZENG Y F, LI Z X, CHEN Z B, et al. A feature-based restoration dynamic interaction network for multimodal sentiment analysis[J]. Engineering Applications of Artificial Intelligence, 2024, 127: 107335.
[41]	LIU W, CAO S C, ZHANG S. Multimodal consistency-specificity fusion based on information bottleneck for sentiment analysis[J]. Journal of King Saud University-Computer and Information Sciences, 2024, 36(2): 101943.
[42]	YI G F, FAN C H, ZHU K, et al. VLP2MSA: Expanding vision-language pre-training to multimodal sentiment analysis[J]. Knowledge-Based Systems, 2024, 283: 111136.