基于多模态双向信息增强的RGBT跟踪网络

赵伟; 刘磊; 王鲲鹏; 涂铮铮; 罗斌

doi:10.13700/j.bh.1001-5965.2022.0395

基于多模态双向信息增强的RGBT跟踪网络

doi: 10.13700/j.bh.1001-5965.2022.0395

赵伟^{1, 2},
刘磊^{1, 2},
王鲲鹏^{1, 2},
涂铮铮^{1, 2, ,},
罗斌^{1, 2}

1.
多模态认知计算安徽省重点实验室，合肥 230601
2.
安徽大学计算机科学与技术学院，合肥 230601

基金项目: 国家自然科学基金(62376005)；安徽省重点研发计划(202104d07020008,KJ2020A0033)；安徽省自然科学基金(2108085MF211)；安徽省高校协同创新项目(GXXT-2022-014)

详细信息

通讯作者:
E-mail：zhengzhengahu@163.com

中图分类号: TP183
计量
- 文章访问数: 719
- HTML全文浏览量: 96
- PDF下载量: 15
- 被引次数: 0
出版历程
- 收稿日期: 2022-05-20
- 录用日期: 2022-07-02
- 网络出版日期: 2022-10-18
- 整期出版日期: 2024-02-27

Multimodal bidirectional information enhancement network for RGBT tracking

ZHAO Wei^{1, 2},
LIU Lei^{1, 2},
WANG Kunpeng^{1, 2},
TU Zhengzheng^{1, 2
, ,},
LUO Bin^{1, 2}

1.
Anhui Provincial Key Laboratory of Multimodal Cognitive Computation，Hefei 230601，China
2.
School of Computer Science and Technology，Anhui University，Hefei 230601，China

Funds: National Natural Science Foundation of China (62376005); Anhui Provincial Key Research and Development Program (202104d07020008,KJ2020A0033); Anhui Provincial Natural Science Foundation (2108085MF211); University Synergy Innovation Program of Anhui Province (GXXT-2022-014)

More Information

Corresponding author: E-mail：zhengzhengahu@163.com

摘要

摘要:
可见光-热红外(RGBT)目标跟踪旨在挖掘可见光和热红外数据的互补优势，实现鲁棒的目标跟踪。目前主流方法通常引入模态权重来实现多模态信息融合，但简单地为各个模态分配权重无法充分挖掘可见光和热红外模态的互补优势。基于此，提出了一种多模态双向信息增强的RGBT跟踪网络(MBIENet)。设计了一种特征聚合模块，用于聚合模态共享特征和模态特定特征以建模目标外观信息；提出了一种新的多模态双向调制融合模块，可有效融合模态互补信息，减少冗余特征和无用特征对跟踪器的影响；提出了一个轻量化的通道空间注意力模块，可自适应调整不同环境下不同模态的贡献。在GTOT、RGBT234和LasHeR数据集上的实验结果表明：所提跟踪算法的准确率和成功率优于当前主流的跟踪算法。
- 可见光-热红外 /
- 目标跟踪 /
- 深度学习 /
- 多模态信息融合 /
- 多模态信息交互
Abstract:
The goal of RGB-thermal infrared (RGBT) visual object tracking, which has drawn increasing interest in recent years, is to take advantage of the complimentary strengths of RGB and thermal infrared picture data to accomplish reliable visual tracking. For obtaining a robust appearance representation of an object, existing mainstream methods introduced the modal weight to fuse information of two modalities. Simply assigning weights to the individual modalities can’t fully explore the complementary benefits of RGB and thermal infrared modalities. To solve these problems, propose a novel multimodal bidirectional information enhancement network for RGBT tracking (MBIENet). Specifically, design a feature aggregation module to aggregate modality-shared and modality-specific features for modeling the appearance information of an object. Further proposes a novel multimodal bidirectional modulation fusion module that can effectively fuse the complementary information of two modalities and alleviate the impact of redundant and useless features on the tracker. The contributions of various modalities in various situations are then adaptively adjusted using a lightweight channel-spatial attention module that is proposed. Experimental results on GTOT, RGBT234, and LasHeR datasets show that the accuracy rate and success rate of the proposed method are better than the existing mainstream trackers.
- RGB-thermal infrared /
- object tracking /
- deep learning /
- multimodal information fusion /
- multimodal information interaction

HTML全文

图 1 MBIENet框架

Figure 1. MBIENet framework

下载: 全尺寸图片幻灯片

图 2 模态不可靠示例

Figure 2. Example of modal unreliability

下载: 全尺寸图片幻灯片

图 3 特征聚合模块结构

Figure 3. Structure of feature aggregation module

下载: 全尺寸图片幻灯片

图 4 多模态双向调制融合模块结构

Figure 4. Structure of multimode bidirectional modulation fusion module

下载: 全尺寸图片幻灯片

图 5 特征图可视化

Figure 5. Visualization of feature map

下载: 全尺寸图片幻灯片

图 6 通道空间注意力模块结构

Figure 6. Structure of channel-spatial attention module

下载: 全尺寸图片幻灯片

图 7 GTOT评估曲线

Figure 7. Evaluation curves of GTOT

下载: 全尺寸图片幻灯片

图 8 RGBT234评估曲线

Figure 8. Evaluation curves of RGBT234

下载: 全尺寸图片幻灯片

图 9 LasHeR评估曲线

Figure 9. Evaluation curves of LasHeR

下载: 全尺寸图片幻灯片

图 10 跟踪结果比较

Figure 10. Tracking results comparison

下载: 全尺寸图片幻灯片

表 1 消融实验结果

Table 1. Ablation experiments results

模块	准确率			成功率
模块	RGBT234	LasHeR	GTOT	RGBT234	LasHeR	GTOT
MBIENet-FAM	0.818	0.477	0.892	0.568	0.348	0.707
MBIENet-MBMF	0.813	0.471	0.886	0.560	0.342	0.701
MBIENet-CSA	0.822	0.479	0.899	0.574	0.347	0.718
MBIENet	0.829	0.484	0.902	0.582	0.355	0.720

下载: 导出CSV

表 2 注意力机制比较

Table 2. Comparison of attention mechanisms

模块	准确率		成功率		检测速度/ (帧·s⁻¹)
模块	GTOT	RGBT234	GTOT	RGBT234	检测速度/ (帧·s⁻¹)
MBIENet Concat	0.896	0.822	0.709	0.574	1.8
MBIENet CBAM	0.897	0.824	0.714	0.576	1.38
MBIENet SE	0.899	0.825	0.718	0.580	1.6
MBIENet CSA	0.902	0.829	0.720	0.582	1.54

下载: 导出CSV

参考文献(31)

[1]	BERYINETTO L, VALMADRE J, HENRIQUES J, et al. Fully-convolutional Siamese networks for object tracking[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2016: 850-865.
[2]	GE S M, LUO Z, ZHANG C H, et al. Distilling channels for efficient deep tracking[J]. IEEE Transactions on Image Processing, 2020, 29: 2610-2621.
[3]	ZHANG J M, JIN X K, SUN J, et al. Spatial and semantic convolutional features for robust visual object tracking[J]. Multimedia Tools and Applications, 2020, 79(21): 15095-15115.
[4]	LIU Q, LU X H, HE Z Y, et al. Deep convolutional neural networks for thermal infrared object tracking[J]. Knowledge-Based Systems, 2017, 134: 189-198. doi: 10.1016/j.knosys.2017.07.032
[5]	ZHANG L C, GONZALEZ-GARCIA A, DANELLJAN M, et al. Synthetic data generation for end-to-end thermal infrared tracking[J]. IEEE Transactions on Image Processing, 2019, 28(4): 1837-1850.
[6]	LIU Q, LI X, HE Z Y, et al. Learning deep multi-level similarity for thermal infrared object tracking[J]. IEEE Transactions on Multimedia, 2021, 23: 2114-2126.
[7]	SUN T X, SHAO Y F, LI X N, et al. Learning sparse sharing architectures for multiple tasks[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press,2020, 34: 8936-8943.
[8]	GAO Y, LI C L, ZHU Y B, et al. Deep adaptive fusion network for high performance RGBT tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 91-99.
[9]	LI C L, LIU L, LU A D, et al. Challenge-aware RGBT tracking[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2020: 222-237.
[10]	ZHANG P Y, WANG D, LU H C, et al. Learning adaptive attribute-driven representation for real-time RGB-T tracking[J]. International Journal of Computer Vision, 2021, 129(9): 2714-2729. doi: 10.1007/s11263-021-01495-3
[11]	XIAO Y, YANG M M, LI C L, et al. Attribute-based progressive fusion network for RGBT tracking[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 2831-2838.
[12]	LI C L, LU A D, ZHENG A H, et al. Multi-adapter RGBT tracking[C]//Proceedings of the IEEE/CVF International Comference on Computer Vision. Piscataway: IEEE Press, 2019: 2262-2270.
[13]	XU Q, MEI Y M, LIU J P, et al. Multimodal cross-layer bilinear pooling for RGBT tracking[J]. IEEE Transactions on Multimedia, 2022, 24: 567-580.
[14]	NAM H, HAN B. Learning multi-domain convolutional neural networks for visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 4293-4302.
[15]	DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2009: 248-255.
[16]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018.
[17]	LI C L, CHENG H, HU S Y, et al. Learning collaborative sparse representation for grayscale-thermal tracking[J]. IEEE Transactions on Image Processing, 2016, 25(12): 5743-5756. doi: 10.1109/TIP.2016.2614135
[18]	LI C L, ZHAO N, LU Y J, et al. Weighted sparse representation regularized graph learning for RGB-T object tracking[C]//Proceedings of the 25th ACM International Conference on Multimedia. New York: ACM, 2017: 1856-1864.
[19]	LI C L, SUN X, WANG X, et al. Grayscale-thermal object tracking via multitask Laplacian sparse representation[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2017, 47(4): 673-681.
[20]	ZHU Y B, LI C L, TANG J, et al. Quality-aware feature aggregation network for robust RGBT tracking[J]. IEEE Transactions on Intelligent Vehicles, 2020, 6(1): 121-130.
[21]	PEREZ E, STRUB F, DE VRIES H, et al. FiLM: Visual reasoning with a general conditioning layer[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2018, 483: 3942-3951.
[22]	WANG X, SHU X J, ZHANG S L, et al. MFGNet: Dynamic modality-aware filter generation for RGB-T tracking[J]. IEEE Transactions on Multimedia, 2023, 25: 4335-4348.
[23]	WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2018: 3-19.
[24]	LI C L, LIANG X Y, LU Y J, et al. RGB-T object tracking: Benchmark and baseline[J]. Pattern Recognition, 2019, 96: 106977. doi: 10.1016/j.patcog.2019.106977
[25]	LI C L, XUE W L, JIA Y Q, et al. LasHeR: A large-scale high-diversity benchmark for RGBT tracking[J]. IEEE Transactions on Image Processing, 2022, 31: 392-404.
[26]	TU Z Z, LIN C, ZHAO W, et al. M⁵L: Multi-modal multi-margin metric learning for RGBT tracking[J]. IEEE Transactions on Image Processing, 2022, 31: 85-98.
[27]	WANG C Q, XU C Y, CUI Z, et al. Cross-modal pattern-propagation for RGB-T tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 7064-7073.
[28]	ZHANG H, ZHANG L, ZHUO L, et al. Object tracking in RGB-T videos using modal-aware attention network and competitive learning[J]. Sensors, 2020, 20(2): 393. doi: 10.3390/s20020393
[29]	ZHANG L C, DANELLJAN M, GONZALEZ-GARCIA A, et al. Multi-modal fusion for end-to-end RGB-T tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 2252-2261.
[30]	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 7132-7141.
[31]	ZHU Y B, LI C L, LUO B, et al. Dense feature aggregation and pruning for rgbt tracking[C]//Proceedings of the 27th ACM International Conference on Multimedia. New York: ACM, 2019: 465-472.