-
摘要:
可见光-热红外(RGBT)目标跟踪旨在挖掘可见光和热红外数据的互补优势,实现鲁棒的目标跟踪。目前主流方法通常引入模态权重来实现多模态信息融合,但简单地为各个模态分配权重无法充分挖掘可见光和热红外模态的互补优势。基于此,提出了一种多模态双向信息增强的RGBT跟踪网络(MBIENet)。设计了一种特征聚合模块,用于聚合模态共享特征和模态特定特征以建模目标外观信息;提出了一种新的多模态双向调制融合模块,可有效融合模态互补信息,减少冗余特征和无用特征对跟踪器的影响;提出了一个轻量化的通道空间注意力模块,可自适应调整不同环境下不同模态的贡献。在GTOT、RGBT234和LasHeR数据集上的实验结果表明:所提跟踪算法的准确率和成功率优于当前主流的跟踪算法。
Abstract:The goal of RGB-thermal infrared (RGBT) visual object tracking, which has drawn increasing interest in recent years, is to take advantage of the complimentary strengths of RGB and thermal infrared picture data to accomplish reliable visual tracking. For obtaining a robust appearance representation of an object, existing mainstream methods introduced the modal weight to fuse information of two modalities. Simply assigning weights to the individual modalities can’t fully explore the complementary benefits of RGB and thermal infrared modalities. To solve these problems, propose a novel multimodal bidirectional information enhancement network for RGBT tracking (MBIENet). Specifically, design a feature aggregation module to aggregate modality-shared and modality-specific features for modeling the appearance information of an object. Further proposes a novel multimodal bidirectional modulation fusion module that can effectively fuse the complementary information of two modalities and alleviate the impact of redundant and useless features on the tracker. The contributions of various modalities in various situations are then adaptively adjusted using a lightweight channel-spatial attention module that is proposed. Experimental results on GTOT, RGBT234, and LasHeR datasets show that the accuracy rate and success rate of the proposed method are better than the existing mainstream trackers.
-
表 1 消融实验结果
Table 1. Ablation experiments results
模块 准确率 成功率 RGBT234 LasHeR GTOT RGBT234 LasHeR GTOT MBIENet-FAM 0.818 0.477 0.892 0.568 0.348 0.707 MBIENet-MBMF 0.813 0.471 0.886 0.560 0.342 0.701 MBIENet-CSA 0.822 0.479 0.899 0.574 0.347 0.718 MBIENet 0.829 0.484 0.902 0.582 0.355 0.720 表 2 注意力机制比较
Table 2. Comparison of attention mechanisms
模块 准确率 成功率 检测速度/
(帧·s−1)GTOT RGBT234 GTOT RGBT234 MBIENet Concat 0.896 0.822 0.709 0.574 1.8 MBIENet CBAM 0.897 0.824 0.714 0.576 1.38 MBIENet SE 0.899 0.825 0.718 0.580 1.6 MBIENet CSA 0.902 0.829 0.720 0.582 1.54 -
[1] BERYINETTO L, VALMADRE J, HENRIQUES J, et al. Fully-convolutional Siamese networks for object tracking[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2016: 850-865. [2] GE S M, LUO Z, ZHANG C H, et al. Distilling channels for efficient deep tracking[J]. IEEE Transactions on Image Processing, 2020, 29: 2610-2621. [3] ZHANG J M, JIN X K, SUN J, et al. Spatial and semantic convolutional features for robust visual object tracking[J]. Multimedia Tools and Applications, 2020, 79(21): 15095-15115. [4] LIU Q, LU X H, HE Z Y, et al. Deep convolutional neural networks for thermal infrared object tracking[J]. Knowledge-Based Systems, 2017, 134: 189-198. doi: 10.1016/j.knosys.2017.07.032 [5] ZHANG L C, GONZALEZ-GARCIA A, DANELLJAN M, et al. Synthetic data generation for end-to-end thermal infrared tracking[J]. IEEE Transactions on Image Processing, 2019, 28(4): 1837-1850. [6] LIU Q, LI X, HE Z Y, et al. Learning deep multi-level similarity for thermal infrared object tracking[J]. IEEE Transactions on Multimedia, 2021, 23: 2114-2126. [7] SUN T X, SHAO Y F, LI X N, et al. Learning sparse sharing architectures for multiple tasks[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press,2020, 34: 8936-8943. [8] GAO Y, LI C L, ZHU Y B, et al. Deep adaptive fusion network for high performance RGBT tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 91-99. [9] LI C L, LIU L, LU A D, et al. Challenge-aware RGBT tracking[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2020: 222-237. [10] ZHANG P Y, WANG D, LU H C, et al. Learning adaptive attribute-driven representation for real-time RGB-T tracking[J]. International Journal of Computer Vision, 2021, 129(9): 2714-2729. doi: 10.1007/s11263-021-01495-3 [11] XIAO Y, YANG M M, LI C L, et al. Attribute-based progressive fusion network for RGBT tracking[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 2831-2838. [12] LI C L, LU A D, ZHENG A H, et al. Multi-adapter RGBT tracking[C]//Proceedings of the IEEE/CVF International Comference on Computer Vision. Piscataway: IEEE Press, 2019: 2262-2270. [13] XU Q, MEI Y M, LIU J P, et al. Multimodal cross-layer bilinear pooling for RGBT tracking[J]. IEEE Transactions on Multimedia, 2022, 24: 567-580. [14] NAM H, HAN B. Learning multi-domain convolutional neural networks for visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 4293-4302. [15] DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2009: 248-255. [16] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018. [17] LI C L, CHENG H, HU S Y, et al. Learning collaborative sparse representation for grayscale-thermal tracking[J]. IEEE Transactions on Image Processing, 2016, 25(12): 5743-5756. doi: 10.1109/TIP.2016.2614135 [18] LI C L, ZHAO N, LU Y J, et al. Weighted sparse representation regularized graph learning for RGB-T object tracking[C]//Proceedings of the 25th ACM International Conference on Multimedia. New York: ACM, 2017: 1856-1864. [19] LI C L, SUN X, WANG X, et al. Grayscale-thermal object tracking via multitask Laplacian sparse representation[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2017, 47(4): 673-681. [20] ZHU Y B, LI C L, TANG J, et al. Quality-aware feature aggregation network for robust RGBT tracking[J]. IEEE Transactions on Intelligent Vehicles, 2020, 6(1): 121-130. [21] PEREZ E, STRUB F, DE VRIES H, et al. FiLM: Visual reasoning with a general conditioning layer[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2018, 483: 3942-3951. [22] WANG X, SHU X J, ZHANG S L, et al. MFGNet: Dynamic modality-aware filter generation for RGB-T tracking[J]. IEEE Transactions on Multimedia, 2023, 25: 4335-4348. [23] WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2018: 3-19. [24] LI C L, LIANG X Y, LU Y J, et al. RGB-T object tracking: Benchmark and baseline[J]. Pattern Recognition, 2019, 96: 106977. doi: 10.1016/j.patcog.2019.106977 [25] LI C L, XUE W L, JIA Y Q, et al. LasHeR: A large-scale high-diversity benchmark for RGBT tracking[J]. IEEE Transactions on Image Processing, 2022, 31: 392-404. [26] TU Z Z, LIN C, ZHAO W, et al. M5L: Multi-modal multi-margin metric learning for RGBT tracking[J]. IEEE Transactions on Image Processing, 2022, 31: 85-98. [27] WANG C Q, XU C Y, CUI Z, et al. Cross-modal pattern-propagation for RGB-T tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 7064-7073. [28] ZHANG H, ZHANG L, ZHUO L, et al. Object tracking in RGB-T videos using modal-aware attention network and competitive learning[J]. Sensors, 2020, 20(2): 393. doi: 10.3390/s20020393 [29] ZHANG L C, DANELLJAN M, GONZALEZ-GARCIA A, et al. Multi-modal fusion for end-to-end RGB-T tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 2252-2261. [30] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 7132-7141. [31] ZHU Y B, LI C L, LUO B, et al. Dense feature aggregation and pruning for rgbt tracking[C]//Proceedings of the 27th ACM International Conference on Multimedia. New York: ACM, 2019: 465-472.