-
摘要:
围绕如何在复杂的场景中充分利用图像语义分割得到目标的语义信息,并对目标跟踪器的输出进行尺度优化,设计了基于注意力机制优化的图像语义分割网络。针对目标跟踪器的输出和特征输入2个方面进行优化,该网络可以实现对各类算法的即插即用。利用图像语义分割掩码获得目标的旋转框边界,根据目标的旋转框和非旋转框边界对目标输入阶段的特征进行去噪优化,减弱背景噪声对跟踪器的判别影响。分别从所设计网络的结构、训练、目标旋转框标定及对跟踪器的输入特征进行去噪等方面进行讨论。在公开数据集OTB100、VOT2016和VOT2018上进行实验,对比验证了目标运动模型在解决目标跟踪过程中,目标尺度优化的准确率和鲁棒性。
Abstract:In order to optimize the scale of the tracker's output, this paper focuses on how to fully utilize the semantic information of the target obtained by image semantic segmentation in complex scenes. It also designs an image semantic segmentation network based on the optimization of the attention mechanism to optimize the target tracker's output and the input of the features, which can realize plug-and-play for various algorithms. The image semantic segmentation mask is used to obtain the rotating frame boundary of the target, and the denoising optimization of the features in the input phase of the target is carried out according to the rotating and non-rotating frame boundaries of the target to attenuate the influence of the background noise on the discriminator of the tracker. The structure of the designed network, training, calibration of the target's rotating frame, and denoising of the tracker's input features are discussed, respectively. The correctness of the target motion model in resolving the scale calibration of the target during target tracking is verified through experimental comparison analysis on public datasets OTB100, VOT2016 and VOT2018. This enhances the accuracy and resilience of target tracking.
-
表 1 VOT-2016上不同包围矩形框标定策略的性能
Table 1. Performance of different enclosing rectangular box calibration strategies on VOT-2016
表 2 VOT2016和VOT2018视频序列基准上的性能比较
Table 2. Performance comparison on VOT2016 and VOT2018 video sequence benchmarks
算法 准确率 鲁棒性 EAO VOT2016 VOT2018 VOT2016 VOT2018 VOT2016 VOT2018 ECO-S 0.5295 0.5405 16.5817 7.2398 0.3293 0.4083 ECO[27] 0.4847 0.4978 15.0437 13.5112 0.3089 0.3077 SiamMask[5] 0.5470 0.5337 17.9393 7.4410 0.3251 0.4016 SiamRPN-S 0.5702 0.5515 19.2720 10.2551 0.3206 0.3727 SiamRPN[8] 0.5386 0.5399 21.0817 14.2040 0.2579 0.3177 DeepSRDCF-S 0.5072 0.5263 19.5438 17.4695 0.2773 0.2799 DeepSRDCF[33] 0.5220 0.5016 20.3462 23.9644 0.2756 0.2282 表 3 OBT100、VOT-2016、VOT-2018上的计算开销比较
Table 3. Comparison of computational overhead on OBT100, VOT-2016, VOT-2018
-
[1] DING W Z, XU Q J, LIU S Y, et al. SAMF: a self-adaptive protein modeling framework[J]. Bioinformatics, 2021, 37(22): 4075-4082. [2] TANG L N, XU X C, WANG X, et al. An automatic object attention likelihood map correlation filter for visual tracking[C]//Proceedings of the IEEE 11th Asia-Pacific Conference on Antennas and Propagation. Piscataway: IEEE Press, 2024: 1-2. [3] LI B L, WANG Y, XU Y M, et al. DSST: a dual student model guided student–teacher framework for semi-supervised medical image segmentation[J]. Biomedical Signal Processing and Control, 2024, 90: 105890. [4] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2014: 580-587. [5] HU W M, WANG Q, ZHANG L, et al. SiamMask: a framework for fast online object tracking and segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3072-3089. [6] LEE S, KIM S. Comparative analysis of IR-IR image matching applying the deep learning-based template matching techniques[C]//Proceedings of the 23rd International Conference on Control, Automation and Systems. Piscataway: IEEE Press, 2023: 441-444. [7] HELD D, THRUN S, SAVARESE S. Learning to track at 100 FPS with deep regression networks[C]//Proceedings of the Computer Vision-ECCV. Berlin: Springer, 2016: 749-765. [8] WANG S Q, QIAN K, SHEN J L, et al. AD-SiamRPN: anti-deformation object tracking via an improved siamese region proposal network on hyperspectral videos[J]. Remote Sensing, 2023, 15(7): 1731. [9] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. [10] HARIHARAN B, ARBELÁEZ P, GIRSHICK R, et al. Simultaneous detection and segmentation[C]//Proceedings of the Computer Vision-ECCV. Berlin: Springer, 2014: 297-312. [11] PINHEIRO P O, COLLOBERT R, DOLLÁR P. Learning to segment object candidates[C]//Proceedings of the 29th International Conference on Neural Information Processing Systems. New York: ACM, 2015(2): 1990-1998. [12] DAI J F, HE K M, SUN J. Instance-aware semantic segmentation via multi-task network cascades[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 3150-3158. [13] HARIHARAN B, ARBELÁEZ P, GIRSHICK R, et al. Hypercolumns for object segmentation and fine-grained localization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 447-456. [14] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90. [15] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015-04-10)[2024-02-01]. https://arxiv.org/abs/1409.1556. [16] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 1-9. [17] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 770-778. [18] SHELHAMER E, LONG J, DARRELL T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence. Piscataway: IEEE Press, 2016: 640-651. [19] XIE S N, TU Z W. Holistically-nested edge detection[J]. International Journal of Computer Vision, 2017, 125(1): 3-18. [20] SERMANET P, KAVUKCUOGLU K, CHINTALA S, et al. Pedestrian detection with unsupervised multi-stage feature learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2013: 3626-3633. [21] KOONCE B. ResNet 50[C]//Proceedings of the Convolutional Neural Networks with Swift for Tensorflow. Berkeley: Apress, 2021: 63-72. [22] HAN F, JIANG S K, WU J M, et al. Real-time object tracking in the wild with siamese network[J]. Multimedia Tools and Applications, 2023, 82(16): 24327-24343. [23] CAO D, DAI R H, WANG J, et al. Fast visual tracking with squeeze and excitation region proposal network[J]. Human-centric Computing and Information Sciences, 2023, 13(7): 20. [24] WANG Q, GAO J, XING J L, et al. Dcfnet: discriminant correlation filters network for visual tracking[EB/OL]. (2017-04-13)[2024-02-01]. https://arxiv.org/abs/1704.04057. [25] ZHANG H Y, LIU G X, ZHANG Y, et al. Robust multi-model visual tracking with distractor-aware template-coupled correlation filters joint learning[J]. IEEE Transactions on Multimedia, 2024, 26: 1813-1828. [26] FANG H, LIAO G S, LIU Y J, et al. Shadow-assisted moving target tracking based on multidiscriminant correlation filters network in video SAR[J]. IEEE Geoscience and Remote Sensing Letters, 2023, 20: 4006205. [27] DANELLJAN M, BHAT G, SHAHBAZ KHAN F, et al. ECO: efficient convolution operators for tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 6638-6646. [28] RAHMAN M M. Target focused shallow transformer framework for efficient visual tracking[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(21): 23409-23410. [29] FU J Y, LIANG Q F, XIE Q S, et al. Object tracking based on foreground adaptive bounding box and motion state redetection[C]//Proceedings of the 3rd International Conference on Artificial Intelligence and Computer Engineering. Bellingham: SPIE, 2023: 155. [30] 黄煜杰, 陈凯, 王子源, 等. 多目视觉下基于融合特征的密集行人跟踪方法[J]. 北京航空航天大学学报, 2025, 51(7): 2513-2525.HUANG Y J, CHEN K, WANG Z Y, et al. A dense pedestrian tracking method based on fusion features under multi-vision[J]. Journal of Beijing University of Aeronautics and Astronautics, 2025, 51(7): 2513-2525(in Chinese). [31] CHEN K, SONG X, YUAN H T, et al. Fully convolutional encoder-decoder with an attention mechanism for practical pedestrian trajectory prediction[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(11): 20046-20060. [32] CHEN K, ZHU H H, TANG D B, et al. Future pedestrian location prediction in first-person videos for autonomous vehicles and social robots[J]. Image and Vision Computing, 2023, 134: 104671. [33] DANELLJAN M, HAGER G, SHAHBAZ KHAN F, et al. Convolutional features for correlation filter based visual tracking[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops. Piscataway: IEEE Press, 2015: 58-66. -


下载: