基于图像语义分割的目标跟踪尺度自适应算法

陈凯; 赵晓冬; 黄煜杰; 王鹏飞; 雷一辰; 张彪

doi:10.13700/j.bh.1001-5965.2024.0197

基于图像语义分割的目标跟踪尺度自适应算法

doi: 10.13700/j.bh.1001-5965.2024.0197

1.
南京航空航天大学机电学院，南京 210016
2.
中国电子科技集团第二十八研究所，南京 210007

基金项目:

国家自然科学基金(52202417)；中国博士后科学基金(2022TQ0155,2022M721605)；虚拟现实技术与系统全国重点实验室(北京航空航天大学)开放课题基金(VRLAB2023A02)；中国科协青年科技人才托举工程(2023QNRC001)；江苏省科协青年科技人才托举工程(JSTJ-2023-XH032)

详细信息

通讯作者:
E-mail：xdzhao@nuaa.edu.cn

中图分类号: TP391.4
计量
- 文章访问数: 312
- HTML全文浏览量: 161
- PDF下载量: 13
- 被引次数: 0
出版历程
- 收稿日期: 2024-04-07
- 录用日期: 2024-05-24
- 网络出版日期: 2024-06-08
- 整期出版日期: 2026-05-26

Adaptive algorithm for target tracking scale based on image semantic segmentation

1.
College of Mechanical and Electrical Engineering，Nanjing University of Aeronautics and Astronautics，Nanjing 210016，China
2.
China Electronics Technology Group 28th Research Institute，Nanjing 210007，China

Funds:

National Natural Science Foundation of China (52202417); China Postdoctoral Science Foundation (2022TQ0155,2022M721605); Open Project Program of State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (VRLAB2023A02); Young Elite Scientists Sponsorship Program by China Association for Science and Technology (2023QNRC001); Young Elite Scientists Sponsorship Program by Jiangsu Association for Science and Technology (JSTJ-2023-XH032)

More Information

Corresponding author: E-mail：xdzhao@nuaa.edu.cn

摘要

摘要:
围绕如何在复杂的场景中充分利用图像语义分割得到目标的语义信息，并对目标跟踪器的输出进行尺度优化，设计了基于注意力机制优化的图像语义分割网络。针对目标跟踪器的输出和特征输入2个方面进行优化，该网络可以实现对各类算法的即插即用。利用图像语义分割掩码获得目标的旋转框边界，根据目标的旋转框和非旋转框边界对目标输入阶段的特征进行去噪优化，减弱背景噪声对跟踪器的判别影响。分别从所设计网络的结构、训练、目标旋转框标定及对跟踪器的输入特征进行去噪等方面进行讨论。在公开数据集OTB100、VOT2016和VOT2018上进行实验，对比验证了目标运动模型在解决目标跟踪过程中，目标尺度优化的准确率和鲁棒性。
- 目标跟踪 /
- 语义分割 /
- 注意力机制 /
- 尺度优化 /
- 背景去噪
Abstract:
In order to optimize the scale of the tracker's output, this paper focuses on how to fully utilize the semantic information of the target obtained by image semantic segmentation in complex scenes. It also designs an image semantic segmentation network based on the optimization of the attention mechanism to optimize the target tracker's output and the input of the features, which can realize plug-and-play for various algorithms. The image semantic segmentation mask is used to obtain the rotating frame boundary of the target, and the denoising optimization of the features in the input phase of the target is carried out according to the rotating and non-rotating frame boundaries of the target to attenuate the influence of the background noise on the discriminator of the tracker. The structure of the designed network, training, calibration of the target's rotating frame, and denoising of the tracker's input features are discussed, respectively. The correctness of the target motion model in resolving the scale calibration of the target during target tracking is verified through experimental comparison analysis on public datasets OTB100, VOT2016 and VOT2018. This enhances the accuracy and resilience of target tracking.
- target tracking /
- semantic segmentation /
- attention mechanisms /
- scale optimization /
- background denoising

HTML全文

图 1 本文算法整体流程图

Figure 1. Overall flow chart of the proposed algorithm

下载: 全尺寸图片幻灯片

图 2 本文图像语义分割网络的结构

Figure 2. Structure of the image semantic segmentation network

下载: 全尺寸图片幻灯片

图 3 上采样优化模块的结构

Figure 3. Structure of the oversampling optimization module

下载: 全尺寸图片幻灯片

图 4 深度残差模块中加入特征修正模块的结构

Figure 4. Structure of adding feature correction module to depth residual module

下载: 全尺寸图片幻灯片

图 5 3种从二进制掩码生成边界框的算法对比

Figure 5. Comparison of three algorithms for generating bounding boxes from binary masks

下载: 全尺寸图片幻灯片

图 6 3种掩码对目标背景去噪的示意图

Figure 6. Schematic diagram of three masks for denoising target background in target images

下载: 全尺寸图片幻灯片

图 7 语义扩展和利用余弦窗对目标特征降噪后相关滤波跟踪算法ECO判别器的输出响应

Figure 7. Semantic expansion and output response of ECO discriminator of correlation filter tracking algorithm after noise reduction of target features using a cosine window

下载: 全尺寸图片幻灯片

图 8 OTB2015基准上的成功率和准确率

Figure 8. Success rate and accuracy on OTB2015 benchmarks

下载: 全尺寸图片幻灯片

图 9 ECO-S和SiamMask在OTB100数据集2个视频序列上的定性结果

Figure 9. Characterization results of ECO-S and SiamMask on two video sequences of OTB100

下载: 全尺寸图片幻灯片

图 10 VOT2016基准上的实验对比

Figure 10. Comparison of experiments on VOT2016 benchmark

下载: 全尺寸图片幻灯片

图 11 VOT2018基准上的实验对比

Figure 11. Comparison of experiments on VOT2016 benchmark

下载: 全尺寸图片幻灯片

图 12 ECO-S和SiamMask在VOT数据集2个视频序列上的定性结果

Figure 12. Qualitative results of ECO-S and SiamMask on two video sequences of VOT

下载: 全尺寸图片幻灯片

表 1 VOT-2016上不同包围矩形框标定策略的性能

Table 1. Performance of different enclosing rectangular box calibration strategies on VOT-2016

算法	mIOU/%	mAP@0.5/%	mAP@0.7/%
SiamFC^[6]	50.48	56.42	9.28
SiamRPN^[8]	60.02	76.20	32.47
SiamMask-Min-max	65.05	82.99	43.09
SiamMask-MBR	67.15	85.42	50.86
SiamMask-Opt	71.68	90.77	60.47
ECO-Min-max	69.33	85.14	50.23
ECO-MBR	72.57	88.31	56.74
ECO-Opt	74.97	91.85	68.56

下载: 导出CSV

表 2 VOT2016和VOT2018视频序列基准上的性能比较

Table 2. Performance comparison on VOT2016 and VOT2018 video sequence benchmarks

算法	准确率		鲁棒性		EAO
算法	VOT2016	VOT2018	VOT2016	VOT2018	VOT2016	VOT2018
ECO-S	0.5295	0.5405	16.5817	7.2398	0.3293	0.4083
ECO^[27]	0.4847	0.4978	15.0437	13.5112	0.3089	0.3077
SiamMask^[5]	0.5470	0.5337	17.9393	7.4410	0.3251	0.4016
SiamRPN-S	0.5702	0.5515	19.2720	10.2551	0.3206	0.3727
SiamRPN^[8]	0.5386	0.5399	21.0817	14.2040	0.2579	0.3177
DeepSRDCF-S	0.5072	0.5263	19.5438	17.4695	0.2773	0.2799
DeepSRDCF^[33]	0.5220	0.5016	20.3462	23.9644	0.2756	0.2282

下载: 导出CSV

表 3 OBT100、VOT-2016、VOT-2018上的计算开销比较

Table 3. Comparison of computational overhead on OBT100, VOT-2016, VOT-2018

算法	参数量	浮点运算速度/10⁹ s⁻¹
算法	参数量	OBT-100	VOT-2016	VOT-2018
ECO-S	5.92×10⁶	0.39	0.46	0.51
ECO^[27]	5.34×10⁶	0.40	0.46	0.50
SiamMask^[5]	12.96×10⁶	0.86	0.97	1.07
SiamRPN-S	11.13×10⁶	0.81	0.89	0.91
SiamRPN^[8]	10.61×10⁶	0.75	0.81	0.85
DeepSRDCF-S	12.68×10⁶	0.86	0.96	1.07
DeepSRDCF^[33]	12.02×10⁶	0.85	0.96	1.06

下载: 导出CSV

参考文献(33)

[1]	DING W Z, XU Q J, LIU S Y, et al. SAMF: a self-adaptive protein modeling framework[J]. Bioinformatics, 2021, 37(22): 4075-4082.
[2]	TANG L N, XU X C, WANG X, et al. An automatic object attention likelihood map correlation filter for visual tracking[C]//Proceedings of the IEEE 11th Asia-Pacific Conference on Antennas and Propagation. Piscataway: IEEE Press, 2024: 1-2.
[3]	LI B L, WANG Y, XU Y M, et al. DSST: a dual student model guided student–teacher framework for semi-supervised medical image segmentation[J]. Biomedical Signal Processing and Control, 2024, 90: 105890.
[4]	GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2014: 580-587.
[5]	HU W M, WANG Q, ZHANG L, et al. SiamMask: a framework for fast online object tracking and segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3072-3089.
[6]	LEE S, KIM S. Comparative analysis of IR-IR image matching applying the deep learning-based template matching techniques[C]//Proceedings of the 23rd International Conference on Control, Automation and Systems. Piscataway: IEEE Press, 2023: 441-444.
[7]	HELD D, THRUN S, SAVARESE S. Learning to track at 100 FPS with deep regression networks[C]//Proceedings of the Computer Vision-ECCV. Berlin: Springer, 2016: 749-765.
[8]	WANG S Q, QIAN K, SHEN J L, et al. AD-SiamRPN: anti-deformation object tracking via an improved siamese region proposal network on hyperspectral videos[J]. Remote Sensing, 2023, 15(7): 1731.
[9]	LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[10]	HARIHARAN B, ARBELÁEZ P, GIRSHICK R, et al. Simultaneous detection and segmentation[C]//Proceedings of the Computer Vision-ECCV. Berlin: Springer, 2014: 297-312.
[11]	PINHEIRO P O, COLLOBERT R, DOLLÁR P. Learning to segment object candidates[C]//Proceedings of the 29th International Conference on Neural Information Processing Systems. New York: ACM, 2015(2): 1990-1998.
[12]	DAI J F, HE K M, SUN J. Instance-aware semantic segmentation via multi-task network cascades[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 3150-3158.
[13]	HARIHARAN B, ARBELÁEZ P, GIRSHICK R, et al. Hypercolumns for object segmentation and fine-grained localization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 447-456.
[14]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[15]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015-04-10)[2024-02-01]. https://arxiv.org/abs/1409.1556.
[16]	SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 1-9.
[17]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 770-778.
[18]	SHELHAMER E, LONG J, DARRELL T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence. Piscataway: IEEE Press, 2016: 640-651.
[19]	XIE S N, TU Z W. Holistically-nested edge detection[J]. International Journal of Computer Vision, 2017, 125(1): 3-18.
[20]	SERMANET P, KAVUKCUOGLU K, CHINTALA S, et al. Pedestrian detection with unsupervised multi-stage feature learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2013: 3626-3633.
[21]	KOONCE B. ResNet 50[C]//Proceedings of the Convolutional Neural Networks with Swift for Tensorflow. Berkeley: Apress, 2021: 63-72.
[22]	HAN F, JIANG S K, WU J M, et al. Real-time object tracking in the wild with siamese network[J]. Multimedia Tools and Applications, 2023, 82(16): 24327-24343.
[23]	CAO D, DAI R H, WANG J, et al. Fast visual tracking with squeeze and excitation region proposal network[J]. Human-centric Computing and Information Sciences, 2023, 13(7): 20.
[24]	WANG Q, GAO J, XING J L, et al. Dcfnet: discriminant correlation filters network for visual tracking[EB/OL]. (2017-04-13)[2024-02-01]. https://arxiv.org/abs/1704.04057.
[25]	ZHANG H Y, LIU G X, ZHANG Y, et al. Robust multi-model visual tracking with distractor-aware template-coupled correlation filters joint learning[J]. IEEE Transactions on Multimedia, 2024, 26: 1813-1828.
[26]	FANG H, LIAO G S, LIU Y J, et al. Shadow-assisted moving target tracking based on multidiscriminant correlation filters network in video SAR[J]. IEEE Geoscience and Remote Sensing Letters, 2023, 20: 4006205.
[27]	DANELLJAN M, BHAT G, SHAHBAZ KHAN F, et al. ECO: efficient convolution operators for tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 6638-6646.
[28]	RAHMAN M M. Target focused shallow transformer framework for efficient visual tracking[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(21): 23409-23410.
[29]	FU J Y, LIANG Q F, XIE Q S, et al. Object tracking based on foreground adaptive bounding box and motion state redetection[C]//Proceedings of the 3rd International Conference on Artificial Intelligence and Computer Engineering. Bellingham: SPIE, 2023: 155.
[30]	黄煜杰, 陈凯, 王子源, 等. 多目视觉下基于融合特征的密集行人跟踪方法[J]. 北京航空航天大学学报, 2025, 51(7): 2513-2525. HUANG Y J, CHEN K, WANG Z Y, et al. A dense pedestrian tracking method based on fusion features under multi-vision[J]. Journal of Beijing University of Aeronautics and Astronautics, 2025, 51(7): 2513-2525(in Chinese).
[31]	CHEN K, SONG X, YUAN H T, et al. Fully convolutional encoder-decoder with an attention mechanism for practical pedestrian trajectory prediction[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(11): 20046-20060.
[32]	CHEN K, ZHU H H, TANG D B, et al. Future pedestrian location prediction in first-person videos for autonomous vehicles and social robots[J]. Image and Vision Computing, 2023, 134: 104671.
[33]	DANELLJAN M, HAGER G, SHAHBAZ KHAN F, et al. Convolutional features for correlation filter based visual tracking[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops. Piscataway: IEEE Press, 2015: 58-66.