-
摘要:
近年来,视频事件检测在计算机视觉领域受到了越来越多的关注。不确定性估计可以在输出检测结果不可靠时提醒决策系统或人员,从而减少决策失误。将不确定性估计引入事件检测任务,提出一种融合不确定性估计的端到端视频事件检测算法。该算法对预测视频事件的定位不确定性和分类不确定性进行估计,通过降低网络预测的不确定性优化网络性能。将不确定性估计和非极大值抑制策略相结合,进一步筛选出高质量预测框。实验结果表明:添加不确定性分支可提升模型的性能。在J-HMDB-21数据集上,所提算法在mAP50检测指标上相比先进算法提升了0.8%。在时空局部原子视觉(AVA)数据集上与其他的端到端算法相比,mAP50指标也得到了1.3%的提升。
Abstract:In recent years, video event detection has attracted increasing attention in the field of computer vision. Uncertainty estimation can alert decision-making systems or personnel when the output detection results are unreliable, reducing decision-making errors. In this paper, we proposed an end-to-end video event detection algorithm that fuses uncertainty estimation into the event detection task. The proposed algorithm estimates the localization and classification uncertainty for the input video events and optimizes network performance by reducing the predicted uncertainty. In addition, this paper combines uncertainty estimation with the Non-Maximum Suppression strategy to further filter the high-quality prediction boxes. The experimental results show that adding an uncertainty branch can improve the algorithm’s performance. On the J-HMDB-21 dataset, the proposed algorithm improves the mAP50 detection index by 0.8% compared with advanced algorithms. On the atomic visual actions (AVA) dataset, compared with other end-to-end networks, the mAP50 index is also improved by 1.3%.
-
表 1 不同初始化方差对算法性能的影响
Table 1. Influence of different initialization variance on algorithm performance
表 2 不同视频帧参数对算法性能的影响
Table 2. Influence of different video frame parameters on algorithm performance
算法 算法类型 数据类型 mAP50/% I3D[10] 双阶段 V+F 15.6 ACRN, S3D[30] 双阶段 V+F 17.4 STEP, I3D[31] 双阶段 V+F 18.6 RTPR[29] 双阶段 V+F 22.3 LFB, R101+NL[32] 双阶段(离线) V 27.4 ACAR, R50, 8x8, (64-f)[30] 双阶段(离线) V 28.3 SlowFast, R50,8x8,(64-f)[33] 双阶段(离线) V 24.8 YOWO(32-f)[12] 单阶段 V 18.3 UC-YOWO(32-f) 单阶段 V 18.5 UC-YOWO+StdNMS(32-f) 单阶段 V 19.6 -
[1] WEI D F, TIAN Y, WEI L Q, et al. Efficient dual attention slowfast networks for video action recognition[J]. Computer Vision and Image Understanding, 2022, 222: 103484. doi: 10.1016/j.cviu.2022.103484 [2] 秦明星, 王忠, 李海龙, 等. 基于分布式模型预测的无人机编队避障控制[J]. 北京航空航天大学学报, 2024, 50(6): 1969-1981.QIN M X, WANG Z, LI H L, et al. Obstacle avoidance control of UAV formation based on distributed model prediction[J]. Journal of Beijing University of Aeronautics and Astronautics, 2024, 50(6): 1969-1981(in Chinese). [3] 郑宇祥, 郝鹏翼, 吴冬恩, 等. 结合多层特征及空间信息蒸馏的医学影像分割[J]. 北京航空航天大学学报, 2022, 48(8): 1409-1417.ZHENG Y X, HAO P Y, WU D E, et al. Medical image segmentation based on multi-layer features and spatial information distillation[J]. Journal of Beijing University of Aeronautics and Astronautics, 2022, 48(8): 1409-1417(in Chinese). [4] SINGH G, AKRIGG S, DI MAIO M, et al. ROAD: The road event awareness dataset for autonomous driving[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 1036-1054. doi: 10.1109/TPAMI.2022.3150906 [5] TANG C M, WANG X, BAI Y C, et al. Learning spatial-frequency transformer for visual object tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(9): 5102-5116. doi: 10.1109/TCSVT.2023.3249468 [6] SU Y T, LU Y, LIU J, et al. Spatio-temporal mitosis detection in time-lapse phase-contrast microscopy image sequences: A benchmark[J]. IEEE Transactions on Medical Imaging, 2021, 40(5): 1319-1328. doi: 10.1109/TMI.2021.3052854 [7] GKIOXARI G, MALIK J. Finding action tubes[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 759-768. [8] WEINZAEPFEL P, HARCHAOUI Z, SCHMID C. Learning to track for spatio-temporal action localization[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2015: 3164-3172. [9] PENG X J, SCHMID C. Multi-region two-stream R-CNN for action detection[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2016: 744-759. [10] GU C H, SUN C, ROSS D A, et al. AVA: A video dataset of spatio-temporally localized atomic visual actions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 6047-6056. [11] WU J C, KUANG Z H, WANG L M, et al. Context-aware RCNN: A baseline for action detection in videos[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2020: 440-456. [12] KÖPÜKLÜ O, WEI X Y, RIGOLL G. You only watch once: A unified CNN architecture for real-time spatiotemporal action localization[EB/OL]. (2019-11-15)[2022-10-10]. [13] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: A large video database for human motion recognition[C]//Proceedings of the International Conference on Computer Vision. Piscataway: IEEE Press, 2011: 2556-2563. [14] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 4724-4733. [15] PAN J T, CHEN S Y, SHOU M Z, et al. Actor-context-actor relation network for spatio-temporal action localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 464-474. [16] FENG Y T, JIANG J W, HUANG Z Y, et al. Relation modeling in spatio-temporal action localization[EB/OL]. (2021-06-15)[2022-10-10]. [17] LAKSHMINARAYANAN B, PRITZEL A, BLUNDELL C. Simple and scalable predictive uncertainty estimation using deep ensembles[EB/OL]. (2016-12-05)[2022-10-10]. [18] FINN C, ABBEEL P, LEVINE S. Model-agnostic meta-learning for fast adaptation of deep networks[EB/OL]. (2017-03-09)[2022-10-10]. [19] KIM T , YOON J, DIA O, et al. Bayesian model-agnostic meta-learning[EB/OL]. (2018-06-11)[2022-10-10]. [20] HALL D, DAYOUB F, SKINNER J, et al. Probabilistic object detection: definition and evaluation[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE Press, 2020: 1020-1029. [21] HE Y H, ZHU C C, WANG J R, et al. Bounding box regression with uncertainty for accurate object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 2883-2892. [22] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[EB/OL]. (2017-08-07)[2022-10-10]. [23] OH S, HOOGS A, PERERA A, et al. A large-scale benchmark dataset for event recognition in surveillance video[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2011: 3153-3160. [24] EVERINGHAM M, VAN GOOL L, WILLIANMS C K I, et al. The PASCAL visual object classes challenge 2012 (VOC2012) Results[EB/OL]. (2012-11-13)[2022-10-10]. [25] GOYAL P, DOLLÁR P, GIRSHICK R, et al. Accurate, large minibatch SGD: Training ImageNet in 1 hour[EB/OL]. (2017-06-08)[2022-10-09]. [26] HOU R, CHEN C, SHAH M. Tube convolutional neural network (T-CNN) for action detection in videos[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2017: 5823-5832. [27] KALOGEITON V, WEINZAEPFEL P, FERRARI V, et al. Action tubelet detector for spatio-temporal action localization[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2017: 4415-4423. [28] WEI J C, WANG H L, YI Y, et al. P3D-CTN: Pseudo-3D convolutional tube network for spatio-temporal action detection in videos[C]//Proceedings of the IEEE International Conference on Image Processing. Piscataway: IEEE Press, 2019: 300-304. [29] LI D, QIU Z F, DAI Q, et al. Recurrent tubelet proposal and recognition networks for action detection[M]. Berlin: Springer International Publishing, 2018: 306-322. [30] SUN C, SHRIVASTAVA A, VONDRICK C, et al. Actor-centric relation network[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2018: 335-351. [31] YANG X T, YANG X D, LIU M Y, et al. STEP: Spatio-temporal progressive learning for video action detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 264-272. [32] WU C Y, FEICHTENHOFER C, FAN H Q, et al. Long-term feature banks for detailed video understanding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 284-293. [33] XIAO F Y, LEE Y J, GRAUMAN K, et al. Audiovisual slowfast networks for video recognition[EB/OL]. (2020-01-23)[2022-10-12].