留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

融合不确定性估计的端到端视频事件检测算法

庞枫骞 赵鸿飞 康营营

庞枫骞,赵鸿飞,康营营. 融合不确定性估计的端到端视频事件检测算法[J]. 北京航空航天大学学报,2024,50(12):3759-3770 doi: 10.13700/j.bh.1001-5965.2022.0897
引用本文: 庞枫骞,赵鸿飞,康营营. 融合不确定性估计的端到端视频事件检测算法[J]. 北京航空航天大学学报,2024,50(12):3759-3770 doi: 10.13700/j.bh.1001-5965.2022.0897
PANG F Q,ZHAO H F,KANG Y Y. Uncertainty estimation fused end-to-end video event detection algorithm[J]. Journal of Beijing University of Aeronautics and Astronautics,2024,50(12):3759-3770 (in Chinese) doi: 10.13700/j.bh.1001-5965.2022.0897
Citation: PANG F Q,ZHAO H F,KANG Y Y. Uncertainty estimation fused end-to-end video event detection algorithm[J]. Journal of Beijing University of Aeronautics and Astronautics,2024,50(12):3759-3770 (in Chinese) doi: 10.13700/j.bh.1001-5965.2022.0897

融合不确定性估计的端到端视频事件检测算法

doi: 10.13700/j.bh.1001-5965.2022.0897
基金项目: 国家自然科学基金(62001009);北京市教育委员会科技计划一般项目(KM202210009003);北方工业大学科研启动基金
详细信息
    通讯作者:

    E-mail: fqpang@ncut.edu.cn

  • 中图分类号: TP751.1;TP391.4

Uncertainty estimation fused end-to-end video event detection algorithm

Funds: National Natural Science Foundation of China (62001009); Research and Development Program of Beijing Municipal Education Commission (KM202210009003); Scientific Research Initiation Foundation of North China University of Technology
More Information
  • 摘要:

    近年来,视频事件检测在计算机视觉领域受到了越来越多的关注。不确定性估计可以在输出检测结果不可靠时提醒决策系统或人员,从而减少决策失误。将不确定性估计引入事件检测任务,提出一种融合不确定性估计的端到端视频事件检测算法。该算法对预测视频事件的定位不确定性和分类不确定性进行估计,通过降低网络预测的不确定性优化网络性能。将不确定性估计和非极大值抑制策略相结合,进一步筛选出高质量预测框。实验结果表明:添加不确定性分支可提升模型的性能。在J-HMDB-21数据集上,所提算法在mAP50检测指标上相比先进算法提升了0.8%。在时空局部原子视觉(AVA)数据集上与其他的端到端算法相比,mAP50指标也得到了1.3%的提升。

     

  • 图 1  UC-YOWO算法结构

    Figure 1.  UC-YOWO algorithm structure

    图 2  UC-YOWO 算法子模块结构

    Figure 2.  UC-YOWO algorithm sub-module structure

    图 3  不确定性损失函数设计

    Figure 3.  Uncertainty loss function design

    图 4  NMS 算法示意图

    Figure 4.  Schematic diagram of NMS algorithm

    图 5  不同λ参数的相对 mAP

    Figure 5.  Relative mAP of different parameters λ

    图 6  J-HMDB-21数据集[13]检测结果

    Figure 6.  Test result in J-HMDB-21 data set[13]

    图 7  不确定性值和 IoU 散点图

    Figure 7.  Scatter plot of uncertainty and IoU

    图 8  去除关键帧时的不确定性

    Figure 8.  Uncertainties while removing keyframes

    图 9  替换关键帧时的不确定性

    Figure 9.  Uncertainties while replacing keyframes

    图 10  UC-YOWO 算法对新类别的检测结果

    Figure 10.  UC-YOWO algorithm test results for new categories

    表  1  不同初始化方差对算法性能的影响

    Table  1.   Influence of different initialization variance on algorithm performance

    σ0 mAP50/%
    VIRAT[23] J-HMDB-21[13]
    37.2 65.4
    10−1 37.1 65.1
    10−2 37.4 65.5
    10−3 37.9 66.9
    10−4 36.9 66.7
    下载: 导出CSV

    表  2  不同视频帧参数对算法性能的影响

    Table  2.   Influence of different video frame parameters on algorithm performance

    帧长度/帧 采样间隔/帧 mAP50/%
    J-HMDB-21[13] AVA[10] VIRAT[23]
    8 $d = 1$ 66.9 16.5 37.9
    8 $d = 2$ 65.8 16.0 35.0
    8 $d = 3$ 62.7 16.0 34.6
    16 $d = 1$ 74.8 17.8 41.0
    32 $d = 1$ 71.2 18.5 44.9
    下载: 导出CSV

    表  3  现有不同算法在J-HMDB-21数据集[13]的对比

    Table  3.   Comparison of different deep algorithms in J-HMDB-21 data set[13]

    算法 算法类型 数据类型 mAP50/%
    Peng w/o MR[9] 双阶段 V 56.9
    Peng w/ MR[9] 双阶段 V 58.5
    T-CNN[26] 双阶段 V 61.3
    ACT[27] 双阶段 V+F 65.7
    P3D-CTN[28] 单阶段 V 71.1
    YOWO[12] 单阶段 V 74.4
    UC-YOWO 单阶段 V 74.7
    UC-YOWO+Std-NMS 单阶段 V 75.2
    下载: 导出CSV

    表  4  现有不同算法在AVA数据集[10]的对比

    Table  4.   Comparison of different deep algorithms in AVA data set[10]

    算法 算法类型 数据类型 mAP50/%
    I3D[10] 双阶段 V+F 15.6
    ACRN, S3D[30] 双阶段 V+F 17.4
    STEP, I3D[31] 双阶段 V+F 18.6
    RTPR[29] 双阶段 V+F 22.3
    LFB, R101+NL[32] 双阶段(离线) V 27.4
    ACAR, R50, 8x8, (64-f)[30] 双阶段(离线) V 28.3
    SlowFast, R50,8x8,(64-f)[33] 双阶段(离线) V 24.8
    YOWO(32-f)[12] 单阶段 V 18.3
    UC-YOWO(32-f) 单阶段 V 18.5
    UC-YOWO+StdNMS(32-f) 单阶段 V 19.6
    下载: 导出CSV
  • [1] WEI D F, TIAN Y, WEI L Q, et al. Efficient dual attention slowfast networks for video action recognition[J]. Computer Vision and Image Understanding, 2022, 222: 103484. doi: 10.1016/j.cviu.2022.103484
    [2] 秦明星, 王忠, 李海龙, 等. 基于分布式模型预测的无人机编队避障控制[J]. 北京航空航天大学学报, 2024, 50(6): 1969-1981.

    QIN M X, WANG Z, LI H L, et al. Obstacle avoidance control of UAV formation based on distributed model prediction[J]. Journal of Beijing University of Aeronautics and Astronautics, 2024, 50(6): 1969-1981(in Chinese).
    [3] 郑宇祥, 郝鹏翼, 吴冬恩, 等. 结合多层特征及空间信息蒸馏的医学影像分割[J]. 北京航空航天大学学报, 2022, 48(8): 1409-1417.

    ZHENG Y X, HAO P Y, WU D E, et al. Medical image segmentation based on multi-layer features and spatial information distillation[J]. Journal of Beijing University of Aeronautics and Astronautics, 2022, 48(8): 1409-1417(in Chinese).
    [4] SINGH G, AKRIGG S, DI MAIO M, et al. ROAD: The road event awareness dataset for autonomous driving[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 1036-1054. doi: 10.1109/TPAMI.2022.3150906
    [5] TANG C M, WANG X, BAI Y C, et al. Learning spatial-frequency transformer for visual object tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(9): 5102-5116. doi: 10.1109/TCSVT.2023.3249468
    [6] SU Y T, LU Y, LIU J, et al. Spatio-temporal mitosis detection in time-lapse phase-contrast microscopy image sequences: A benchmark[J]. IEEE Transactions on Medical Imaging, 2021, 40(5): 1319-1328. doi: 10.1109/TMI.2021.3052854
    [7] GKIOXARI G, MALIK J. Finding action tubes[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 759-768.
    [8] WEINZAEPFEL P, HARCHAOUI Z, SCHMID C. Learning to track for spatio-temporal action localization[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2015: 3164-3172.
    [9] PENG X J, SCHMID C. Multi-region two-stream R-CNN for action detection[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2016: 744-759.
    [10] GU C H, SUN C, ROSS D A, et al. AVA: A video dataset of spatio-temporally localized atomic visual actions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 6047-6056.
    [11] WU J C, KUANG Z H, WANG L M, et al. Context-aware RCNN: A baseline for action detection in videos[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2020: 440-456.
    [12] KÖPÜKLÜ O, WEI X Y, RIGOLL G. You only watch once: A unified CNN architecture for real-time spatiotemporal action localization[EB/OL]. (2019-11-15)[2022-10-10].
    [13] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: A large video database for human motion recognition[C]//Proceedings of the International Conference on Computer Vision. Piscataway: IEEE Press, 2011: 2556-2563.
    [14] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 4724-4733.
    [15] PAN J T, CHEN S Y, SHOU M Z, et al. Actor-context-actor relation network for spatio-temporal action localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 464-474.
    [16] FENG Y T, JIANG J W, HUANG Z Y, et al. Relation modeling in spatio-temporal action localization[EB/OL]. (2021-06-15)[2022-10-10].
    [17] LAKSHMINARAYANAN B, PRITZEL A, BLUNDELL C. Simple and scalable predictive uncertainty estimation using deep ensembles[EB/OL]. (2016-12-05)[2022-10-10].
    [18] FINN C, ABBEEL P, LEVINE S. Model-agnostic meta-learning for fast adaptation of deep networks[EB/OL]. (2017-03-09)[2022-10-10].
    [19] KIM T , YOON J, DIA O, et al. Bayesian model-agnostic meta-learning[EB/OL]. (2018-06-11)[2022-10-10].
    [20] HALL D, DAYOUB F, SKINNER J, et al. Probabilistic object detection: definition and evaluation[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE Press, 2020: 1020-1029.
    [21] HE Y H, ZHU C C, WANG J R, et al. Bounding box regression with uncertainty for accurate object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 2883-2892.
    [22] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[EB/OL]. (2017-08-07)[2022-10-10].
    [23] OH S, HOOGS A, PERERA A, et al. A large-scale benchmark dataset for event recognition in surveillance video[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2011: 3153-3160.
    [24] EVERINGHAM M, VAN GOOL L, WILLIANMS C K I, et al. The PASCAL visual object classes challenge 2012 (VOC2012) Results[EB/OL]. (2012-11-13)[2022-10-10].
    [25] GOYAL P, DOLLÁR P, GIRSHICK R, et al. Accurate, large minibatch SGD: Training ImageNet in 1 hour[EB/OL]. (2017-06-08)[2022-10-09].
    [26] HOU R, CHEN C, SHAH M. Tube convolutional neural network (T-CNN) for action detection in videos[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2017: 5823-5832.
    [27] KALOGEITON V, WEINZAEPFEL P, FERRARI V, et al. Action tubelet detector for spatio-temporal action localization[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2017: 4415-4423.
    [28] WEI J C, WANG H L, YI Y, et al. P3D-CTN: Pseudo-3D convolutional tube network for spatio-temporal action detection in videos[C]//Proceedings of the IEEE International Conference on Image Processing. Piscataway: IEEE Press, 2019: 300-304.
    [29] LI D, QIU Z F, DAI Q, et al. Recurrent tubelet proposal and recognition networks for action detection[M]. Berlin: Springer International Publishing, 2018: 306-322.
    [30] SUN C, SHRIVASTAVA A, VONDRICK C, et al. Actor-centric relation network[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2018: 335-351.
    [31] YANG X T, YANG X D, LIU M Y, et al. STEP: Spatio-temporal progressive learning for video action detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 264-272.
    [32] WU C Y, FEICHTENHOFER C, FAN H Q, et al. Long-term feature banks for detailed video understanding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 284-293.
    [33] XIAO F Y, LEE Y J, GRAUMAN K, et al. Audiovisual slowfast networks for video recognition[EB/OL]. (2020-01-23)[2022-10-12].
  • 加载中
图(10) / 表(4)
计量
  • 文章访问数:  321
  • HTML全文浏览量:  132
  • PDF下载量:  7
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-11-03
  • 录用日期:  2022-12-09
  • 网络出版日期:  2023-02-15
  • 整期出版日期:  2024-12-31

目录

    /

    返回文章
    返回
    常见问答