Single shot multibox detector based on asynchronous convolution factorization and shunt structure
-
摘要:
目标检测网络SSD的多层回归特征图存在各层回归计算之间相对独立的问题,且基于SSD改进的系列算法在提高检测精度的同时难以兼顾实时性。针对上述问题,提出一种基于异步卷积分解与分流(shunt)结构的单阶段目标检测器。基于异步卷积分解算法设计了一种shunt结构,交错连接多层特征图,增强了回归计算之间的统一性与协调性。优化了原有高层主流结构,在主流结构与shunt结构中分别用最大池化和异步卷积分解2种不同的方式对特征图大小进行降维,保留空间相关信息的同时提高了特征的多样性。实验结果表明,将VOC2007trainval和VOC2012trainval中的图片统一缩小至300像素×300像素进行训练,提出的目标检测器在VOC2007test上进行检测时的平均精度均值可达到80.5%,检测速度超过30帧/s。
Abstract:Single shot multibox detector (SSD) owns the relatively independent regression computations of multi-regressive feature maps, while the object detection algorithms based on SSD cannot make a tradeoff between detection accuracy and real-time speed. To solve the problems above, a single shot mutibox detector based on asynchronous convolution factorization and shunt structure (FA-SSD) is introduced based on asynchronous convolution factorization algorithm and shunt structure. The shunt structure, based on the proposed asynchronous convolution factorization algorithm, is designed to staggerly connect the layers of regression features, enhancing the unity and coordination between regression calculations. In order to optimize the mainstream of high-level structure, the asynchronous convolution factorization algorithm and max pooling are implemented to reduce the dimension of image features in the mainstream and shunt respectively, which can hold the spatial information while improving the diversity of features. According to the experimental results from VOC2007test, FA-SSD achieves a mean average precision of 80.5% after the training of VOC2007trainval and VOC2012trainval with nominal resolution of 300×300, while the detection speed exceeds 30 frames per second.
-
表 1 不同算法在VOC2007test上的检测结果
Table 1. Detection results of different algorithms on VOC2007test
算法 训练数据 预训练 底层网络 图片大小 建议框数 显卡 速度/(帧·s-1) m_AP/% Fast R-CNN[8] 07+12 √ VGGNet 600×1 000* 300 K40 3.125 66.9 Faster R-CNN[9] 07+12 √ VGGNet 600×1 000* 300 K40 5 73.2 R-FCN[22] 07+12 √ VGGNet 600×1 000 300 K40 5.8 75.6 YOLOv2[12] 07+12 √ Darknet-19 352×352 Titan X 81 73.7 SSD300[13] 07+12 × VGGNet 300×300 8 732 Titan X 46 74.3 SSD300[13] 07+12 √ VGGNet 300×300 8 732 Titan X 46 77.2 SSD300*[13] 07+12 × VGGNet 300×300 8 732 1080Ti 43.5 74 DSOD300[16] 07+12 × DS/64-192-48-1 300×300 8 732 Titan X 17.4 77.7 DSSD321[14] 07+12 √ ResNet 321×321 17 080 Titan X 9.5 78.6 FA-SSD300 07+12 × VGGNet 300×300 8 732 1080Ti 30 79.0 FA-SSD300 07+12 √ VGGNet 300×300 8 732 1080Ti 30 80.5 表 2 针对VOC2007test具体类别的检测对比
Table 2. Comparison of specific category detections on VOC2007test
类别 Fast R-CNN[8] Faster R-CNN[9] ION[22] R-FCN[23] MR-CNN[24] SSD300[13] DSSD321[14] FA-SSD300 Aero 77.0 76.5 79.2 79.9 80.3 79.5 81.9 86.4 Bike 78.1 79 83.1 87.2 84.1 83.9 84.9 85.9 Bird 69.3 70.9 77.6 81.5 78.5 76 80.5 79.6 Boat 59.4 65.5 65.6 72 70.8 69.6 68.4 73.3 Bottle 38.3 52.1 54.9 69.8 68.5 50.4 53.9 53.6 Bus 81.6 83.1 85.4 86.8 88 87 85.6 90.2 Car 78.6 84.7 85.1 88.5 85.9 85.7 86.2 89.2 Cat 86.7 86.4 87 89.8 87.8 88.1 88.9 91.7 Chair 42.8 52 54.4 67 60.3 60.3 61.1 60.0 Cow 78.8 81.9 80.6 88.1 85.2 81.5 83.5 84.3 Table 68.9 65.7 73.8 74.5 73.7 77 78.7 80.9 Dog 84.7 84.8 85.3 89.8 87.2 86.1 86.7 89.1 Horse 82.0 84.6 82.2 90.6 86.5 87.5 88.7 87.4 Mbike 76.6 77.5 82.2 79.9 85 83.97 86.7 86.5 Person 69.9 76.7 74.4 81.2 76.4 79.4 79.7 83.3 Plant 31.8 38.8 47.1 53.7 48.5 52.3 51.7 54.2 Sheep 70.1 73.6 75.8 81.8 76.3 77.9 78 83.2 Sofa 74.8 73.9 72.7 81.5 75.5 79.4 80.9 82.3 Train 80.4 83 84.2 85.9 85 87.6 87.9 89.2 Tv 70.4 72.6 80.4 79.9 81 76.8 79.4 78.5 mAP/% 70.0 73.2 75.6 80.5 78.2 77.2 78.6 80.5 -
[1] VIOLA P, JONES M.Rapid object detection using a boosted cascade of simple features[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2003: 511-518. [2] DALAL N, TRIGGS B.Histograms of oriented gradients for human detection[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2005: 886-893. [3] FELZENSZWALB P, MCALLESTER D, RAMANAN D.A discriminatively trained, multiscale, deformable part model[C]//IEEE Computer, Society Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2008: 1-8. [4] EVERINGHAM M, GOOL L V, WILLIAMS C K I, et al.The pascal, visual object classes (VOC) challenge[J].International Journal of Computer Vision, 2010, 88(2):303-338. doi: 10.2533-chimia.2011.925/ [5] 李旭冬, 叶茂, 李涛.基于卷积神经网络的目标检测研究综述[J].计算机应用研究, 2017, 34(10):2881-2886. doi: 10.3969/j.issn.1001-3695.2017.10.001LI X D, YE M, LI T. Review of object detection based on convolutional neural networks[J].Application Research of Computers, 2017, 34(10):2881-2886(in Chinese). doi: 10.3969/j.issn.1001-3695.2017.10.001 [6] GIRSHICK R, DONAHUE J, DARRELL T, et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2014: 580-587. [7] HE K, ZHANG X, REN S, et al.Spatial pyramid pooling in deep convolutional networks for visual recognition[J].IEEE Transactions on Pattern Analysis & Machine Intelligence, 2014, 37(9):346-361. [8] GIRSHICK R.Fast R-CNN[C]//IEEE International Conference on Computer Vision.Piscataway, NJ: IEEE Press, 2015: 1440-1448. [9] REN S, HE K, GIRSHICK R, et al.Faster R-CNN: Towards real-time object detection with region proposal networks[C]//International Conference on Neural Information Processing Systems.Cambridge: MIT Press, 2015: 91-99. [10] LIN T Y, DOLLAR P, GIRSHICK R, et al.Feature pyramid networks for object detection[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2017: 936-944. [11] REDMON J, DIVVALA S, GIRSHICK R, et al.You only look once: Unified, real-time object detection[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2015: 779-788. [12] REDMON J, FARHADI A.YOLO9000: Better, faster, stronger[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2017: 6517-6525. [13] LIU W, ANGUELOV D, ERHAN D, et al.SSD: Single shot multibox detector[C]//European Conference on Computer Vision.Berlin: Springer, 2016: 21-37. [14] REDMON J, FARHADI A.YOLOv3: An incremental improvement[EB/OL].(2018-04-08)[2018-09-21].http://cn.arxiv.org/pdf/1804.02767v1. [15] FU C Y, LIU W, RANGA A, et al.DSSD: Deconvolutional single shot detector[EB/OL].(2017-01-23)[2018-09-21].http://cn.arxiv.org/pdf/1701.06659. [16] SHEN Z, LIU Z, LI J, et al.DSOD: Learning deeply supervised object detectors from scratch[C]//IEEE International Conference on Computer Vision.Piscataway, NJ: IEEE Press, 2017: 1937-1945. [17] SIMONYAN K, ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[EB/OL].(2015-03-10)[2018-09-21].http://cn.arxiv.org/pdf/1409.1556. [18] HE K, ZHANG X, REN S, et al.Deep residual learning for image recognition[C]//IEEE International Conference on Computer Vision.Piscataway, NJ: IEEE Press, 2015: 770-778. [19] SZEGEDY C, VANHOUCKE V, IOFFE S, et al.Rethinking the inception architecture for computer vision[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2016: 2818-2826. [20] IOFFE S, SZEGEDY C.Batch normalization: Accelerating deep network training by reducing internal covariate shift[EB/OL].(2015-03-02)[2018-09-26].https://arxiv.org/abs/1502.03167. [21] RUSSAKOVSKY O, DENG J, SU H, et al.ImageNet large scale visual recognition challenge[J].International Journal of Computer Vision, 2015, 115(3):211-252. http://d.old.wanfangdata.com.cn/NSTLHY/NSTL_HYCC0214533907/ [22] BELL S, ZITNICK C L, BALA K, et al.Inside-outside Net: Detecting objects in context with skip pooling and recurrent neural networks[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2016: 2874-2883. [23] DAI J, LI Y, HE K, et al.R-FCN: Object detection via region-based fully convolutional networks[EB/OL].(2016-06-21)[2018-09-26].https://arxiv.org/abs/1605.06409. [24] HE K, GKIOXARI G, DOLLAR P, et al.Mask R-CNN[C]//IEEE International Conference on Computer Vision.Piscataway, NJ: IEEE Press, 2017: 1-13.