-
摘要:
为了从单张RGB图像估计出相机的位姿信息,提出了一种深度编解码双路卷积神经网络(CNN),提升了视觉自定位的精度。首先,使用编码器从输入图像中提取高维特征;然后,使用解码器提升特征的空间分辨率;最后,通过多尺度位姿预测器输出位姿参数。由于位置和姿态的特性不同,网络从解码器开始采用双路结构,对位置和姿态分别进行处理,并且在编解码之间增加跳跃连接以保持空间信息。实验结果表明:所提网络的精度与目前同类型算法相比有明显提升,其中相机姿态角度精度有较大提升。
-
关键词:
- 视觉自定位 /
- 编解码结构 /
- 卷积神经网络(CNN) /
- 跳跃连接 /
- 双路网络
Abstract:In order to calculate the camera pose from a single RGB image, a deep encoder-decoder dual-stream convolutional neural network (CNN) is proposed, which can improve the accuracy of visual localization. The network first uses an encoder to extract advanced features from input images. Second, the spacialresolution is enhancedby a pose decoder.Finally, a multi-scale estimator is used to output pose parameters. Becauseof the differentperformance of position and orientation, the network adopts a dual-stream structure from the decoder to process the position and orientationseparately. To restore the spatial information, several skip connections are added to encoder-decoder architecture. The experimental results show that the accuracy of the network is obviously improved compared with the congener state-of-the-art algorithms, and the orientation accuracy of camera pose is improved dramatically.
-
表 1 不同场景下不同算法的位置误差和姿态误差
Table 1. Position error and orientation error with different scenes for various algorithms
场景 位置误差/m, 姿态误差/(°) PoseNet[11] Bayesian DS[15] LSTM-Pose[13] VidLoc[12] Hourglass[14] BiLocNet Chess 0.32, 8.12 0.28, 7.05 0.24, 5.77 0.18, N/A 0.15, 6.17 0.13, 5.13 Fire 0.47, 14.4 0.43, 12.52 0.34, 11.9 0.26, N/A 0.27, 10.84 0.29, 10.48 Heads 0.29, 12.0 0.25, 12.72 0.21, 13.7 0.14, N/A 0.19, 11.63 0.16, 12.67 Office 0.48, 7.68 0.30, 8.92 0.30, 8.08 0.26, N/A 0.21, 8.48 0.25, 6.82 Pupkin 0.47, 8.42 0.36, 7.53 0.33, 7.00 0.36, N/A 0.25, 8.12 0.25, 5.23 Kitchen 0.59, 8.64 0.45, 9.80 0.37, 8.83 0.32, N/A 0.27, 10.15 0.26, 6.95 Stairs 0.47, 13.8 0.42, 13.06 0.40, 13.7 0.26, N/A 0.29, 12.46 0.33, 9.86 均值 0.44, 10.44 0.35, 10.22 0.31, 9.85 0.25, N/A 0.23, 9.69 0.23, 8.16 表 2 不同预测器的网络精度对比
Table 2. Comparison of network precision of different estimators
预测器 位置误差/m 姿态误差/(°) 全连接预测器 0.26 8.03 GAP预测器 0.14 5.21 多尺度位姿预测器 0.13 5.13 表 3 α权重损失函数与σ权重损失函数的网络精度对比
Table 3. Comparison of network precision between α weights loss function and σ weights loss function
损失函数 位置误差/m 姿态误差/(°) α权重 0.19 6.55 σ权重 0.17 5.64 表 4 有无迁移学习的网络精度对比
Table 4. Comparison of network precision with and without transfer learning
有无迁移学习 位置误差/m 姿态误差/(°) 有 0.32 10.64 无 0.29 10.48 -
[1] CHEN D M, BAATZ G, KOSER K, et al.City-scale landmark identification on mobile devices[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2011: 12258110. [2] TORII A, SIVIC J, PAJDLA T, et al.Visual place recognition with repetitive structures[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2013: 883-890. [3] SCHINDLER G, BROWN M, SZELISKI R.City-scale location recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2007: 1-7. [4] ARTH C, PIRCHHEIM C, VENTURA J, et al.Instant outdoor localization and SLAM initialization from 2.5 D maps[J].IEEE Transactions on Visualization and Computer Graphics, 2015, 21(11):1309-1318. doi: 10.1109/TVCG.2015.2459772 [5] POGLITSCH C, ARTH C, SCHMALSTIEG D, et al.A particle filter approach to outdoor localization using image-based rendering[C]//IEEE International Symposium on Mixed and Augmented Reality(ISMAR).Piscataway, NJ: IEEE Press, 2015: 132-135. [6] SATTLER T, LEIBE B, KOBBELT L.Improving image-based localization by active correspondence search[C]//Proceedings of European Conference on Computer Vision.Berlin: Springer, 2012: 752-765. [7] LI Y, SNAVELY N, HUTTENLOCHER D, et al.Worldwide pose estimation using 3D point clouds[C]//Proceedings of European Conference on Computer Vision.Berlin: Springer, 2012: 15-29. [8] CHOUDHARY S, NARAYANAN P J.Visibility probability structure from SFM datasets and applications[C]//Proceedings of European Conference on Computer Vision.Berlin: Springer, 2012: 130-143. [9] SVARM L, ENQVIST O, OSKARSSON M, et al.Accurate localization and pose estimation for large 3D models[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2014: 532-539. [10] SHOTTON J, GLOCKER B, ZACH C, et al.Scene coordinate regression forests for camera relocalization in RGB-D images[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2013: 2930-2937. [11] KENDALL A, GRIMES M, CIPOLLA R.PoseNet: A convolutional network for real-time 6-DOF camera relocalization[C]//Proceedings of IEEE International Conference on Computer Vision.Piscataway, NJ: IEEE Press, 2015: 2938-2946. [12] WALCH F, HAZIRBAS C, LEAL-TAIXÉ L, et al.Image-based localization with spatial LSTMs[EB/OL].(2016-11-23)[2018-12-25].https://arxiv.org/pdf/1611.07890v1. [13] CLARK R, WANG S, MARKHAM A, et al.VidLoc: A deep spatio-temporal model for 6-DOF video-clip relocalization[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2017: 6856-6864. [14] KENDALL A, CIPOLLA R.Modelling uncertainty in deep learning forcamera relocalization[C]//Proceedings of IEEE International Conference on Robotics and Automation (ICRA).Piscataway, NJ: IEEE Press, 2016: 4762-4769. [15] MELEKHOV I, YLIOINAS J, KANNALA J, et al.Image-based localization using hourglass networks[EB/OL].(2017-08-24)[2018-12-25].https://arxiv.org/abs/1703.07971. [16] LI R, LIU Q, GUI J, et al.Indoor relocalization in challenging environments with dual-stream convolutional neural networks[J].IEEE Transactions on Automation Science and Engineering, 2018, 15(2):651-662. doi: 10.1109/TASE.2017.2664920 [17] RADWAN N, VALADA A, BURGARD W.Vlocnet++: Deep multitask learning for semantic visual localization and odometry[EB/OL].(2016-10-11)[2018-12-25].https://arxiv.org/abs/1804.08366. [18] BRAHMBHATT S, GU J, KIM K, et al.Geometry-aware learning of maps for camera localization[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2018: 2616-2625. [19] LI X, YLIOINAS J, KANNALA J.Full-frame scene coordinate regression for image-based localization[EB/OL].(2018-01-25)[2018-12-25].https://arxiv.org/abs/1802.03237. [20] LASKAR Z, MELEKHOV I, KALIA S, et al.Camera relocalization by computing pairwise relative poses using convolutional neural network[C]//Proceedings of IEEE International Conference on Computer Vision.Piscataway, NJ: IEEE Press, 2017: 929-938. [21] KENDALL A, GAL Y, CIPOLLA R.Multi-task learning using uncertainty to weigh losses for scene geometry and semantics[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2018: 7482-7491. [22] SZEGEDY C, IOFFE S, VANHOUCKE V, et al.Inception-v4, inception-resnet and the impact of residual connections on learning[C]//Thirty-First AAAI Conference on Artificial Intelligence, 2017: 4-12. [23] KENDALL A, GAL Y.What uncertainties do we need in Bayesian deep learning for computer vision [EB/OL].(2017-10-05)[2018-12-26].https://arxiv.org/abs/1703.04977. [24] IZADI S, KIM D, HILLIGES O, et al.KinectFusion: Real-time 3D reconstruction and interaction using a moving depth camera[C]//Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology.New York: ACM, 2011: 559-568. [25] HE K M, ROSS G, PIOTR D.Rethinking imagenet pre-training[EB/OL].(2015-11-21)[2018-12-25].https://arxiv.org/abs/1811.08883. [26] WU Y X, HE K M.Group normalization[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018, 3-19.