-
摘要:
室内空间布局估计作为当下计算机视觉领域的研究之一,在目标检测、增强现实和机器人导航等任务中发挥着重要的作用。为更加有效地感知室内场景的布局关系,提出了一种基于多任务监督学习的室内空间布局估计方法,端到端地提取出室内场景的空间分割图。针对室内图像的分割特点,设计编码器-解码器的网络结构,并引入多任务监督学习,从而推理出室内空间布局和各区域的语义边缘结果;定义联合损失函数,在模型训练过程中不断优化分割效果;为更好地表达出各区域之间的布局关系,通过各区域的边缘预测结果,对网络模型的输出进行局部精细化处理,以推理出室内场景空间的最终布局。在公共数据集LSUN和Hedau上进行实验,所提方法能够有效地优化室内空间布局估计效果,分别获得7.54%和7.08%的像素误差,总体上优于对比方法。
Abstract:Indoor spatial layout estimation is currently one of the research hotspots in the computer vision field. It plays a crucial role in object detection, augmented reality, and robot navigation. This paper proposed an indoor spatial layout estimation method based on multi-task supervised learning to efficiently perceive the layout relationship of indoor scenes. This method could extract the spatial segmentation map of indoor scenes in an end-to-end manner. According to the segmentation characteristics of indoor layout images, an encoder-decoder network structure was designed, and multi-task supervised learning was introduced to obtain the indoor spatial layout and the semantic edge results of each region. The joint loss function was defined to continuously optimize the segmentation effect during the model training. In order to better express the layout relationship between regions, the edge prediction results of each region were used to locally refine the output of the network model, so as to infer the final spatial layout of indoor scenes. Experiments on the public datasets LSUN and Hedau show that the proposed method can effectively optimize the effect of indoor spatial layout estimation and obtain 7.54% and 7.08% pixel errors respectively, which is better than the comparison method in general.
-
Key words:
- layout estimation /
- indoor scene /
- multi-task supervised learning /
- end-to-end /
- semantic edge
-
表 1 编码器和解码器的相关参数信息
Table 1. Relevant parameters of encoder and decoder
模块 层次编号 模块类型 输入特征层 输出特征层 步长 采样率 1 DSBlock 3×512×512 32×256×256 2 2 IRBlock 32×256×256 16×256×256 1 3~4 IRBlock 16×256×256 24×128×128 2 5~7 IRBlock 24×128×128 32×64×64 2 8~11 IRBlock 32×64×64 64×64×64 2 12~14 IRBlock 64×64×64 96×64×64 1 编码器 15~17 IRBlock 96×64×64 160×64×64 2 18 IRBlock 160×64×64 320×64×64 1 19 DSCBlock 320×64×64 256×64×64 1 1 20 DSCBlock 320×64×64 256×64×64 1 6 21 DSCBlock 320×64×64 256×64×64 1 12 22 DSCBlock 320×64×64 256×64×64 1 8 23 IPBlock 320×64×64 256×64×64 1 1 SCBlock 24×128×128 48×128×128 1 2 CCBlock 304×128×128 256×128×128 1 解码器 3-1 SegBlock 256×128×128 5×128×128 1 3-2 TEdgeBlock 256×128×128 1×128×128 1 3-3 AEdgeBlock 256×128×128 5×128×128 1 表 2 LSUN数据集上本文方法与现有方法的对比
Table 2. Comparison between the proposed method and existing methods on LSUN dataset
表 3 Hedau数据集上本文方法与现有方法的对比
Table 3. Comparison between the proposed method and existing methods on Hedau dataset
表 4 消融实验中本文方法各模块的优化结果
Table 4. Optimization results of each module by the proposed method in ablation experiments
基准 改进
编码器多任务
监督学习特征融合
后处理参数量/MB 单图预测
时间/ms像素误差/% √ × × × 209.6 45 11.15 √ √ × × 22.1 26 11.34 √ × √ × 209.8 46 8.41 √ √ √ × 22.4 28 8.63 √ √ √ √ 22.4 28 7.54 -
[1] PARK S J, HONG K S. Recovering an indoor 3D layout with top-down semantic segmentation from a single image[J]. Pattern Recognition Letters, 2015, 68: 70-75. doi: 10.1016/j.patrec.2015.08.014 [2] SAITO H, BABA S, KANADE T. Appearance-based virtual view generation from multicamera videos captured in the 3-D room[J]. IEEE Transactions on Multimedia, 2003, 5(3): 303-316. doi: 10.1109/TMM.2003.813283 [3] DE CRISTÓFORIS P, NITSCHE M, KRAJNÍK T, et al. Hybrid vision-based navigation for mobile robots in mixed indoor/outdoor environments[J]. Pattern Recognition Letters, 2015, 53: 118-128. doi: 10.1016/j.patrec.2014.10.010 [4] XIE H T, FANG S C, ZHA Z J, et al. Convolutional attention networks for scene text recognition[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2019, 15(1s): 1-17. [5] HONG J, HONG Y, UH Y, et al. Discovering overlooked objects: Context-based boosting of object detection in indoor scenes[J]. Pattern Recognition Letters, 2017, 86: 56-61. doi: 10.1016/j.patrec.2016.12.017 [6] YU F, SEFF A, ZHANG Y D, et al. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop[EB/OL]. (2016-06-04)[2022-10-01]. http://arxiv.org/abs/1506.03365. [7] HEDAU V, HOIEM D, FORSYTH D. Thinking inside the box: Using appearance models and context based on room geometry[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2010: 224-237. [8] YAN C G, LI L, ZHANG C J, et al. Cross-modality bridging and knowledge transferring for image understanding[J]. IEEE Transactions on Multimedia, 2019, 21(10): 2675-2685. doi: 10.1109/TMM.2019.2903448 [9] DI MAURO D, FURNARI A, PATANÈ G, et al. SceneAdapt: Scene-based domain adaptation for semantic segmentation using adversarial learning[J]. Pattern Recognition Letters, 2020, 136: 175-182. doi: 10.1016/j.patrec.2020.06.002 [10] BAHETI B, INNANI S, GAJRE S, et al. Semantic scene segmentation in unstructured environment with modified DeepLabV3+[J]. Pattern Recognition Letters, 2020, 138: 223-229. doi: 10.1016/j.patrec.2020.07.029 [11] ISMAIL A S, SEIFELNASR M M, GUO H X. Understanding indoor scene: Spatial layout estimation, scene classification, and object detection[C]//Proceedings of the 3rd International Conference on Multimedia Systems and Signal Processing. New York: ACM, 2018: 64-70. [12] HUANG C, HE Z H. Task-driven progressive part localization for fine-grained recognition[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE Press, 2016: 1-9. [13] TANG J H, JIN L, LI Z C, et al. RGB-D object recognition via incorporating latent data structure and prior knowledge[J]. IEEE Transactions on Multimedia, 2015, 17(11): 1899-1908. doi: 10.1109/TMM.2015.2476660 [14] 黄荣泽, 孟庆浩, 刘胤伯. 基于多任务监督学习的实时室内布局估计方法[J]. 激光与光电子学进展, 2021, 58(14): 1410023.HUANG R Z, MENG Q H, LIU Y B. Real-time indoor layout estimation method based on multi-task supervised learning[J]. Laser & Optoelectronics Progress, 2021, 58(14): 1410023(in Chinese). [15] COUGHLAN J, YUILLE A L. The Manhattan world assumption: Regularities in scene statistics which enable Bayesian inference[C]//Proceedings of the Neural Information Processing Systems. Trier: DBLP, 2000. [16] 许宏科, 秦严严, 陈会茹. 一种基于改进Canny的边缘检测算法[J]. 红外技术, 2014, 36(3): 210-214.XU H K, QIN Y Y, CHEN H R. An improved algorithm for edge detection based on Canny[J]. Infrared Technology, 2014, 36(3): 210-214(in Chinese). [17] YILMAZ B, ABDULLAH S N H, KOK V J. Vanishing region loss for crowd density estimation[J]. Pattern Recognition Letters, 2020, 138: 336-345. doi: 10.1016/j.patrec.2020.08.001 [18] WANG H Y, GOULD S, ROLLER D. Discriminative learning with latent variables for cluttered indoor scene understanding[J]. Communications of the ACM, 2013, 56(4): 92-99. doi: 10.1145/2436256.2436276 [19] LIU C X, SCHWING A G, KUNDU K, et al. Rent3D: Floor-plan priors for monocular layout estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 3413-3421. [20] DEL PERO L, BOWDISH J, KERMGARD B, et al. Understanding Bayesian rooms using composite 3D object models[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2013: 153-160. [21] DEL PERO L, BOWDISH J, FRIED D, et al. Bayesian geometric modeling of indoor scenes[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2012: 2719-2726. [22] REN Y Z, LI S W, CHEN C, et al. A coarse-to-fine indoor layout estimation (CFILE) method[C]//Proceedings of the Asian Conference on Computer Vision. Berlin: Springer, 2017: 36-51. [23] LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 3431-3440. [24] LEE C Y, BADRINARAYANAN V, MALISIEWICZ T, et al. RoomNet: End-to-end room layout estimation[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2017: 4875-4884. [25] ZHANG W D, ZHANG W, GU J. Edge-semantic learning strategy for layout estimation in indoor environment[J]. IEEE Transactions on Cybernetics, 2020, 50(6): 2730-2739. doi: 10.1109/TCYB.2019.2895837 [26] ZHENG W Z, LU J W, ZHOU J. Structural deep metric learning for room layout estimation[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2020: 735-751. [27] HIRZER M, ROTHP M, LEPETIT V. Smart hypothesis generation for efficient and robust room layout estimation[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE Press, 2020: 2901-2909. [28] WANG A P, WEN S T, GAO Y J, et al. An efficient method for indoor layout estimation with FPN[C]//Proceedings of the International Conference on Web Information Systems Engineering. Berlin: Springer, 2021: 94-106. [29] KIRILLOV A, GIRSHICK R, HE K M, et al. Panoptic feature pyramid networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 6392-6401. [30] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 936-944. [31] CHEN L C, ZHU Y K, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2018: 833-851. [32] CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848. doi: 10.1109/TPAMI.2017.2699184 [33] CHOLLET F. Xception: Deep learning with depthwise separable convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 1800-1807. [34] CHEN L C, BARRON J T, PAPANDREOU G, et al. Semantic image segmentation with task-specific edge detection using CNNs and a discriminatively trained domain transform[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 4545-4554. [35] DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2009: 248-255. [36] MALLYA A, LAZEBNIK S. Learning informative edge maps for indoor scene layout prediction[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2015: 936-944. [37] DASGUPTA S, FANG K, CHEN K, et al. DeLay: Robust spatial layout estimation for cluttered indoor scenes[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 616-624.