基于多任务监督学习的室内空间布局估计模型

邹一波; 李涛; 陈明; 葛艳; 赵林林

doi:10.13700/j.bh.1001-5965.2022.0834

基于多任务监督学习的室内空间布局估计模型

doi: 10.13700/j.bh.1001-5965.2022.0834

邹一波^{1, 2},
李涛¹,
陈明^{1, 2, ,},
葛艳^{1, 2},
赵林林^{1, 2}

1.
上海海洋大学信息学院，上海 201306
2.
农业农村部渔业信息重点实验室，上海 201306

基金项目: 上海市科技创新计划(20dz1203800)

详细信息

通讯作者:
E-mail：mchen@shou.edu.cn

中图分类号: V221⁺.3；TB553
计量
- 文章访问数: 438
- HTML全文浏览量: 137
- PDF下载量: 27
- 被引次数: 0
出版历程
- 收稿日期: 2022-10-04
- 录用日期: 2022-12-04
- 网络出版日期: 2022-12-26
- 整期出版日期: 2024-11-30

Indoor spatial layout estimation model based on multi-task supervised learning

ZOU Yibo^{1, 2},
LI Tao¹,
CHEN Ming^{1, 2
, ,},
GE Yan^{1, 2},
ZHAO Linlin^{1, 2}

1.
College of Information Technology，Shanghai Ocean University，Shanghai 201306，China
2.
Key Laboratory of Fisheries Information，Ministry of Agriculture and Rural Affairs，Shanghai 201306，China

Funds: Shanghai Science and Technology Innovation Action Planning (20dz1203800)

More Information

Corresponding author: E-mail：mchen@shou.edu.cn

摘要

摘要:
室内空间布局估计作为当下计算机视觉领域的研究之一，在目标检测、增强现实和机器人导航等任务中发挥着重要的作用。为更加有效地感知室内场景的布局关系，提出了一种基于多任务监督学习的室内空间布局估计方法，端到端地提取出室内场景的空间分割图。针对室内图像的分割特点，设计编码器-解码器的网络结构，并引入多任务监督学习，从而推理出室内空间布局和各区域的语义边缘结果；定义联合损失函数，在模型训练过程中不断优化分割效果；为更好地表达出各区域之间的布局关系，通过各区域的边缘预测结果，对网络模型的输出进行局部精细化处理，以推理出室内场景空间的最终布局。在公共数据集LSUN和Hedau上进行实验，所提方法能够有效地优化室内空间布局估计效果，分别获得7.54%和7.08%的像素误差，总体上优于对比方法。
- 布局估计 /
- 室内场景 /
- 多任务监督学习 /
- 端到端 /
- 语义边缘
Abstract:
Indoor spatial layout estimation is currently one of the research hotspots in the computer vision field. It plays a crucial role in object detection, augmented reality, and robot navigation. This paper proposed an indoor spatial layout estimation method based on multi-task supervised learning to efficiently perceive the layout relationship of indoor scenes. This method could extract the spatial segmentation map of indoor scenes in an end-to-end manner. According to the segmentation characteristics of indoor layout images, an encoder-decoder network structure was designed, and multi-task supervised learning was introduced to obtain the indoor spatial layout and the semantic edge results of each region. The joint loss function was defined to continuously optimize the segmentation effect during the model training. In order to better express the layout relationship between regions, the edge prediction results of each region were used to locally refine the output of the network model, so as to infer the final spatial layout of indoor scenes. Experiments on the public datasets LSUN and Hedau show that the proposed method can effectively optimize the effect of indoor spatial layout estimation and obtain 7.54% and 7.08% pixel errors respectively, which is better than the comparison method in general.
- layout estimation /
- indoor scene /
- multi-task supervised learning /
- end-to-end /
- semantic edge

HTML全文

图 1 室内空间布局估计任务示例

Figure 1. Examples of indoor spatial layout estimation task

下载: 全尺寸图片幻灯片

图 2 多任务监督学习的网络架构

Figure 2. Network architecture of multi-task supervised learning

下载: 全尺寸图片幻灯片

图 3 编码器的主要结构

Figure 3. Main architecture of encoder

下载: 全尺寸图片幻灯片

图 4 解码器的主要结构

Figure 4. Main architecture of decoder

下载: 全尺寸图片幻灯片

图 5 室内空间布局数据的语义分割和语义边缘

Figure 5. Semantic segmentation and sematic edge of indoor spatial layout data

下载: 全尺寸图片幻灯片

图 6 本文方法在LSUN数据集上的表现

Figure 6. Performance of the proposed method on LSUN dataset

下载: 全尺寸图片幻灯片

图 7 本文方法在LSUN数据集上效果欠佳的示例

Figure 7. Poor performance of the proposed method on LSUN dataset

下载: 全尺寸图片幻灯片

图 8 本文方法在Hedau数据集上的表现

Figure 8. Performance of the proposed method on Hedau dataset

下载: 全尺寸图片幻灯片

图 9 引入多任务监督学习前后的视觉对比结果

Figure 9. Visual comparison results before and after introduction of multi-task supervised learning

下载: 全尺寸图片幻灯片

表 1 编码器和解码器的相关参数信息

Table 1. Relevant parameters of encoder and decoder

模块	层次编号	模块类型	输入特征层	输出特征层	步长	采样率
	1	DSBlock	3×512×512	32×256×256	2
	2	IRBlock	32×256×256	16×256×256	1
	3～4	IRBlock	16×256×256	24×128×128	2
	5～7	IRBlock	24×128×128	32×64×64	2
	8～11	IRBlock	32×64×64	64×64×64	2
	12～14	IRBlock	64×64×64	96×64×64	1
编码器	15～17	IRBlock	96×64×64	160×64×64	2
	18	IRBlock	160×64×64	320×64×64	1
	19	DSCBlock	320×64×64	256×64×64	1	1
	20	DSCBlock	320×64×64	256×64×64	1	6
	21	DSCBlock	320×64×64	256×64×64	1	12
	22	DSCBlock	320×64×64	256×64×64	1	8
	23	IPBlock	320×64×64	256×64×64	1
	1	SCBlock	24×128×128	48×128×128	1
	2	CCBlock	304×128×128	256×128×128	1
解码器	3-1	SegBlock	256×128×128	5×128×128	1
	3-2	TEdgeBlock	256×128×128	1×128×128	1
	3-3	AEdgeBlock	256×128×128	5×128×128	1

下载: 导出CSV

表 2 LSUN数据集上本文方法与现有方法的对比

Table 2. Comparison between the proposed method and existing methods on LSUN dataset

方法	像素误差/%	是否采用端到端方法
文献[7]	24.23	×
文献[36]	16.71	×
文献[37]	10.63	×
RoomNet^[24]	9.86	√
CFILF^[22]	9.31	√
文献[27]	7.79	√
文献[28]	7.99	√
本文	7.54	√

下载: 导出CSV

表 3 Hedau数据集上本文方法与现有方法的对比

Table 3. Comparison between the proposed method and existing methods on Hedau dataset

方法	像素误差/%	是否采用端到端方法
文献[7]	21.20	×
文献[36]	12.83	×
文献[37]	9.73	×
RoomNet^[24]	8.36	√
CFILF^[22]	8.67	√
文献[27]	7.44	√
本文	7.08	√

下载: 导出CSV

表 4 消融实验中本文方法各模块的优化结果

Table 4. Optimization results of each module by the proposed method in ablation experiments

基准	改进编码器	多任务监督学习	特征融合后处理	参数量/MB	单图预测时间/ms	像素误差/%
√	×	×	×	209.6	45	11.15
√	√	×	×	22.1	26	11.34
√	×	√	×	209.8	46	8.41
√	√	√	×	22.4	28	8.63
√	√	√	√	22.4	28	7.54

下载: 导出CSV

参考文献(37)

[1]	PARK S J, HONG K S. Recovering an indoor 3D layout with top-down semantic segmentation from a single image[J]. Pattern Recognition Letters, 2015, 68: 70-75. doi: 10.1016/j.patrec.2015.08.014
[2]	SAITO H, BABA S, KANADE T. Appearance-based virtual view generation from multicamera videos captured in the 3-D room[J]. IEEE Transactions on Multimedia, 2003, 5(3): 303-316. doi: 10.1109/TMM.2003.813283
[3]	DE CRISTÓFORIS P, NITSCHE M, KRAJNÍK T, et al. Hybrid vision-based navigation for mobile robots in mixed indoor/outdoor environments[J]. Pattern Recognition Letters, 2015, 53: 118-128. doi: 10.1016/j.patrec.2014.10.010
[4]	XIE H T, FANG S C, ZHA Z J, et al. Convolutional attention networks for scene text recognition[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2019, 15(1s): 1-17.
[5]	HONG J, HONG Y, UH Y, et al. Discovering overlooked objects: Context-based boosting of object detection in indoor scenes[J]. Pattern Recognition Letters, 2017, 86: 56-61. doi: 10.1016/j.patrec.2016.12.017
[6]	YU F, SEFF A, ZHANG Y D, et al. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop[EB/OL]. (2016-06-04)[2022-10-01].
[7]	HEDAU V, HOIEM D, FORSYTH D. Thinking inside the box: Using appearance models and context based on room geometry[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2010: 224-237.
[8]	YAN C G, LI L, ZHANG C J, et al. Cross-modality bridging and knowledge transferring for image understanding[J]. IEEE Transactions on Multimedia, 2019, 21(10): 2675-2685. doi: 10.1109/TMM.2019.2903448
[9]	DI MAURO D, FURNARI A, PATANÈ G, et al. SceneAdapt: Scene-based domain adaptation for semantic segmentation using adversarial learning[J]. Pattern Recognition Letters, 2020, 136: 175-182. doi: 10.1016/j.patrec.2020.06.002
[10]	BAHETI B, INNANI S, GAJRE S, et al. Semantic scene segmentation in unstructured environment with modified DeepLabV3+[J]. Pattern Recognition Letters, 2020, 138: 223-229. doi: 10.1016/j.patrec.2020.07.029
[11]	ISMAIL A S, SEIFELNASR M M, GUO H X. Understanding indoor scene: Spatial layout estimation, scene classification, and object detection[C]//Proceedings of the 3rd International Conference on Multimedia Systems and Signal Processing. New York: ACM, 2018: 64-70.
[12]	HUANG C, HE Z H. Task-driven progressive part localization for fine-grained recognition[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE Press, 2016: 1-9.
[13]	TANG J H, JIN L, LI Z C, et al. RGB-D object recognition via incorporating latent data structure and prior knowledge[J]. IEEE Transactions on Multimedia, 2015, 17(11): 1899-1908. doi: 10.1109/TMM.2015.2476660
[14]	黄荣泽, 孟庆浩, 刘胤伯. 基于多任务监督学习的实时室内布局估计方法[J]. 激光与光电子学进展, 2021, 58(14): 1410023. HUANG R Z, MENG Q H, LIU Y B. Real-time indoor layout estimation method based on multi-task supervised learning[J]. Laser & Optoelectronics Progress, 2021, 58(14): 1410023(in Chinese).
[15]	COUGHLAN J, YUILLE A L. The Manhattan world assumption: Regularities in scene statistics which enable Bayesian inference[C]//Proceedings of the Neural Information Processing Systems. Trier: DBLP, 2000.
[16]	许宏科, 秦严严, 陈会茹. 一种基于改进Canny的边缘检测算法[J]. 红外技术, 2014, 36(3): 210-214. XU H K, QIN Y Y, CHEN H R. An improved algorithm for edge detection based on Canny[J]. Infrared Technology, 2014, 36(3): 210-214(in Chinese).
[17]	YILMAZ B, ABDULLAH S N H, KOK V J. Vanishing region loss for crowd density estimation[J]. Pattern Recognition Letters, 2020, 138: 336-345. doi: 10.1016/j.patrec.2020.08.001
[18]	WANG H Y, GOULD S, ROLLER D. Discriminative learning with latent variables for cluttered indoor scene understanding[J]. Communications of the ACM, 2013, 56(4): 92-99. doi: 10.1145/2436256.2436276
[19]	LIU C X, SCHWING A G, KUNDU K, et al. Rent3D: Floor-plan priors for monocular layout estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 3413-3421.
[20]	DEL PERO L, BOWDISH J, KERMGARD B, et al. Understanding Bayesian rooms using composite 3D object models[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2013: 153-160.
[21]	DEL PERO L, BOWDISH J, FRIED D, et al. Bayesian geometric modeling of indoor scenes[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2012: 2719-2726.
[22]	REN Y Z, LI S W, CHEN C, et al. A coarse-to-fine indoor layout estimation (CFILE) method[C]//Proceedings of the Asian Conference on Computer Vision. Berlin: Springer, 2017: 36-51.
[23]	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2015: 3431-3440.
[24]	LEE C Y, BADRINARAYANAN V, MALISIEWICZ T, et al. RoomNet: End-to-end room layout estimation[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2017: 4875-4884.
[25]	ZHANG W D, ZHANG W, GU J. Edge-semantic learning strategy for layout estimation in indoor environment[J]. IEEE Transactions on Cybernetics, 2020, 50(6): 2730-2739. doi: 10.1109/TCYB.2019.2895837
[26]	ZHENG W Z, LU J W, ZHOU J. Structural deep metric learning for room layout estimation[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2020: 735-751.
[27]	HIRZER M, ROTHP M, LEPETIT V. Smart hypothesis generation for efficient and robust room layout estimation[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE Press, 2020: 2901-2909.
[28]	WANG A P, WEN S T, GAO Y J, et al. An efficient method for indoor layout estimation with FPN[C]//Proceedings of the International Conference on Web Information Systems Engineering. Berlin: Springer, 2021: 94-106.
[29]	KIRILLOV A, GIRSHICK R, HE K M, et al. Panoptic feature pyramid networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 6392-6401.
[30]	LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 936-944.
[31]	CHEN L C, ZHU Y K, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2018: 833-851.
[32]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848. doi: 10.1109/TPAMI.2017.2699184
[33]	CHOLLET F. Xception: Deep learning with depthwise separable convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 1800-1807.
[34]	CHEN L C, BARRON J T, PAPANDREOU G, et al. Semantic image segmentation with task-specific edge detection using CNNs and a discriminatively trained domain transform[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 4545-4554.
[35]	DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2009: 248-255.
[36]	MALLYA A, LAZEBNIK S. Learning informative edge maps for indoor scene layout prediction[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2015: 936-944.
[37]	DASGUPTA S, FANG K, CHEN K, et al. DeLay: Robust spatial layout estimation for cluttered indoor scenes[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 616-624.