基于FastSAM辅助表示增强的自监督单目深度估计

张冬冬; 王春平; 付强

doi:10.13700/j.bh.1001-5965.2023.0846

基于FastSAM辅助表示增强的自监督单目深度估计

doi: 10.13700/j.bh.1001-5965.2023.0846

1.
国防科技大学电子对抗学院，合肥 230037
2.
陆军工程大学石家庄校区，石家庄 050003

详细信息

通讯作者:
E-mail：1418748495@qq.com

中图分类号: TP753
计量
- 文章访问数: 298
- HTML全文浏览量: 127
- PDF下载量: 125
- 被引次数: 0
出版历程
- 收稿日期: 2024-01-03
- 录用日期: 2024-03-22
- 网络出版日期: 2024-04-07
- 整期出版日期: 2026-03-31

FastSAM-assisted representation enhancement for self-supervised monocular depth estimation

1.
College of Electronic Engineering，National University of Defense Technology，Hefei 230037, China
2.
Shijiazhuang Campus，Army Engineering University of PLA，Shijiazhuang 050003，China

More Information

Corresponding author: E-mail：1418748495@qq.com

摘要

摘要:
针对自监督单目深度估计方法对薄结构区域和边界区域深度估计效果不佳的问题，提出一种基于FastSAM辅助表示增强的自监督单目深度估计的方法。首先，引入FastSAM，在不引入额外监督的情况下为深度网络提供丰富的语义信息。其次，提出语义指导模块(SGM)，以探索语义特征和深度特征之间的相关性，并增强全局特征表示。另外，设计(EGM)，以引导网络更加关注局部特征，从而改善边界深度估计性能。大量实验表明，所提方法优于最先进的方法，尤其是在薄结构区域和边界区域的深度估计上具有明显优势。
- 深度估计 /
- 自监督 /
- FastSAM /
- 语义信息 /
- 特征表示
Abstract:
In order to address the issue of unsatisfactory performance in self-supervised monocular depth estimation methods when applied to thin structured regions and boundary regions, this paper proposes a method for self-supervised monocular depth estimation based on FastSAM-assisted representation enhancement. Firstly, without the need for extra supervision, FastSAM is presented to supply the depth network with rich semantic information. Secondly, a semantic guidance module (SGM) is proposed to explore the correlation between semantic features and depth features, and to enhance the global feature representation. Furthermore, to enhance the performance of boundary depth estimation, a edge guiding module (EGM) is built to direct the network to focus more on local features. Extensive experiments show that the proposed method outperforms the state-of-the-art methods, especially in depth estimation of thin-structured regions and boundary regions.
- depth estimation /
- self-supervised /
- FastSAM /
- semantic information /
- feature representation

HTML全文

图 1 所提方法的整体框架

Figure 1. Overall framework of the proposed methodology

下载: 全尺寸图片幻灯片

图 2 语义指导模块

Figure 2. Semantic guidance module

下载: 全尺寸图片幻灯片

图 3 边界引导模块

Figure 3. Edge guidance module

下载: 全尺寸图片幻灯片

图 4 训练过程可视化

Figure 4. Visualization of training process

下载: 全尺寸图片幻灯片

图 5 KITTI数据集上可视化结果

Figure 5. Visualization results on KITTI dataset

下载: 全尺寸图片幻灯片

图 6 特征图可视化示例

Figure 6. Example of feature map visualization

下载: 全尺寸图片幻灯片

图 7 不同嵌入方案的可视化结果

Figure 7. Visualization results of different embedding schemes

下载: 全尺寸图片幻灯片

图 8 失败案例

Figure 8. Failure cases

下载: 全尺寸图片幻灯片

图 9 Make3D数据集上的可视化结果

Figure 9. Visualization results on the Make3D dataset

下载: 全尺寸图片幻灯片

表 1 不同方法在KITTI数据集上的定量检测结果

Table 1. Quantitative detection results of different methods on the KITTI dataset

方法	监督方式	Abs Rel↓	Sq Rel↓	RMS↓	RMS log↓	δ₁↑	δ₂↑	δ₃↑
Monodepth2^[7]	M	0.115	0.903	4.863	0.193	0.877	0.959	0.981
SAFENet^[16]	M+Se	0.112	0.788	4.582	0.187	0.878	0.963	0.983
SGDepth^[25]	M+Se	0.113	0.835	4.693	0.191	0.879	0.961	0.981
SGRLDepth^[28]	M	0.102	0.698	4.381	0.178	0.896	0.964	0.984
WaveletMonodepth^[29]	S	0.106	0.824	4.824	0.205	0.870	0.949	0.975
FSREDepth^[17]	M+Se	0.105	0.722	4.547	0.182	0.886	0.964	0.984
HRDepth^[22]	M	0.109	0.792	4.632	0.185	0.884	0.962	0.983
DepthHints^[30]	S	0.102	0.762	4.602	0.189	0.880	0.960	0.981
VADepth^[31]	M	0.109	0.785	4.624	0.190	0.875	0.960	0.982
DIFFNet^[2]	M	0.102	0.764	4.483	0.180	0.896	0.965	0.983
本文	M	0.101	0.688	4.383	0.178	0.900	0.965	0.984
注：“↓”表示数值越低越好，“↑”表示数值越高越好，M表示单目自监督，S表示立体对自监督，Se表示语义监督。最优结果用粗体表示，次优结果添加了下划线。

下载: 导出CSV

表 2 消融实验结果

Table 2. Results of ablation experiments

模型	Abs Rel↓	Sq Rel↓	RMS↓	RMS log↓	δ₁↑	δ₂↑	δ₃↑
Baseline	0.109	0.816	4.605	0.186	0.878	0.962	0.983
Baseline+语义指导模块	0.102	0.709	4.436	0.180	0.893	0.964	0.984
Baseline+边界引导模块	0.105	0.743	4.501	0.183	0.886	0.963	0.983
本文	0.101	0.688	4.383	0.178	0.900	0.965	0.984

下载: 导出CSV

表 3 语义指导模块不同嵌入方案的定量结果

Table 3. Quantitative results of different embedding schemes for the semantic guidance module

数量	Abs Rel↓	Sq Rel↓	RMS↓	RMS log↓	δ₁↑	δ₂↑	δ₃↑
0	0.109	0.816	4.605	0.186	0.878	0.962	0.983
1(P5)	0.106	0.783	4.545	0.184	0.884	0.963	0.983
2(P5+P4)	0.104	0.761	4.507	0.182	0.887	0.964	0.984
2(P5+P3)	0.103	0.743	4.474	0.182	0.889	0.963	0.984
3(P5+P4+P3)	0.102	0.709	4.436	0.180	0.893	0.964	0.984

下载: 导出CSV

表 4 不同方法在Make3D数据集上的定量检测结果

Table 4. Quantitative detection results of different methods on the Make3D dataset

方法	Abs Rel↓	Sq Rel↓	RMS↓	MS logR↓
Monodepth2^[7]	0.322	3.589	7.417	0.163
FSREDepth^[17]	0.296	2.802	7.003	0.138
DIFFNet^[2]	0.287	2.984	6.808	0.133
本文	0.284	2.655	6.503	0.128

下载: 导出CSV

参考文献(32)

[1]	柴国强, 薄祥仕, 刘海军, 等. 基于不确定性单目图像自监督场景深度估计[J]. 北京航空航天大学学报, 2024, 50(12): 3780-3787. CHAI G Q, BO X S, LIU H J, et al. Self-supervised monocular depth estimation based on uncertainty[J]. Journal of Beijing University of Aeronautics and Astronautics, 2024, 50(12): 3780-3787.
[2]	ZHOU H, GREENWOOD D, TAYLOR S. Self-supervised monocular depth estimation with internal feature fusion[EB/OL]. (2023-11-20)[2021-10-18]. https://doi.org/10.48550/arXiv.2110.09482.
[3]	张冬冬, 王春平, 付强. 基于人眼视觉机制的伪装目标检测网络[J]. 北京航空航天大学学报, 2025, 51(7): 2553-2561. ZHANG D D, WANG C P, FU Q. Camouflaged object detection network based on human visual mechanisms[J]. Journal of Beijing University of Aeronautics and Astronautics, 2025, 51(7): 2553-2561(in Chinese).
[4]	LEE J H, HAN M, KO D W, et al. From big to small: Multi-scale local planar guidance for monocular depth estimation[EB/OL]. (2023-11-21)[2019-07-24]. https://doi.org/10.48550/arXiv.1907.10326.
[5]	PATIL V, SAKARIDIS C, LINIGER A, et al. P3Depth: monocular depth estimation with a piecewise planarity prior[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2022: 1600-1611.
[6]	LIU C, KUMAR S, GU S, et al. Va-depthnet: A variational approach to single image depth prediction[EB/OL]. (2023-11-22)[2023-02-13]. https://doi.org/10.48550/arXiv.2302.06556.
[7]	GODARD C, MAC AODHA O, FIRMAN M, et al. Digging into self-supervised monocular depth estimation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 3827-3837.
[8]	EIGEN D, PUHRSCH C, FERGUS R. Depth map prediction from a single image using a multi-scale deep network[C]//Proceedings of Advances in Neural Information Processing Systems. Vancouver: NeurIPS, 2014.
[9]	FAROOQ BHAT S, ALHASHIM I, WONKA P. AdaBins: depth estimation using adaptive bins[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 4008-4017.
[10]	BHAT S F, ALHASHIM I, WONKA P. Localbins: improving depth estimation by learning local distributions[C]//Computer Vision–ECCV 2022. Berlin: Springer, 2022: 480-496.
[11]	GUO X Y, LI H S, YI S, et al. Learning monocular depth by distilling cross-domain stereo networks[C]//Computer Vision–ECCV 2018. Berlin: Springer, 2018: 506-523.
[12]	ZHAO S S, FU H, GONG M M, et al. Geometry-aware symmetric domain adaptation for monocular depth estimation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2019: 9780-9790.
[13]	RANFTL R, BOCHKOVSKIY A, KOLTUN V. Vision transformers for dense prediction[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2021: 12159-12168.
[14]	KIM D, KA W, AHN P, et al. Global-local path networks for monocular depth estimation with vertical cutdepth[EB/OL]. (2023-11-30)[2022-01-19]. https://doi.org/10.48550/arXiv.2201.07436.
[15]	ZHANG N, NEX F, VOSSELMAN G, et al. Lite-mono: a lightweight CNN and transformer architecture for self-supervised monocular depth estimation[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2023: 18537-18546.
[16]	CHOI J, JUNG D, LEE D, et al. SAFENet: self-supervised monocular depth estimation with semantic-aware feature extraction[EB/OL]. (2023-12-01)[2020-08-06]. https://doi.org/10.48550/arXiv.2010.02893.
[17]	JUNG H, PARK E, YOO S. Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2021: 12622-12632.
[18]	LI R, MAO Q, WANG P, et al. Semantic-guided representation enhancement for self-supervised monocular trained depth estimation[EB/OL]. (2023-12-03)[2020-12-15]. https://doi.org/10.48550/arXiv.2012.08048.
[19]	KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[EB/OL]. (2023-12-04)[2024-04-05]. https://doi.org/10.48550/arXiv.2304.02643.
[20]	ZHAO X, DING W C, AN Y Q, et al. Fast segment anything[EB/OL]. (2023-12-05)[2023-06-21]. https://doi.org/10.48550/arXiv.2306.12156.
[21]	PENG R, WANG R, LAI Y, et al. Excavating the potential capacity of self-supervised monocular depth estimation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2021: 15540-15549.
[22]	LYU X Y, LIU L, WANG M M, et al. HR-Depth: high resolution self-supervised monocular depth estimation[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(3): 2294-2301.
[23]	RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015. Berlin: Springer, 2015: 234-241.
[24]	NEKRASOV V, DHARMASIRI T, SPEK A, et al. Real-time joint semantic segmentation and depth estimation using asymmetric annotations[C]//Proceedings of the 2019 International Conference on Robotics and Automation. Piscataway: IEEE Press, 2019: 7101-7107.
[25]	KLINGNER M, TERMÖHlen J A, MIKOLAJCZYK J, et al. Self-supervised monocular depth estimation: solving the dynamic object problem by semantic guidance[C]//Computer Vision–ECCV 2020. Berlin: Springer, 2020: 582-600.
[26]	GEIGER A, LENZ P, URTASUN R. Are we ready for autonomous driving? The KITTI vision benchmark suite[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2012: 3354-3361.
[27]	ZHOU T H, BROWN M, SNAVELY N, et al. Unsupervised learning of depth and ego-motion from video[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 6612-6619.
[28]	GUIZILINI V, HOU R, LI J, et al. Semantically-guided representation learning for self-supervised monocular depth[EB/OL]. (2023-12-15)[2020-02-27]. https://doi.org/10.48550/arXiv.2002.12319.
[29]	RAMAMONJISOA M, FIRMAN M, WATSON J, et al. Single image depth prediction with wavelet decomposition[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 11084-11093.
[30]	WATSON J, FIRMAN M, BROSTOW G, et al. Self-supervised monocular depth hints[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 2162-2171.
[31]	XIANG J, WANG Y, AN L F, et al. Visual attention-based self-supervised absolute depth estimation using geometric priors in autonomous driving[J]. IEEE Robotics and Automation Letters, 2022, 7(4): 11998-12005.
[32]	SAXENA A, SUN M, NG A Y. Make3D: learning 3D scene structure from a single still image[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(5): 824-840.