Catalogue Search | MBRL

Deep Learning-Based Monocular Depth Estimation Methods—A State-of-the-Art Review

by Khan, Faisal , Salahuddin, Saqib , Javidnia, Hossein in Cameras , CNN monocular depth , Computer vision

2020

Monocular depth estimation from Red-Green-Blue (RGB) images is a well-studied ill-posed problem in computer vision which has been investigated intensively over the past decade using Deep Learning (DL) approaches. The recent approaches for monocular depth estimation mostly rely on Convolutional Neural Networks (CNN). Estimating depth from two-dimensional images plays an important role in various applications including scene reconstruction, 3D object-detection, robotics and autonomous driving. This survey provides a comprehensive overview of this research topic including the problem representation and a short description of traditional methods for depth estimation. Relevant datasets and 13 state-of-the-art deep learning-based approaches for monocular depth estimation are reviewed, evaluated and discussed. We conclude this paper with a perspective towards future research work requiring further investigation in monocular depth estimation challenges.

Journal Article

Share this book

Add to My Shelf

Monocular Depth Estimation Using Deep Learning: A Review

by Cristiano, Julián , Asif, M. Salman , Rashwan, Hatem A. in Accuracy , Algorithms , Cameras

2022

In current decades, significant advancements in robotics engineering and autonomous vehicles have improved the requirement for precise depth measurements. Depth estimation (DE) is a traditional task in computer vision that can be appropriately predicted by applying numerous procedures. This task is vital in disparate applications such as augmented reality and target tracking. Conventional monocular DE (MDE) procedures are based on depth cues for depth prediction. Various deep learning techniques have demonstrated their potential applications in managing and supporting the traditional ill-posed problem. The principal purpose of this paper is to represent a state-of-the-art review of the current developments in MDE based on deep learning techniques. For this goal, this paper tries to highlight the critical points of the state-of-the-art works on MDE from disparate aspects. These aspects include input data shapes and training manners such as supervised, semi-supervised, and unsupervised learning approaches in combination with applying different datasets and evaluation indicators. At last, limitations regarding the accuracy of the DL-based MDE models, computational time requirements, real-time inference, transferability, input images shape and domain adaptation, and generalization are discussed to open new directions for future research.

Journal Article

Share this book

Add to My Shelf

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

by Li, Zhenyu , Chen, Zehui , Liu, Xianming in Ablation , Competition , Convolution

2023

This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Moreover, the Transformer and convolution are good at long-range and close-range depth estimation, respectively. Therefore, we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch. The former can model global context with the effective attention mechanism and the latter aims to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents. However, independent branches lead to a shortage of connections between features. To bridge this gap, we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features and model the affinity between the heterogeneous features in a set-to-set translation manner. Due to the unbearable memory cost introduced by the global attention on high-resolution feature maps, we adopt the deformable scheme to reduce the complexity. Extensive experiments on the KITTI, NYU, and SUN RGB-D datasets demonstrate that our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins. The effectiveness of each proposed module is elaborately evaluated through meticulous and intensive ablation studies.

Journal Article

Share this book

Add to My Shelf

Real-Time Single Image Depth Perception in the Wild with Handheld Devices

by Aleotti, Filippo , Tosi, Fabio , Bartolomei, Luca in deep learning , mobile systems , monocular depth estimation

2020

Depth perception is paramount for tackling real-world problems, ranging from autonomous driving to consumer applications. For the latter, depth estimation from a single image would represent the most versatile solution since a standard camera is available on almost any handheld device. Nonetheless, two main issues limit the practical deployment of monocular depth estimation methods on such devices: (i) the low reliability when deployed in the wild and (ii) the resources needed to achieve real-time performance, often not compatible with low-power embedded systems. Therefore, in this paper, we deeply investigate all these issues, showing how they are both addressable by adopting appropriate network design and training strategies. Moreover, we also outline how to map the resulting networks on handheld devices to achieve real-time performance. Our thorough evaluation highlights the ability of such fast networks to generalize well to new environments, a crucial feature required to tackle the extremely varied contexts faced in real applications. Indeed, to further support this evidence, we report experimental results concerning real-time, depth-aware augmented reality and image blurring with smartphones in the wild.

Journal Article

Share this book

Add to My Shelf

Robust 3D Multi-Object Tracking via 4D mmWave Radar-Camera Fusion and Disparity-Domain Depth Recovery

by Wang, Dingheng , Sun, Zhenping , Li, Xiaohui in 4D millimeter-wave radar , Accuracy , Cameras

2026

4D millimeter-wave radar provides high-precision ranging capability and exhibits strong robustness under adverse weather and low-visibility conditions, but its point clouds are relatively sparse and suffer from severe elevation-angle measurement noise. Monocular cameras, by contrast, provide rich semantic information and high recall, yet are fundamentally limited by scale ambiguity. To exploit the complementary characteristics of these two sensors, this paper proposes a radar-camera fusion 3D multi-object tracking framework that does not rely on complex 3D annotated data. First, on the radar signal-processing side, a Gaussian distribution-based adaptive angle compression method and IMU-based velocity compensation are introduced to effectively suppress measurement noise, and an improved DBSCAN clustering scheme with recursive cluster splitting and historical static-box guidance is employed to generate high-quality radar detections. Second, a disparity-domain metric depth recovery method is proposed. This method uses filtered radar points as sparse metric anchors, performs robust fitting with RANSAC, and applies Kalman filtering for temporal smoothing, thereby converting the relative depth output of the visual foundation model Depth Anything V2 into metric depth. Finally, a hierarchical fusion strategy is designed at both the detection and tracking levels to achieve stable cross-modal state association. Experimental results on a self-collected dataset show that the proposed method achieves an overall MOTA of 77.93%, outperforming single-modality baselines and other comparison methods by 11 to 31 percentage points. This study provides an effective solution for low-cost and robust environment perception in complex dynamic scenarios.

Journal Article

Share this book

Add to My Shelf

WeatherMono: A CNN-Transformer Architecture for Self-Supervised Monocular Depth Estimation in Rainy and Foggy Conditions

by Qiu, Yongsheng in Accuracy , adaptive attention mechanism , adverse weather conditions

2026

In rainy and foggy conditions, the scattering of light and the occlusion effects of atmospheric particles distort the reflected light from object surfaces, leading to inconsistent depth information. As a result, depth estimation models trained under clear weather conditions fail to generalize effectively to adverse weather conditions. To address this challenge, we propose a novel CNN-Transformer architecture, WeatherMono, for self-supervised monocular depth estimation under rainy and foggy weather. Rainy and foggy images often contain large regions of low contrast and blurry features. By combining Convolutional Neural Networks (CNNs) with Transformers, WeatherMono effectively captures both local and global contextual information, thus improving depth estimation accuracy. Specifically, we introduce a Multi-Scale Deformable Convolution (MDC) module and a Global-Local Feature Interaction (GLFI) module. The MDC module extracts detailed local features in rainy and foggy environments, while the GLFI module incorporates an efficient multi-head attention mechanism into the Transformer encoder, enabling more effective capture of both local and global information. This enhances the model’s ability to comprehend image features, strengthens its capability to handle low-contrast and blurry images, and ultimately improves the accuracy of depth estimation in adverse weather conditions. Experiments on WeatherKITTI show WeatherMono achieves AbsRel of 0.097, outperforming WeatherDepth (0.104) and RoboDepth (0.107). On DrivingStereo, it achieves AbsRel of 0.149 (rain) and 0.101 (fog). Extensive qualitative and quantitative experiments demonstrate that WeatherMono significantly outperforms existing methods in terms of both accuracy and robustness under rainy and foggy conditions.

Journal Article

Share this book

Add to My Shelf

EFDepth: A Monocular Depth Estimation Model for Multi-Scale Feature Optimization

by Shao, Xinying , Zhang, Chunying , Ren, Jing in Accuracy , Analysis , codec structure

2025

To address the accuracy issues in monocular depth estimation caused by insufficient feature extraction and inadequate context modeling, a multi-scale feature optimization model named EFDepth was proposed to improve prediction performance. This framework adopted an encoder-decoder structure: the encoder (EC-Net) was composed of MobileNetV3-E and ETFBlock, and its features were optimized through multi-scale dilated convolution; the decoder (LapFA-Net) combined the Laplacian pyramid and the FMA module to enhance cross-scale feature fusion and output accurate depth maps. Comparative experiments between EFDepth and algorithms including Lite-mono, Hr-depth, and Lapdepth were conducted on the KITTI datasets. The results show that, for the three error metrics-RMSE (Root Mean Square Error), AbsRel (Absolute Relative Error), and SqRel (Squared Relative Error)-EFDepth is 1.623, 0.030, and 0.445 lower than the average values of the comparison algorithms, respectively, and for the three accuracy metrics, it is 0.052, 0.023, and 0.011 higher than the average values of the comparison algorithms, respectively. Experimental results indicate that EFDepth outperforms the comparison methods in most metrics, providing an effective reference for monocular depth estimation and 3D reconstruction of complex scenes.

Journal Article

Share this book

Add to My Shelf

RepACNet: A Lightweight Reparameterized Asymmetric Convolution Network for Monocular Depth Estimation

by Li, Jun , Chen, Hao , Niu, Yaoqian in Accuracy , Architecture , CNNs

2026

Monocular depth estimation (MDE) is a cornerstone task in 2D/3D scene reconstruction and recognition with widespread applications in autonomous driving, robotics, and augmented reality. However, existing state-of-the-art methods face a fundamental trade-off between computational efficiency and estimation accuracy, limiting their deployment in resource-constrained real-world scenarios. It is of high interest to design lightweight but effective models to enable potential deployment on resource-constrained mobile devices. To address this problem, we present RepACNet, a novel lightweight network that addresses this challenge through reparameterized asymmetric convolution designs and CNN-based architecture that integrates MLP-Mixer components. First, we propose Reparameterized Token Mixer with Asymmetric Convolution (RepTMAC), an efficient block that captures long-range dependencies while maintaining linear computational complexity. Unlike Transformer-based methods, our approach achieves global feature interaction with tiny overhead. Second, we introduce Squeeze-and-Excitation Consecutive Dilated Convolutions (SECDCs), which integrates adaptive channel attention with dilated convolutions to capture depth-specific features across multiple scales. We validate the effectiveness of our approach through extensive experiments on two widely recognized benchmarks, NYU Depth v2 and KITTI Eigen. The experimental results demonstrate that our model achieves competitive performance while maintaining significantly fewer parameters compared to state-of-the-art models.

Journal Article

Share this book

Add to My Shelf

A Monocular Depth Estimation Method for Autonomous Driving Vehicles Based on Gaussian Neural Radiance Fields

by Zhao, Zhouxing , Xu, Liang , Nie, Ziqin in Accuracy , adaptive channel attention , autonomous driving

2026

Monocular depth estimation is one of the key tasks in autonomous driving, which derives depth information of the scene from a single image. And it is a fundamental component for vehicle decision-making and perception. However, approaches currently face challenges such as visual artifacts, scale ambiguity and occlusion handling. These limitations lead to suboptimal performance in complex environments, reducing model efficiency and generalization and hindering their broader use in autonomous driving and other applications. To solve these challenges, this paper introduces a Neural Radiance Field (NeRF)-based monocular depth estimation method for autonomous driving. It introduces a Gaussian probability-based ray sampling strategy to effectively solve the problem of massive sampling points in large complex scenes and reduce computational costs. To improve generalization, a lightweight spherical network incorporating a fine-grained adaptive channel attention mechanism is designed to capture detailed pixel-level features. These features are subsequently mapped to 3D spatial sampling locations, resulting in diverse and expressive point representations for improving the generalizability of the NeRF model. Our approach exhibits remarkable performance on the KITTI benchmark, surpassing traditional methods in depth estimation tasks. This work contributes significant technical advancements for practical monocular depth estimation in autonomous driving applications.

Journal Article

Share this book

Add to My Shelf

Enhancing long-range depth estimation via heterogeneous CNN-transformer encoding and cross-dimensional semantic fusion

by Chen, Yunhao , Tang, Jianing , Wang, Jianlong in 639/166 , 639/705 , Accuracy

2026

Monocular depth estimation enables 3D scene reconstruction from a single 2D image, offering a cost-effective solution widely applied in autonomous driving and UAVs. However, existing deep neural networks often fail to balance local texture details with global contextual information, leading to significant inaccuracies in distant-region depth prediction. To address this challenge, we introduce a novel monocular depth estimation framework featuring a heterogeneous encoder and a Cross-dimensional Semantic Fusion (CSF) module. The heterogeneous encoder integrates the initial convolutional layers of ResNet-50 with the hierarchical attention mechanism of Swin Transformer to efficiently capture both local details and long-range dependencies. Specifically targeting the characteristics of distant objects—low pixel occupancy but high semantic relevance—the CSF module enhances feature aggregation in the decoder through multi-scale interactions and spatial-channel coupling. Additionally, the decoder incorporates a Depth-Separable Upsampling Block (DSUB) and a Multi-scale Self-Attention (MSA) module to refine detail restoration and ensure spatial consistency. Experiments validate the superiority of our method. On the KITTI dataset, it achieves leading results: 0.050 Abs-Rel, 2.107 RMSE, and a long-range error of 0.2725. The SUN RGB-D dataset demonstrates strong generalization with an Abs-Rel of 0.142. This framework significantly advances long-range depth estimation research and shows broad application prospects.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter