Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
35 result(s) for "RGB-D perception"
Sort by:
Visual Localization and Mapping in Dynamic and Changing Environments
The real-world deployment of fully autonomous mobile robots depends on a robust simultaneous localization and mapping (SLAM) system, capable of handling dynamic environments, where objects are moving in front of the robot, and changing environments, where objects are moved or replaced after the robot has already mapped the scene. This paper proposes Changing-SLAM, a method for robust Visual SLAM in both dynamic and changing environments. This is achieved by using a Bayesian filter combined with a long-term data association algorithm. Also, it employs an efficient algorithm for dynamic keypoints filtering based on object detection that correctly identifies features inside the bounding box that are not dynamic, preventing a depletion of features that could cause lost tracks. Furthermore, a new dataset was developed with RGB-D data specially designed for the evaluation of changing environments on an object level, called PUC-USP dataset. Six sequences were created using a mobile robot, an RGB-D camera and a motion capture system. The sequences were designed to capture different scenarios that could lead to a tracking failure or map corruption. Changing-SLAM does not assume a given camera pose or a known map, being also able to operate in real time. The proposed method was evaluated using benchmark datasets and compared with other state-of-the-art methods, proving to be highly accurate.
RA6D: Reliability-Aware 6D Pose Estimation via Attention-Guided Point Cloud in Aerosol Environments
We address the problem of 6D object pose estimation in aerosol environments, where RGB and depth sensors experience correlated degradation due to scattering and absorption. Handling such spatially varying degradation typically requires depth restoration, but obtaining ground-truth complete depth in aerosol conditions is prohibitively expensive. To overcome this limitation without relying on costly depth completion, we propose RA6D, a framework that integrates attention-guided reliability modeling with feature distillation. The attention map generated during RGB dehazing reflects aerosol distribution and provides a compact indicator of depth reliability. By embedding this attention as an additional feature in an Attention-Guided Point cloud (AGP), the network can adaptively respond to spatially varying degradation. In addition, to address the scarcity of aerosol-domain data, we employ clean-to-aerosol feature distillation, transferring robust representations learned under clean conditions. Experiments on aerosol benchmarks show that RA6D achieves higher accuracy and significantly faster inference than restoration-based pipelines, offering a practical solution for real-time robotic perception under severe visual degradation.
Depth Matters: Geometry-Aware RGB-D-Based Transformer-Enabled Deep Reinforcement Learning for Mapless Navigation
Autonomous navigation in unknown environments demands policies that can jointly perceive semantic context and geometric safety. Existing Transformer-enabled deep reinforcement learning (DRL) frameworks, such as the Goal-guided Transformer Soft Actor–Critic (GoT-SAC), rely on temporal stacking of multiple RGB frames, which encodes short-term motion cues but lacks explicit spatial understanding. This study introduces a geometry-aware RGB-D early fusion modality that replaces temporal redundancy with cross-modal alignment between appearance and depth. Within the GoT-SAC framework, we integrate a pixel-aligned RGB-D input into the Transformer encoder, enabling the attention mechanism to simultaneously capture semantic textures and obstacle geometry. A comprehensive systematic ablation study was conducted across five modality variants (4RGB, RGB-D, G-D, 4G-D, and 4RGB-D) and three fusion strategies (early, parallel, and late) under identical hyperparameter settings in a controlled simulation environment. The proposed RGB-D early fusion achieved a 40.0% success rate and +94.1 average reward, surpassing the canonical 4RGB baseline (28.0% success, +35.2 reward), while a tuned configuration further improved performance to 54.0% success and +146.8 reward. These results establish early pixel-level multimodal fusion (RGB-D) as a principled and efficient successor to temporal stacking, yielding higher stability, sample efficiency, and geometry-aware decision-making. This work provides the first controlled evidence that spatially aligned multimodal fusion within Transformer-based DRL significantly enhances mapless navigation performance and offers a reproducible foundation for sim-to-real transfer in autonomous mobile robots.
Probabilistic Indoor 3D Object Detection from RGB-D via Gaussian Distribution Estimation
Conventional object detectors represent each object by a deterministic bounding box, regressing its center and size from RGB images. However, such discrete parameterization ignores the inherent uncertainty in object appearance and geometric projection, which can be more naturally modeled as a probabilistic density field. Recent works have introduced Gaussian-based formulations that treat objects as distributions rather than boxes, yet they remain limited to 2D images or require late fusion between image and depth modalities. In this paper, we propose a unified Gaussian-based framework for direct 3D object detection from RGB-D inputs. Our method is built upon a vision transformer backbone to effectively capture global context. Instead of separately embedding RGB and depth features or refining depth within region proposals, our method takes a full four-channel RGB-D tensor and predicts the mean and covariance of a 3D Gaussian distribution for each object in a single forward pass. We extend a pretrained vision transformer to accept four-channel inputs by augmenting the patch embedding layer while preserving ImageNet-learned representations. This formulation allows the detector to represent both object location and geometric uncertainty in 3D space. By optimizing divergence metrics such as the Kullback–Leibler or Bhattacharyya distances between predicted and target distributions, the network learns a physically consistent probabilistic representation of objects. Experimental results on the SUN RGB-D benchmark demonstrate that our approach achieves competitive performance compared to state-of-the-art point-cloud-based methods while offering uncertainty-aware and geometrically interpretable 3D detections.
Collaborative Viewpoint Adjusting and Grasping via Deep Reinforcement Learning in Clutter Scenes
For the robotic grasping of randomly stacked objects in a cluttered environment, the active multiple viewpoints method can improve grasping performance by improving the environment perception ability. However, in many scenes, it is redundant to always use multiple viewpoints for grasping detection, which will reduce the robot’s grasping efficiency. To improve the robot’s grasping performance, we present a Viewpoint Adjusting and Grasping Synergy (VAGS) strategy based on deep reinforcement learning which coordinates the viewpoint adjusting and grasping directly. For the training efficiency of VAGS, we propose a Dynamic Action Exploration Space (DAES) method based on ε-greedy to reduce the training time. To address the sparse reward problem in reinforcement learning, a reward function is created to evaluate the impact of adjusting the camera pose on the grasping performance. According to experimental findings in simulation and the real world, the VAGS method can improve grasping success and scene clearing rate. Compared with only direct grasping, our proposed strategy increases the grasping success rate and the scene clearing rate by 10.49% and 11%.
Deep Learning Computer Vision-Based Automated Localization and Positioning of the ATHENA Parallel Surgical Robot
Manual alignment between the trocar, surgical instrument, and robot during minimally invasive surgery (MIS) can be time-consuming and error-prone, and many existing systems do not provide autonomous localization and pose estimation. This paper presents an artificial intelligence (AI)-assisted, vision-guided framework for automated localization and positioning of the ATHENA parallel surgical robot. The proposed approach combines an Intel RealSense RGB–depth (RGB-D) camera with a You Only Look Once version 11 (YOLO11) object detection model to estimate the 3D spatial coordinates of key surgical components in real time. The estimated coordinates are streamed over Transmission Control Protocol/Internet Protocol (TCP/IP) to a programmable logic controller (PLC) using Modbus/TCP, enabling closed-loop robot positioning for automated docking. Experimental validation in a controlled setup designed to replicate key intraoperative constraints demonstrated submillimeter positioning accuracy (≤0.8 mm), an average end-to-end latency of 67 ms, and a 42% reduction in setup time compared with manual alignment, while remaining robust under variable lighting. These results indicate that the proposed perception-to-control pipeline is a practical step toward reliable autonomous robotic docking in MIS workflows.
RGB–D terrain perception and dense mapping for legged robots
This paper addresses the issues of unstructured terrain modeling for the purpose of navigation with legged robots. We present an improved elevation grid concept adopted to the specific requirements of a small legged robot with limited perceptual capabilities. We propose an extension of the elevation grid update mechanism by incorporating a formal treatment of the spatial uncertainty. Moreover, this paper presents uncertainty models for a structured light RGB-D sensor and a stereo vision camera used to produce a dense depth map. The model for the uncertainty of the stereo vision camera is based on uncertainty propagation from calibration, through undistortion and rectification algorithms, allowing calculation of the uncertainty of measured 3D point coordinates. The proposed uncertainty models were used for the construction of a terrain elevation map using the Videre Design STOC stereo vision camera and Kinect-like range sensors. We provide experimental verification of the proposed mapping method, and a comparison with another recently published terrain mapping method for walking robots.
A Semi-Supervised Semantic Segmentation Method for Blast-Hole Detection
The goal of blast-hole detection is to help place charge explosives into blast-holes. This process is full of challenges, because it requires the ability to extract sample features in complex environments, and to detect a wide variety of blast-holes. Detection techniques based on deep learning with RGB-D semantic segmentation have emerged in recent years of research and achieved good results. However, implementing semantic segmentation based on deep learning usually requires a large amount of labeled data, which creates a large burden on the production of the dataset. To address the dilemma that there is very little training data available for explosive charging equipment to detect blast-holes, this paper extends the core idea of semi-supervised learning to RGB-D semantic segmentation, and devises an ERF-AC-PSPNet model based on a symmetric encoder–decoder structure. The model adds a residual connection layer and a dilated convolution layer for down-sampling, followed by an attention complementary module to acquire the feature maps, and uses a pyramid scene parsing network to achieve hole segmentation during decoding. A new semi-supervised learning method, based on pseudo-labeling and self-training, is proposed, to train the model for intelligent detection of blast-holes. The designed pseudo-labeling is based on the HOG algorithm and depth data, and proved to have good results in experiments. To verify the validity of the method, we carried out experiments on the images of blast-holes collected at a mine site. Compared to the previous segmentation methods, our method is less dependent on the labeled data and achieved IoU of 0.810, 0.867, 0.923, and 0.945, at labeling ratios of 1/8, 1/4, 1/2, and 1.
RGB-D salient object detection: A survey
Salient object detection, which simulates human visual perception in locating the most significant object(s) in a scene, has been widely applied to various computer vision tasks. Now, the advent of depth sensors means that depth maps can easily be captured; this additional spatial information can boost the performance of salient object detection. Although various RGB-D based salient object detection models with promising performance have been proposed over the past several years, an in-depth understanding of these models and the challenges in this field remains lacking. In this paper, we provide a comprehensive survey of RGB-D based salient object detection models from various perspectives, and review related benchmark datasets in detail. Further, as light fields can also provide depth maps, we review salient object detection models and popular benchmark datasets from this domain too. Moreover, to investigate the ability of existing models to detect salient objects, we have carried out a comprehensive attribute-based evaluation of several representative RGB-D based salient object detection models. Finally, we discuss several challenges and open directions of RGB-D based salient object detection for future research. All collected models, benchmark datasets, datasets constructed for attribute-based evaluation, and related code are publicly available at https://github.com/taozh2017/RGBD-SODsurvey .
Unifying Terrain Awareness for the Visually Impaired through Real-Time Semantic Segmentation
Navigational assistance aims to help visually-impaired people to ambulate the environment safely and independently. This topic becomes challenging as it requires detecting a wide variety of scenes to provide higher level assistive awareness. Vision-based technologies with monocular detectors or depth sensors have sprung up within several years of research. These separate approaches have achieved remarkable results with relatively low processing time and have improved the mobility of impaired people to a large extent. However, running all detectors jointly increases the latency and burdens the computational resources. In this paper, we put forward seizing pixel-wise semantic segmentation to cover navigation-related perception needs in a unified way. This is critical not only for the terrain awareness regarding traversable areas, sidewalks, stairs and water hazards, but also for the avoidance of short-range obstacles, fast-approaching pedestrians and vehicles. The core of our unification proposal is a deep architecture, aimed at attaining efficient semantic understanding. We have integrated the approach in a wearable navigation system by incorporating robust depth segmentation. A comprehensive set of experiments prove the qualified accuracy over state-of-the-art methods while maintaining real-time speed. We also present a closed-loop field test involving real visually-impaired users, demonstrating the effectivity and versatility of the assistive framework.