Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Source
    • Language
5,540 result(s) for "visual features"
Sort by:
An Improved Underwater Visual SLAM through Image Enhancement and Sonar Fusion
To enhance the performance of visual SLAM in underwater environments, this paper presents an enhanced front-end method based on visual feature enhancement. The method comprises three modules aimed at optimizing and improving the matching capability of visual features from different perspectives. Firstly, to address issues related to insufficient underwater illumination and uneven distribution of artificial light sources, a brightness-consistency recovery method is proposed. This method employs an adaptive histogram equalization algorithm to balance the brightness of images. Secondly, a method for denoising underwater suspended particulates is introduced to filter out noise from images. After image-level processing, a combined underwater acousto–optic feature-association method is proposed, which associates acoustic features from sonar with visual features, thereby providing distance information for visual features. Finally, utilizing the AFRL dataset, the improved system incorporating the proposed enhancement methods is evaluated for its performance against the OKVIS framework. The system achieves a better trajectory estimation accuracy compared to OKVIS and demonstrates robustness in underwater environments.
Brain Decoding of Multiple Subjects for Estimating Visual Information Based on a Probabilistic Generative Model
Brain decoding is a process of decoding human cognitive contents from brain activities. However, improving the accuracy of brain decoding remains difficult due to the unique characteristics of the brain, such as the small sample size and high dimensionality of brain activities. Therefore, this paper proposes a method that effectively uses multi-subject brain activities to improve brain decoding accuracy. Specifically, we distinguish between the shared information common to multi-subject brain activities and the individual information based on each subject’s brain activities, and both types of information are used to decode human visual cognition. Both types of information are extracted as features belonging to a latent space using a probabilistic generative model. In the experiment, an publicly available dataset and five subjects were used, and the estimation accuracy was validated on the basis of a confidence score ranging from 0 to 1, and a large value indicates superiority. The proposed method achieved a confidence score of 0.867 for the best subject and an average of 0.813 for the five subjects, which was the best compared to other methods. The experimental results show that the proposed method can accurately decode visual cognition compared with other existing methods in which the shared information is not distinguished from the individual information.
A Concise and Varied Visual Features-Based Image Captioning Model with Visual Selection
Image captioning has gained increasing attention in recent years. Visual characteristics found in input images play a crucial role in generating high-quality captions. Prior studies have used visual attention mechanisms to dynamically focus on localized regions of the input image, improving the effectiveness of identifying relevant image regions at each step of caption generation. However, providing image captioning models with the capability of selecting the most relevant visual features from the input image and attending to them can significantly improve the utilization of these features. Consequently, this leads to enhanced captioning network performance. In light of this, we present an image captioning framework that efficiently exploits the extracted representations of the image. Our framework comprises three key components: the Visual Feature Detector module (VFD), the Visual Feature Visual Attention module (VFVA), and the language model. The VFD module is responsible for detecting a subset of the most pertinent features from the local visual features, creating an updated visual features matrix. Subsequently, the VFVA directs its attention to the visual features matrix generated by the VFD, resulting in an updated context vector employed by the language model to generate an informative description. Integrating the VFD and VFVA modules introduces an additional layer of processing for the visual features, thereby contributing to enhancing the image captioning model’s performance. Using the MS-COCO dataset, our experiments show that the proposed framework competes well with state-of-the-art methods, effectively leveraging visual representations to improve performance. The implementation code can be found here: (accessed on 30 July 2024).
DP-AMF: Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion for Single-View 3D Reconstruction
Single-view 3D reconstruction remains fundamentally ill-posed, as a single RGB image lacks scale and depth cues, often yielding ambiguous results under occlusion or in texture-poor regions. We propose DP-AMF, a novel Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion framework that integrates high-fidelity depth priors—generated offline by the MARIGOLD diffusion-based estimator and cached to avoid extra training cost—with hierarchical local features from ResNet-32/ResNet-18 and semantic global features from DINO-ViT. A learnable fusion module dynamically adjusts per-channel weights to balance these modalities according to local texture and occlusion, and an implicit signed-distance field decoder reconstructs the final mesh. Extensive experiments on 3D-FRONT and Pix3D demonstrate that DP-AMF reduces Chamfer Distance by 7.64%, increases F-Score by 2.81%, and boosts Normal Consistency by 5.88% compared to strong baselines, while qualitative results show sharper edges and more complete geometry in challenging scenes. DP-AMF achieves these gains without substantially increasing model size or inference time, offering a robust and effective solution for complex single-view reconstruction tasks.
CNN-based search model fails to account for human attention guidance by simple visual features
Recently, Zhang et al. ( Nature communications, 9 (1), 3730, 2018 ) proposed an interesting model of attention guidance that uses visual features learnt by convolutional neural networks (CNNs) for object classification. I adapted this model for search experiments, with accuracy as the measure of performance. Simulation of our previously published feature and conjunction search experiments revealed that the CNN-based search model proposed by Zhang et al. considerably underestimates human attention guidance by simple visual features. Using target-distractor differences instead of target features for attention guidance or computing attention map at lower layers of the network could improve the performance. Still, the model fails to reproduce qualitative regularities of human visual search. The most likely explanation is that standard CNNs that are trained on image classification have not learnt medium- or high-level features required for human-like attention guidance.
A 3D Tracking and Registration Method Based on Point Cloud and Visual Features for Augmented Reality Aided Assembly System
To improve the robustness and applicability of 3D tracking and registration for augmented reality(AR) aided mechanical assembly system, a 3D registration and tracking method based on the point cloud and visual features is proposed. Firstly, the reference model point cloud is used to definite absolute tracking coordinate system, thus the locating datum of the virtual assembly guidance information is determined. Then by adding visual features matching to the iterative closest points (ICP) registration process, the robustness of tracking and registration is improved. In order to obtain sufficient number of visual feature matching points in this process, a visual feature matching strategy based on orientation vector consistency is proposed. Finally, the loop closure detection and global pose optimization from key frames are added in the tracking registration process. The experimental result shows that the proposed method has good real-time performance and accuracy, and the running speed can reach 30 frames per second. Moreover, it also shows good robustness when the camera is moving fast and the depth information is inaccurate, and the comprehensive performance of the proposed method is better than the KinectFusion method. 为了提高三维跟踪注册方法面向机械产品增强现实装配引导的适用性和鲁棒性,提出了一种点云和视觉特征融合的三维跟踪注册方法。首先利用参考模型点云对三维跟踪注册绝对坐标系进行定义,从而确定虚拟装配引导信息的定位基准。然后在迭代最近点法点云数据配准基础上,结合深度传感器彩色图像信息,通过视觉特征匹配,提高深度传感器快速移动时的跟踪注册过程鲁棒性。为了在此过程获取足够数量的视觉特征匹配点对,提出了一种基于方向向量一致性的视觉特征匹配策略。最后在跟踪注册过程加入基于关键帧的回环检测和全局位姿优化。实验结果表明:新方法精确性、实时性好,能达到每秒30帧。而且在相机快速移动时仍能表现出较好的鲁棒性,其综合性能优于基于点云的Kinect Fusion方法。
Local self-attention in transformer for visual question answering
Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94 % and 98.72 %, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at https://github.com/shenxiang-vqa/LSAT.
No evidence for proactive suppression of explicitly cued distractor features
Visual search benefits from advance knowledge of nontarget features. However, it is unknown whether these negatively cued features are suppressed in advance (proactively) or during search (reactively). To test this, we presented color cues varying from trial-to-trial that predicted target or nontarget colors. Experiment 1 ( N = 96) showed that both target and nontarget cues speeded search. To test whether attention proactively modified cued feature representations, in Experiment 2 ( N = 200), we interleaved color probe and search trials and had participants detect the color of a briefly presented ring that could either match the cued color or not. People detected positively cued colors better than other colors, whereas negatively cued colors were detected no better or worse than other colors. These results demonstrate that nontarget features are not suppressed proactively, and instead suggest that anticipated nontarget features are ignored via reactive mechanisms.
Saccades influence functional modularity in the human cortical vision network
Visual cortex is thought to show both dorsoventral and hemispheric modularity, but it is not known if the same functional modules emerge spontaneously from an unsupervised network analysis, or how they interact when saccades necessitate increased sharing of spatial information. Here, we address these issues by applying graph theory analysis to fMRI data obtained while human participants decided whether an object’s shape or orientation changed, with or without an intervening saccade across the object. BOLD activation from 50 vision-related cortical nodes was used to identify local and global network properties. Modularity analysis revealed three sub-networks during fixation: a bilateral parietofrontal network linking areas implicated in visuospatial processing and two lateralized occipitotemporal networks linking areas implicated in object feature processing. When horizontal saccades required visual comparisons between visual hemifields, functional interconnectivity and information transfer increased, and the two lateralized ventral modules became functionally integrated into a single bilateral sub-network. This network included ‘between module’ connectivity hubs in lateral intraparietal cortex and dorsomedial occipital areas previously implicated in transsaccadic integration. These results provide support for functional modularity in the visual system and show that the hemispheric sub-networks are modified and functionally integrated during saccades.
Automatic robot Manoeuvres detection using computer vision and deep learning techniques: a perspective of internet of robotics things (IoRT)
To minimize any impediments in real-time Internet of Things (IoT)-enabled robotics applications, this study demonstrated how to build and deploy a revolutionary framework using computer vision and deep learning. In contrast to robotic path planning algorithms based on geolocation. We focus on sensor-captured streams/images and geographical information to enable the Internet of Robotic Things (IoRT) to evolve. The application will collect real-time data from moving robotics at various situations and intervals and use it for research projects. The data collected in videos/image forms are delivered in the robotics application using visual sensor nodes. In this study, anticipating moving robot moves automatically early on can aid in issuing commands to monitor and regulate robots’ future activities before they occur. To do so, we propose the framework using efficient computer vision techniques and a deep learning classifier. The computer vision methods are designed for frame quality improvement, object segmentation, and feature estimation. The Long-Term Short Memory (LSTM) classifier detects robot motions automatically from initial sequential features. We mainly designed the proposed model using an LSTM classifier to perform the earlier prediction from the initial sequential features of partial video frames and to overcome the problems of exploding and vanishing gradients. LSTM helps to reduce the prediction duration with higher accuracy. It also enables the central system of a certain robotic application to prevent collisions caused by impediments in the interior or outdoor situation. The simulation results utilizing publicly available research datasets demonstrate the proposed model’s efficiency and robustness compared to state-of-the-art approaches. The overall accuracy of the proposed model has improved approximately by 5% and reduced computational complexity by 84% approximately.