Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Source
    • Language
1,959 result(s) for "Multi-modal fusion"
Sort by:
Transformer-Based Multi-Modal Data Fusion Method for COPD Classification and Physiological and Biochemical Indicators Identification
As the number of modalities in biomedical data continues to increase, the significance of multi-modal data becomes evident in capturing complex relationships between biological processes, thereby complementing disease classification. However, the current multi-modal fusion methods for biomedical data require more effective exploitation of intra- and inter-modal interactions, and the application of powerful fusion methods to biomedical data is relatively rare. In this paper, we propose a novel multi-modal data fusion method that addresses these limitations. Our proposed method utilizes a graph neural network and a 3D convolutional network to identify intra-modal relationships. By doing so, we can extract meaningful features from each modality, preserving crucial information. To fuse information from different modalities, we employ the Low-rank Multi-modal Fusion method, which effectively integrates multiple modalities while reducing noise and redundancy. Additionally, our method incorporates the Cross-modal Transformer to automatically learn relationships between different modalities, facilitating enhanced information exchange and representation. We validate the effectiveness of our proposed method using lung CT imaging data and physiological and biochemical data obtained from patients diagnosed with Chronic Obstructive Pulmonary Disease (COPD). Our method demonstrates superior performance compared to various fusion methods and their variants in terms of disease classification accuracy.
NVTrans‐UNet: Neighborhood vision transformer based U‐Net for multi‐modal cardiac MR image segmentation
With the rapid development of artificial intelligence and image processing technology, medical imaging technology has turned into a critical tool for clinical diagnosis and disease treatment. The extraction and segmentation of the regions of interest in cardiac images are crucial to the diagnosis of cardiovascular diseases. Due to the erratically diastolic and systolic cardiac, the boundaries of Magnetic Resonance (MR) images are quite fuzzy. Moreover, it is hard to provide complete information using a single modality due to the complex structure of the cardiac image. Furthermore, conventional CNN‐based segmentation methods are weak in feature extraction. To overcome these challenges, we propose a multi‐modal method for cardiac image segmentation, called NVTrans‐UNet. Firstly, we employ the Neighborhood Vision Transformer (NVT) module, which takes advantage of Neighborhood Attention (NA) and inductive biases. It can better extract the local information of the cardiac image as well as reduce the computational cost. Secondly, we introduce a Multi‐modal Gated Fusion (MGF) network, which can automatically adjust the contributions of different modal feature maps and make full use of multi‐modal information. Thirdly, the bottleneck layer with Atrous Spatial Pyramid Pooling (ASPP) is proposed to expand the feature receptive field. Finally, the mixed loss is added to the cardiac image to focus the fuzzy boundary and realize accurate segmentation. We evaluated our model on MyoPS 2020 dataset. The Dice score of myocardial infarction (MI) was 0.642 ± 0.171, and the Dice score of myocardial infarction + edema (MI + ME) was 0.574 ± 0.110. Compared with the baseline, the MI increases by 11.2%, and the MI + ME increases by 12.5%. The results show the effectiveness of the proposed NVTrans‐UNet in the segmentation of MI and ME.
Multi-Modal 3D Object Detection in Autonomous Driving: A Survey
The past decade has witnessed the rapid development of autonomous driving systems. However, it remains a daunting task to achieve full autonomy, especially when it comes to understanding the ever-changing, complex driving scenes. To alleviate the difficulty of perception, self-driving vehicles are usually equipped with a suite of sensors (e.g., cameras, LiDARs), hoping to capture the scenes with overlapping perspectives to minimize blind spots. Fusing these data streams and exploiting their complementary properties is thus rapidly becoming the current trend. Nonetheless, combining data that are captured by different sensors with drastically different ranging/ima-ging mechanisms is not a trivial task; instead, many factors need to be considered and optimized. If not careful, data from one sensor may act as noises to data from another sensor, with even poorer results by fusing them. Thus far, there has been no in-depth guidelines to designing the multi-modal fusion based 3D perception algorithms. To fill in the void and motivate further investigation, this survey conducts a thorough study of tens of recent deep learning based multi-modal 3D detection networks (with a special emphasis on LiDAR-camera fusion), focusing on their fusion stage (i.e., when to fuse), fusion inputs (i.e., what to fuse), and fusion granularity (i.e., how to fuse). These important design choices play a critical role in determining the performance of the fusion algorithm. In this survey, we first introduce the background of popular sensors used for self-driving, their data properties, and the corresponding object detection algorithms. Next, we discuss existing datasets that can be used for evaluating multi-modal 3D object detection algorithms. Then we present a review of multi-modal fusion based 3D detection networks, taking a close look at their fusion stage, fusion input and fusion granularity, and how these design choices evolve with time and technology. After the review, we discuss open challenges as well as possible solutions. We hope that this survey can help researchers to get familiar with the field and embark on investigations in the area of multi-modal 3D object detection.
A Comprehensive Survey on Deep Learning Multi-Modal Fusion: Methods, Technologies and Applications
Multi-modal fusion technology gradually become a fundamental task in many fields, such as autonomous driving, smart healthcare, sentiment analysis, and human-computer interaction. It is rapidly becoming the dominant research due to its powerful perception and judgment capabilities. Under complex scenes, multi-modal fusion technology utilizes the complementary characteristics of multiple data streams to fuse different data types and achieve more accurate predictions. However, achieving outstanding performance is challenging because of equipment performance limitations, missing information, and data noise. This paper comprehensively reviews existing methods based on multi-modal fusion techniques and completes a detailed and in-depth analysis. According to the data fusion stage, multi-modal fusion has four primary methods: early fusion, deep fusion, late fusion, and hybrid fusion. The paper surveys the three major multi-modal fusion technologies that can significantly enhance the effect of data fusion and further explore the applications of multi-modal fusion technology in various fields. Finally, it discusses the challenges and explores potential research opportunities. Multi-modal tasks still need intensive study because of data heterogeneity and quality. Preserving complementary information and eliminating redundant information between modalities is critical in multi-modal technology. Invalid data fusion methods may introduce extra noise and lead to worse results. This paper provides a comprehensive and detailed summary in response to these challenges.
A Semantically-Driven Multimodal Sentiment Analysis Framework With Temporal and Synergistic Attention
Multimodal sentiment analysis aims to identify emotional tendencies from text, audio, and visual data, but existing methods often struggle with weak temporal modeling within modalities and shallow cross-modal fusion. The proposed temporal modeling and synergistic attention–based multimodal sentiment analysis framework can address these issues. Word-level features are first extracted from all modalities, then modeled using a state-gated long short-term memory network combined with multi-head attention to capture temporal emotional dynamics while filtering noise. A hierarchical collaborative attention mechanism is further designed to enable deep, fine-grained cross-modal semantic interactions. Experiments on the Carnegie Mellon University multimodal corpus of sentiment intensity and multimodal opinion sentiment and emotion intensity datasets show that the modeling and synergistic attention-based multimodal sentiment analysis framework achieves an F1 score of 87.3% and an mean absolute error of 0.426, it achieves a 1.2–1.5% improvement while simultaneously reducing mean absolute error to its lowest value, outperforming existing state-of-the-art approaches and demonstrating its effectiveness in modeling complex multimodal emotions.
A defensive attention mechanism to detect deepfake content across multiple modalities
Recently, researchers have attracted much attention to the realistic nature of multi-modal deepfake content. They have employed plenty of handcrafted, learned features, and deep learning techniques to achieve promising performances for recognizing facial deepfakes. However, attackers continue to create deepfakes that outperform their earlier works by focusing on changes in many modalities, making deepfake identification under multiple modalities difficult. To exploit the merits of attention-based network architecture, we propose a novel cross-modal attention architecture on a bi-directional recurrent convolutional network to capture fake content in audio and video. For effective deepfake detection, the system records the spatial–temporal deformations of audio–video sequences and investigates the correlation in these modalities. We propose a self-attenuated VGG16 deep model for extracting visual features for facial fake recognition. Besides, the system incorporates a recurrent neural network with self-attention to extract false audio elements effectively. The cross-modal attention mechanism effectively learns the divergence between two modalities. Besides, we include multi-modal fake examples to create a well-balanced bespoke dataset to address the drawbacks of small and unbalanced training samples. We test the effectiveness of our proposed multi-modal deepfake detection strategy in comparison to state-of-the-art methods on a variety of existing datasets.
Towards Automatic Depression Detection: A BiLSTM/1D CNN-Based Model
Depression is a global mental health problem, the worst cases of which can lead to self-injury or suicide. An automatic depression detection system is of great help in facilitating clinical diagnosis and early intervention of depression. In this work, we propose a new automatic depression detection method utilizing speech signals and linguistic content from patient interviews. Specifically, the proposed method consists of three components, which include a Bidirectional Long Short-Term Memory (BiLSTM) network with an attention layer to deal with linguistic content, a One-Dimensional Convolutional Neural Network (1D CNN) to deal with speech signals, and a fully connected network integrating the outputs of the previous two models to assess the depressive state. Evaluated on two publicly available datasets, our method achieves state-of-the-art performance compared with the existing methods. In addition, our method utilizes audio and text features simultaneously. Therefore, it can get rid of the misleading information provided by the patients. As a conclusion, our method can automatically evaluate the depression state and does not require an expert to conduct the psychological evaluation on site. Our method greatly improves the detection accuracy, as well as the efficiency.
Absolute and Relative Depth-Induced Network for RGB-D Salient Object Detection
Detecting salient objects in complicated scenarios is a challenging problem. Except for semantic features from the RGB image, spatial information from the depth image also provides sufficient cues about the object. Therefore, it is crucial to rationally integrate RGB and depth features for the RGB-D salient object detection task. Most existing RGB-D saliency detectors modulate RGB semantic features with absolution depth values. However, they ignore the appearance contrast and structure knowledge indicated by relative depth values between pixels. In this work, we propose a depth-induced network (DIN) for RGB-D salient object detection, to take full advantage of both absolute and relative depth information, and further, enforce the in-depth fusion of the RGB-D cross-modalities. Specifically, an absolute depth-induced module (ADIM) is proposed, to hierarchically integrate absolute depth values and RGB features, to allow the interaction between the appearance and structural information in the encoding stage. A relative depth-induced module (RDIM) is designed, to capture detailed saliency cues, by exploring contrastive and structural information from relative depth values in the decoding stage. By combining the ADIM and RDIM, we can accurately locate salient objects with clear boundaries, even from complex scenes. The proposed DIN is a lightweight network, and the model size is much smaller than that of state-of-the-art algorithms. Extensive experiments on six challenging benchmarks, show that our method outperforms most existing RGB-D salient object detection models.
EPCNet: Implementing an ‘Artificial Fovea’ for More Efficient Monitoring Using the Sensor Fusion of an Event-Based and a Frame-Based Camera
Efficient object detection is crucial to real-time monitoring applications such as autonomous driving or security systems. Modern RGB cameras can produce high-resolution images for accurate object detection. However, increased resolution results in increased network latency and power consumption. To minimise this latency, Convolutional Neural Networks (CNNs) often have a resolution limitation, requiring images to be down-sampled before inference, causing significant information loss. Event-based cameras are neuromorphic vision sensors with high temporal resolution, low power consumption, and high dynamic range, making them preferable to regular RGB cameras in many situations. This project proposes the fusion of an event-based camera with an RGB camera to mitigate the trade-off between temporal resolution and accuracy, while minimising power consumption. The cameras are calibrated to create a multi-modal stereo vision system where pixel coordinates can be projected between the event and RGB camera image planes. This calibration is used to project bounding boxes detected by clustering of events into the RGB image plane, thereby cropping each RGB frame instead of down-sampling to meet the requirements of the CNN. Using the Common Objects in Context (COCO) dataset evaluator, the average precision (AP) for the bicycle class in RGB scenes improved from 21.08 to 57.38. Additionally, AP increased across all classes from 37.93 to 46.89. To reduce system latency, a novel object detection approach is proposed where the event camera acts as a region proposal network, and a classification algorithm is run on the proposed regions. This achieved a 78% improvement over baseline.
IFE-CMT: Instance-Aware Fine-Grained Feature Enhancement Cross Modal Transformer for 3D Object Detection
In recent years, multi-modal 3D object detection algorithms have experienced significant development. However, current algorithms primarily focus on designing overall fusion strategies for multi-modal features, neglecting finer-grained representations, which leads to a decline in the detection accuracy of small objects. To address this issue, this paper proposes the Instance-aware Fine-grained feature Enhancement Cross Modal Transformer (IFE-CMT) model. We designed an Instance feature Enhancement Module (IE-Module), which can accurately extract object features from multi-modal data and use them to enhance overall features while avoiding view transformations and maintaining low computational overhead. Additionally, we design a new point cloud branch network that effectively expands the network’s receptive field, enhancing the model’s semantic expression capabilities while preserving texture details of the objects. Experimental results on the nuScenes dataset demonstrate that compared to the CMT model, our proposed IFE-CMT model improves mAP and NDS by 2.1% and 0.8% on the validation set, respectively. On the test set, it improves mAP and NDS by 1.9% and a 0.7%. Notably, for small object categories such as bicycles and motorcycles, the mAP improved by 6.6% and 3.7%, respectively, significantly enhancing the detection accuracy of small objects.