Catalogue Search | MBRL

GAC-Net: A Geometric–Attention Fusion Network for Sparse Depth Completion from LiDAR and Image

by Zhu, Kuang , Sun, Min , Zhao, Leyang in 3D geometric representation , Accuracy , Artificial intelligence

2025

Depth completion aims to reconstruct dense depth maps from sparse LiDAR measurements guided by RGB images. Although BPNet enhanced depth structure perception through a bilateral propagation module and achieved state-of-the-art performance at the time, there is still room for improvement in leveraging 3D geometric priors and adaptively fusing heterogeneous modalities. To this end, we proposed GAC-Net, a Geometric–Attention Fusion Network that enhances geometric representation and cross-modal fusion. Specifically, we designed a dual-branch PointNet++-S encoder, where two PointNet++ modules with different receptive fields are applied to extract scale-aware geometric features from the back-projected sparse point cloud. These features are then fused using a channel attention mechanism to form a robust global 3D representation. A Channel Attention-Based Feature Fusion Module (CAFFM) was further introduced to adaptively integrate this geometric prior with RGB and depth features. Experiments on the KITTI depth completion benchmark demonstrated the effectiveness of GAC-Net, achieving an RMSE of 680.82 mm, ranking first among all peer-reviewed methods at the time of submission.

Journal Article

Share this book

Add to My Shelf

HyMSS-GAD: a hybrid multi-stage framework for multi-view graph anomaly detection with structural, contextual, and geometric reasoning

by Brahim, Kamel , Ebrahim, Nadhem , Elloumi, Mourad in 639/166 , 639/705 , Attributed networks

2026

Graph anomaly detection has become an important task in discovering abnormal patterns within attributed networks, where anomalies can occur due to structural, contextual, or geometric mismatch. Current methods are mainly based on either a reconstruction-based or contrastive objective, which seldom consider the relationship between heterogeneous modalities and higher-order graph geometry. To address this gap, we present HyMSS-GAD, a Hybrid Multi-Stage Framework for Graph Anomaly Detection that combines contextual, structural, and geometric reasoning in a five-step pipeline. First, a cross-modal contrastive learning module learns aligned representations from feature and topology-based modalities, utilizing InfoNCE and alignment regularization. Second, a motif-based structural reconstruction module discovers higher-order connectivity roles with deterministic motif enumeration and autoencoder based reconstruction. Third, we apply an attention-driven fusion mechanism to dynamically combine contextual and structural embeddings into a single representation. Fourth, we introduce a curvature-aware decoder to predict and reconstruct Ollivier–Ricci curvature for geometry-based anomaly detection within graph manifolds. Finally, we develop a multi-view anomaly scoring strategy to combine contextual, structural, and geometric residuals into an interpretable anomaly score. In-depth evaluations conducted on five standard benchmark datasets namely, Cora, Citeseer, PubMed, ACM, and Amazon, show that HyMSS-GAD consistently outperforms the state-of-the-art baseline models. Moreover, curvature residuals offer an increased degree of interpretability by indicating the bridge node regions of communities and the anomalous boundary regions. Overall, HyMSS-GAD is shown to be a scalable, explainable, geometrically informed model for graph anomaly detection on a set of diverse attributed networks.

Journal Article

Share this book

Add to My Shelf

Multimodal Model for Automated Pain Assessment: Leveraging Video and fNIRS

by Shin, Jieun , Kim, Soohyung , Divakaran, Anjitha in attention-based fusion , Automation , brain–computer interface

2025

Pain assessment is a challenging task for clinicians due to its subjective nature, particularly in individuals with communication difficulties, cognitive impairments, or severe disabilities. Traditional methods such as the Visual Analogue Scale (VAS), Numerical Rating Scale (NRS), and Verbal Rating Scale (VRS) rely heavily on patient feedback, which can be inconsistent and subjective. To address these limitations, developing objective and reliable pain assessment tools that incorporate advanced technologies, such as multimodal data integration from video and fNIRS, is important for improving clinical outcomes. However, challenges such as noise susceptibility in fNIRS signals must be carefully addressed to realize their full potential. Recent studies have explored automatic pain assessment using machine learning and deep learning techniques, which require high-quality data that can accurately represent pain categories. In response to the introduction of a new dataset in the AI4Pain Challenge, we proposed a multimodal neural network model utilizing attention-based fusion to improve overall accuracy (MMAPA). Our model leverages video and fNIRS modalities as well as manually extracted statistical features. We also implemented fNIRS signal preprocessing and artifact noise filtering, which significantly improved performance on both the fNIRS and statistical feature branches. On the hidden test set, our model achieved an accuracy of 51.33%, outperforming the official baseline of 43.33%. To evaluate generalizability, we further tested our method on the BioVid Heat Pain Database, where our fusion model achieved the highest accuracy in the 10-fold cross-validation setting, outperforming PainAttNet and unimodal variants. These results highlight the effectiveness of our multimodal attention-based approach in improving pain classification performance across datasets.

Journal Article

Share this book

Add to My Shelf

Detection of Computer Graphics Using Attention-Based Dual-Branch Convolutional Neural Network from Fused Color Components

by Wang, Hongxia , Li, Haoliang , He, Peisong in 3-D graphics , attention-based fusion , Computer graphics

2020

With the development of 3D rendering techniques, people can create photorealistic computer graphics (CG) easily with the advanced software, which is of great benefit to the video game and film industries. On the other hand, the abuse of CGs has threatened the integrity and authenticity of digital images. In the last decade, several detection methods of CGs have been proposed successfully. However, existing methods cannot provide reliable detection results for CGs with the small patch size and post-processing operations. To overcome the above-mentioned limitation, we proposed an attention-based dual-branch convolutional neural network (AD-CNN) to extract robust representations from fused color components. In pre-processing, raw RGB components and their blurred version with Gaussian low-pass filter are stacked together in channel-wise as the input for the AD-CNN, which aims to help the network learn more generalized patterns. The proposed AD-CNN starts with a dual-branch structure where two branches work in parallel and have the identical shallow CNN architecture, except that the first convolutional layer in each branch has various kernel sizes to exploit low-level forensics traces in multi-scale. The output features from each branch are jointly optimized by the attention-based fusion module which can assign the asymmetric weights to different branches automatically. Finally, the fused feature is fed into the following fully-connected layers to obtain final detection results. Comparative and self-analysis experiments have demonstrated the better detection capability and robustness of the proposed detection compared with other state-of-the-art methods under various experimental settings, especially for image patch with the small size and post-processing operations.

Journal Article

Share this book

Add to My Shelf

MV-RiskNet: Multi-View Attention-Based Deep Learning Model for Regional Epidemic Risk Prediction and Mapping

by Okudan, Beyzanur , Karcioglu, Abdullah Ammar in attention-based fusion , Classification , COVID-19

2026

Regional epidemic risk prediction requires holistic modeling of heterogeneous data sources such as demographic structure, health capacity, geographical features and human mobility. In this study, a unique and multi-modal epidemiological data set integrating demographic, health, geographic and mobility indicators of Türkiye and its neighboring countries was collected. Türkiye’s neighboring countries are Greece, Bulgaria, Georgia, Armenia, Iran, and Iraq. This dataset, created by combining raw data from these neighboring countries, provides a comprehensive regional representation that allows for both quantitative classification and spatial mapping of epidemiological risk. To address the class imbalance problem, Conditional GAN (CGAN), a class-conditional synthetic example generation approach that enhances high-risk category representation was used. In this study, we proposed a multi-view deep learning model named MV-RiskNet, which effectively models the multi-dimensional data structure by processing each view into independent subnetworks and integrating the representations with an attention-based fusion mechanism for regional epidemic risk prediction. Experimental studies were compared using Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Autoencoder classifier, and Graph Convolutional Network (GCN) models. The proposed MV-RiskNet with CGAN model achieved better results compared to other models, with 97.22% accuracy and 97.40% F1-score. The generated risk maps reveal regional clustering patterns in a spatially consistent manner, while attention analyses show that demographic and geographic features are the dominant determinants, while mobility plays a complementary role, especially in high-risk regions.

Journal Article

Share this book

Add to My Shelf

ARF-Net: a multi-modal aesthetic attention-based fusion

by Gavrilova, Marina , Iffath, Fariha in Accuracy , Aesthetics , Algorithms

2024

Over the last decade, Online Social Media platforms have witnessed a dramatic expansion due to the substantial reliance of individuals on these communication channels. These platforms are widely utilized to convey emotions, share opinions, and express preferences through various means such as artworks, multimedia contents, and blogs. Researchers are exploring these individual-specific traits for biometric identification. Aesthetic biometric systems utilize users’ unique preferences across various subjective forms such as images, music, and textual contents. This study introduces a novel multi-modal aesthetic system, with a primary contribution to the development of an attention-based fusion method for person identification. The proposed identification system leverages a deep pre-trained model for high-level feature extraction from visual and auditory modalities. The paper introduces a novel fusion architecture named attention-based residual fusion network (ARF-Net) to incorporate two heterogeneous aesthetic feature vectors. The proposed model yielded a 99.38% identification accuracy on the Aesthetic Image Audio 32 (AIA32) dataset and 98.02% identification accuracy on Aesthetic Image Audio 52 (AIA52) dataset, outperforming other aesthetic biometric systems. The proposed architecture stands out for its efficiency, showcasing a lightweight architecture with minimal parameters, ensuring optimal performance in different modalities.

Journal Article

Share this book

Add to My Shelf

EAML: ensemble self-attention-based mutual learning network for document image classification

by Coustaty, Mickaël , Rusiñol, Marçal , Bakkali, Souhail in Ablation , Artificial neural networks , Attention

2021

In the recent past, complex deep neural networks have received huge interest in various document understanding tasks such as document image classification and document retrieval. As many document types have a distinct visual style, learning only visual features with deep CNNs to classify document images has encountered the problem of low inter-class discrimination, and high intra-class structural variations between its categories. In parallel, text-level understanding jointly learned with the corresponding visual properties within a given document image has considerably improved the classification performance in terms of accuracy. In this paper, we design a self-attention-based fusion module that serves as a block in our ensemble trainable network. It allows to simultaneously learn the discriminant features of image and text modalities throughout the training stage. Besides, we encourage mutual learning by transferring the positive knowledge between image and text modalities during the training stage. This constraint is realized by adding a truncated Kullback–Leibler divergence loss (Tr- KLDReg ) as a new regularization term, to the conventional supervised setting. To the best of our knowledge, this is the first time to leverage a mutual learning approach along with a self-attention-based fusion module to perform document image classification. The experimental results illustrate the effectiveness of our approach in terms of accuracy for the single-modal and multi-modal modalities. Thus, the proposed ensemble self-attention-based mutual learning model outperforms the state-of-the-art classification results based on the benchmark RVL-CDIP and Tobacco-3482 datasets.

Journal Article

Share this book

Add to My Shelf

Modified RefineNet with Attention-Based Fusion for Multi-Class Classification of Corn and Pepper Plant Diseases

by Srinivasulu, Maramreddy , Maiti, Sandipan in Accuracy , Adaptation , Agricultural practices

2026

Early and precise detection of plant diseases is essential for safeguarding crop yield and ensuring sustainable agricultural practices. In this study, we propose the Modified RefineNet with Attention based Fusion (MoRefNet-AF), a Modified RefineNet architecture enhanced with attention-based fusion for multi-class classification of corn (maize) and Pepper leaf diseases. Unlike the original RefineNet, which was segmentation-oriented and computationally heavy, MoRefNet-AF is redesigned for lightweight and discriminative classification. The modifications include replacing standard convolutions with depthwise separable convolutions for efficiency, adopting the Mish activation function for smoother gradient flow, redesigning the multi-resolution fusion module with concatenation and shared convolution for richer cross-scale integration, and incorporating Squeeze-and-Excitation (SE) blocks for adaptive channel recalibration. Additionally, Chained Residual Pooling (CRP) with atrous convolutions enhances contextual representation, while global average pooling with dense layers improves classification readiness. When evaluated on a curated six-class dataset combining PlantVillage and Mendeley leaf disease repositories, MoRefNet-AF achieved 99.88% accuracy, 99.74% precision, 99.73% recall, 99.95% F1-score, and 99.73% specificity. These results outperform strong baselines including ResNet152V2, DenseNet201, EfficientNet-B0, and ConvNeXt-Tiny, while maintaining only 0.3 M parameters. With its compact design and TensorFlow Lite (v2.13) compatibility, MoRefNet-AF offers a robust, lightweight, and real-time deployable solution for precision agriculture and smart plant disease monitoring.

Journal Article

Share this book

Add to My Shelf

Employability assessment using multimodal deep learning framework

by Zhou, Fengjin in Accountability , Accuracy , Artificial Intelligence

2026

The integration of artificial intelligence into talent acquisition has accelerated the development of multimodal frameworks for employability assessment, offering greater accuracy, scalability, and objectivity. In this paper, we propose a novel Multimodal Deep Learning Framework that unifies textual resumes, video interviews, and audio responses into a single, interpretable decision system. Our approach leverages state-of-the-art neural architectures, including Transformer-based models for textual analysis, wav2vec 2.0 embeddings for speech, and three-dimensional facial expression modeling with EfficientNet and OpenFace. A dynamic attention-driven fusion module adaptively balances contributions from different modalities, while built-in explainability mechanisms support transparent and fair decision-making. We evaluate the framework on a rigorously curated multimodal dataset annotated by HR professionals. Results demonstrate that our method outperforms unimodal and hybrid baselines, achieving a 14% improvement in F1-score and 90% top-1 accuracy in employability prediction. Importantly, integrated bias mitigation techniques reduce gender- and ethnicity-related disparities by more than 25%, underscoring the framework’s potential for fair, responsible, and practical AI deployment in modern talent analytics.

Journal Article

Share this book

Add to My Shelf

Multi-modal Land Cover Classification of Historical Aerial Images and Topographic Maps Exploiting Attention-based Feature Fusion

by Hovenbitzer, Michael , Thiemann, Frank , Sester, Monika in Classification , Image classification , Land cover

2025

Knowledge about past and present land cover is of interest for the assessment of the current status of our environment and, thus, for proper planning of the future. Information on past land cover is exclusively contained in an implicit way in historic remote sensing imagery and historic topographic maps. To make this information explicit, pixel-wise classification methods based on neural networks can be used. The method proposed in this paper aims to automatically predict land cover based on historic aerial imagery and scanned topographic maps. The proposed deep learning-based classifier extracts features at different scales from both modalities and fuses the most complex topographic map features of the smallest scale to enrich the ones derived from the aerial images. Both, the multi-modal features and those of the aerial images at larger scales, are mapped to pixel-wise predictions by means of a decoder. Comprehensive experiments show that the result of the proposed multi-modal classifier are superior compared to those of a uni-modal aerial image classifier; the multi-modal mIOU of 82.3% is 1.4% larger than the one of uni-modal classifier. This demonstrates that aerial image classification can benefit from additional information contained in topographic maps.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter