Catalogue Search | MBRL

MSSA: memory-driven and simplified scaled attention for enhanced image captioning

by Hossen, Md. Bipul , Ye, ZhongFu , Islam, Md Shohidul in 631/114 , 639/166 , 639/705

2026

Image captioning is a key task in computer vision and natural language processing. It involves creating clear and accurate descriptions of what we see in images, helping to connect visuals with words in a meaningful way. This paper introduces MSSA (Memory-Driven and Simplified Scaled Attention), a novel framework for image captioning designed to enhance multimodal integration and caption generation. MSSA leverages Extended Multimodal Feature Extraction, incorporating a diverse range of features, including geometric features encoding spatial properties of bounding boxes, color features representing pixel intensity distributions in RGB space, texture features capturing local variations using Local Binary Patterns (LBP), edge features describing boundary structures via Canny edge detection, and frequency-domain features detecting orientation- and frequency-specific patterns through Gabor filters. This comprehensive feature set provides a richer understanding of complex visual scenes. The framework integrates two key mechanisms: Memory-Driven Attention (MDA) and Simplified Scaled Attention (SSA). MDA iteratively refines the alignment of visual and multimodal features using an LSTM-based memory mechanism, ensuring dynamic adaptation to contextually relevant image and textual elements. SSA generates context vectors by leveraging scaled dot-product attention, enabling efficient modeling of spatial, semantic, and contextual interactions while maintaining computational simplicity through the removal of complex gating mechanisms. Extensive experiments on the MSCOCO dataset demonstrate that MSSA outperforms state-of-the-art methods across several evaluation metrics. The proposed framework combines robust feature extraction with a simplified attention module, and we support the “streamlined” claim by reporting concrete efficiency evidence (Params/FP32 size, FLOPs, and inference latency) within our LSTM-based captioning pipeline, without implying a direct runtime advantage over Transformer-based captioning models. The codes and resources for MSSA are publicly available at: https://github.com/alamgirustc/MSSA .

Journal Article

Share this book

Add to My Shelf

Multimodal representations of transfer learning with snake optimization algorithm on bone marrow cell classification using biomedical histopathological images

by Tarmissi, Khaled , Obayya, Marwa , Alsamri, Jamal in 639/705/117 , 639/705/258 , Algorithms

2025

Bone marrow (BM) plays a crucial role in the hematopoietic process, producing all of the body’s blood cells and maintaining the overall immune and health system. Red and yellow BM are the two various kinds of BM. A comprehensive identification of these cells assists in the primary and precise recognition of these disorders. The recognition and identification of BM cells are crucial bases for haematology diagnostics. Physical study of BM detection and classification presently performed in medical laboratories can be primarily insufficient owing to various factors, such as prolonged and challenging. Recently, with the fast growth of deep learning (DL) and machine learning (ML) methods, object detection methods have been progressively used for cell detection. DL is a secondary domain of artificial intelligence (AI) methods able to spontaneously assess delicate graphical features to create exact predictions that have been newly popularized in various imaging-related tasks. This study proposes a Multimodal Transfer Learning with Snake Optimization on Bone Marrow Cell Classification (MTLSO-BMCC) technique using biomedical histopathological images. The main intention of the MTLSO-BMCC technique is to identify and classify BM cells utilizing HI. To achieve this, the presented MTLSO-BMCC method initially performs image preprocessing using a median filter (MF) for noise removal. Besides, the multimodal feature extraction process is accomplished in InceptionV3, Deep SqueezeNet, and SE-DenseNet models. The presented MTLSO-BMCC technique employs the hybrid kernel extreme learning machine (HKELM) method for the BM classification method. Finally, the snake optimization algorithm (SOA) is implemented to tune the parameter of the HKELM model. A widespread MTLSO-BMCC methodology simulation is accomplished under the BM Cell Classification dataset. The experimental validation of the MTLSO-BMCC methodology portrayed a superior accuracy value of 98.60% over existing approaches.

Journal Article

Share this book

Add to My Shelf

MFSM-Net: Multimodal Feature Fusion for the Semantic Segmentation of Urban-Scale Textured 3D Meshes

by Hao, Xinjie , Leng, Wei , Wang, Jiahui in 3D mesh scene understanding , Accuracy , Algorithms

2025

The semantic segmentation of textured 3D meshes is a critical step in constructing city-scale realistic 3D models. Compared to colored point clouds, textured 3D meshes have the advantage of high-resolution texture image patches embedded on each mesh face. However, existing studies predominantly focus on their geometric structures, with limited utilization of these high-resolution textures. Inspired by the binocular perception of humans, this paper proposes a multimodal feature fusion network based on 3D geometric structures and 2D high-resolution texture images for the semantic segmentation of textured 3D meshes. Methodologically, the 3D feature extraction branch computes the centroid coordinates and face normals of mesh faces as initial 3D features, followed by a multi-scale Transformer network to extract high-level 3D features. The 2D feature extraction branch employs orthographic views of city scenes captured from a top-down perspective and uses a U-Net to extract high-level 2D features. To align features across 2D and 3D modalities, a Bridge view-based alignment algorithm is proposed, which visualizes the 3D mesh indices to establish pixel-level associations with orthographic views, achieving the precise alignment of multimodal features. Experimental results demonstrate that the proposed method achieves competitive performance in city-scale textured 3D mesh semantic segmentation, validating the effectiveness and potential of the cross-modal fusion strategy.

Journal Article

Share this book

Add to My Shelf

A Comprehensive Review of Multimodal Analysis in Education

by Romero, Francisco P. , Olivas, Jose A. , Menéndez-Domínguez, Víctor H. in Affect (Psychology) , Artificial intelligence , Collaboration

2025

Multimodal learning analytics (MMLA) has become a prominent approach for capturing the complexity of learning by integrating diverse data sources such as video, audio, physiological signals, and digital interactions. This comprehensive review synthesises findings from 177 peer-reviewed studies to examine the foundations, methodologies, tools, and applications of MMLA in education. It provides a detailed analysis of data collection modalities, feature extraction pipelines, modelling techniques—including machine learning, deep learning, and fusion strategies—and software frameworks used across various educational settings. Applications are categorised by pedagogical goals, including engagement monitoring, collaborative learning, simulation-based environments, and inclusive education. The review identifies key challenges, such as data synchronisation, model interpretability, ethical concerns, and scalability barriers. It concludes by outlining future research directions, with emphasis on real-world deployment, longitudinal studies, explainable artificial intelligence, emerging modalities, and cross-cultural validation. This work aims to consolidate current knowledge, address gaps in practice, and offer practical guidance for researchers and practitioners advancing multimodal approaches in education.

Journal Article

Share this book

Add to My Shelf

Multimodal information fusion application to human emotion recognition from face and speech

by Moghaddam Charkari, Nasrollah , Mansoorizadeh, Muharram

2010

Journal Article

Share this book

Add to My Shelf

Palmprint and Face Multi-Modal Biometric Recognition Based on SDA-GSVD and Its Kernelization

by Lu, Jia-Sen , Yang, Jing-Yu , Li, Sheng in Algorithms , Biometrics , Biometry

2012

When extracting discriminative features from multimodal data, current methods rarely concern themselves with the data distribution. In this paper, we present an assumption that is consistent with the viewpoint of discrimination, that is, a person’s overall biometric data should be regarded as one class in the input space, and his different biometric data can form different Gaussians distributions, i.e., different subclasses. Hence, we propose a novel multimodal feature extraction and recognition approach based on subclass discriminant analysis (SDA). Specifically, one person’s different bio-data are treated as different subclasses of one class, and a transformed space is calculated, where the difference among subclasses belonging to different persons is maximized, and the difference within each subclass is minimized. Then, the obtained multimodal features are used for classification. Two solutions are presented to overcome the singularity problem encountered in calculation, which are using PCA preprocessing, and employing the generalized singular value decomposition (GSVD) technique, respectively. Further, we provide nonlinear extensions of SDA based multimodal feature extraction, that is, the feature fusion based on KPCA-SDA and KSDA-GSVD. In KPCA-SDA, we first apply Kernel PCA on each single modal before performing SDA. While in KSDA-GSVD, we directly perform Kernel SDA to fuse multimodal data by applying GSVD to avoid the singular problem. For simplicity two typical types of biometric data are considered in this paper, i.e., palmprint data and face data. Compared with several representative multimodal biometrics recognition methods, experimental results show that our approaches outperform related multimodal recognition methods and KSDA-GSVD achieves the best recognition performance.

Journal Article

Share this book

Add to My Shelf

Adaptive multimodal prompt for human-object interaction with local feature enhanced transformer

by Yu, Wenjun , Chen, Mingxuan , Xue, Kejun in Adaptive learning , Computer engineering , Computer vision

2024

Human-object interaction (HOI) detection is an important computer vision task for recognizing the interaction between humans and surrounding objects in an image or video. The HOI datasets have a serious long-tailed data distribution problem because it is challenging to have a dataset that contains all potential interactions. Many HOI detectors have addressed this issue by utilizing visual-language models. However, due to the calculation mechanism of the Transformer, the visual-language model is not good at extracting the local features of input samples. Therefore, we propose a novel local feature enhanced Transformer to motivate encoders to extract multi-modal features that contain more information. Moreover, it is worth noting that the application of prompt learning in HOI detection is still in preliminary stages. Consequently, we propose a multi-modal adaptive prompt module, which uses an adaptive learning strategy to facilitate the interaction of language and visual prompts. In the HICO-DET and SWIG-HOI datasets, the proposed model achieves full interaction with 24.21% mAP and 14.29% mAP, respectively. Our code is available at https://github.com/small-code-cat/AMP-HOI.

Journal Article

Share this book

Add to My Shelf

A Uniform Multi-Modal Feature Extraction and Adaptive Local–Global Feature Fusion Structure for RGB-X Marine Animal Segmentation

by Jiang, Yue , Lin, Yuanshan , Gao, Yan in Complexity , Decomposition , Design

2025

Marine animal segmentation aims at segmenting marine animals in complex ocean scenes, which plays an important role in underwater intelligence research. Due to the complexity of underwater scenes, relying solely on a single RGB image or learning from a specific combination of multi-model information may not be very effective. Therefore, we propose a uniform multi-modal feature extraction and adaptive local–global feature fusion structure for RGB-X marine animal segmentation. It can be applicable to various situations such as RGB-D (RGB+depth) and RGB-O (RGB+optical flow) marine animal segmentation. Specifically, we first fine-tune the SAM encoder using parallel LoRA and adapters to separately extract RGB information and auxiliary information. Then, the Adaptive Local–Global Feature Fusion (ALGFF) module is proposed to progressively fuse multi-modal and multi-scale features in a simple and dynamical way. Experimental results on both RGB-D and RGB-O datasets demonstrate that our model achieves superior performance in underwater scene segmentation tasks.

Journal Article

Share this book

Add to My Shelf

Research on the prediction of English topic richness in the context of multimedia data

by Jiao, Jie , Aljuaid, Hanan in Algorithms , Algorithms and Analysis of Algorithms , Artificial Intelligence

2024

With the evolution of the Internet and multimedia technologies, delving deep into multimedia data for predicting topic richness holds significant practical implications in public opinion monitoring and data discourse power competition. This study introduces an algorithm for predicting English topic richness based on the Transformer model, applied specifically to the Twitter platform. Initially, relevant data is organized and extracted following an analysis of Twitter’s characteristics. Subsequently, a feature fusion approach is employed to mine, extract, and construct features from Twitter blogs and users, encompassing blog features, topic features, and user features, which are amalgamated into multimodal features. Lastly, the combined features undergo training and learning using the Transformer model. Through experimentation on the Twitter topic richness dataset, our algorithm achieves an accuracy of 82.3%, affirming the efficacy and superior performance of the proposed approach.

Journal Article

Share this book

Add to My Shelf

Enhancing Human Activity Recognition through Integrated Multimodal Analysis: A Focus on RGB Imaging, Skeletal Tracking, and Pose Estimation

by Mehmood, Asif , Ali, Moazzam , Yasin, Aman Ullah in 2 + 1 dimensional convolutional neural network (2 + 1D CNN) , Accuracy , Adaptability

2024

Human activity recognition (HAR) is pivotal in advancing applications ranging from healthcare monitoring to interactive gaming. Traditional HAR systems, primarily relying on single data sources, face limitations in capturing the full spectrum of human activities. This study introduces a comprehensive approach to HAR by integrating two critical modalities: RGB imaging and advanced pose estimation features. Our methodology leverages the strengths of each modality to overcome the drawbacks of unimodal systems, providing a richer and more accurate representation of activities. We propose a two-stream network that processes skeletal and RGB data in parallel, enhanced by pose estimation techniques for refined feature extraction. The integration of these modalities is facilitated through advanced fusion algorithms, significantly improving recognition accuracy. Extensive experiments conducted on the UTD multimodal human action dataset (UTD MHAD) demonstrate that the proposed approach exceeds the performance of existing state-of-the-art algorithms, yielding improved outcomes. This study not only sets a new benchmark for HAR systems but also highlights the importance of feature engineering in capturing the complexity of human movements and the integration of optimal features. Our findings pave the way for more sophisticated, reliable, and applicable HAR systems in real-world scenarios.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter