Catalogue Search | MBRL

A systematic literature review on incomplete multimodal learning: techniques and challenges

by Huang, Mengjie , Yang, Rui , You, Junxian in Incomplete multimodal learning , Information sources , Literature reviews

2025

Recently, machine learning technologies have been successfully applied across various fields. However, most existing machine learning models rely on unimodal data for information inference, which hinders their ability to generalize to complex application scenarios. This limitation has resulted in the development of multimodal learning, a field that integrates information from different modalities to enhance models' capabilities. However, data often suffers from missing or incomplete modalities in practical applications. This necessitates that models maintain robustness and effectively infer complete information in the presence of missing modalities. The emerging research direction of incomplete multimodal learning (IML) aims to facilitate effective learning from incomplete multimodal training sets, ensuring that models can dynamically and robustly address new instances with arbitrary missing modalities during the testing phase. This paper offers a comprehensive review of methods based on IML. It categorizes existing approaches based on their information sources into two main types: based on internal information and external information methods. These categories are further subdivided into data-based, feature-based, knowledge transfer-based, graph knowledge enhancement-based, and human-in-the-loop-based methods. The paper conducts comparative analyses from two perspectives: comparisons among similar methods and comparisons among different types of methods. Finally, it offers insights into the research trends in IML.

Journal Article

Share this book

Add to My Shelf

MutualFormer: Multi-modal Representation Learning via Cross-Diffusion Attention

by Jiang, Bo , Wang, Xixi , Luo, Bin in Affinity , Cognitive tasks , Diffusion

2024

Aggregating multi-modal data to obtain reliable data representation attracts more and more attention. Recent studies demonstrate that Transformer models usually work well for multi-modal tasks. Existing Transformers generally either adopt the cross-attention (CA) mechanism or simple concatenation to achieve the information interaction among different modalities which generally ignore the issue of modality gap. In this work, we re-think Transformer and extend it to MutualFormer for multi-modal data representation. Rather than CA in Transformer, MutualFormer employs our new design of cross-diffusion attention (CDA) to conduct the information communication among different modalities. Comparing with CA, the main advantages of the proposed CDA are three aspects. First, the cross-affinities in CDA are defined based on the individual modal affinities (token metrics) which thus can naturally alleviate the issue of modality/domain gap existed in traditional token feature based CA definition. Second, CDA provides a general scheme which can either be used for multi-modal representation or serve as the post-optimization for existing CA models. Third, CDA is implemented efficiently. We successfully apply the MutualFormer on several multi-modal learning tasks. Extensive experiments demonstrate the effectiveness of the proposed MutualFormer.

Journal Article

Share this book

Add to My Shelf

Relative Norm Alignment for Tackling Domain Shift in Deep Multi-modal Classification

by Planamente, Mirco , Peirone, Simone Alberto , Plizzari, Chiara in Ablation , Accuracy , Algorithms

2024

Multi-modal learning has gained significant attention due to its ability to enhance machine learning algorithms. However, it brings challenges related to modality heterogeneity and domain shift. In this work, we address these challenges by proposing a new approach called Relative Norm Alignment (RNA) loss. RNA loss exploits the observation that variations in marginal distributions between modalities manifest as discrepancies in their mean feature norms, and rebalances feature norms across domains, modalities, and classes. This rebalancing improves the accuracy of models on test data from unseen (“target”) distributions. In the context of Unsupervised Domain Adaptation (UDA), we use unlabeled target data to enhance feature transferability. We achieve this by combining RNA loss with an adversarial domain loss and an Information Maximization term that regularizes predictions on target data. We present a comprehensive analysis and ablation of our method for both Domain Generalization and UDA settings, testing our approach on different modalities for tasks such as first and third person action recognition, object recognition, and fatigue detection. Experimental results show that our approach achieves competitive or state-of-the-art performance on the proposed benchmarks, showing the versatility and effectiveness of our method in a wide range of applications.

Journal Article

Share this book

Add to My Shelf

I2DFormer+: Learning Image to Document Summary Attention for Zero-Shot Image Classification

by Gool, Luc Van , Tombari, Federico , Xian, Yongqin in Documents , Embedding , Image classification

2024

Despite the tremendous progress in zero-shot learning (ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name. However, word embeddings extracted from pre-trained language models do not necessarily capture visual similarities, resulting in poor zero-shot performance. In this work, we argue that online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes, therefore can be used as powerful unsupervised side information for ZSL. To this end, we propose I2DFormer+, a novel transformer-based ZSL framework that jointly learn to encode images and documents by aligning both modalities in a shared embedding space. I2DFormer+ utilizes our novel Document Summary Transformer (DSTransformer), a text transformer, that learns to encode a sequence of text into a fixed set of summary tokens. These summary tokens are utilized by a cross-model attention module that learns finegrained interactions between image patches and the summary of the document. Consequently, our I2DFormer+ not only learns highly discriminative document embeddings that capture visual similarities but also gains the ability to explain what regions of the image are important for the decision. Quantitatively, we demonstrate that I2DFormer+ significantly outperforms previous unsupervised semantic embeddings under both zero-shot and generalized zero-shot learning settings on three public datasets. Qualitatively, we show that our methods lead to highly interpretable results. Furthermore, we scale our model to the large scale zero-shot learning setting and show state-of-the-art performance on two challenging ImageNet benchmarks.

Journal Article

Share this book

Add to My Shelf

Diagram Perception Networks for Textbook Question Answering via Joint Optimization

by Chai, Qi , Tao, Jing , Liu, Jun in Ablation , Accumulation , Artificial neural networks

2024

Textbook question answering requires a system to answer questions with or without diagrams accurately, given multimodal contexts that include rich paragraphs and diagrams. Existing methods usually utilize a pipelined way to extract the most relevant paragraph from multimodal contexts and only employ convolutional neural networks to comprehend diagram semantics under the supervision of answer labels. The former will result in error accumulation, while the latter will lead to poor diagram understanding. To provide a remedy for the above issues, we propose an end-to-end DIagraM Perception network for textbook question answering (DIMP), which is jointly optimized by the supervision of relation predicting, diagram classification, and question answering. Specifically, knowledge extracting is regarded as a sequence classification task and optimized through the supervision of answer labels to alleviate error accumulation. To capture diagram semantics effectively, DIMP uses an explicit relation-aware method that first parses a diagram into several graphs under specific relations and then grasps the information propagation within them. Evaluation on two benchmark datasets shows that our method achieves competitive or better results without large data pre-training and constructing auxiliary tasks compared with current state-of-the-art methods. We provide comprehensive ablation studies and thorough analyses to determine what factors contribute to this success. We also make in-depth analyses for relational graph learning and joint optimization.

Journal Article

Share this book

Add to My Shelf

Learning Text-to-Video Retrieval from Image Captioning

by Ventura, Lucas , Schmid, Cordelia , Varol, Gül in Ablation , Annotations , Artificial Intelligence

2025

We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD. Code and models will be made publicly available.

Journal Article

Share this book

Add to My Shelf

Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

by Cornia, Marcella , Cucchiara, Rita , Fiameni, Giuseppe in Datasets , Experiments , Language

2024

This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt language modeling, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing real-world concepts and producing high-quality captions. Extensive experiments are performed on different image captioning datasets, including CC3M, nocaps, and the competitive COCO dataset, where our model consistently outperforms baselines and state-of-the-art approaches.

Journal Article

Share this book

Add to My Shelf

Multimodal Learning Technology Aimed at Exploring the Innovative Path of Library Intelligence Service

2025

[Purpose/Significance] The evolution of smart libraries has ushered in a new era, marked by the integration of multimodal learning technologies that combine information from various modalities such as speech, images, and video. This cutting-edge technology is revolutionizing traditional information service systems by providing a more interactive, efficient, and personalized user experience. Unlike traditional studies that focus on single-mode interactions, this research examines the role of multimodal technologies in transforming library services and increasing user engagement. The study highlights its unique contributions to the field of library science, particularly in improving knowledge dissemination, enhancing user-centered services, and addressing emerging challenges in digital information management. These findings not only enrich the theoretical framework of smart libraries, but also provide practical insights into the design and deployment of advanced information services. [Method/Process] This study takes a multidisciplinary approach, drawing from library science, information technology, and human-computer interaction theories. It systematically reviews the historical development and theoretical foundations of multimodal learning technologies while emphasizing their relevance to intelligent library ecosystems. The analysis is organized around key application areas, including intelligent navigation, intelligent question and answer systems, user education with intelligent support, and immersive reading experiences. These areas were explored through a combination of case studies, and a detailed analysis of current library practices. To evaluate the practical impact of these technologies, the study employed qualitative methods, analyzing user feedback and system performance metrics. This comprehensive research also identifies current barriers to adoption, such as data privacy concerns, technology costs, and disparities in user acceptance across different demographics. [Results/Conclusions] The results show that multimodal learning technologies significantly enhance the functionality and user experience of smart libraries. They improve the accuracy of information retrieval, enable more interactive and immersive learning environments, and enable personalized services tailored to individual needs. Despite these advantages, challenges remain, particularly in areas such as securing user data, reducing deployment costs, and increasing accessibility for underprivileged users. The study proposes actionable strategies to address these issues, including enhancing system interoperability, refining ethical frameworks, and fostering human-computer collaboration to reduce barriers to technology adoption. It also identifies gaps in current research, such as the need for more empirical studies of long-term user interaction patterns and the scalability of multimodal systems in large library networks. Future studies could also explore the integration of emerging technologies such as augmented reality (AR) and artificial intelligence (AI) into multimodal library services to further improve their efficiency and reach. By providing a robust framework and practical strategies, this study contributes to the ongoing discourse on smart library innovation, and paves the way for more sustainable and inclusive information service models. It underscores the transformative potential of multimodal technologies to redefine library science and advance the global digital information landscape.

Journal Article

Share this book

Add to My Shelf

Toward multimodal learning analytics in simulation-based collaborative learning: A design ethnography of maritime training

by Sellberg, Charlott , Sharma, Amit in Annan teknik , Automation , Collaborative learning

2025

Collaborative learning in high-fidelity simulators is an important part of how master mariner students are preparing for their future career at sea by becoming part of a ship’s bridge team. This study aims to inform the design of multimodal learning analytics to be used for providing automated feedback to master mariner students engaged in collaborative learning activities in high-fidelity navigation simulators. Through a design ethnographic approach, we analyze video records of everyday training practices at a simulator center in Scandinavia, exploring (a) how feedback is delivered to students during collaborative activities in full-mission simulators and (b) which sensors are needed and why they are needed for capturing the multimodal nature of professional performance, communication, and collaboration in simulation-based collaborative learning. Our detailed analysis of two episodes from the data corpus shows how the delivery of feedback during simulations consists of recurring, multidimensional, and multimodal feedback cycles, comprising instructors’ close monitoring of student’s actions to continuously assess the fit between the learning objectives and the ongoing task. Through these embedded assessments, feedback that draws on the rich semiotic resources of the simulated environment, while considering aspects of realism and authenticity, is provided. Considering the multidimensional and multimodal nature of feedback in professional learning contexts, we identify technologies and sensors needed for capturing professional performance in simulated environments.

Journal Article

Share this book

Add to My Shelf

A survey of multimodal federated learning: background, applications, and perspectives

by Lin, Xiaogang , He, Lipeng , Shi, Yicong in Artificial intelligence , Benchmarks , Cellular telephones

2024

Multimodal Federated Learning (MMFL) is a novel machine learning technique that enhances the capabilities of traditional Federated Learning (FL) to support collaborative training of local models using data available in various modalities. With the generation and storage of a vast amount of multimodal data from the internet, sensors, and mobile devices, as well as the rapid iteration of artificial intelligence models, the demand for multimodal models is growing rapidly. While FL has been widely studied in the past few years, most of the existing research was based in unimodal settings. With the hope of inspiring more applications and research within the MMFL paradigm, we conduct a comprehensive review of the progress and challenges in various aspects of state-of-the-art MMFL. Specifically, we analyze the research motivation for MMFL, propose a new classification method of existing research, discuss the available datasets and application scenarios, and put forward perspectives on the opportunities and challenges faced by MMFL.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter