Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
225
result(s) for
"Cross-modal retrieval"
Sort by:
Learning DALTS for cross-modal retrieval
by
Wang, Wenmin
,
Yu, Zheng
in
Adaptation
,
B6135 Optical, image and video signal processing
,
C5260B Computer vision and image processing techniques
2019
Cross-modal retrieval has been recently proposed to find an appropriate subspace, where the similarity across different modalities such as image and text can be directly measured. In this study, different from most existing works, the authors propose a novel model for cross-modal retrieval based on a domain-adaptive limited text space (DALTS) rather than a common space or an image space. Experimental results on three widely used datasets, Flickr8K, Flickr30K and Microsoft Common Objects in Context (MSCOCO), show that the proposed method, dubbed DALTS, is able to learn superior text space features which can effectively capture the necessary information for cross-modal retrieval. Meanwhile, DALTS achieves promising improvements in accuracy for cross-modal retrieval compared with the current state-of-the-art methods.
Journal Article
Efficient low-rank multi-component fusion with component-specific factors in image-recipe retrieval
by
Zhao, Wenyu
,
Cao, Buqing
,
Chen, Jinjun
in
Computer Communication Networks
,
Computer Science
,
Data Structures and Information Theory
2024
Image-Recipe retrieval is the task of retrieving closely related recipes from a collection given a food image and vice versa. The modality gap between images and recipes makes it a challenging task. Recent studies usually focus on learning consistent image and recipe representations to bridge the semantic gap. Though the existing methods have significantly improved image-recipe retrieval, several challenges still remain: 1) Previous studies usually directly concatenate the textual embeddings of different recipe components to generate recipe presentations. Only simple interactions rather than complex interactions are considered. 2) They commonly focus on textual feature extraction from recipes. The methods to extract image features are relatively simple, and most studies utilize the ResNet-50 model. 3) Apart from the retrieval learning loss (triplet loss, for example), several auxiliary loss functions (such as adversarial loss and reconstruction loss) are commonly used to match the recipe and image representations. To deal with these issues, we introduce a novel Low-rank Multi-component Fusion method with Component-Specific Factors (LMF-CSF) to model the different textual components in a recipe for producing superior textual representations. Furthermore, try to pay some attention to image feature extraction. A visual transformer is used to learn better image representations. Then the enhanced representations from two modalities are directly fed into a triplet loss function for image-recipe retrieval learning. Experimental results conducted on the Recipe1M dataset indicate that our LMF-CSF method can outperform the current state-of-the-art image-recipe retrieval baselines.
Journal Article
Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective
2024
Fine-grained representation is fundamental to species classification based on deep learning, and in this context, cross-modal contrastive learning is an effective method. The diversity of species coupled with the inherent contextual ambiguity of natural language poses a primary challenge in the cross-modal representation alignment of conservation area image data. Integrating cross-modal retrieval tasks with generation tasks contributes to cross-modal representation alignment based on contextual understanding. However, during the contrastive learning process, apart from learning the differences in the data itself, a pair of encoders inevitably learns the differences caused by encoder fluctuations. The latter leads to convergence shortcuts, resulting in poor representation quality and an inaccurate reflection of the similarity relationships between samples in the original dataset within the shared space of features. To achieve fine-grained cross-modal representation alignment, we first propose a residual attention network to enhance consistency during momentum updates in cross-modal encoders. Building upon this, we propose momentum encoding from a multi-task perspective as a bridge for cross-modal information, effectively improving cross-modal mutual information, representation quality, and optimizing the distribution of feature points within the cross-modal shared semantic space. By acquiring momentum encoding queues for cross-modal semantic understanding through multi-tasking, we align ambiguous natural language representations around the invariant image features of factual information, alleviating contextual ambiguity and enhancing model robustness. Experimental validation shows that our proposed multi-task perspective of cross-modal momentum encoders outperforms similar models on standardized image classification tasks and image–text cross-modal retrieval tasks on public datasets by up to 8% on the leaderboard, demonstrating the effectiveness of the proposed method. Qualitative experiments on our self-built conservation area image–text paired dataset show that our proposed method accurately performs cross-modal retrieval and generation tasks among 8142 species, proving its effectiveness on fine-grained cross-modal image–text conservation area image datasets.
Journal Article
A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval
2023
Due to the swift growth in the scale of remote sensing imagery, scholars have progressively directed their attention towards achieving efficient and adaptable cross-modal retrieval for remote sensing images. They have also steadily tackled the distinctive challenge posed by the multi-scale attributes of these images. However, existing studies primarily concentrate on the characterization of these features, neglecting the comprehensive investigation of the complex relationship between multi-scale targets and the semantic alignment of these targets with text. To address this issue, this study introduces a fine-grained semantic alignment method that adequately aggregates multi-scale information (referred to as FAAMI). The proposed approach comprises multiple stages. Initially, we employ a computing-friendly cross-layer feature connection method to construct a multi-scale feature representation of an image. Subsequently, we devise an efficient feature consistency enhancement module to rectify the incongruous semantic discrimination observed in cross-layer features. Finally, a shallow cross-attention network is employed to capture the fine-grained semantic relationship between multiple-scale image regions and the corresponding words in the text. Extensive experiments were conducted using two datasets: RSICD and RSITMD. The results demonstrate that the performance of FAAMI surpasses that of recently proposed advanced models in the same domain, with significant improvements observed in R@K and other evaluation metrics. Specifically, the mR values achieved by FAAMI are 23.18% and 35.99% for the two datasets, respectively.
Journal Article
Deep adversarial metric learning for cross-modal retrieval
2019
Cross-modal retrieval has become a highlighted research topic, to provide flexible retrieval experience across multimedia data such as image, video, text and audio. The core of existing cross-modal retrieval approaches is to narrow down the gap between different modalities either by finding a maximally correlated embedding space. Recently, researchers leverage Deep Neural Network (DNN) to learn nonlinear transformations for each modality to obtain transformed features in a common subspace where cross-modal matching can be performed. However, the statistical characteristics of the original features for each modality are not explicitly preserved in the learned subspace. Inspired by recent advances in adversarial learning, we propose a novel Deep Adversarial Metric Learning approach, termed DAML for cross-modal retrieval. DAML nonlinearly maps labeled data pairs of different modalities into a shared latent feature subspace, under which the intra-class variation is minimized and the inter-class variation is maximized, and the difference of each data pair captured from two modalities of the same class is minimized, respectively. In addition to maximizing the correlations between modalities, we add an additional regularization by introducing adversarial learning. In particular, we introduce a modality classifier to predict the modality of a transformed feature, which ensures that the transformed features are also statistically indistinguishable. Experiments on three popular multimodal datasets show that DAML achieves superior performance compared to several state of the art cross-modal retrieval methods.
Journal Article
Intramodal consistency in triplet-based cross-modal learning for image retrieval
by
Araya, Mauricio
,
Ñanculef, Ricardo
,
Mallea, Mario
in
Ablation
,
Artificial Intelligence
,
Computer Science
2025
Cross-modal retrieval requires building a common latent space that captures and correlates information from different data modalities, usually images and texts. Cross-modal training based on the triplet loss with hard negative mining is a state-of-the-art technique to address this problem. This paper shows that such approach is not always effective in handling intra-modal similarities. Specifically, we found that this method can lead to inconsistent similarity orderings in the latent space, where intra-modal pairs with unknown ground-truth similarity are ranked higher than cross-modal pairs representing the same concept. To address this problem, we propose two novel loss functions that leverage intra-modal similarity constraints available in a training triplet but not used by the original formulation. Additionally, this paper explores the application of this framework to unsupervised image retrieval problems, where cross-modal training can provide the supervisory signals that are otherwise missing in the absence of category labels. Up to our knowledge, we are the first to evaluate cross-modal training for intra-modal retrieval without labels. We present comprehensive experiments on MS-COCO and Flickr30k, demonstrating the advantages and limitations of the proposed methods in cross-modal and intra-modal retrieval tasks in terms of performance and novelty measures. We also conduct a case study on the ROCO dataset to assess the performance of our method on medical images and present an ablation study on one of our approaches to understanding the impact of the different components of the proposed loss function. Our code is publicly available on GitHub
https://github.com/MariodotR/FullHN.git
.
Journal Article
Learning with Noisy Correspondence
2024
This paper studies a new learning paradigm for noisy labels, i.e., noisy correspondence (NC). Unlike the well-studied noisy labels that consider the errors in the category annotation of a sample, the NC refers to the errors in the alignment relationship of two data points. Although such false positive pairs are common especially in the data harvested from the Internet, which however are neglected by most existing works. By taking cross-modal retrieval as a showcase, we propose a method called learning with noisy correspondence (LNC). In brief, the LNC first roughly obtains the clean and noisy subsets from the original data and then rectifies the false positive pairs by using a novel adaptive prediction function. Finally, the LNC adopts a novel triplet loss with soft margins to endow cross-modal retrieval the robustness to the NC. To verify the effectiveness of the proposed LNC, we conduct experiments on six benchmark datasets in image-text and video-text retrieval tasks. Besides the effectiveness of the LNC, the experimental results show the necessity of the explicit solution to the NC faced by not only the standard model training paradigm but also the pre-training and fine-tuning paradigms.
Journal Article
Cross-Modal Simplex Center Learning for Speech-Face Association
2025
Speech-face association aims to achieve identity matching between facial images and voice segments by aligning cross-modal features. Existing research primarily focuses on learning shared-space representations and computing one-to-one similarities between cross-modal sample pairs to establish their correlation. However, these approaches do not fully account for intra-class variations between the modalities or the many-to-many relationships among cross-modal samples, which are crucial for robust association modeling. To address these challenges, we propose a novel framework that leverages global information to align voice and face embeddings while effectively correlating identity information embedded in both modalities. First, we jointly pre-train face recognition and speaker recognition networks to encode discriminative features from facial images and voice segments. This shared pre-training step ensures the extraction of complementary identity information across modalities. Subsequently, we introduce a cross-modal simplex center loss, which aligns samples with identity centers located at the vertices of a regular simplex inscribed on a hypersphere. This design enforces an equidistant and balanced distribution of identity embeddings, reducing intra-class variations. Furthermore, we employ an improved triplet center loss that emphasizes hard sample mining and optimizes inter-class separability, enhancing the model’s ability to generalize across challenging scenarios. Extensive experiments validate the effectiveness of our framework, demonstrating superior performance across various speech-face association tasks, including matching, verification, and retrieval. Notably, in the challenging gender-constrained matching task, our method achieves a remarkable accuracy of 79.22%, significantly outperforming existing approaches. These results highlight the potential of the proposed framework to advance the state of the art in cross-modal identity association.
Journal Article
Mutual contextual relation-guided dynamic graph networks for cross-modal image-text retrieval
by
Guntreddi, Venkataramana
,
Raju, Akella S. Narasimha
,
Sitharamulu, V.
in
631/114
,
692/1537
,
692/308
2025
With the rapid growth of multimodal data on the web, cross-modal retrieval has become increasingly important for applications such as multimedia search and content recommendation. It aims to align visual and textual features to retrieve semantically relevant content across modalities. However, this remains a challenging task due to the inherent heterogeneity and semantic gap between image and text representations. This paper proposes mutual contextual relation-guided dynamic graph network that integrates ViT, BERT and GCNN to construct a unified and interpretable multimodal representation space for image-text matching. The ViT and BERT features are structured into a dynamic cross-modal feature graph (DCMFG), where nodes represent image features and text features, and edges are dynamically updated based on mutual contextual relations that is neighboring relations extracted using KNN. The attention-guided mechanism refines graph connections, ensuring adaptive and context-aware alignment between modalities. The mutual contextual relation helps in identifying relevant neighborhood structures among image and text nodes, enabling the graph to capture both local and global associations. Meanwhile, the attention mechanism dynamically weights edges, enhancing the propagation of important cross-modal interactions. This emphasizes meaningful connections or edges among different modality nodes, improving the interpretability by revealing how image regions and text features interact. This approach overcome the limitations of existing models that rely on static feature alignment and insufficient modeling of contextual relationships. Experimental results on benchmark datasets MirFlickr-25K and NUS-WIDE are used to demonstrate significant performance improvements over state-of-the-art methods in precision and recall, validating its effectiveness for cross-modal retrieval.
Journal Article
InstructSee: Instruction-Aware and Feedback-Driven Multimodal Retrieval with Dynamic Query Generation
2025
In recent years, cross-modal retrieval has garnered significant attention due to its potential to bridge heterogeneous data modalities, particularly in aligning visual content with natural language. Despite notable progress, existing methods often struggle to accurately capture user intent when queries are expressed through complex or evolving instructions. To address this challenge, we propose a novel cross-modal representation learning framework that incorporates an instruction-aware dynamic query generation mechanism, augmented by the semantic reasoning capabilities of large language models (LLMs). The framework dynamically constructs and iteratively refines query representations conditioned on natural language instructions and guided by user feedback, thereby enabling the system to effectively infer and adapt to implicit retrieval intent. Extensive experiments on standard multimodal retrieval benchmarks demonstrate that our method significantly improves retrieval accuracy and adaptability, outperforming fixed-query baselines and showing enhanced cross-modal alignment and generalization across diverse retrieval tasks.
Journal Article