Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
37 result(s) for "multimodal knowledge graph application"
Sort by:
A Survey on Multimodal Knowledge Graphs: Construction, Completion and Applications
As an essential part of artificial intelligence, a knowledge graph describes the real-world entities, concepts and their various semantic relationships in a structured way and has been gradually popularized in a variety practical scenarios. The majority of existing knowledge graphs mainly concentrate on organizing and managing textual knowledge in a structured representation, while paying little attention to the multimodal resources (e.g., pictures and videos), which can serve as the foundation for the machine perception of a real-world data scenario. To this end, in this survey, we comprehensively review the related advances of multimodal knowledge graphs, covering multimodal knowledge graph construction, completion and typical applications. For construction, we outline the methods of named entity recognition, relation extraction and event extraction. For completion, we discuss the multimodal knowledge graph representation learning and entity linking. Finally, the mainstream applications of multimodal knowledge graphs in miscellaneous domains are summarized.
RoEMF: rotational embedding multimodal fusion for link prediction
Multimodal link prediction aims to identify missing head and tail entities in the relational triples of multimodal knowledge graphs. However, each modality contains distinct information, and how to effectively fuse multimodal data has become a complex challenge. To address this issue, the rotational embedding multimodal fusion (RoEMF) model was proposed based on rotary position encoding (RoPE). The model employs a multi-head cross-attention mechanism, combined with RoPE, to enhance the representation of positional and contextual information, thereby improving the multimodal data fusion. It focuses on integrating information from different subspaces while capturing cross-modal correlations to mitigate potential data loss, enhances feature fusion, and optimizes the heterogeneity of the representation. Additionally, the cross-modal joint decision loss was proposed to reduce the model’s reliance on single-modal data, aiding in the identification of missing head and tail entities, while enhancing the accuracy and generalization ability of multimodal link prediction. Experimental results on three public MMKG benchmarks demonstrate the outstanding performance of RoEMF compared with other methods in link prediction.
CA-VLN: Collaborative Agents in MLLM-Powered Visual-Language Navigation
Generalization to unseen environments remains a fundamental challenge in Vision-Language Navigation. To tackle this issue, we propose a novel framework that leverages world knowledge embedded within Multimodal Large Language Models. We introduce Collaborative Agents in Visual-Language Navigation (CA-VLN), a framework based on a dual-agent architecture. This architecture comprises a Knowledge Agent, which infuses the action prediction process with semantic context and commonsense reasoning, and a Hierarchical History Agent, which constructs a detailed episodic memory to enable long-horizon planning. The collaboration between these agents facilitates a dynamic interplay between high-level semantic understanding and grounded episodic experience. Extensive experiments on the R2R, REVERIE and SOON datasets demonstrate that our model achieves state-of-the-art performance, significantly improving generalization and navigation success in previously unobserved environments.
MM-HGNN: Multimodal Representation Learning Heterogeneous Graph Neural Network
Multimodal learning heterogeneous graphs are very challenging because of the diverse structures and data modalities. The existing graph neural networks cannot efficiently capture both the multimodality of the data and the inherent heterogeneity of such graphs. In this paper, we propose Multimodal Representation Learning Heterogeneous Graph Neural network (MM-HGNN) to tackle these challenges. MM-HGNN introduces a novel Modality Transferability Function to quantify the heterogeneity between different modalities, which allows the model to dynamically adjust the attention scores and give precedence to unique information that is non-redundant. Additionally, it integrates modality-level attention that distributes attention in an adaptive way over different modalities according to their relevance, enhancing feature representations for tasks such as node classification. To further improve representation learning, a splicing mechanism is proposed to integrate outputs from multiple network layers, combining high-level features for more expressive node embeddings. We validate the effectiveness of MM-HGNN through extensive experiments on the IMDB and Amazon datasets. Our model outperforms several state-of-the-art methods under the Macro-F1, Micro-F1, and AUC metrics by a large margin, which well demonstrates its strong capability in dealing with the challenging multimodal and heterogeneous data. Comprehensive ablation studies further emphasize the contributions of each key component in improving the overall performance.
Game-on: graph attention network based multimodal fusion for fake news detection
Fake news being spread on social media platforms has a disruptive and damaging impact on our lives. Multimedia content improves the visibility of posts more than text data but is also being used for creating fake news. Previous multimodal works have tried to address the problem of modeling heterogeneous modalities in identifying fake news. However, these works have the following limitations: (1) inefficient encoding of inter-modal relations by utilizing a simple concatenation operator on the modalities at a later stage in a model, which might result in information loss; (2) training very deep neural networks with a disproportionate number of parameters on small multimodal datasets result in higher chances of overfitting. To address these limitations, we propose GAME-ON, a Graph Neural Network based end-to-end trainable framework that allows granular interactions within and across different modalities to learn more robust data representations for multimodal fake news detection. We use two publicly available fake news datasets, Twitter and Weibo, for evaluations. GAME-ON outperforms on Twitter by an average of 11% and achieves state-of-the-art performance on Weibo while using 91% fewer parameters than the best comparable state-of-the-art baseline. For deployment in real-world applications, GAME-ON can be used as a lightweight model (less memory and latency requirements), which makes it more feasible than previous state-of-the-art models.
Deep learning for drug-drug interaction prediction: A comprehensive review
The prediction of drug-drug interactions (DDIs) is a crucial task for drug safety research, and identifying potential DDIs helps us to explore the mechanism behind combinatorial therapy. Traditional wet chemical experiments for DDI are cumbersome and time-consuming, and are too small in scale, limiting the efficiency of DDI predictions. Therefore, it is particularly crucial to develop improved computational methods for detecting drug interactions. With the development of deep learning, several computational models based on deep learning have been proposed for DDI prediction. In this review, we summarized the high-quality DDI prediction methods based on deep learning in recent years, and divided them into four categories: neural network-based methods, graph neural network-based methods, knowledge graph-based methods, and multimodal-based methods. Furthermore, we discuss the challenges of existing methods and future potential perspectives. This review reveals that deep learning can significantly improve DDI prediction performance compared to traditional machine learning. Deep learning models can scale to large-scale datasets and accept multiple data types as input, thus making DDI predictions more efficient and accurate.
Multimodal heterogeneous graph fusion for automated obstructive sleep apnea-hypopnea syndrome diagnosis
Polysomnography is the diagnostic gold standard for obstructive sleep apnea-hypopnea syndrome (OSAHS), requiring medical professionals to analyze apnea-hypopnea events from multidimensional data throughout the sleep cycle. This complex process is susceptible to variability based on the clinician’s experience, leading to potential inaccuracies. Existing automatic diagnosis methods often overlook multimodal physiological signals and medical prior knowledge, leading to limited diagnostic capabilities. This study presents a novel hetero geneous g raph c onvolutional f usion net work ( HeteroGCFNet ) leveraging multimodal physiological signals and domain knowledge for automated OSAHS diagnosis. This framework constructs two types of graph representations: physical space graphs, which map the spatial layout of sensors on the human body, and process knowledge graphs which detail the physiological relationships among breathing patterns, oxygen saturation, and vital signals. The framework leverages heterogeneous graph convolutional neural networks to extract both localized and global features from these graphs. Additionally, a multi-head fusion module combines these features into a unified representation for effective classification, enhancing focus on relevant signal characteristics and cross-modal interactions. This study evaluated the proposed framework on a large-scale OSAHS dataset, combined from publicly available sources and data provided by a collaborative university hospital. It demonstrated superior diagnostic performance compared to conventional machine learning models and existing deep learning approaches, effectively integrating domain knowledge with data-driven learning to produce explainable representations and robust generalization capabilities, which can potentially be utilized for clinical use. Code is available at https://github.com/AmbitYuki/HeteroGCFNet .
Owner name entity recognition in websites based on heterogeneous and dynamic graph transformer
Identifying owners of devices on the Internet can enable numerous network security applications. For example, accurate Owner Name Entity Recognition (ONER) of websites is critical to find influenced owners in light of new security threats. In this situation, as a specific task of Multimodal Named Entity Recognition (MNER), ONER is essential and helpful for network security. Currently, most existing MNER models only use texts and images, so they cannot effectively utilize the multimodal data of devices to achieve ONER accurately. Also, most of the existing MNER models separately use information in each modality and between modalities. Thus, the fusion is inconsistent, so the effect is not satisfied. Therefore, the paper proposes HDGT: A heterogeneous and Dynamic Graph Transformer, to improve the performance of ONER. The core components in HDGT to realize MNER are a dynamic graph and two-stream mechanism, which could learn the relationship between different modalities during training and the graph’s structure well. The paper manually labels a multimodal dataset containing texts, images, and domains to prove the performance of HDGT. Also, the paper conducts experiments on existing and public MNER datasets. The results show that HDGT achieves 84.88% F1 scores on the recognition of owner entities, 75.21% F1 on Twitter2015, and 87.03% F1 on Twitter2017, which outperforms other existing MNER models.
Bidirectional transformer with knowledge graph for video captioning
Models based on transformer architecture have risen to prominence for video captioning. However, most models are only to improve either the encoder or the decoder, because when we improve the encoder and decoder simultaneously, the shortcomings of either side may be amplified. Based on the transformer architecture, we connect a bidirectional decoder and an encoder that integrates fine-grained spatio-temporal features, objects, and relationships between the objects in the video. Experiments show that improvements in the encoder amplify the information leakage of the bidirectional decoder and further produce a worse result. To tackle this problem, we generate pseudo reverse captions and propose a Bidirectional Transformer with Knowledge Graph (BTKG), which integrates the outputs of two encoders into the forward and backward decoders of the bidirectional decoder, respectively. In addition, we make fine-grained improvements on the interior of the different encoders according to four modal features of the video. Experiments on two mainstream benchmark datasets, i.e., MSVD and MSR-VTT, demonstrate the effectiveness of BTKG, which achieves state-of-the-art performance in significant metrics. Moreover, the sentences generated by BTKG contain scene words and modifiers, that are more in line with human language habits. Codes are available on https://github.com/nickchen121/BTKG .
Multimodal deep representation learning for protein interaction identification and protein family classification
Background Protein-protein interactions(PPIs) engage in dynamic pathological and biological procedures constantly in our life. Thus, it is crucial to comprehend the PPIs thoroughly such that we are able to illuminate the disease occurrence, achieve the optimal drug-target therapeutic effect and describe the protein complex structures. However, compared to the protein sequences obtainable from various species and organisms, the number of revealed protein-protein interactions is relatively limited. To address this dilemma, lots of research endeavor have investigated in it to facilitate the discovery of novel PPIs. Among these methods, PPI prediction techniques that merely rely on protein sequence data are more widespread than other methods which require extensive biological domain knowledge. Results In this paper, we propose a multi-modal deep representation learning structure by incorporating protein physicochemical features with the graph topological features from the PPI networks. Specifically, our method not only bears in mind the protein sequence information but also discerns the topological representations for each protein node in the PPI networks. In our paper, we construct a stacked auto-encoder architecture together with a continuous bag-of-words (CBOW) model based on generated metapaths to study the PPI predictions. Following by that, we utilize the supervised deep neural networks to identify the PPIs and classify the protein families. The PPI prediction accuracy for eight species ranged from 96.76% to 99.77%, which signifies that our multi-modal deep representation learning framework achieves superior performance compared to other computational methods. Conclusion To the best of our knowledge, this is the first multi-modal deep representation learning framework for examining the PPI networks.