Catalogue Search | MBRL

Building Semantic Knowledge Graphs from (Semi-)Structured Data: A Review

by Roman, Dumitru , Soylu, Ahmet , Ryen, Vetle in Data retrieval , Data sources , Datasets

2022

Knowledge graphs have, for the past decade, been a hot topic both in public and private domains, typically used for large-scale integration and analysis of data using graph-based data models. One of the central concepts in this area is the Semantic Web, with the vision of providing a well-defined meaning to information and services on the Web through a set of standards. Particularly, linked data and ontologies have been quite essential for data sharing, discovery, integration, and reuse. In this paper, we provide a systematic literature review on knowledge graph creation from structured and semi-structured data sources using Semantic Web technologies. The review takes into account four prominent publication venues, namely, Extended Semantic Web Conference, International Semantic Web Conference, Journal of Web Semantics, and Semantic Web Journal. The review highlights the tools, methods, types of data sources, ontologies, and publication methods, together with the challenges, limitations, and lessons learned in the knowledge graph creation processes.

Journal Article

Share this book

Add to My Shelf

Improving the Quality of Linked Data Using Statistical Distributions

by Bizer, Christian , Paulheim, Heiko in Algorithms , Analysis , Construction

2014

Linked Data on the Web is either created from structured data sources (such as relational databases), from semi-structured sources (such as Wikipedia), or from unstructured sources (such as text). In the latter two cases, the generated Linked Data will likely be noisy and incomplete. In this paper, we present two algorithms that exploit statistical distributions of properties and types for enhancing the quality of incomplete and noisy Linked Data sets: SDType adds missing type statements, and SDValidate identifies faulty statements. Neither of the algorithms uses external knowledge, i.e., they operate only on the data itself. We evaluate the algorithms on the DBpedia and NELL knowledge bases, showing that they are both accurate as well as scalable. Both algorithms have been used for building the DBpedia 3.9 release: With SDType, 3.4 million missing type statements have been added, while using SDValidate, 13,000 erroneous RDF statements have been removed from the knowledge base.

Journal Article

Share this book

Add to My Shelf

Counterfactual Learning on Graphs: A Survey

in Graph neural networks , Graphical representations , Graphs

2025

Graph-structured data are pervasive in the real-world such as social networks, molecular graphs and transaction networks. Graph neural networks (GNNs) have achieved great success in representation learning on graphs, facilitating various downstream tasks. However, GNNs have several drawbacks such as lacking interpretability, can easily inherit the bias of data and cannot model casual relations. Recently, counterfactual learning on graphs has shown promising results in alleviating these drawbacks. Various approaches have been proposed for counterfactual fairness, explainability, link prediction and other applications on graphs. To facilitate the development of this promising direction, in this survey, we categorize and comprehensively review papers on graph counterfactual learning. We divide existing methods into four categories based on problems studied. For each category, we provide background and motivating examples, a general framework summarizing existing works and a detailed review of these works. We point out promising future research directions at the intersection of graph-structured data, counterfactual learning, and real-world applications. To offer a comprehensive view of resources for future studies, we compile a collection of open-source implementations, public datasets, and commonly-used evaluation metrics. This survey aims to serve as a “one-stop-shop” for building a unified understanding of graph counterfactual learning categories and current resources.

Journal Article

Share this book

Add to My Shelf

Robust and scalable content-and-structure indexing

by Zacchiroli, Stefano , Wellenzohn, Kevin , Pietri, Antoine in Archives & records , Computer Science , Data models

2023

Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities, we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our trie-based RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS’s robustness and scalability by indexing data from the Software Heritage (SWH) archive, which is the world’s largest, publicly available source code archive.

Journal Article

Share this book

Add to My Shelf

A hyperbolic approach for learning communities on graphs

by Hajri, Hatem , Gerald, Thomas , Baskiotis, Nicolas in Algorithms , Clustering , Embedding

2023

Detecting communities on graphs has received significant interest in recent literature. Current state-of-the-art approaches tackle this problem by coupling Euclidean graph embedding with community detection. Considering the success of hyperbolic representations of graph-structured data in the last years, an ongoing challenge is to set up a hyperbolic approach to the community detection problem. The present paper meets this challenge by introducing a Riemannian geometry based framework for learning communities on graphs. The proposed methodology combines graph embedding on hyperbolic spaces with Riemannian K-means or Riemannian mixture models to perform community detection. The usefulness of this framework is illustrated through several experiments on generated community graphs and real-world social networks as well as comparisons with the most powerful baselines. The code implementing hyperbolic community embedding is available online https://www.github.com/tgeral68/HyperbolicGraphAndGMM.

Journal Article

Share this book

Add to My Shelf

Semantic enrichment for enhancing LAM data and supporting digital humanities. Review article

by Zeng, Marcia Lei in Cultural heritage , Data , Data quality

2019

With the rapid development of the digital humanities (DH) field, demands for historical and cultural heritage data have generated deep interest the data provided by libraries, archives, and museums (LAMs). In order to enhance LAM data’s quality and discoverability while enabling a self-sustaining ecosystem, “semantic enrichment” becomes a strategy increasingly used by LAMs during recent years. This article introduces a number of semantic enrichment methods and efforts that can be applied to LAM data at various levels, aiming to support deeper and wider exploration and use of LAM data in DH research. The real cases, research projects, experiments, and pilot studies shared in this article demonstrate endless potential for LAM data, whether they are structured, semi-structured, or unstructured, regardless of what types of original artifacts carry the data. Following their roadmaps would encourage more effective initiatives and strengthen this effort to maximize LAM data’s discoverability, use- and reuse-ability, and their value in the mainstream of DH and Semantic Web.

Journal Article

Share this book

Add to My Shelf

The ontological politics of synthetic data: Normalities, outliers, and intersectional hallucinations

by Johnson, Ericka , Lee, Francis , Hajisharif, Saghi in Classification , Data , data bias

2025

Synthetic data is increasingly used as a substitute for real data due to ethical, legal, and logistical reasons. However, the rise of synthetic data also raises critical questions about its entanglement with the politics of classification and the reproduction of social norms and categories. This paper aims to problematize the use of synthetic data by examining how its production is intertwined with the maintenance of certain worldviews and classifications. We argue that synthetic data, like real data, is embedded with societal biases and power structures, leading to the reproduction of existing social inequalities. Through empirical examples, we demonstrate how synthetic data tends to highlight majority elements as the “normal” and minimize minority elements, and that the slight changes to the data structures that create synthetic data will also inevitably result in what we term “intersectional hallucinations.” These hallucinations are inherent to synthetic data and cannot be entirely eliminated without compromising the purpose of creating synthetic datasets. We contend that decisions about synthetic data involve determining which intersections are essential and which can be disregarded, a practice which will imbue these decisions with norms and values. Our study underscores the need for critical engagement with the mathematical and statistical choices in synthetic data production and advocates for careful consideration of the ontological and political implications of these choices during curatorial style production of synthetic structured data.

Journal Article

Share this book

Add to My Shelf

Unsupervised approach to text line extraction in Belfort civil registers of births

by Heyberger, Laurent , Gechter, Franck , Guyeux, Christophe in 20th century , Annotations , Births

2025

Historical documents are invaluable resources for understanding the development of civilizations and cultures. However, the transcription process of these documents comprises many challenges such as complex layouts, degradation, various handwritten styles, and skewed text. This paper presents an unsupervised approach for text line extraction in the Belfort Civil Registers of Births, a historical dataset containing a mix of printed and handwritten text with marginal annotations. The proposed method employs a series of image processing techniques to identify text line cores. The method also utilizes a dynamic gap identification and segment point localization strategy based on text density and histogram analysis to effectively identify the borders of the text lines in polygon shape. An XML file generation tool is then utilized to structure the resulting components and link them with their corresponding text. The method exhibits competitive accuracy in segmenting text lines on both the Belfort dataset and standard benchmarks such as the Saint Gall and READ Bozen datasets. This work contributes to the preservation and accessibility of historical documents by facilitating accurate transcription and structured data representation.

Journal Article

Share this book

Add to My Shelf

Logformer: Cascaded Transformer for System Log Anomaly Detection

by Xie, Linjiang , Liu, Yao , Zhou, Chenghao in Algorithms , Anomalies , Embedding

2023

Modern large-scale enterprise systems produce large volumes of logs that record detailed system runtime status and key events at key points. These logs are valuable for analyzing performance issues and understanding the status of the system. Anomaly detection plays an important role in service management and system maintenance, and guarantees the reliability and security of online systems. Logs are universal semi-structured data, which causes difficulties for traditional manual detection and pattern-matching algorithms. While some deep learning algorithms utilize neural networks to detect anomalies, these approaches have an over-reliance on manually designed features, resulting in the effectiveness of anomaly detection depending on the quality of the features. At the same time, the aforementioned methods ignore the underlying contextual information present in adjacent log entries. We propose a novel model called Logformer with two cascaded transformer-based heads to capture latent contextual information from adjacent log entries, and leverage pre-trained embeddings based on logs to improve the representation of the embedding space. The proposed model achieves comparable results on HDFS and BGL datasets in terms of metric accuracy, recall and F1-score. Moreover, the consistent rise in F1-score proves that the representation of the embedding space with pre-trained embeddings is closer to the semantic information of the log.

Journal Article

Share this book

Add to My Shelf

Graph-structured data generation and analysis for anomaly detection in an automated manufacturing process

by Gao, Xinpu , Yang, Jeongsam , Kim, Namki in Algorithms , Analog data , Anomalies

2024

During automated manufacturing processes, multiple sensors are attached to facilities to collect and analyze analog data for detecting operational anomalies. However, owing to facility devices being interlinked by a control system, simultaneous examination of the control system and analog data enhances the accuracy of anomaly detection and diagnosis of root causes. We proposed a system detecting anomalies by integrating an internal control system with external analog data and representing it in a graph structure. The system generates and combines the adjacency and feature matrices for training a convolutional autoencoder model to identify operational anomalies. Performance tests revealed distinct operational patterns in the cycle data flagged by the model as anomalies. The system diagnosed the root cause of anomalies, such as control operation sequencing, timing variances, and shifts in analog or video signals. This approach may enhance the productivity and quality of the manufacturing processes by facilitating anomaly detection and cause diagnosis.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter