Catalogue Search | MBRL

by Doan, AnHai , Ives, Zachary G. , Halevy, Alon in Data integration (Computer science)

2012

Principles of Data Integration is the first comprehensive textbook of data integration, covering theoretical principles and implementation issues as well as current challenges raised by the semantic web and cloud computing.The book offers a range of data integration solutions enabling you to focus on what is most relevant to the problem at hand.

eBook

Share this book

Add to My Shelf

Semantic‐Integration Research in the Database Community: A Brief Survey

by Doan, AnHai , Halevy, Alon Y. in Artificial intelligence , Attention , CD-ROM catalog

2005

Semantic integration has been a long‐standing challenge for the database community. It has received steady attention over the past two decades, and has now become a prominent area of database research. In this article, we first review database applications that require semantic integration and discuss the difficulties underlying the integration process. We then describe recent progress and identify open research issues. We focus in particular on schema matching, a topic that has received much attention in the database community, but also discuss data matching (for example, tuple deduplication) and open issues beyond the match discovery context (for example, reasoning with matches, match verification and repair, and reconciling inconsistent data values). For previous surveys of database research on semantic integration, see Rahm and Bernstein (2001); Ouksel and Seth (1999); and Batini, Lenzerini, and Navathe (1986).

Journal Article

Share this book

Add to My Shelf

Learning to Match the Schemas of Data Sources: A Multistrategy Approach

by Doan, AnHai , Domingos, Pedro , Halevy, Alon in Information retrieval , Studies

2003

The problem of integrating data from multiple data sources--either on the Internet or within enterprises--has received much attention in the database and AI communities. The focus has been on building data integration systems that provide a uniform query interface to the sources. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the query interface and the source schemas. Examples of mappings are \"element location maps to address\" and \"price maps to listed-price\". We propose a multistrategy learning approach to automatically find such mappings. The approach applies multiple learner modules, where each module exploits a different type of information either in the schemas of the sources or in their data, then combines the predictions of the modules using a meta-learner. Learner modules employ a variety of techniques, ranging from Naive Bayes and nearest-neighbor classification to entity recognition and information retrieval. We describe the LSD system, which employs this approach to find semantic mappings. To further improve matching accuracy, LSD exploits domain integrity constraints, user feedback, and nested structures in XML data. We test LSD experimentally on several real-world domains. The experiments validate the utility of multistrategy learning for data integration and show that LSD proposes semantic mappings with a high degree of accuracy.[PUBLICATION ABSTRACT]

Journal Article

Share this book

Add to My Shelf

Factuality challenges in the era of large language models and opportunities for fact-checking

by Ciampaglia, Giovanni Luca , Chakraborty, Tanmoy , DiResta, Renee in 4014/4009 , 639/705/117 , Access to information

2024

The emergence of tools based on large language models (LLMs), such as OpenAI’s ChatGPT and Google’s Gemini, has garnered immense public attention owing to their advanced natural language generation capabilities. These remarkably natural-sounding tools have the potential to be highly useful for various tasks. However, they also tend to produce false, erroneous or misleading content—commonly referred to as hallucinations. Moreover, LLMs can be misused to generate convincing, yet false, content and profiles on a large scale, posing a substantial societal challenge by potentially deceiving users and spreading inaccurate information. This makes fact-checking increasingly important. Despite their issues with factual accuracy, LLMs have shown proficiency in various subtasks that support fact-checking, which is essential to ensure factually accurate responses. In light of these concerns, we explore issues related to factuality in LLMs and their impact on fact-checking. We identify key challenges, imminent threats and possible solutions to these factuality issues. We also thoroughly examine these challenges, existing solutions and potential prospects for fact-checking. By analysing the factuality constraints within LLMs and their impact on fact-checking, we aim to contribute to a path towards maintaining accuracy at a time of confluence of generative artificial intelligence and misinformation. Large language models (LLMs) present challenges, including a tendency to produce false or misleading content and the potential to create misinformation or disinformation. Augenstein and colleagues explore issues related to factuality in LLMs and their impact on fact-checking.

Journal Article

Share this book

Add to My Shelf

Semantic Integration

by Doan, AnHai , Noy, Natalya F. , Halevy, Alon Y. in Artificial intelligence , Common sense , Data exchange

2005

Sharing data across disparate sources requires solving many problems of semantic integration, such as matching ontologies or schemas, detecting duplicate tuples, reconciling inconsistent data values, modeling complex relations between concepts in different sources, and reasoning with semantic mappings. This issue of AI Magazine includes papers that discuss various methods on establishing mappings between ontology elements or data fragments. The collection includes papers that discuss semantic‐integration issues in such contexts as data integration and web services. The issue also includes a brief survey of semantic‐integration research in the database community.

Journal Article

Share this book

Add to My Shelf

Data-Management at Web Scale

by Halevy, Alon

2009

Journal Article

Share this book

Add to My Shelf

Identifying Aspects for Web-Search Queries

by Wu, F. , Halevy, A. , Madhavan, J. in Artificial intelligence , Clusters , Knowledge bases (artificial intelligence)

2011

Many web-search queries serve as the beginning of an exploration of an unknown space of information, rather than looking for a specific web page. To answer such queries effec- tively, the search engine should attempt to organize the space of relevant information in a way that facilitates exploration. We describe the Aspector system that computes aspects for a given query. Each aspect is a set of search queries that together represent a distinct information need relevant to the original search query. To serve as an effective means to explore the space, Aspector computes aspects that are orthogonal to each other and to have high combined coverage. Aspector combines two sources of information to compute aspects. We discover candidate aspects by analyzing query logs, and cluster them to eliminate redundancies. We then use a mass-collaboration knowledge base (e.g., Wikipedia) to compute candidate aspects for queries that occur less frequently and to group together aspects that are likely to be semantically related. We present a user study that indicates that the aspects we compute are rated favorably against three competing alternatives related searches proposed by Google, cluster labels assigned by the Clusty search engine, and navigational searches proposed by Bing.

Journal Article

Share this book

Add to My Shelf

Multimodal Neural Databases

by Silvestri, Fabrizio , Rodolà, Emanuele , Trappolini, Giovanni in Information retrieval , Multimedia , Queries

2023

The rise in loosely-structured data available through text, images, and other modalities has called for new ways of querying them. Multimedia Information Retrieval has filled this gap and has witnessed exciting progress in recent years. Tasks such as search and retrieval of extensive multimedia archives have undergone massive performance improvements, driven to a large extent by recent developments in multimodal deep learning. However, methods in this field remain limited in the kinds of queries they support and, in particular, their inability to answer database-like queries. For this reason, inspired by recent work on neural databases, we propose a new framework, which we name Multimodal Neural Databases (MMNDBs). MMNDBs can answer complex database-like queries that involve reasoning over different input modalities, such as text and images, at scale. In this paper, we present the first architecture able to fulfill this set of requirements and test it with several baselines, showing the limitations of currently available models. The results show the potential of these new techniques to process unstructured data coming from different modalities, paving the way for future research in the area. Code to replicate the experiments will be released at https://github.com/GiovanniTRA/MultimodalNeuralDatabases

Paper

Share this book

Add to My Shelf

Leveraging LLMs to Create Content Corpora for Niche Domains

by Zhang, Franklin , Zhang, Sonya , Halevy, Alon in Large language models

2025

Constructing specialized content corpora from vast, unstructured web sources for domain-specific applications poses substantial data curation challenges. In this paper, we introduce a streamlined approach for generating high-quality, domain-specific corpora by efficiently acquiring, filtering, structuring, and cleaning web-based data. We showcase how Large Language Models (LLMs) can be leveraged to address complex data curation at scale, and propose a strategical framework incorporating LLM-enhanced techniques for structured content extraction and semantic deduplication. We validate our approach in the behavior education domain through its integration into 30 Day Me, a habit formation application. Our data pipeline, named 30DayGen, enabled the extraction and synthesis of 3,531 unique 30-day challenges from over 15K webpages. A user survey reports a satisfaction score of 4.3 out of 5, with 91% of respondents indicating willingness to use the curated content for their habit-formation goals.

Paper

Share this book

Add to My Shelf

Learnings from Data Integration for Augmented Language Models

by Dwivedi-Yu, Jane , Halevy, Alon in Data integration

2023

One of the limitations of large language models is that they do not have access to up-to-date, proprietary or personal data. As a result, there are multiple efforts to extend language models with techniques for accessing external data. In that sense, LLMs share the vision of data integration systems whose goal is to provide seamless access to a large collection of heterogeneous data sources. While the details and the techniques of LLMs differ greatly from those of data integration, this paper shows that some of the lessons learned from research on data integration can elucidate the research path we are conducting today on language models.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter