Catalogue Search | MBRL

Natural Questions: A Benchmark for Question Answering Research

بواسطة Parikh, Ankur , Uszkoreit, Jakob , Redfield, Olivia في Aggregate data , Annotations , Answers

2019

We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.

Journal Article

شارك هذا الكتاب

أضف إلى رفتي

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

بواسطة Berant, Jonathan , Roth, Dan , Geva, Mor في Answers , Benchmarks , Computational linguistics

2021

A key limitation in current datasets for is that the required steps for answering the question are mentioned in it . In this work, we introduce S QA, a question answering (QA) benchmark where the required reasoning steps are in the question, and should be inferred using a . A fundamental challenge in this setup is how to elicit such creative questions from crowdsourcing workers, while covering a broad range of potential strategies. We propose a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts. Moreover, we annotate each question with (1) a decomposition into reasoning steps for answering it, and (2) Wikipedia paragraphs that contain the answers to each step. Overall, S QA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs. Analysis shows that questions in S QA are short, topic-diverse, and cover a wide range of strategies. Empirically, we show that humans perform well (87%) on this task, while our best baseline reaches an accuracy of ∼ 66 .

Journal Article

شارك هذا الكتاب

أضف إلى رفتي

CoQA: A Conversational Question Answering Challenge

بواسطة Chen, Danqi , Reddy, Siva , Manning, Christopher D. في Answers , Comprehension , Computational linguistics

2019

Humans gather information through conversations involving a series of interconnected questions and answers. For machines to assist in information gathering, it is therefore essential to enable them to answer conversational questions. We introduce CoQA, a novel dataset for building nversational uestion nswering systems. Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage. We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets (e.g., coreference and pragmatic reasoning). We evaluate strong dialogue and reading comprehension models on CoQA. The best system obtains an F1 score of 65.4%, which is 23.4 points behind human performance (88.8%), indicating that there is ample room for improvement. We present CoQA as a challenge to the community at .

Journal Article

شارك هذا الكتاب

أضف إلى رفتي

How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

بواسطة Ding, Haibo , Araki, Jun , Neubig, Graham في Answers , Calibration , Common sense

2021

Recent works have shown that language models (LM) capture different types of knowledge regarding facts or common sense. However, because no model is perfect, they still fail to provide appropriate answers in many cases. In this paper, we ask the question, “How can we know when language models know, with confidence, the answer to a particular query?” We examine this question from the point of view of , the property of a probabilistic model’s predicted probabilities actually being well correlated with the probabilities of correctness. We examine three strong generative models—T5, BART, and GPT-2—and study whether their probabilities on QA tasks are well calibrated, finding the answer is a relatively emphatic . We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness through fine-tuning, post-hoc probability modification, or adjustment of the predicted outputs or inputs. Experiments on a diverse range of datasets demonstrate the effectiveness of our methods. We also perform analysis to study the strengths and limitations of these methods, shedding light on further improvements that may be made in methods for calibrating LMs. We have released the code at .

Journal Article

شارك هذا الكتاب

أضف إلى رفتي

FeTaQA: Free-form Table Question Answering

بواسطة Mao, Ziming , Rosand, Ben , Kong, Riley في Answers , Comprehension , Datasets

2022

Existing table question answering datasets contain abundant factual questions that primarily evaluate a QA system’s comprehension of query and tabular data. However, restricted by their short-form answers, these datasets fail to include question–answer interactions that represent more advanced and naturally occurring information needs: questions that ask for reasoning and integration of information pieces retrieved from a structured knowledge source. To complement the existing datasets and to reveal the challenging nature of the table-based question answering task, we introduce FeTaQA, a new dataset with 10K Wikipedia-based , , , pairs. FeTaQA is collected from noteworthy descriptions of Wikipedia tables that contain information people tend to seek; generation of these descriptions requires advanced processing that humans perform on a daily basis: Understand the question and table, retrieve, integrate, infer, and conduct text planning and surface realization to generate an answer. We provide two benchmark methods for the proposed task: a pipeline method based on semantic parsing-based QA systems and an end-to-end method based on large pretrained text generation models, and show that FeTaQA poses a challenge for both methods.

Journal Article

شارك هذا الكتاب

أضف إلى رفتي

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

بواسطة Wu, Yuxiang , Piktus, Aleksandra , Minervini, Pasquale في Accuracy , Answers , Computational linguistics

2021

Open-domain Question Answering models that directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared with conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models fall short of the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce (PAQ), a very large resource of 65M automatically generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ and test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster. Using PAQ, we train CBQA models which outperform comparable baselines by 5%, but trail RePAQ by over 15%, indicating the effectiveness of explicit retrieval. RePAQ can be configured for size (under 500MB) or speed (over 1K questions per second) while retaining high accuracy. Lastly, we demonstrate RePAQ’s strength at , abstaining from answering when it is likely to be incorrect. This enables RePAQ to “back-off” to a more expensive state-of-the-art model, leading to a combined system which is both more accurate and 2x faster than the state-of-the-art model alone.

Journal Article

شارك هذا الكتاب

أضف إلى رفتي

Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering

بواسطة Wen, Elliott , Nanayakkara, Suranga , Rana, Rajib في Adaptation , Augmentation , Computational linguistics

2023

Retrieval Augment Generation (RAG) is a recent advancement in Open-Domain Question Answering (ODQA). RAG has only been trained and explored with a Wikipedia-based external knowledge base and is not optimized for use in other specialized domains such as healthcare and news. In this paper, we evaluate the impact of joint training of the retriever and generator components of RAG for the task of domain adaptation in ODQA. We propose , an extension to RAG that can adapt to a domain-specific knowledge base by updating all components of the external knowledge base during training. In addition, we introduce an auxiliary training signal to inject more domain-specific knowledge. This auxiliary signal forces to reconstruct a given sentence by accessing the relevant information from the external knowledge base. Our novel contribution is that, unlike RAG, RAG-end2end does joint training of the retriever and generator for the end QA task and domain adaptation. We evaluate our approach with datasets from three domains: COVID-19, News, and Conversations, and achieve significant performance improvements compared to the original RAG model. Our work has been open-sourced through the HuggingFace Transformers library, attesting to our work’s credibility and technical consistency.

Journal Article

شارك هذا الكتاب

أضف إلى رفتي

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

بواسطة Clark, Jonathan H , Collins, Michael , Garrette, Dan في Answers , Computational linguistics , Data collection

2020

Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA—a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology—the set of linguistic features each language expresses—such that we expect models performing well on this set to generalize across a large number of the world’s languages. We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, and the data is collected directly in each language without the use of translation.

Journal Article

شارك هذا الكتاب

أضف إلى رفتي

TopiOCQA: Open-domain Conversational Question Answering with Topic Switching

بواسطة Dhuliawala, Shehzaad , Suleman, Kaheer , de Vries, Harm في Answers , Computational linguistics , Conversation

2022

In a conversational question answering scenario, a questioner seeks to extract information about a topic through a series of interdependent questions and answers. As the conversation progresses, they may switch to related topics, a phenomenon commonly observed in information-seeking search sessions. However, current datasets for conversational question answering are limiting in two ways: 1) they do not contain topic switches; and 2) they assume the reference text for the conversation is given, that is, the setting is not open-domain. We introduce (pronounced Tapioca), an open-domain conversational dataset with topic switches based on Wikipedia. contains 3,920 conversations with information-seeking questions and free-form answers. On average, a conversation in our dataset spans 13 question-answer turns and involves four topics (documents). poses a challenging test-bed for models, where efficient retrieval is required on multiple turns of the same conversation, in conjunction with constructing valid responses using conversational history. We evaluate several baselines, by combining state-of-the-art document retrieval methods with neural reader models. Our best model achieves F1 of 55.8, falling short of human performance by 14.2 points, indicating the difficulty of our dataset. Our dataset and code are available at .

Journal Article

شارك هذا الكتاب

أضف إلى رفتي

MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering

بواسطة Longpre, Shayne , Lu, Yi , Daiber, Joachim في Annotations , Answers , Benchmarks

2021

Progress in cross-lingual modeling depends on challenging, realistic, and diverse evaluation sets. We introduce Multilingual Knowledge Questions and Answers (MKQA), an open- domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). Answers are based on heavily curated, language- independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering. We benchmark a variety of state- of-the-art methods and baselines for generative and extractive question answering, trained on Natural Questions, in zero shot and translation settings. Results indicate this dataset is challenging even in English, but especially in low-resource languages.

Journal Article

شارك هذا الكتاب

أضف إلى رفتي

محدد اللغة

MBRLGlobalSearch

محدد اللغة

Catalogue Search | MBRL

نتائج البحث

استكشف المجموعة الواسعة من العناوين المتاحة.

MBRLSearchResults

MBRLHappinessMeter