Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HOTPOTQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison. We show that HOTPOTQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.
translated by 谷歌翻译
Multi-hop Machine reading comprehension is a challenging task with aim of answering a question based on disjoint pieces of information across the different passages. The evaluation metrics and datasets are a vital part of multi-hop MRC because it is not possible to train and evaluate models without them, also, the proposed challenges by datasets often are an important motivation for improving the existing models. Due to increasing attention to this field, it is necessary and worth reviewing them in detail. This study aims to present a comprehensive survey on recent advances in multi-hop MRC evaluation metrics and datasets. In this regard, first, the multi-hop MRC problem definition will be presented, then the evaluation metrics based on their multi-hop aspect will be investigated. Also, 15 multi-hop datasets have been reviewed in detail from 2017 to 2022, and a comprehensive analysis has been prepared at the end. Finally, open issues in this field have been discussed.
translated by 谷歌翻译
We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K questionanswer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers. We also present two baseline algorithms: a featurebased classifier and a state-of-the-art neural network, that performs well on SQuAD reading comprehension. Neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that Trivi-aQA is a challenging testbed that is worth significant future study. 1
translated by 谷歌翻译
We present QuAC, a dataset for Question Answering in Context that contains 14K information-seeking QA dialogs (100K questions in total). The dialogs involve two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as we show in a detailed qualitative evaluation. We also report results for a number of reference models, including a recently state-ofthe-art reading comprehension architecture extended to model dialog context. Our best model underperforms humans by 20 F1, suggesting that there is significant room for future work on this data. Dataset, baseline, and leaderboard available at http://quac.ai.
translated by 谷歌翻译
多跳的推理需要汇总多个文档来回答一个复杂的问题。现有方法通常将多跳问题分解为更简单的单跳问题,以解决说明可解释的推理过程的问题。但是,他们忽略了每个推理步骤的支持事实的基础,这往往会产生不准确的分解。在本文中,我们提出了一个可解释的逐步推理框架,以在每个中间步骤中同时合并单跳支持句子识别和单跳问题生成,并利用当前跳跃的推断,直到推理最终结果。我们采用统一的读者模型来进行中级跳跃推理和最终的跳跃推理,并采用关节优化,以更准确,强大的多跳上推理。我们在两个基准数据集HOTPOTQA和2WIKIMULTIHOPQA上进行实验。结果表明,我们的方法可以有效地提高性能,并在不分解监督的情况下产生更好的解释推理过程。
translated by 谷歌翻译
虽然通过简单的因素问题回答,文本理解的大量进展,但更加全面理解话语仍然存在重大挑战。批判性地反映出文本的人将造成好奇心驱动,通常是开放的问题,这反映了对内容的深刻理解,并要求复杂的推理来回答。建立和评估这种类型的话语理解模型的关键挑战是缺乏注释数据,特别是因为找到了这些问题的答案(可能根本不回答),需要高度的注释载荷的高认知负荷。本文提出了一种新的范式,使可扩展的数据收集能够针对新闻文件的理解,通过话语镜头查看这些问题。由此产生的语料库DCQA(疑问回答的话语理解)包括在607名英语文件中的22,430个问题答案对组成。 DCQA以自由形式,开放式问题的形式捕获句子之间的话语和语义链接。在评估集中,我们向问题上的问题提交了来自好奇数据集的问题,我们表明DCQA提供了有价值的监督,以回答开放式问题。我们还在使用现有的问答资源设计预训练方法,并使用合成数据来适应不可批售的问题。
translated by 谷歌翻译
近年来,在挑战的多跳QA任务方面有令人印象深刻的进步。然而,当面对输入文本中的一些干扰时,这些QA模型可能会失败,并且它们进行多跳推理的可解释性仍然不确定。以前的逆势攻击作品通常编辑整个问题句,这对测试基于实体的多跳推理能力有限。在本文中,我们提出了一种基于多跳推理链的逆势攻击方法。我们将从查询实体开始的多跳推理链与构造的图表中的答案实体一起制定,这使我们能够将问题对齐到每个推理跳跃,从而攻击任何跃点。我们将问题分类为不同的推理类型和对应于所选推理跳的部分问题,以产生分散注意力的句子。我们在HotpotQA DataSet上的三个QA模型上测试我们的对抗方案。结果表明,对答案和支持事实预测的显着性能降低,验证了我们推理基于链条推理模型的攻击方法的有效性以及它们的脆弱性。我们的对抗重新培训进一步提高了这些模型的性能和鲁棒性。
translated by 谷歌翻译
有关应答数据集和模型的研究在研究界中获得了很多关注。其中许多人释放了自己的问题应答数据集以及模型。我们在该研究领域看到了巨大的进展。本调查的目的是识别,总结和分析许多研究人员释放的现有数据集,尤其是在非英语数据集以及研究代码和评估指标等资源中。在本文中,我们审查了问题应答数据集,这些数据集可以以法语,德语,日语,中文,阿拉伯语,俄语以及多语言和交叉的问答数据集进行英语。
translated by 谷歌翻译
为了实现长文档理解的构建和测试模型,我们引入质量,具有中文段的多项选择QA DataSet,具有约5,000个令牌的平均长度,比典型的当前模型更长。与经过段落的事先工作不同,我们的问题是由阅读整个段落的贡献者编写和验证的,而不是依赖摘要或摘录。此外,只有一半的问题是通过在紧缩时间限制下工作的注释器来应答,表明略读和简单的搜索不足以一直表现良好。目前的模型在此任务上表现不佳(55.4%),并且落后于人类性能(93.5%)。
translated by 谷歌翻译
Extractive reading comprehension systems can often locate the correct answer to a question in a context document, but they also tend to make unreliable guesses on questions for which the correct answer is not stated in the context. Existing datasets either focus exclusively on answerable questions, or use automatically generated unanswerable questions that are easy to identify. To address these weaknesses, we present SQuAD 2.0, the latest version of the Stanford Question Answering Dataset (SQuAD). SQuAD 2.0 combines existing SQuAD data with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. SQuAD 2.0 is a challenging natural language understanding task for existing models: a strong neural system that gets 86% F1 on SQuAD 1.1 achieves only 66% F1 on SQuAD 2.0.
translated by 谷歌翻译
As large language models (LLMs) grow larger and more sophisticated, assessing their "reasoning" capabilities in natural language grows more challenging. Recent question answering (QA) benchmarks that attempt to assess reasoning are often limited by a narrow scope of covered situations and subject matters. We introduce WikiWhy, a QA dataset built around a novel auxiliary task: explaining why an answer is true in natural language. WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics. Each rationale is a set of supporting statements connecting the question to the answer. WikiWhy serves as a benchmark for the reasoning capabilities of LLMs because it demands rigorous explicit rationales for each answer to demonstrate the acquisition of implicit commonsense knowledge, which is unlikely to be easily memorized. GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition, leaving significant room for future improvements.
translated by 谷歌翻译
Machine reading comprehension (MRC) is a long-standing topic in natural language processing (NLP). The MRC task aims to answer a question based on the given context. Recently studies focus on multi-hop MRC which is a more challenging extension of MRC, which to answer a question some disjoint pieces of information across the context are required. Due to the complexity and importance of multi-hop MRC, a large number of studies have been focused on this topic in recent years, therefore, it is necessary and worth reviewing the related literature. This study aims to investigate recent advances in the multi-hop MRC approaches based on 31 studies from 2018 to 2022. In this regard, first, the multi-hop MRC problem definition will be introduced, then 31 models will be reviewed in detail with a strong focus on their multi-hop aspects. They also will be categorized based on their main techniques. Finally, a fine-grain comprehensive comparison of the models and techniques will be presented.
translated by 谷歌翻译
This paper proposes to tackle opendomain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.
translated by 谷歌翻译
We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at https://stanford-qa.com.
translated by 谷歌翻译
We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questionssampled from Bing's search query logs-each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages-extracted from 3,563,535 web documents retrieved by Bing-that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.
translated by 谷歌翻译
In this paper we introduce a new publicly available dataset for verification against textual sources, FEVER: Fact Extraction and VERification. It consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from.The claims are classified as SUPPORTED, RE-FUTED or NOTENOUGHINFO by annotators achieving 0.6841 in Fleiss κ. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment. To characterize the challenge of the dataset presented, we develop a pipeline approach and compare it to suitably designed oracles. The best accuracy we achieve on labeling a claim accompanied by the correct evidence is 31.87%, while if we ignore the evidence we achieve 50.91%. Thus we believe that FEVER is a challenging testbed that will help stimulate progress on claim verification against textual sources.
translated by 谷歌翻译
In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. SuperGLUE is available at super.gluebenchmark.com.
translated by 谷歌翻译
多跳问题回答(QA)需要对多个文档进行推理,以回答一个复杂的问题并提供可解释的支持证据。但是,提供支持证据不足以证明模型已经执行了所需的推理来达到正确的答案。大多数现有的多跳质量检查方法也无法回答大部分子问题,即使他们的父母问题得到了正确的回答。在本文中,我们为多跳QA提出了基于及时的保护学习(PCL)框架,该框架从多跳QA任务中获取了新知识,同时保留了在单跳QA任务上学习的旧知识,从而减轻了遗忘。具体来说,我们首先在现有的单跳质量检查任务上训练模型,然后冻结该模型,并通过为多跳质量检查任务分配其他子网络来扩展它。此外,为了调整预训练的语言模型以刺激特定多跳问题所需的推理类型,我们学习了新型子网络的软提示,以执行特定于类型的推理。 HOTPOTQA基准测试的实验结果表明,PCL具有多跳质量质量质量检查的竞争力,并且在相应的单跳子问题上保留了良好的性能,这表明PCL通过忘记通过忘记来减轻知识丧失的功效。
translated by 谷歌翻译
Humans gather information through conversations involving a series of interconnected questions and answers. For machines to assist in information gathering, it is therefore essential to enable them to answer conversational questions. We introduce CoQA, a novel dataset for building Conversational Question Answering systems. 1 Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage. We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning. We evaluate strong dialogue and reading comprehension models on CoQA. The best system obtains an F1 score of 65.4%, which is 23.4 points behind human performance (88.8%), indicating there is ample room for improvement. We present CoQA as a challenge to the community at https://stanfordnlp. github.io/coqa.
translated by 谷歌翻译
问题回答(QA)是最重要的自然语言处理(NLP)任务之一。它旨在使用NLP技术根据大规模的非结构化语料库生成对给定问题的相应答案。随着深度学习的发展,正在提出越来越具有挑战性的质量检查数据集,并且许多用于解决它们的新方法也正在出现。在本文中,我们研究了在深度学习时代发布的有影响力的质量检查数据集。具体来说,我们首先引入两个最常见的质量检查任务 - 文本问题答案和视觉问题 - 分别涵盖最具代表性的数据集,然后给出质量检查研究的一些当前挑战。
translated by 谷歌翻译