开放域问答(QA)是AI和NLP中的一个重要问题,它正在成为AI方法和技术普遍性进展的领头羊。通过信息检索方法和语料库构建的进步,开放域QA系统的大部分进展已经实现。在本文中,我们重点介绍最近推出的ARC Challengedataset,其中包含2,590个多项选择题,这些题目是为学校的科学考试而编写的。选择这些问题是当前QA系统中最具挑战性的问题,并且当前的现有技术性能仅比随机机会略好。我们提出了一个系统,它将agiven问题重写为用于从科学相关文本的大量语料库中检索支持文本的查询。我们的重写器能够整合来自ConceptNet的背景知识,并且与在SciTail上训练的通用textualentailment系统相结合,在检索结果中识别支持 ​​- 在端到端QA任务上优于几个强大的基线,尽管只是经过培训以识别基本术语在原始资源中。我们使用一般化的决策方法而不是检索证据并回答候选人以选择最佳答案。通过结合querygrriting,背景知识和文本蕴涵,我们的系统能够在ARC数据集上表现出几个强大的基线。
translated by 谷歌翻译
We present a reading comprehension challenge in which questions can only be answered by taking into account information from multiple sentences. We solicit and verify questions and answers for this challenge through a 4-step crowdsourcing experiment. Our challenge dataset contains ∼6k questions for +800 paragraphs across 7 different domains (ele-mentary school science, news, travel guides, fiction stories, etc) bringing in linguistic diversity to the texts and to the questions wordings. On a subset of our dataset, we found human solvers to achieve an F1-score of 86.4%. We analyze a range of baselines, including a recent state-of-art reading comprehension system , and demonstrate the difficulty of this challenge, despite a high human performance. The dataset is the first to study multi-sentence inference at scale, with an open-ended set of question types that requires reasoning skills.
translated by 谷歌翻译
问答是信息检索和自然语言处理边界中最重要和最困难的应用之一,尤其是当我们讨论需要某种形式的推理来确定正确答案的复杂科学问题时。在本文中,我们提出了一个两步法,它将针对问答交换优化的信息检索技术与用于自然语言推理的深度学习模型相结合,以便在科学领域中解决多项选择问题。对于每个问题 - 答案对,我们使用基于标准检索的模型来查找相关的候选上下文,并将主要问题分解为两个不同的问题。首先,使用来自Lucene的检索模型,基于上下文为每个候选答案分配正确性分数。其次,我们使用深度学习架构来计算候选答案是否可以从包含从知识库中检索到的句子组成的某些选择的上下文中推断出来。最后,所有这些求解器使用简单的神经网络进行组合,从而预测正确的答案。这个提出的两步模型在绝对精度方面优于基于最佳检索的求解器超过3%。
translated by 谷歌翻译
开放式域名问题解答仍然是一项具有挑战性的任务,因为它需要能够理解问题和答案,收集有用信息和推理证据的模型。以前的工作通常将此任务表示为从搜索引擎检索到的阅读理解或蕴涵问题。然而,当没有提供直接相关的证据时,现有技术很难找到间接相关的证据,特别是对于难以准确解析问题要求的复杂问题。在本文中,我们提出了一个猎头阅读器模型,该模型可以在问答过程中参与基本术语。 Webuild(1)一个重要的术语选择器,它首先识别问题中最重要的词,然后重新构造查询并搜索相关证据; (2)增强型读者,区分基本术语和分散注意力的词语以预测答案。我们在多个开放域QA数据集上评估我们的模型,其中它优于现有的最新技术,特别是在AI2 ReasoningChallenge(ARC)数据集上相对改进了8.1%。
translated by 谷歌翻译
我们提出了一种新的问题回答数据集,OpenBookQA,模拟了开放式书籍考试,用于评估人类对某一主题的理解。我们的问题附带的开放式书籍是一套1329个基本级别的科学事实。大约6000个问题探讨了对这些事实的理解以及对新情况的应用。这需要将从其他来源获得的开放式书籍事实(例如,金属导电)与广泛的常识(例如,由金属制成的装备套装)相结合。虽然现有的质量保证数据源于文件或知识库,通常是独立的,专注于语言理解,但OpenBookQA探讨了对共同知识背景下的主题及其表达语言的更深入理解。 OpenBookQA的人类表现接近92%,但是许多最先进的训练有素的QA方法表现得非常糟糕,比我们开发的几个简单的基础线更差。我们旨在规避知识检索瓶颈的oracle实验证明了开卷和其他事实的价值。我们将其作为挑战来解决这个多跳设置中的检索问题并缩小与人类表现的巨大差距。
translated by 谷歌翻译
In this paper we introduce a new publicly available dataset for verification against textual sources, FEVER: Fact Extraction and VERification. It consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as SUPPORTED, REFUTED or NOTENOUGHINFO by annota-tors achieving 0.6841 in Fleiss κ. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment. To characterize the challenge of the dataset presented, we develop a pipeline approach and compare it to suitably designed oracles. The best accuracy we achieve on labeling a claim accompanied by the correct evidence is 31.87%, while if we ignore the evidence we achieve 50.91%. Thus we believe that FEVER is a challenging testbed that will help stimulate progress on claim verification against textual sources.
translated by 谷歌翻译
Reading comprehension (RC)---in contrast to information retrieval---requiresintegrating information and reasoning about events, entities, and theirrelations across a full document. Question answering is conventionally used toassess RC ability, in both artificial agents and children learning to read.However, existing RC datasets and tasks are dominated by questions that can besolved by selecting answers using superficial information (e.g., local contextsimilarity or global term frequency); they thus fail to test for the essentialintegrative aspect of RC. To encourage progress on deeper comprehension oflanguage, we present a new dataset and set of tasks in which the reader mustanswer questions about stories by reading entire books or movie scripts. Thesetasks are designed so that successfully answering their questions requiresunderstanding the underlying narrative rather than relying on shallow patternmatching or salience. We show that although humans solve the tasks easily,standard RC models struggle on the tasks presented here. We provide an analysisof the dataset and the challenges it presents.
translated by 谷歌翻译
常识性知识和常识推理是机器智能的主要瓶颈。在NLP社区中,已经创建了许多基准数据集和任务来解决语言理解的常识推理。这些任务旨在评估机器获取和学习常识知识的能力,以便推理和理解自然语言文本。由于这些任务成为工具和常识研究的推动力,本文旨在概述现有的任务和基准,知识资源,以及对自然语言理解的常识推理的学习和推理方法。通过这一点,我们的目标是支持更好的理解theart的状态,它的局限性和未来的挑战。
translated by 谷歌翻译
我们提出了TriviaQA,这是一个具有挑战性的阅读理解数据集,包含超过650K的问答 - 证据三元组。 TriviaQA包括由琐事爱好者撰写的95Kquestion-answer对和独立的gatheredevidence文档,平均每个问题6个,为回答问题提供高质量的远程监督。我们表明,与其他最近引入的大规模数据集相比,TriviaQA(1)具有相对复杂的组合问题,(2)在问题和相应的答案 - 证据句子之间具有相当大的句法和词汇变异性,并且(3)需要更多的交叉句推理找到答案。我们还提出了twobaseline算法:基于特征的分类器和最先进的神经网络,在SQUAD阅读理解上表现良好。两种方法都不接近人类表现(23%和40%对80%),这表明TriviaQA是一个具有挑战性的试验平台,值得进行重大的未来研究。数据和代码可在 - http://nlp.cs.washington.edu/triviaqa/获取
translated by 谷歌翻译
本文介绍了DuReader,一种新的大型开放域中文机读数理解(MRC)数据集,旨在解决现实世界的MRC.DuReader与以前的MRC数据集相比有三个优点:(1)数据来源:问题和文件基于百度搜索和百度之道;答案是手工生成的。 (2)问题类型:它为更多问题类型提供了丰富的注释,特别是是 - 否和意见问题,为研究界留下了更多的机会。 (3)规模:包含200K题,420K答案和1M文件;它是迄今为止最大的中国MRC数据集。实验表明,人类的表现远远高于目前最先进的基线系统,为社区留下了足够的空间来进行改进。为了帮助社区进行这些改进,DuReader和baseline系统都已在线发布。我们还组织共同竞争,以鼓励探索更多模型。自任务发布以来,基线有了显着的改进。
translated by 谷歌翻译
现有的问答(QA)数据集无法训练QA系统执行复杂的推理并提供答案的解释。我们介绍HotpotQA,这是一个新的数据集,包含113k基于维基百科的问答对,有四个关键特征:(1)问题需要查找和推理多个支持文档才能回答; (2)问题多种多样,不受任何先前存在的知识库或知识模式的约束; (3)我们提供推理所需的句子级支持事实,允许QAsystems在强有力的监督下进行推理并解释预测; (4)我们提供了一种新型的事实比较问题来测试QA系统提取相关事实和进行必要比较的能力。我们证明HotpotQA对最新的QA系统具有挑战性,支持事实使模型能够提高性能并做出可解释的预测。
translated by 谷歌翻译
This paper presents our recent work on the design and development of a new,large scale dataset, which we name MS MARCO, for MAchine ReadingCOmprehension.This new dataset is aimed to overcome a number of well-knownweaknesses of previous publicly available datasets for the same task of readingcomprehension and question answering. In MS MARCO, all questions are sampledfrom real anonymized user queries. The context passages, from which answers inthe dataset are derived, are extracted from real web documents using the mostadvanced version of the Bing search engine. The answers to the queries arehuman generated. Finally, a subset of these queries has multiple answers. Weaim to release one million queries and the corresponding answers in thedataset, which, to the best of our knowledge, is the most comprehensivereal-world dataset of its kind in both quantity and quality. We are currentlyreleasing 100,000 queries with their corresponding answers to inspire work inreading comprehension and question answering along with gathering feedback fromthe research community.
translated by 谷歌翻译
Question answering (QA) systems are easily distracted by irrelevant or redundant words in questions, especially when faced with long or multi-sentence questions in difficult domains. This paper introduces and studies the notion of essential question terms with the goal of improving such QA solvers. We illustrate the importance of essential question terms by showing that humans' ability to answer questions drops significantly when essential terms are eliminated from questions. We then develop a classifier that reliably (90% mean average precision) identifies and ranks essential terms in questions. Finally , we use the classifier to demonstrate that the notion of question term essen-tiality allows state-of-the-art QA solvers for elementary-level science questions to make better and more informed decisions, improving performance by up to 5%. We also introduce a new dataset of over 2,200 crowd-sourced essential terms annotated science questions.
translated by 谷歌翻译
This article provides a comprehensive and comparative overview of question answering technology. It presents the question answering task from an information retrieval perspective and emphasises the importance of retrieval models, i.e., representations of queries and information documents, and retrieval functions which are used for estimating the relevance between a query and an answer candidate. The survey suggests a general question answering architecture that steadily increases the complexity of the representation level of questions and information objects. On the one hand, natural language queries are reduced to keyword-based searches, on the other hand, knowledge bases are queried with structured or logical queries obtained from the natural language questions, and answers are obtained through reasoning. We discuss different levels of processing yielding bag-of-words-based and more complex representations integrating part-of-speech tags, classification of the expected answer type, semantic roles, discourse analysis, translation into a SQL-like language and logical representations.
translated by 谷歌翻译
在过去的一年中,用于预训练和转学习的新模型和方法在各种语言理解任务中带来了显着的性能提升。一年前推出的GLUE基准提供了一个单数量度量标准,总结了各种此类任务的进展情况,但最近基准测试的表现接近非专家人员的水平,表明进一步研究的空间有限。本文回顾了从GLUE基准测试中汲取的经验教训,并介绍了SuperGLUE,这是一款以GLUE为基础的新标记,具有一系列更加困难的语言理解任务,改进的资源以及新的公共排行榜.SuperGLUE将很快在super.gluebenchmark.com上发布。
translated by 谷歌翻译
We propose a novel method for exploiting the semantic structure of text to answer multiple-choice questions. The approach is especially suitable for domains that require reasoning over a diverse set of linguistic constructs but have limited training data. To address these challenges, we present the first system, to the best of our knowledge, that reasons over a wide range of semantic abstractions of the text, which are derived using off-the-shelf, general-purpose, pre-trained natural language modules such as semantic role labelers, coref-erence resolvers, and dependency parsers. Representing multiple abstractions as a family of graphs, we translate question answering (QA) into a search for an optimal subgraph that satisfies certain global and local properties. This formulation generalizes several prior structured QA systems. Our system, SEMANTICILP, demonstrates strong performance on two domains simultaneously. In particular, on a collection of challenging science QA datasets, it outperforms various state-of-the-art approaches, including neural models, broad coverage information retrieval, and specialized techniques using struc-tured knowledge bases, by 2%-6%.
translated by 谷歌翻译
机器阅读中的大多数工作都集中在问题回答问题上,其中的问题直接表达在要阅读的文本中。然而,许多回答问题的现实问题需要阅读文本,而不是因为它包含文字答案,而是因为它包含了一个与读者的背景知识共同得到答案的方法。一个例子是解释法规的任务,回答“我能......?”或“我必须......?”问题如“我在加拿大工作。我是否必须继续支付英国国民保险?”在阅读了英国政府网站关于这个主题之后。这项任务既需要对规则的解释,也需要背景知识的应用。由于实际上大多数问题都没有明确规定,而且人工助理经常要问诸如“你在国外工作多久了?”这样的问题,实在是太复杂了。当答案不能直接来自问题和文本时。在本文中,我们正式确定了这项任务,并制定了一个众包策略,根据现实世界的规则和人群生成的问题和方案收集32k任务实例。我们通过评估基于规则和机器学习基线的性能来分析此任务的挑战并评估其难度。当不需要背景知识时,我们会保留有希望的结果,并且只要需要背景知识,就有很大的改进空间。
translated by 谷歌翻译
阅读理解最近取得了迅速的进展,系统在最受欢迎的任务数据集上匹配人类。然而,大量的工作突出了这些系统的脆弱性,表明还有很多工作要做。我们引入了一个新的英语阅读理解基准,DROP,它需要离散推理超过图的内容。在这个众包,对话创建的96k-questionbenchmark中,系统必须解决问题中的引用,可能是多个输入位置,并对它们执行离散操作(例如添加,计数或排序)。这些操作需要对段落内容进行更全面的理解,而不是对priordatasets所必需的内容。我们从该数据集的阅读理解和语义分析文献中应用了最先进的方法,并表明最佳系统在我们的广义精度度量上仅达到32.7%F1,而专家人类绩效为96.0%。我们还提出了一种新的模型,将阅读理解方法与简单的数值推理相结合,实现了47.0%的F1。
translated by 谷歌翻译
大多数阅读理解方法仅限于使用单个句子,段落或文档进行追问的查询。启用模型tocombine不相交的文本证据将扩展机器理解方法的范围,但目前没有资源来训练和测试这种能力。我们提出了一项新的任务,鼓励开发跨多个文档的文本理解模型,并研究现有方法的局限性。在我们的任务中,模型学习寻找和组合证据 - 有效地执行多跳(别名多步骤)推理。我们设计了一个方法来为这个任务生成数据集,给出了一组查询 - 答案对和主题链接文档。引入了来自不同领域的两个数据集,我们确定了潜在的陷阱和设计策略。我们评估两个先前提出的竞争模型,并发现可以跨文档整合信息。但是,两种模式都难以选择相关信息,因为提供相关信息可以大大提高其绩效。虽然模型表现出几个强大的基线,但它们的最佳准确度达到42.9%,而人类表现达到74.0% - 留下了充足的改进空间。
translated by 谷歌翻译
我们提出了斯坦福问答数据集(SQUAD),这是一个新的阅读理解数据集,由群众工作者在一组维基百科文章中提出的100,000多个问题组成,其中每个问题的答案是来自相应阅读段落的文本片段。我们分析数据集,理解回答问题所需的推理类型,重点关注依赖和选区树。我们建立了一个强大的逻辑回归模型,其得分为51.0%,比简单的基线(20%)显着提高。然而,人类表现(86.8%)更高,表明该数据集对未来研究提出了一个很好的挑战。该数据集可在https://stanford-qa.com免费获取
translated by 谷歌翻译