Reading comprehension (RC)---in contrast to information retrieval---requiresintegrating information and reasoning about events, entities, and theirrelations across a full document. Question answering is conventionally used toassess RC ability, in both artificial agents and children learning to read.However, existing RC datasets and tasks are dominated by questions that can besolved by selecting answers using superficial information (e.g., local contextsimilarity or global term frequency); they thus fail to test for the essentialintegrative aspect of RC. To encourage progress on deeper comprehension oflanguage, we present a new dataset and set of tasks in which the reader mustanswer questions about stories by reading entire books or movie scripts. Thesetasks are designed so that successfully answering their questions requiresunderstanding the underlying narrative rather than relying on shallow patternmatching or salience. We show that although humans solve the tasks easily,standard RC models struggle on the tasks presented here. We provide an analysisof the dataset and the challenges it presents.
translated by 谷歌翻译
We present WIKIREADING, a large-scale natural language understanding task and publicly-available dataset with 18 million instances. The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles. The task contains a rich variety of challenging classification and extraction sub-tasks, making it well-suited for end-to-end models such as deep neural networks (DNNs). We compare various state-of-the-art DNN-based architectures for document classification , information extraction, and question answering. We find that models supporting a rich answer space, such as word or character sequences, perform best. Our best-performing model, a word-level sequence to sequence model with a mechanism to copy out-of-vocabulary words, obtains an accuracy of 71.8%.
translated by 谷歌翻译
This paper proposes to tackle open-domain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.
translated by 谷歌翻译
This paper presents an end-to-end neural network model, named NeuralGenerative Question Answering (GENQA), that can generate answers to simplefactoid questions, based on the facts in a knowledge-base. More specifically,the model is built on the encoder-decoder framework for sequence-to-sequencelearning, while equipped with the ability to enquire the knowledge-base, and istrained on a corpus of question-answer pairs, with their associated triples inthe knowledge-base. Empirical study shows the proposed model can effectivelydeal with the variations of questions and answers, and generate right andnatural answers by referring to the facts in the knowledge-base. The experimenton question answering demonstrates that the proposed model can outperform anembedding-based QA model as well as a neural dialogue model trained on the samedata.
translated by 谷歌翻译
A long-term goal of machine learning is to build intelligent conversational agents. One recent popular approach is to train end-to-end models on a large amount of real dialog transcripts between humans (Sordoni et al., 2015; Vinyals & Le, 2015; Shang et al., 2015). However, this approach leaves many questions unanswered as an understanding of the precise successes and shortcomings of each model is hard to assess. A contrasting recent proposal are the bAbI tasks (Weston et al., 2015b) which are synthetic data that measure the ability of learning machines at various reasoning tasks over toy language. Unfortunately, those tests are very small and hence may encourage methods that do not scale. In this work, we propose a suite of new tasks of a much larger scale that attempt to bridge the gap between the two regimes. Choosing the domain of movies, we provide tasks that test the ability of models to answer factual questions (utilizing OMDB), provide personalization (utilizing MovieLens), carry short conversations about the two, and finally to perform on natural dialogs from Reddit. We provide a dataset covering ∼75k movie entities and with ∼3.5M training examples. We present results of various models on these tasks, and evaluate their performance.
translated by 谷歌翻译
With the rapid growth of knowledge bases (KBs) on the web, how to take full advantage of them becomes increasingly important. Question answering over knowledge base (KB-QA) is one of the promising approaches to access the substantial knowledge. Meanwhile, as the neural network-based (NN-based) methods develop, NN-based KB-QA has already achieved impressive results. However, previous work did not put more emphasis on question representation, and the question is converted into a fixed vector regardless of its candidate answers. This simple representation strategy is not easy to express the proper information in the question. Hence, we present an end-to-end neural network model to represent the questions and their corresponding scores dynamically according to the various candidate answer aspects via cross-attention mechanism. In addition , we leverage the global knowledge inside the underlying KB, aiming at integrating the rich KB information into the representation of the answers. As a result, it could alleviates the out-of-vocabulary (OOV) problem, which helps the cross-attention model to represent the question more precisely. The experimental results on WebQuestions demonstrate the effectiveness of the proposed approach.
translated by 谷歌翻译
开放域问答(QA)是AI和NLP中的一个重要问题,它正在成为AI方法和技术普遍性进展的领头羊。通过信息检索方法和语料库构建的进步,开放域QA系统的大部分进展已经实现。在本文中,我们重点介绍最近推出的ARC Challengedataset,其中包含2,590个多项选择题,这些题目是为学校的科学考试而编写的。选择这些问题是当前QA系统中最具挑战性的问题,并且当前的现有技术性能仅比随机机会略好。我们提出了一个系统,它将agiven问题重写为用于从科学相关文本的大量语料库中检索支持文本的查询。我们的重写器能够整合来自ConceptNet的背景知识,并且与在SciTail上训练的通用textualentailment系统相结合,在检索结果中识别支持 ​​- 在端到端QA任务上优于几个强大的基线,尽管只是经过培训以识别基本术语在原始资源中。我们使用一般化的决策方法而不是检索证据并回答候选人以选择最佳答案。通过结合querygrriting,背景知识和文本蕴涵,我们的系统能够在ARC数据集上表现出几个强大的基线。
translated by 谷歌翻译
This paper presents our recent work on the design and development of a new,large scale dataset, which we name MS MARCO, for MAchine ReadingCOmprehension.This new dataset is aimed to overcome a number of well-knownweaknesses of previous publicly available datasets for the same task of readingcomprehension and question answering. In MS MARCO, all questions are sampledfrom real anonymized user queries. The context passages, from which answers inthe dataset are derived, are extracted from real web documents using the mostadvanced version of the Bing search engine. The answers to the queries arehuman generated. Finally, a subset of these queries has multiple answers. Weaim to release one million queries and the corresponding answers in thedataset, which, to the best of our knowledge, is the most comprehensivereal-world dataset of its kind in both quantity and quality. We are currentlyreleasing 100,000 queries with their corresponding answers to inspire work inreading comprehension and question answering along with gathering feedback fromthe research community.
translated by 谷歌翻译
常识性知识和常识推理是机器智能的主要瓶颈。在NLP社区中,已经创建了许多基准数据集和任务来解决语言理解的常识推理。这些任务旨在评估机器获取和学习常识知识的能力,以便推理和理解自然语言文本。由于这些任务成为工具和常识研究的推动力,本文旨在概述现有的任务和基准,知识资源,以及对自然语言理解的常识推理的学习和推理方法。通过这一点,我们的目标是支持更好的理解theart的状态,它的局限性和未来的挑战。
translated by 谷歌翻译
We present RACE, a new dataset for benchmark evaluation of methods in the reading comprehension task. Collected from the English exams for middle and high school Chinese students in the age range between 12 to 18, RACE consists of near 28,000 passages and near 100,000 questions generated by human experts (English instructors), and covers a variety of topics which are carefully designed for evaluating the students' ability in understanding and reasoning. In particular, the proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models (43%) and the ceiling human performance (95%). We hope this new dataset can serve as a valuable resource for research and evaluation in machine comprehension. The dataset is freely available at
translated by 谷歌翻译
现有的问答(QA)数据集无法训练QA系统执行复杂的推理并提供答案的解释。我们介绍HotpotQA,这是一个新的数据集,包含113k基于维基百科的问答对,有四个关键特征:(1)问题需要查找和推理多个支持文档才能回答; (2)问题多种多样,不受任何先前存在的知识库或知识模式的约束; (3)我们提供推理所需的句子级支持事实,允许QAsystems在强有力的监督下进行推理并解释预测; (4)我们提供了一种新型的事实比较问题来测试QA系统提取相关事实和进行必要比较的能力。我们证明HotpotQA对最新的QA系统具有挑战性,支持事实使模型能够提高性能并做出可解释的预测。
translated by 谷歌翻译
We study the task of generating from Wikipedia articles question-answer pairs that cover content beyond a single sentence. We propose a neural network approach that incorporates coreference knowledge via a novel gating mechanism. Compared to models that only take into account sentence-level information (Heil-man and Smith, 2010; Du et al., 2017; Zhou et al., 2017), we find that the linguistic knowledge introduced by the coref-erence representation aids question generation significantly, producing models that outperform the current state-of-the-art. We apply our system (composed of an answer span extraction system and the passage-level QG system) to the 10,000 top-ranking Wikipedia articles and create a corpus of over one million question-answer pairs. We also provide a qualitative analysis for this large-scale generated corpus from Wikipedia.
translated by 谷歌翻译
我们描述了一类称为内存网络的新型学习模型。 Memorynetworks推理使用推理组件和长期内存组件;他们学习如何共同使用这些。可以读取和写入长期记忆,目的是将其用于预测。我们在问答(QA)的背景下研究这些模型,其中长期记忆有效地充当(动态)知识库,并且输出是文本响应。我们在大规模QA任务中评估它们,并从模拟世界生成一个更小但更复杂的玩具任务。在后者中,我们通过将多个支持句子链接到需要理解动词内涵的问题来展示这些模型的推理能力。
translated by 谷歌翻译
Existing knowledge-based question answering systems often rely on small annotated training data. While shallow methods like relation extraction are robust to data scarcity, they are less expressive than the deep meaning representation methods like semantic parsing, thereby failing at answering questions involving multiple constraints. Here we alleviate this problem by empowering a relation extraction method with additional evidence from Wikipedia. We first present a neural network based relation extractor to retrieve the candidate answers from Freebase, and then infer over Wikipedia to validate these answers. Experiments on the WebQuestions question answering dataset show that our method achieves an F 1 of 53.3%, a substantial improvement over the state-of-the-art.
translated by 谷歌翻译
问答是信息检索和自然语言处理边界中最重要和最困难的应用之一,尤其是当我们讨论需要某种形式的推理来确定正确答案的复杂科学问题时。在本文中,我们提出了一个两步法,它将针对问答交换优化的信息检索技术与用于自然语言推理的深度学习模型相结合,以便在科学领域中解决多项选择问题。对于每个问题 - 答案对,我们使用基于标准检索的模型来查找相关的候选上下文,并将主要问题分解为两个不同的问题。首先,使用来自Lucene的检索模型,基于上下文为每个候选答案分配正确性分数。其次,我们使用深度学习架构来计算候选答案是否可以从包含从知识库中检索到的句子组成的某些选择的上下文中推断出来。最后,所有这些求解器使用简单的神经网络进行组合,从而预测正确的答案。这个提出的两步模型在绝对精度方面优于基于最佳检索的求解器超过3%。
translated by 谷歌翻译
Information Extraction (IE) refers to automatically extracting struc-tured relation tuples from unstructured texts. Common IE solutions, including Relation Extraction (RE) and open IE systems, can hardly handle cross-sentence tuples, and are severely restricted by limited relation types as well as informal relation specifications (e.g., free-text based relation tuples). In order to overcome these weaknesses, we propose a novel IE framework named QA4IE, which leverages the flexible question answering (QA) approaches to produce high quality relation triples across sentences. Based on the framework, we develop a large IE benchmark with high quality human evaluation. This benchmark contains 293K documents, 2M golden relation triples, and 636 relation types. We compare our system with some IE baselines on our benchmark and the results show that our system achieves great improvements.
translated by 谷歌翻译
我们提出了QuAC,一个上下文问答的数据集,包含14K信息搜索QA对话(总共100K问题)。这个对话包括两个群众工作者:(1)一个学生,他提出一系列自由形式,尽可能多地学习隐藏的维基百科文本,以及(2)通过提供文本摘要来回答问题的教师.QuAC不会引入挑战在现有的机器理解数据集中找到:它的问题通常在对话框环境中更开放,无法回答或仅有意义,正如我们在详细的定性评估中所显示的那样。我们还报告了许多参考模型的结果,包括最近最先进的阅读理解架构扩展的tomodel对话框上下文。我们最好的模型在20 F1之前表现不及人类,这表明这些数据未来有很大的发展空间。数据集,基线和排行榜可在http://quac.ai上找到。
translated by 谷歌翻译
We present NewsQA, a challenging machine comprehension dataset of over100,000 human-generated question-answer pairs. Crowdworkers supply questionsand answers based on a set of over 10,000 news articles from CNN, with answersconsisting of spans of text from the corresponding articles. We collect thisdataset through a four-stage process designed to solicit exploratory questionsthat require reasoning. A thorough analysis confirms that NewsQA demandsabilities beyond simple word matching and recognizing textual entailment. Wemeasure human performance on the dataset and compare it to several strongneural models. The performance gap between humans and machines (0.198 in F1)indicates that significant progress can be made on NewsQA through futureresearch. The dataset is freely available athttps://datasets.maluuba.com/NewsQA.
translated by 谷歌翻译
问答(QA)近年来受益于深度学习技术。然而,领域特定的QA仍然是训练神经网络所需的大量数据的挑战。本文通过选择圣经中的相关经文来研究圣经领域中的答案句选择任务和答案。为此,我们基于圣经琐事问题创建一个新的数据集BibleQA,并为我们的任务提出三个神经网络模型。我们在大型QAdataset,SQuAD上预训我们的模型,并研究传递权重对模式的影响。此外,我们还使用不同的上下文长度和不同的圣经翻译来测量模型的准确度。我们确认转移学习在模型准确性方面有显着提高。使用较短的上下文长度来获得相对较好的结果,而较长的上下文长度降低了模型精度。我们还发现在数据集中使用更现代的可翻译对任务有积极影响。
translated by 谷歌翻译
A critical task for question answering is the final answer selection stage, which has to combine multiple signals available about each answer candidate. This paper proposes EviNets: a novel neural network architecture for factoid question answering. EviNets scores candidate answer entities by combining the available supporting evidence, e.g., structured knowledge bases and unstructured text documents. EviNets represents each piece of evidence with a dense embeddings vector, scores their relevance to the question, and aggregates the support for each candidate to predict their final scores. Each of the components is generic and allows plugging in a variety of models for semantic similarity scoring and information aggregation. We demonstrate the effectiveness of EviNets in experiments on the existing TREC QA and WikiMovies benchmarks, and on the new Yahoo! Answers dataset introduced in this paper. EviNets can be extended to other information types and could facilitate future work on combining evidence signals for joint reasoning in question answering.
translated by 谷歌翻译