Humans gather information through conversations involving a series of interconnected questions and answers. For machines to assist in information gathering, it is therefore essential to enable them to answer conversational questions. We introduce CoQA, a novel dataset for building Conversational Question Answering systems. 1 Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage. We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning. We evaluate strong dialogue and reading comprehension models on CoQA. The best system obtains an F1 score of 65.4%, which is 23.4 points behind human performance (88.8%), indicating there is ample room for improvement. We present CoQA as a challenge to the community at https://stanfordnlp. github.io/coqa.
translated by 谷歌翻译
We present QuAC, a dataset for Question Answering in Context that contains 14K information-seeking QA dialogs (100K questions in total). The dialogs involve two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as we show in a detailed qualitative evaluation. We also report results for a number of reference models, including a recently state-ofthe-art reading comprehension architecture extended to model dialog context. Our best model underperforms humans by 20 F1, suggesting that there is significant room for future work on this data. Dataset, baseline, and leaderboard available at http://quac.ai.
translated by 谷歌翻译
为了实现长文档理解的构建和测试模型,我们引入质量,具有中文段的多项选择QA DataSet,具有约5,000个令牌的平均长度,比典型的当前模型更长。与经过段落的事先工作不同,我们的问题是由阅读整个段落的贡献者编写和验证的,而不是依赖摘要或摘录。此外,只有一半的问题是通过在紧缩时间限制下工作的注释器来应答,表明略读和简单的搜索不足以一直表现良好。目前的模型在此任务上表现不佳(55.4%),并且落后于人类性能(93.5%)。
translated by 谷歌翻译
会话问题应答(CQA)系统旨在为用户提供自然语言答案,以信息寻求对话。现有的CQA基准测试与预先收集的人类谈话进行比较模型,使用在会话历史中提供的地面真理答案。它仍然尚不清楚我们是否可以依赖于模型开发的这种静态评估,以及当前系统是否能够充分地概括为现实世界的人机对话。在这项工作中,我们开展了最先进的CQA系统的大规模人类评估,人类评估人员与模型交谈并判断了答案的正确性。我们发现,人机对话的分布与人类谈话的分配急剧不同,并且在模型排名方面存在人和金历史评估之间的分歧。我们进一步调查了如何改进自动评估,并提出基于预测历史的问题重写机制,与人类判断更好地相关。最后,我们讨论了各种建模策略和未来方向对更好的会话问题应答系统的影响。
translated by 谷歌翻译
We present RACE, a new dataset for benchmark evaluation of methods in the reading comprehension task. Collected from the English exams for middle and high school Chinese students in the age range between 12 to 18, RACE consists of near 28,000 passages and near 100,000 questions generated by human experts (English instructors), and covers a variety of topics which are carefully designed for evaluating the students' ability in understanding and reasoning. In particular, the proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models (43%) and the ceiling human performance (95%). We hope this new dataset can serve as a valuable resource for research and evaluation in machine comprehension. The dataset is freely available at http://www.cs.cmu.edu/ ˜glai1/data/race/ and the code is available at https://github.com/ qizhex/RACE_AR_baselines
translated by 谷歌翻译
We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at https://stanford-qa.com.
translated by 谷歌翻译
有关应答数据集和模型的研究在研究界中获得了很多关注。其中许多人释放了自己的问题应答数据集以及模型。我们在该研究领域看到了巨大的进展。本调查的目的是识别,总结和分析许多研究人员释放的现有数据集,尤其是在非英语数据集以及研究代码和评估指标等资源中。在本文中,我们审查了问题应答数据集,这些数据集可以以法语,德语,日语,中文,阿拉伯语,俄语以及多语言和交叉的问答数据集进行英语。
translated by 谷歌翻译
Multi-hop Machine reading comprehension is a challenging task with aim of answering a question based on disjoint pieces of information across the different passages. The evaluation metrics and datasets are a vital part of multi-hop MRC because it is not possible to train and evaluate models without them, also, the proposed challenges by datasets often are an important motivation for improving the existing models. Due to increasing attention to this field, it is necessary and worth reviewing them in detail. This study aims to present a comprehensive survey on recent advances in multi-hop MRC evaluation metrics and datasets. In this regard, first, the multi-hop MRC problem definition will be presented, then the evaluation metrics based on their multi-hop aspect will be investigated. Also, 15 multi-hop datasets have been reviewed in detail from 2017 to 2022, and a comprehensive analysis has been prepared at the end. Finally, open issues in this field have been discussed.
translated by 谷歌翻译
过去十年互联网上可用的信息和信息量增加。该数字化导致自动应答系统需要从冗余和过渡知识源中提取富有成效的信息。这些系统旨在利用自然语言理解(NLU)从此巨型知识源到用户查询中最突出的答案,从而取决于问题答案(QA)字段。问题答案涉及但不限于用户问题映射的步骤,以获取相关查询,检索相关信息,从检索到的信息等找到最佳合适的答案等。当前对深度学习模型的当前改进估计所有这些任务的令人信服的性能改进。在本综述工作中,根据问题的类型,答案类型,证据答案来源和建模方法进行分析QA场的研究方向。此细节随后是自动问题生成,相似性检测和语言的低资源可用性等领域的开放挑战。最后,提出了对可用数据集和评估措施的调查。
translated by 谷歌翻译
We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questionssampled from Bing's search query logs-each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages-extracted from 3,563,535 web documents retrieved by Bing-that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.
translated by 谷歌翻译
In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. SuperGLUE is available at super.gluebenchmark.com.
translated by 谷歌翻译
在寻求信息的对话中,用户与代理商进行对话,以提出一系列通常可以不足或过度指定的问题。理想的代理商首先将通过搜索其基本知识来源,然后与用户进行适当互动以解决它,从而确定他们处于这种情况。但是,大多数现有研究都无法或人为地纳入此类代理端计划。在这项工作中,我们介绍了Inscit(发音为Insight),这是一种用于与混合互动相互作用的信息寻求对话的数据集。它包含从805个人类对话中进行的4.7k用户代理转弯,代理商对Wikipedia进行搜索,并要求澄清或提供相关信息以解决用户查询。我们定义了两个子任务,即证据通过识别和响应产生,以及一种新的人类评估协议来评估模型绩效。我们根据对话知识识别和开放域问题的最新模型报告了两个强大的基线的结果。这两种模型都显着不足,并且没有产生连贯和信息丰富的反应,这表明未来的研究有足够的改进空间。
translated by 谷歌翻译
问题回答(QA)是最重要的自然语言处理(NLP)任务之一。它旨在使用NLP技术根据大规模的非结构化语料库生成对给定问题的相应答案。随着深度学习的发展,正在提出越来越具有挑战性的质量检查数据集,并且许多用于解决它们的新方法也正在出现。在本文中,我们研究了在深度学习时代发布的有影响力的质量检查数据集。具体来说,我们首先引入两个最常见的质量检查任务 - 文本问题答案和视觉问题 - 分别涵盖最具代表性的数据集,然后给出质量检查研究的一些当前挑战。
translated by 谷歌翻译
虽然通过简单的因素问题回答,文本理解的大量进展,但更加全面理解话语仍然存在重大挑战。批判性地反映出文本的人将造成好奇心驱动,通常是开放的问题,这反映了对内容的深刻理解,并要求复杂的推理来回答。建立和评估这种类型的话语理解模型的关键挑战是缺乏注释数据,特别是因为找到了这些问题的答案(可能根本不回答),需要高度的注释载荷的高认知负荷。本文提出了一种新的范式,使可扩展的数据收集能够针对新闻文件的理解,通过话语镜头查看这些问题。由此产生的语料库DCQA(疑问回答的话语理解)包括在607名英语文件中的22,430个问题答案对组成。 DCQA以自由形式,开放式问题的形式捕获句子之间的话语和语义链接。在评估集中,我们向问题上的问题提交了来自好奇数据集的问题,我们表明DCQA提供了有价值的监督,以回答开放式问题。我们还在使用现有的问答资源设计预训练方法,并使用合成数据来适应不可批售的问题。
translated by 谷歌翻译
We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K questionanswer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers. We also present two baseline algorithms: a featurebased classifier and a state-of-the-art neural network, that performs well on SQuAD reading comprehension. Neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that Trivi-aQA is a challenging testbed that is worth significant future study. 1
translated by 谷歌翻译
随着近期自然语言生成(NLG)模型的各种应用程序的改进,它变得必须具有识别和评估NLG输出是否仅共享关于外部世界的可验证信息的手段。在这项工作中,我们提出了一个归属于识别的来源(AIS)的新评估框架,用于评估自然语言生成模型的输出,当这种输出涉及外部世界时。我们首先定义AIS,并引入两级注释管道,用于允许注释器根据AIS指南适当地评估模型输出。通过人为评估研究,我们在三个代数据集(会话QA域中的两个中和总结一下,概括地验证了这种方法,表明AIS可以作为测量模型生成的语句是否支持基础来源的常见框架。我们释放人类评估研究指南。
translated by 谷歌翻译
In open-domain dialogue intelligent agents should exhibit the use of knowledge, however there are few convincing demonstrations of this to date. The most popular sequence to sequence models typically "generate and hope" generic utterances that can be memorized in the weights of the model when mapping from input utterance(s) to output, rather than employing recalled knowledge as context. Use of knowledge has so far proved difficult, in part because of the lack of a supervised learning benchmark task which exhibits knowledgeable open dialogue with clear grounding. To that end we collect and release a large dataset with conversations directly grounded with knowledge retrieved from Wikipedia. We then design architectures capable of retrieving knowledge, reading and conditioning on it, and finally generating natural responses. Our best performing dialogue models are able to conduct knowledgeable discussions on open-domain topics as evaluated by automatic metrics and human evaluations, while our new benchmark allows for measuring further improvements in this important research direction.
translated by 谷歌翻译
大多数在对话率问题回答中建模对话历史记录(CQA)的作品报告了共同CQA基准测试的主要结果。尽管现有模型在CQA排行榜上显示出令人印象深刻的结果,但尚不清楚它们在设置方面(有时是更现实的),训练数据大小(例如从大型集合到小型集合)和域是否有牢固的变化。在这项工作中,我们设计并进行了首次针对CQA的历史建模方法的大规模鲁棒性研究。我们发现,高基准分数不一定会转化为强大的鲁棒性,并且在不同的设置下,各种方法的性能都大不相同。配备了我们研究的见解,我们设计了一种基于及时的新型历史建模方法,并在各种环境中展示了其强大的鲁棒性。我们的方法灵感来自现有方法,这些方法突出了段落中的历史答案。但是,我们不是通过修改段落令牌嵌入来突出显示,而是直接在段落文本中添加文本提示。我们的方法简单,易于插入实际上任何模型,并且非常有效,因此我们建议它作为未来模型开发人员的起点。我们还希望我们的研究和见解将提高人们对以鲁棒性评估的重要性的认识,除了获得较高的排行榜分数,从而提高了更好的CQA系统。
translated by 谷歌翻译
This paper proposes to tackle opendomain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.
translated by 谷歌翻译
包含布尔问题的现有数据集(如Booolq和Tydi QA)为用户提供对问题的是/否响应。然而,一个单词响应不足以可说明的系统。我们通过释放一组标记现有TYDI QA和Booolq数据集的证据的新辅助来促进解释性。我们表明,与依赖现有资源的模型相比,我们的注释可用于培训提取改进证据跨度的模型。我们通过用户学习确认我们的调查结果表明我们提取的证据涵盖了增强用户体验。我们还提供进一步了解回答布尔问题的挑战,例如包含冲突的是和无答案的段落,以及预测证据的不同程度。
translated by 谷歌翻译