虽然通过简单的因素问题回答,文本理解的大量进展,但更加全面理解话语仍然存在重大挑战。批判性地反映出文本的人将造成好奇心驱动,通常是开放的问题,这反映了对内容的深刻理解,并要求复杂的推理来回答。建立和评估这种类型的话语理解模型的关键挑战是缺乏注释数据,特别是因为找到了这些问题的答案(可能根本不回答),需要高度的注释载荷的高认知负荷。本文提出了一种新的范式,使可扩展的数据收集能够针对新闻文件的理解,通过话语镜头查看这些问题。由此产生的语料库DCQA(疑问回答的话语理解)包括在607名英语文件中的22,430个问题答案对组成。 DCQA以自由形式,开放式问题的形式捕获句子之间的话语和语义链接。在评估集中,我们向问题上的问题提交了来自好奇数据集的问题,我们表明DCQA提供了有价值的监督,以回答开放式问题。我们还在使用现有的问答资源设计预训练方法,并使用合成数据来适应不可批售的问题。
translated by 谷歌翻译
Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HOTPOTQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison. We show that HOTPOTQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.
translated by 谷歌翻译
为了实现长文档理解的构建和测试模型,我们引入质量,具有中文段的多项选择QA DataSet,具有约5,000个令牌的平均长度,比典型的当前模型更长。与经过段落的事先工作不同,我们的问题是由阅读整个段落的贡献者编写和验证的,而不是依赖摘要或摘录。此外,只有一半的问题是通过在紧缩时间限制下工作的注释器来应答,表明略读和简单的搜索不足以一直表现良好。目前的模型在此任务上表现不佳(55.4%),并且落后于人类性能(93.5%)。
translated by 谷歌翻译
随着近期自然语言生成(NLG)模型的各种应用程序的改进,它变得必须具有识别和评估NLG输出是否仅共享关于外部世界的可验证信息的手段。在这项工作中,我们提出了一个归属于识别的来源(AIS)的新评估框架,用于评估自然语言生成模型的输出,当这种输出涉及外部世界时。我们首先定义AIS,并引入两级注释管道,用于允许注释器根据AIS指南适当地评估模型输出。通过人为评估研究,我们在三个代数据集(会话QA域中的两个中和总结一下,概括地验证了这种方法,表明AIS可以作为测量模型生成的语句是否支持基础来源的常见框架。我们释放人类评估研究指南。
translated by 谷歌翻译
We present QuAC, a dataset for Question Answering in Context that contains 14K information-seeking QA dialogs (100K questions in total). The dialogs involve two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as we show in a detailed qualitative evaluation. We also report results for a number of reference models, including a recently state-ofthe-art reading comprehension architecture extended to model dialog context. Our best model underperforms humans by 20 F1, suggesting that there is significant room for future work on this data. Dataset, baseline, and leaderboard available at http://quac.ai.
translated by 谷歌翻译
Humans gather information through conversations involving a series of interconnected questions and answers. For machines to assist in information gathering, it is therefore essential to enable them to answer conversational questions. We introduce CoQA, a novel dataset for building Conversational Question Answering systems. 1 Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage. We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning. We evaluate strong dialogue and reading comprehension models on CoQA. The best system obtains an F1 score of 65.4%, which is 23.4 points behind human performance (88.8%), indicating there is ample room for improvement. We present CoQA as a challenge to the community at https://stanfordnlp. github.io/coqa.
translated by 谷歌翻译
Question Answering (QA) is a growing area of research, often used to facilitate the extraction of information from within documents. State-of-the-art QA models are usually pre-trained on domain-general corpora like Wikipedia and thus tend to struggle on out-of-domain documents without fine-tuning. We demonstrate that synthetic domain-specific datasets can be generated easily using domain-general models, while still providing significant improvements to QA performance. We present two new tools for this task: A flexible pipeline for validating the synthetic QA data and training downstream models on it, and an online interface to facilitate human annotation of this generated data. Using this interface, crowdworkers labelled 1117 synthetic QA pairs, which we then used to fine-tune downstream models and improve domain-specific QA performance by 8.75 F1.
translated by 谷歌翻译
We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K questionanswer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers. We also present two baseline algorithms: a featurebased classifier and a state-of-the-art neural network, that performs well on SQuAD reading comprehension. Neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that Trivi-aQA is a challenging testbed that is worth significant future study. 1
translated by 谷歌翻译
有关应答数据集和模型的研究在研究界中获得了很多关注。其中许多人释放了自己的问题应答数据集以及模型。我们在该研究领域看到了巨大的进展。本调查的目的是识别,总结和分析许多研究人员释放的现有数据集,尤其是在非英语数据集以及研究代码和评估指标等资源中。在本文中,我们审查了问题应答数据集,这些数据集可以以法语,德语,日语,中文,阿拉伯语,俄语以及多语言和交叉的问答数据集进行英语。
translated by 谷歌翻译
In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. SuperGLUE is available at super.gluebenchmark.com.
translated by 谷歌翻译
自Bert(Devlin等,2018)以来,学习上下文化的单词嵌入一直是NLP中的事实上的标准。然而,学习上下文化短语嵌入的进展受到缺乏人类通知的语句基准基准的阻碍。为了填补这一空白,我们提出了PIC- 〜28K名词短语的数据集伴随着它们的上下文Wikipedia页面,以及一套三个任务,这些任务增加了评估短语嵌入质量的难度。我们发现,在我们的数据集中进行的培训提高了排名模型的准确性,并明显地将问题答案(QA)模型推向了近人类的准确性,而在语义搜索上,鉴于询问短语和段落,在语义搜索上是95%的精确匹配(EM)。有趣的是,我们发现这种令人印象深刻的性能的证据是因为质量检查模型学会了更好地捕获短语的共同含义,而不管其实际背景如何。也就是说,在我们的短语中歧义歧义(PSD)任务上,SOTA模型的精度大大下降(60%EM),在两个不同情况下未能区分相同短语的两种不同感觉。在我们的3任任务基准测试中的进一步结果表明,学习上下文化的短语嵌入仍然是一个有趣的开放挑战。
translated by 谷歌翻译
There are many potential benefits to news readers accessing diverse sources. Modern news aggregators do the hard work of organizing the news, offering readers a plethora of source options, but choosing which source to read remains challenging. We propose a new framework to assist readers in identifying source differences and gaining an understanding of news coverage diversity. The framework is based on the generation of Discord Questions: questions with a diverse answer pool, explicitly illustrating source differences. To assemble a prototype of the framework, we focus on two components: (1) discord question generation, the task of generating questions answered differently by sources, for which we propose an automatic scoring method, and create a model that improves performance from current question generation (QG) methods by 5%, (2) answer consolidation, the task of grouping answers to a question that are semantically similar, for which we collect data and repurpose a method that achieves 81% balanced accuracy on our realistic test set. We illustrate the framework's feasibility through a prototype interface. Even though model performance at discord QG still lags human performance by more than 15%, generated questions are judged to be more interesting than factoid questions and can reveal differences in the level of detail, sentiment, and reasoning of sources in news coverage.
translated by 谷歌翻译
包含布尔问题的现有数据集(如Booolq和Tydi QA)为用户提供对问题的是/否响应。然而,一个单词响应不足以可说明的系统。我们通过释放一组标记现有TYDI QA和Booolq数据集的证据的新辅助来促进解释性。我们表明,与依赖现有资源的模型相比,我们的注释可用于培训提取改进证据跨度的模型。我们通过用户学习确认我们的调查结果表明我们提取的证据涵盖了增强用户体验。我们还提供进一步了解回答布尔问题的挑战,例如包含冲突的是和无答案的段落,以及预测证据的不同程度。
translated by 谷歌翻译
最近,对建立问题的兴趣越来越兴趣,其中跨多种模式(如文本和图像)的原因。但是,使用图像的QA通常仅限于从预定义的选项集中挑选答案。此外,在现实世界中的图像,特别是在新闻中,具有与文本共同参考的对象,其中来自两个模态的互补信息。在本文中,我们提出了一种新的QA评估基准,并在新闻文章中提出了1,384个问题,这些文章需要跨媒体接地图像中的物体接地到文本上。具体地,该任务涉及需要推理图像标题对的多跳问题,以识别接地的视觉对象,然后从新闻正文文本中预测跨度以回答问题。此外,我们介绍了一种新颖的多媒体数据增强框架,基于跨媒体知识提取和合成问题答案生成,自动增强可以为此任务提供弱监管的数据。我们在我们的基准测试中评估了基于管道和基于端到端的预先预测的多媒体QA模型,并表明他们实现了有希望的性能,而在人类性能之后大幅滞后,因此留下了未来工作的大型空间,以便在这一具有挑战性的新任务上的工作。
translated by 谷歌翻译
学术研究是解决以前从未解决过的问题的探索活动。通过这种性质,每个学术研究工作都需要进行文献审查,以区分其Novelties尚未通过事先作品解决。在自然语言处理中,该文献综述通常在“相关工作”部分下进行。鉴于研究文件的其余部分和引用的论文列表,自动相关工作生成的任务旨在自动生成“相关工作”部分。虽然这项任务是在10年前提出的,但直到最近,它被认为是作为科学多文件摘要问题的变种。然而,即使在今天,尚未标准化了自动相关工作和引用文本生成的问题。在这项调查中,我们进行了一个元研究,从问题制定,数据集收集,方法方法,绩效评估和未来前景的角度来比较相关工作的现有文献,以便为读者洞察到国家的进步 - 最内容的研究,以及如何进行未来的研究。我们还调查了我们建议未来工作要考虑整合的相关研究领域。
translated by 谷歌翻译
This paper proposes to tackle opendomain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.
translated by 谷歌翻译
问题回答(QA)是最重要的自然语言处理(NLP)任务之一。它旨在使用NLP技术根据大规模的非结构化语料库生成对给定问题的相应答案。随着深度学习的发展,正在提出越来越具有挑战性的质量检查数据集,并且许多用于解决它们的新方法也正在出现。在本文中,我们研究了在深度学习时代发布的有影响力的质量检查数据集。具体来说,我们首先引入两个最常见的质量检查任务 - 文本问题答案和视觉问题 - 分别涵盖最具代表性的数据集,然后给出质量检查研究的一些当前挑战。
translated by 谷歌翻译
In this work, we introduce IndicXTREME, a benchmark consisting of nine diverse tasks covering 18 languages from the Indic sub-continent belonging to four different families. Across languages and tasks, IndicXTREME contains a total of 103 evaluation sets, of which 51 are new contributions to the literature. To maintain high quality, we only use human annotators to curate or translate\footnote{for IndicXParaphrase, where an automatic translation system is used, a second human verification and correction step is done.} our datasets. To the best of our knowledge, this is the first effort toward creating a standard benchmark for Indic languages that aims to test the zero-shot capabilities of pretrained language models. We also release IndicCorp v2, an updated and much larger version of IndicCorp that contains 20.9 billion tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate it on IndicXTREME to show that it outperforms existing multilingual language models such as XLM-R and MuRIL.
translated by 谷歌翻译
We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at https://stanford-qa.com.
translated by 谷歌翻译
As large language models (LLMs) grow larger and more sophisticated, assessing their "reasoning" capabilities in natural language grows more challenging. Recent question answering (QA) benchmarks that attempt to assess reasoning are often limited by a narrow scope of covered situations and subject matters. We introduce WikiWhy, a QA dataset built around a novel auxiliary task: explaining why an answer is true in natural language. WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics. Each rationale is a set of supporting statements connecting the question to the answer. WikiWhy serves as a benchmark for the reasoning capabilities of LLMs because it demands rigorous explicit rationales for each answer to demonstrate the acquisition of implicit commonsense knowledge, which is unlikely to be easily memorized. GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition, leaving significant room for future improvements.
translated by 谷歌翻译