We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K questionanswer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers. We also present two baseline algorithms: a featurebased classifier and a state-of-the-art neural network, that performs well on SQuAD reading comprehension. Neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that Trivi-aQA is a challenging testbed that is worth significant future study. 1
translated by 谷歌翻译
Multi-hop Machine reading comprehension is a challenging task with aim of answering a question based on disjoint pieces of information across the different passages. The evaluation metrics and datasets are a vital part of multi-hop MRC because it is not possible to train and evaluate models without them, also, the proposed challenges by datasets often are an important motivation for improving the existing models. Due to increasing attention to this field, it is necessary and worth reviewing them in detail. This study aims to present a comprehensive survey on recent advances in multi-hop MRC evaluation metrics and datasets. In this regard, first, the multi-hop MRC problem definition will be presented, then the evaluation metrics based on their multi-hop aspect will be investigated. Also, 15 multi-hop datasets have been reviewed in detail from 2017 to 2022, and a comprehensive analysis has been prepared at the end. Finally, open issues in this field have been discussed.
translated by 谷歌翻译
This paper proposes to tackle opendomain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.
translated by 谷歌翻译
有关应答数据集和模型的研究在研究界中获得了很多关注。其中许多人释放了自己的问题应答数据集以及模型。我们在该研究领域看到了巨大的进展。本调查的目的是识别,总结和分析许多研究人员释放的现有数据集,尤其是在非英语数据集以及研究代码和评估指标等资源中。在本文中,我们审查了问题应答数据集,这些数据集可以以法语,德语,日语,中文,阿拉伯语,俄语以及多语言和交叉的问答数据集进行英语。
translated by 谷歌翻译
Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HOTPOTQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison. We show that HOTPOTQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.
translated by 谷歌翻译
We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at https://stanford-qa.com.
translated by 谷歌翻译
问题回答(QA)是最重要的自然语言处理(NLP)任务之一。它旨在使用NLP技术根据大规模的非结构化语料库生成对给定问题的相应答案。随着深度学习的发展,正在提出越来越具有挑战性的质量检查数据集,并且许多用于解决它们的新方法也正在出现。在本文中,我们研究了在深度学习时代发布的有影响力的质量检查数据集。具体来说,我们首先引入两个最常见的质量检查任务 - 文本问题答案和视觉问题 - 分别涵盖最具代表性的数据集,然后给出质量检查研究的一些当前挑战。
translated by 谷歌翻译
自动问题应答(QA)系统的目的是以时间有效的方式向用户查询提供答案。通常在数据库(或知识库)或通常被称为语料库的文件集合中找到答案。在过去的几十年里,收购知识的扩散,因此生物医学领域的新科学文章一直是指数增长。因此,即使对于领域专家,也难以跟踪域中的所有信息。随着商业搜索引擎的改进,用户可以在某些情况下键入其查询并获得最相关的一小组文档,以及在某些情况下从文档中的相关片段。但是,手动查找所需信息或答案可能仍然令人疑惑和耗时。这需要开发高效的QA系统,该系统旨在为用户提供精确和精确的答案提供了生物医学领域的自然语言问题。在本文中,我们介绍了用于开发普通域QA系统的基本方法,然后彻底调查生物医学QA系统的不同方面,包括使用结构化数据库和文本集合的基准数据集和几种提出的方​​法。我们还探讨了当前系统的局限性,并探索潜在的途径以获得进一步的进步。
translated by 谷歌翻译
We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questionssampled from Bing's search query logs-each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages-extracted from 3,563,535 web documents retrieved by Bing-that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.
translated by 谷歌翻译
In this paper we introduce a new publicly available dataset for verification against textual sources, FEVER: Fact Extraction and VERification. It consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from.The claims are classified as SUPPORTED, RE-FUTED or NOTENOUGHINFO by annotators achieving 0.6841 in Fleiss κ. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment. To characterize the challenge of the dataset presented, we develop a pipeline approach and compare it to suitably designed oracles. The best accuracy we achieve on labeling a claim accompanied by the correct evidence is 31.87%, while if we ignore the evidence we achieve 50.91%. Thus we believe that FEVER is a challenging testbed that will help stimulate progress on claim verification against textual sources.
translated by 谷歌翻译
Humans gather information through conversations involving a series of interconnected questions and answers. For machines to assist in information gathering, it is therefore essential to enable them to answer conversational questions. We introduce CoQA, a novel dataset for building Conversational Question Answering systems. 1 Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage. We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning. We evaluate strong dialogue and reading comprehension models on CoQA. The best system obtains an F1 score of 65.4%, which is 23.4 points behind human performance (88.8%), indicating there is ample room for improvement. We present CoQA as a challenge to the community at https://stanfordnlp. github.io/coqa.
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
Extractive reading comprehension systems can often locate the correct answer to a question in a context document, but they also tend to make unreliable guesses on questions for which the correct answer is not stated in the context. Existing datasets either focus exclusively on answerable questions, or use automatically generated unanswerable questions that are easy to identify. To address these weaknesses, we present SQuAD 2.0, the latest version of the Stanford Question Answering Dataset (SQuAD). SQuAD 2.0 combines existing SQuAD data with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. SQuAD 2.0 is a challenging natural language understanding task for existing models: a strong neural system that gets 86% F1 on SQuAD 1.1 achieves only 66% F1 on SQuAD 2.0.
translated by 谷歌翻译
虽然通过简单的因素问题回答,文本理解的大量进展,但更加全面理解话语仍然存在重大挑战。批判性地反映出文本的人将造成好奇心驱动,通常是开放的问题,这反映了对内容的深刻理解,并要求复杂的推理来回答。建立和评估这种类型的话语理解模型的关键挑战是缺乏注释数据,特别是因为找到了这些问题的答案(可能根本不回答),需要高度的注释载荷的高认知负荷。本文提出了一种新的范式,使可扩展的数据收集能够针对新闻文件的理解,通过话语镜头查看这些问题。由此产生的语料库DCQA(疑问回答的话语理解)包括在607名英语文件中的22,430个问题答案对组成。 DCQA以自由形式,开放式问题的形式捕获句子之间的话语和语义链接。在评估集中,我们向问题上的问题提交了来自好奇数据集的问题,我们表明DCQA提供了有价值的监督,以回答开放式问题。我们还在使用现有的问答资源设计预训练方法,并使用合成数据来适应不可批售的问题。
translated by 谷歌翻译
近年来,低资源机器阅读理解(MRC)取得了重大进展,模型在各种语言数据集中获得了显着性能。但是,这些模型都没有为URDU语言定制。这项工作探讨了通过将机器翻译的队伍与来自剑桥O级书籍的Wikipedia文章和Urdu RC工作表组合的人生成的样本组合了机器翻译的小队,探讨了乌尔通题的半自动创建了数据集(UQuad1.0)。 UQuad1.0是一个大型URDU数据集,用于提取机器阅读理解任务,由49K问题答案成对组成,段落和回答格式。在UQuad1.0中,通过众包的原始SquAd1.0和大约4000对的机器翻译产生45000对QA。在本研究中,我们使用了两种类型的MRC型号:基于规则的基线和基于先进的变换器的模型。但是,我们发现后者优于其他人;因此,我们已经决定专注于基于变压器的架构。使用XLMroberta和多语言伯特,我们分别获得0.66和0.63的F1得分。
translated by 谷歌翻译
Machine reading comprehension (MRC) is a long-standing topic in natural language processing (NLP). The MRC task aims to answer a question based on the given context. Recently studies focus on multi-hop MRC which is a more challenging extension of MRC, which to answer a question some disjoint pieces of information across the context are required. Due to the complexity and importance of multi-hop MRC, a large number of studies have been focused on this topic in recent years, therefore, it is necessary and worth reviewing the related literature. This study aims to investigate recent advances in the multi-hop MRC approaches based on 31 studies from 2018 to 2022. In this regard, first, the multi-hop MRC problem definition will be introduced, then 31 models will be reviewed in detail with a strong focus on their multi-hop aspects. They also will be categorized based on their main techniques. Finally, a fine-grain comprehensive comparison of the models and techniques will be presented.
translated by 谷歌翻译
We present QuAC, a dataset for Question Answering in Context that contains 14K information-seeking QA dialogs (100K questions in total). The dialogs involve two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as we show in a detailed qualitative evaluation. We also report results for a number of reference models, including a recently state-ofthe-art reading comprehension architecture extended to model dialog context. Our best model underperforms humans by 20 F1, suggesting that there is significant room for future work on this data. Dataset, baseline, and leaderboard available at http://quac.ai.
translated by 谷歌翻译
过去十年互联网上可用的信息和信息量增加。该数字化导致自动应答系统需要从冗余和过渡知识源中提取富有成效的信息。这些系统旨在利用自然语言理解(NLU)从此巨型知识源到用户查询中最突出的答案,从而取决于问题答案(QA)字段。问题答案涉及但不限于用户问题映射的步骤,以获取相关查询,检索相关信息,从检索到的信息等找到最佳合适的答案等。当前对深度学习模型的当前改进估计所有这些任务的令人信服的性能改进。在本综述工作中,根据问题的类型,答案类型,证据答案来源和建模方法进行分析QA场的研究方向。此细节随后是自动问题生成,相似性检测和语言的低资源可用性等领域的开放挑战。最后,提出了对可用数据集和评估措施的调查。
translated by 谷歌翻译
最近的开放式域问题回答表明,新颖的测试问题之间的模型性能和那些在很大程度上与培训问题重叠的模型性能存在很大差异。然而,目前尚不清楚新颖的问题的哪些方面使他们成为挑战。在进行系统泛化的研究时,我们根据三个类别介绍和注释问题,这些类别测量了不同的水平和概括的种类:培训设定重叠,组成泛化(Comp-Gen)和新颖的实体概括(新实体)。在评估六个流行的参数和非参数模型时,我们发现,对于既定的自然问题和TriviaQA数据集,即使是Comp-Gen /新颖实体的最强的模型性能也是13.1 / 5.4%和9.6 / 1.5%,而与此相比降低对于完整的测试集 - 表示这些类型的问题所带来的挑战。此外,我们表明,虽然非参数模型可以相对良好地处理含有新颖实体的问题,但它们与那些需要组成泛化的问题斗争。最后,我们发现关键问题是:来自检索组件的级联错误,问题模式的频率和实体的频率。
translated by 谷歌翻译
在该职位论文中,我们提出了一种新方法,以基于问题的产生和实体链接来生成文本的知识库(KB)。我们认为,所提出的KB类型具有传统符号KB的许多关键优势:尤其是由小型模块化组件组成,可以在组合上合并以回答复杂的查询,包括涉及“多跳跃”的关系查询和查询。“推论。但是,与传统的KB不同,该信息商店与常见的用户信息需求相符。
translated by 谷歌翻译