Recent work on open domain question answering (QA) assumes strong supervision of the supporting evidence and/or assumes a blackbox information retrieval (IR) system to retrieve evidence candidates. We argue that both are suboptimal, since gold evidence is not always available, and QA is fundamentally different from IR. We show for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system. In this setting, evidence retrieval from all of Wikipedia is treated as a latent variable. Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task. We evaluate on open versions of five QA datasets. On datasets where the questioner already knows the answer, a traditional IR system such as BM25 is sufficient. On datasets where a user is genuinely seeking an answer, we show that learned retrieval is crucial, outperforming BM25 by up to 19 points in exact match.
translated by 谷歌翻译
Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dualencoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks. 1 * Equal contribution 1 The code and trained models have been released at https://github.com/facebookresearch/DPR.
translated by 谷歌翻译
我们提出了一种用于在生成答案时将信息与多个检索文件中的信息组合的可检索增强的开放式开放式开放式开放域问题训练方法。我们将检索决策模拟作为相关文件集的潜在变量。由于通过对所检索的文件集的边缘化,因此使用期望最大化算法估计这一点。我们迭代地估计我们的潜在变量的价值(给定问题的这些相关文档集),然后使用此估计来更新检索器和读取器参数。我们假设这种端到端的训练允许训练信号流到读者,然后比上演明智的训练更好地流到猎犬。这导致检索器能够为问题和读者选择更多相关文档,这些文件在更准确的文档中培训以生成答案。三个基准数据集的实验表明,我们所提出的方法优于所有现有的相当大小的方法2-3%绝对精确匹配点,实现了新的最先进的结果。我们的结果还展示了学习检索以改善答复的可行性,而无明确监督检索决策。
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
This paper proposes to tackle opendomain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
知识密集型语言任务(苏格兰信)通常需要大量信息来提供正确的答案。解决此问题的一种流行范式是将搜索系统与机器读取器相结合,前者检索支持证据,后者检查它们以产生答案。最近,读者组成部分在大规模预培养的生成模型的帮助下见证了重大进展。同时,搜索组件中的大多数现有解决方案都依赖于传统的``索引 - retrieve-then-Rank''管道,该管道遭受了巨大的内存足迹和端到端优化的困难。受到最新构建基于模型的IR模型的努力的启发,我们建议用新颖的单步生成模型替换传统的多步搜索管道,该模型可以极大地简化搜索过程并以端到端的方式进行优化。我们表明,可以通过一组经过适当设计的预训练任务来学习强大的生成检索模型,并被采用以通过进一步的微调来改善各种下游苏格兰短裙任务。我们将预训练的生成检索模型命名为Copusbrain,因为有关该语料库的所有信息均以其参数进行编码,而无需构造其他索引。经验结果表明,在苏格兰语基准上的检索任务并建立了新的最新性能,Copusbrain可以极大地超过强大的基准。我们还表明,在零农源和低资源设置下,科体班运行良好。
translated by 谷歌翻译
信息检索的任务是许多自然语言处理系统的重要组成部分,例如开放式域问题回答。尽管传统方法是基于手工制作的功能,但基于神经网络的连续表示最近获得了竞争结果。使用此类方法的一个挑战是获取监督数据以训练回猎犬模型,该模型对应于一对查询和支持文档。在本文中,我们提出了一种技术,以学习以知识蒸馏的启发,并不需要带注释的查询和文档对。我们的方法利用读者模型的注意分数,用于根据检索文档解决任务,以获取猎犬的合成标签。我们评估我们的方法回答,获得最新结果。
translated by 谷歌翻译
对于开放式域问题的密集检索已被证明通过在问题通道对的大型数据集上培训来实现令人印象深刻的性能。我们调查是否可以以自我监督的方式学习密集的检索,并有效地应用没有任何注释。我们观察到这种情况下的检索斗争的现有借用模型,并提出了一种设计用于检索的新预制方案:重复跨度检索。我们在文档中使用经常性跨度来创建用于对比学习的伪示例。由此产生的模型 - 蜘蛛 - 在广泛的ODQA数据集上没有任何示例,并且与BM25具有竞争力,具有强烈的稀疏基线。此外,蜘蛛通常优于DPR在其他数据集的问题上培训的DPR培训的强大基线。我们将蜘蛛与BM25结合的混合猎犬改进了所有数据集的组件,并且通常与域中DPR模型具有竞争力,这些模型培训数万例培训。
translated by 谷歌翻译
信息检索是自然语言处理中的重要组成部分,用于知识密集型任务,如问题应答和事实检查。最近,信息检索已经看到基于神经网络的密集检索器的出现,作为基于术语频率的典型稀疏方法的替代方案。这些模型在数据集和任务中获得了最先进的结果,其中提供了大型训练集。但是,它们不会很好地转移到没有培训数据的新域或应用程序,并且通常因未经监督的术语 - 频率方法(例如BM25)的术语频率方法而言。因此,自然问题是如果没有监督,是否有可能训练密集的索取。在这项工作中,我们探讨了对比学习的限制,作为培训无人监督的密集检索的一种方式,并表明它导致强烈的检索性能。更确切地说,我们在15个数据集中出现了我们的模型胜过BM25的Beir基准测试。此外,当有几千例的示例可用时,我们显示微调我们的模型,与BM25相比,这些模型导致强大的改进。最后,当在MS-Marco数据集上微调之前用作预训练时,我们的技术在Beir基准上获得最先进的结果。
translated by 谷歌翻译
We present SpaceQA, to the best of our knowledge the first open-domain QA system in Space mission design. SpaceQA is part of an initiative by the European Space Agency (ESA) to facilitate the access, sharing and reuse of information about Space mission design within the agency and with the public. We adopt a state-of-the-art architecture consisting of a dense retriever and a neural reader and opt for an approach based on transfer learning rather than fine-tuning due to the lack of domain-specific annotated data. Our evaluation on a test set produced by ESA is largely consistent with the results originally reported by the evaluated retrievers and confirms the need of fine tuning for reading comprehension. As of writing this paper, ESA is piloting SpaceQA internally.
translated by 谷歌翻译
问题答案(QA)是自然语言处理中最具挑战性的最具挑战性的问题之一(NLP)。问答(QA)系统试图为给定问题产生答案。这些答案可以从非结构化或结构化文本生成。因此,QA被认为是可以用于评估文本了解系统的重要研究区域。大量的QA研究致力于英语语言,调查最先进的技术和实现最先进的结果。然而,由于阿拉伯QA中的研究努力和缺乏大型基准数据集,在阿拉伯语问答进展中的研究努力得到了很大速度的速度。最近许多预先接受的语言模型在许多阿拉伯语NLP问题中提供了高性能。在这项工作中,我们使用四个阅读理解数据集来评估阿拉伯QA的最先进的接种变压器模型,它是阿拉伯语 - 队,ArcD,AQAD和TYDIQA-GoldP数据集。我们微调并比较了Arabertv2基础模型,ArabertV0.2大型型号和ARAElectra模型的性能。在最后,我们提供了一个分析,了解和解释某些型号获得的低绩效结果。
translated by 谷歌翻译
Systems for knowledge-intensive tasks such as open-domain question answering (QA) usually consist of two stages: efficient retrieval of relevant documents from a large corpus and detailed reading of the selected documents to generate answers. Retrievers and readers are usually modeled separately, which necessitates a cumbersome implementation and is hard to train and adapt in an end-to-end fashion. In this paper, we revisit this design and eschew the separate architecture and training in favor of a single Transformer that performs Retrieval as Attention (ReAtt), and end-to-end training solely based on supervision from the end QA task. We demonstrate for the first time that a single model trained end-to-end can achieve both competitive retrieval and QA performance, matching or slightly outperforming state-of-the-art separately trained retrievers and readers. Moreover, end-to-end adaptation significantly boosts its performance on out-of-domain datasets in both supervised and unsupervised settings, making our model a simple and adaptable solution for knowledge-intensive tasks. Code and models are available at https://github.com/jzbjyb/ReAtt.
translated by 谷歌翻译
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit nonparametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
translated by 谷歌翻译
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
translated by 谷歌翻译
使用来自表格(TableQA)的信息回答自然语言问题是最近的兴趣。在许多应用程序中,表未孤立,但嵌入到非结构化文本中。通常,通过将其部分与表格单元格内容或非结构化文本跨度匹配,并从任一源中提取答案来最佳地回答问题。这导致了HybridQA数据集引入的TextableQA问题的新空间。现有的表格表示对基于变换器的阅读理解(RC)架构的适应性未通过单个系统解决两个表示的不同模式。培训此类系统因对遥远监督的需求而进一步挑战。为了降低认知负担,培训实例通常包括问题和答案,后者匹配多个表行和文本段。这导致嘈杂的多实例培训制度不仅涉及表的行,而且涵盖了链接文本的跨度。我们通过提出Mitqa来回应这些挑战,这是一个新的TextableQA系统,明确地模拟了表行选择和文本跨度选择的不同但密切相关的概率空间。与最近的基线相比,我们的实验表明了我们的方法的优越性。该方法目前在HybridQA排行榜的顶部,并进行了一个试验集,在以前公布的结果上实现了对em和f1的21%的绝对改善。
translated by 谷歌翻译
我们介绍了Art,这是一种新的语料库级自动编码方法,用于培训密集检索模型,不需要任何标记的培训数据。密集的检索是开放域任务(例如Open QA)的核心挑战,在该任务中,最先进的方法通常需要大量的监督数据集,并具有自定义的硬性采矿和肯定式示例。相反,艺术品仅需要访问未配对的投入和输出(例如问题和潜在的答案文件)。它使用新的文档 - 重新定义自动编码方案,其中(1)输入问题用于检索一组证据文档,并且(2)随后使用文档来计算重建原始问题的概率。基于问题重建的检索培训可以有效地学习文档和问题编码器,以后可以将其纳入完整的QA系统中,而无需任何进一步的填充。广泛的实验表明,ART在多个QA检索基准测试基准上获得最先进的结果,并且仅来自预训练的语言模型的一般初始化,从而消除了对标记的数据和特定于任务的损失的需求。
translated by 谷歌翻译
多跳的推理(即跨两个或多个文档的推理)是NLP模型的关键要素,该模型利用大型语料库表现出广泛的知识。为了检索证据段落,多跳模型必须与整个啤酒花的快速增长的搜索空间抗衡,代表结合多个信息需求的复杂查询,并解决有关在训练段落之间跳出的最佳顺序的歧义。我们通过Baleen解决了这些问题,Baleen可以提高多跳检索的准确性,同时从多跳的训练信号中学习强大的训练信号的准确性。为了驯服搜索空间,我们提出了凝结的检索,该管道总结了每个跃点后检索到单个紧凑型上下文的管道。为了建模复杂的查询,我们引入了一个重点的后期相互作用检索器,该检索器允许同一查询表示的不同部分匹配不同的相关段落。最后,为了推断无序的训练段落中的跳跃依赖性,我们设计了潜在的跳跃订购,这是一种弱者的策略,在该策略中,受过训练的检索员本身选择了啤酒花的顺序。我们在检索中评估Baleen的两跳问答和多跳的要求验证,并确定最先进的绩效。
translated by 谷歌翻译
基于强大的预训练语言模型(PLM)的密集检索方法(DR)方法取得了重大进步,并已成为现代开放域问答系统的关键组成部分。但是,他们需要大量的手动注释才能进行竞争性,这是不可行的。为了解决这个问题,越来越多的研究作品最近着重于在低资源场景下改善DR绩效。这些作品在培训所需的资源和采用各种技术的资源方面有所不同。了解这种差异对于在特定的低资源场景下选择正确的技术至关重要。为了促进这种理解,我们提供了针对低资源DR的主流技术的彻底结构化概述。根据他们所需的资源,我们将技术分为三个主要类别:(1)仅需要文档; (2)需要文件和问题; (3)需要文档和提问对。对于每种技术,我们都会介绍其一般形式算法,突出显示开放的问题和利弊。概述了有希望的方向以供将来的研究。
translated by 谷歌翻译
We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K questionanswer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers. We also present two baseline algorithms: a featurebased classifier and a state-of-the-art neural network, that performs well on SQuAD reading comprehension. Neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that Trivi-aQA is a challenging testbed that is worth significant future study. 1
translated by 谷歌翻译