Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
translated by 谷歌翻译
Language models (LMs) now excel at many tasks such as few-shot learning, question answering, reasoning, and dialog. However, they sometimes generate unsupported or misleading content. A user cannot easily determine whether their outputs are trustworthy or not, because most LMs do not have any built-in mechanism for attribution to external evidence. To enable attribution while still preserving all the powerful advantages of recent generation models, we propose RARR (Retrofit Attribution using Research and Revision), a system that 1) automatically finds attribution for the output of any text generation model and 2) post-edits the output to fix unsupported content while preserving the original output as much as possible. When applied to the output of several state-of-the-art LMs on a diverse set of generation tasks, we find that RARR significantly improves attribution while otherwise preserving the original input to a much greater degree than previously explored edit models. Furthermore, the implementation of RARR requires only a handful of training examples, a large language model, and standard web search.
translated by 谷歌翻译
知识密集型任务,例如开放域问题答案(QA),需要访问大量的世界知识或领域知识。知识密集型任务的一种常见方法是采用检索到阅读的管道,该管道首先从诸如Wikipedia之类的外部语料库中检索少数相关的上下文文档,然后预测在检索文档的条件下得到答案。在本文中,我们提出了一种新的观点,可以通过用大型语言模型生成器代替文档检索器来解决知识密集型任务。我们称我们的方法生成-Read Read(GenRead),该方法首先提示大型语言模型根据给定问题生成上下文文档,然后读取生成的文档以产生最终答案。此外,我们提出了一种基于聚类的提示方法,该方法选择了不同的提示,从而产生了涵盖不同观点的生成文档,从而更好地回忆了可接受的答案。我们对三个不同的知识密集任务进行了广泛的实验,包括开放域质量检查,事实检查和对话系统。值得注意的是,GenRead在Triviaqa和WebQ上实现了71.6和54.4的精确匹配分数,显着超过了最先进的检索到+4.0和+3.9的最先进的dpr-fid,而无需从任何外部知识源中检索任何文档。最后,我们证明可以通过结合检索和生成来进一步提高模型性能。
translated by 谷歌翻译
随着近期自然语言生成(NLG)模型的各种应用程序的改进,它变得必须具有识别和评估NLG输出是否仅共享关于外部世界的可验证信息的手段。在这项工作中,我们提出了一个归属于识别的来源(AIS)的新评估框架,用于评估自然语言生成模型的输出,当这种输出涉及外部世界时。我们首先定义AIS,并引入两级注释管道,用于允许注释器根据AIS指南适当地评估模型输出。通过人为评估研究,我们在三个代数据集(会话QA域中的两个中和总结一下,概括地验证了这种方法,表明AIS可以作为测量模型生成的语句是否支持基础来源的常见框架。我们释放人类评估研究指南。
translated by 谷歌翻译
最近的开放式域问题回答表明,新颖的测试问题之间的模型性能和那些在很大程度上与培训问题重叠的模型性能存在很大差异。然而,目前尚不清楚新颖的问题的哪些方面使他们成为挑战。在进行系统泛化的研究时,我们根据三个类别介绍和注释问题,这些类别测量了不同的水平和概括的种类:培训设定重叠,组成泛化(Comp-Gen)和新颖的实体概括(新实体)。在评估六个流行的参数和非参数模型时,我们发现,对于既定的自然问题和TriviaQA数据集,即使是Comp-Gen /新颖实体的最强的模型性能也是13.1 / 5.4%和9.6 / 1.5%,而与此相比降低对于完整的测试集 - 表示这些类型的问题所带来的挑战。此外,我们表明,虽然非参数模型可以相对良好地处理含有新颖实体的问题,但它们与那些需要组成泛化的问题斗争。最后,我们发现关键问题是:来自检索组件的级联错误,问题模式的频率和实体的频率。
translated by 谷歌翻译
检索增强的代表在许多知识密集型的NLP任务中表现出最先进的表现,例如打开问题应答和事实验证。考虑到检索到的段落,这些模型训练以产生最终输出,这可能与原始查询无关,导致学习虚假线索或回答记忆。这项工作介绍了一种融入通道的证据性的方法 - 是否段落包含正确的证据来支持输出 - 培训发电机。我们介绍了一个多任务学习框架,共同生成最终输出并预测每个段落的证据性,利用新的任务不可行方法来获得{\ IT Silver}分证分性标签进行监督。我们在三个知识密集型任务中的五个数据集的实验表明,我们的新的证据引导发电机具有相同尺寸模型的直接对应的直接对应,并使Faviq-Ambig的最先进。我们将这些改进归因于辅助多任务学习和银证处分性挖掘技术。
translated by 谷歌翻译
Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dualencoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks. 1 * Equal contribution 1 The code and trained models have been released at https://github.com/facebookresearch/DPR.
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
我们微调GPT-3使用基于文本的Web浏览环境来回答长形问题,允许模型搜索和导航Web。通过建立任务,以便通过人类执行,我们能够使用模仿学习培训在任务上的模型,然后通过人体反馈优化答案质量。为了使人为评估事实精度更容易,模型必须在浏览支持答案时收集引用。我们在ELI5上培训并评估我们的模型,Reddit用户提出的问题数据集。我们的最佳模型是通过使用行为克隆进行微调GPT-3获得的,然后对训练训练的奖励模型进行拒绝采样来获得以预测人类偏好。这种模式的答案是人类56%的答案,我们的人类示威者的时间和69%的时间到Reddit的最高投票答复。
translated by 谷歌翻译
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit nonparametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
translated by 谷歌翻译
关于信息检索的许多最新研究集中在如何从一项任务(通常具有丰富的监督数据)转移到有限的其他各种任务,并隐含地假设可以从一个任务概括到所有其余的任务。但是,这忽略了这样一个事实,即有许多多样化和独特的检索任务,每个任务都针对不同的搜索意图,查询和搜索域。在本文中,我们建议使用几乎没有散热的检索,每个任务都有一个简短的描述和一些示例。为了扩大一些示例的功能,我们提出了针对检索器(即将到来)的及时基本查询生成,该查询将大型语言模型(LLM)作为几个弹片查询生成器,并根据生成的数据创建特定于任务的检索器。通过LLM的概括能力提供动力,即要来源使得可以仅基于一些示例{没有自然问题或MS MARCO来训练%问题生成器或双重编码器,就可以仅基于一些示例{没有}来创建特定于任务的端到端检索。出乎意料的是,LLM提示不超过8个示例,允许双重编码器在MARCO(例如Colbert V2)上训练的大量工程模型平均在11个检索套件中超过1.2 NDCG。使用相同生成数据的进一步培训标准尺寸的重新级别可获得5.0点NDCG的改进。我们的研究确定,查询产生比以前观察到的更有效,尤其是在给出少量特定于任务知识的情况下。
translated by 谷歌翻译
Recent work on open domain question answering (QA) assumes strong supervision of the supporting evidence and/or assumes a blackbox information retrieval (IR) system to retrieve evidence candidates. We argue that both are suboptimal, since gold evidence is not always available, and QA is fundamentally different from IR. We show for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system. In this setting, evidence retrieval from all of Wikipedia is treated as a latent variable. Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task. We evaluate on open versions of five QA datasets. On datasets where the questioner already knows the answer, a traditional IR system such as BM25 is sufficient. On datasets where a user is genuinely seeking an answer, we show that learned retrieval is crucial, outperforming BM25 by up to 19 points in exact match.
translated by 谷歌翻译
Retrieval-augmented in-context learning has emerged as a powerful approach for addressing knowledge-intensive tasks using frozen language models (LM) and retrieval models (RM). Existing work has combined these in simple "retrieve-then-read" pipelines in which the RM retrieves passages that are inserted into the LM prompt. To begin to fully realize the potential of frozen LMs and RMs, we propose Demonstrate-Search-Predict (DSP), a framework that relies on passing natural language texts in sophisticated pipelines between an LM and an RM. DSP can express high-level programs that bootstrap pipeline-aware demonstrations, search for relevant passages, and generate grounded predictions, systematically breaking down problems into small transformations that the LM and RM can handle more reliably. We have written novel DSP programs for answering questions in open-domain, multi-hop, and conversational settings, establishing in early evaluations new state-of-the-art in-context learning results and delivering 37-200%, 8-40%, and 80-290% relative gains against vanilla LMs, a standard retrieve-then-read pipeline, and a contemporaneous self-ask pipeline, respectively.
translated by 谷歌翻译
在寻求信息的对话中,用户与代理商进行对话,以提出一系列通常可以不足或过度指定的问题。理想的代理商首先将通过搜索其基本知识来源,然后与用户进行适当互动以解决它,从而确定他们处于这种情况。但是,大多数现有研究都无法或人为地纳入此类代理端计划。在这项工作中,我们介绍了Inscit(发音为Insight),这是一种用于与混合互动相互作用的信息寻求对话的数据集。它包含从805个人类对话中进行的4.7k用户代理转弯,代理商对Wikipedia进行搜索,并要求澄清或提供相关信息以解决用户查询。我们定义了两个子任务,即证据通过识别和响应产生,以及一种新的人类评估协议来评估模型绩效。我们根据对话知识识别和开放域问题的最新模型报告了两个强大的基线的结果。这两种模型都显着不足,并且没有产生连贯和信息丰富的反应,这表明未来的研究有足够的改进空间。
translated by 谷歌翻译
我们介绍了Art,这是一种新的语料库级自动编码方法,用于培训密集检索模型,不需要任何标记的培训数据。密集的检索是开放域任务(例如Open QA)的核心挑战,在该任务中,最先进的方法通常需要大量的监督数据集,并具有自定义的硬性采矿和肯定式示例。相反,艺术品仅需要访问未配对的投入和输出(例如问题和潜在的答案文件)。它使用新的文档 - 重新定义自动编码方案,其中(1)输入问题用于检索一组证据文档,并且(2)随后使用文档来计算重建原始问题的概率。基于问题重建的检索培训可以有效地学习文档和问题编码器,以后可以将其纳入完整的QA系统中,而无需任何进一步的填充。广泛的实验表明,ART在多个QA检索基准测试基准上获得最先进的结果,并且仅来自预训练的语言模型的一般初始化,从而消除了对标记的数据和特定于任务的损失的需求。
translated by 谷歌翻译
我们介绍了Sparrow,这是一个寻求信息的对话代理,与提示的语言模型基线相比,训练有素,更有帮助,正确和无害。我们使用从人类反馈中的强化学习来培训我们的模型,以帮助人类评估者判断代理人的行为。首先,为了使我们的代理人更有帮助和无害,我们将良好对话的要求分解为代理人应遵循的自然语言规则,并分别向评估者询问每个规则。我们证明,这种崩溃使我们能够收集对代理行为的更多针对性的人类判断,并允许更有效的规则条件奖励模型。其次,我们的代理商在收集对模型声明的偏好判决时提供了支持事实主张的来源的证据。对于事实问题,麻雀提供的证据支持了78%的时间。比基线比基线更享受麻雀,同时对人类的对抗性探测更具弹性,在探测时只有8%的时间违反了我们的规则。最后,我们进行了广泛的分析,表明尽管我们的模型学会遵守我们的规则,但它可以表现出分布偏见。
translated by 谷歌翻译
对于开放式对话问题应答(CQA),可以检索最相关的段落来回答问题,但与标准通道检索相比,这是具有挑战性,因为它需要了解完整的对话背景而不是单个查询。此外,重新列车良好的检索者(例如用于非对话查询)最初开发的搜索引擎都可以昂贵。为了便于他们的使用,我们开发了一个查询重写模型Conqrr,可以将上下文中的会话问题重写为独立的问题。它培训了一种新颖的奖励功能,可以直接优化检索,并且可以使用强化学习来适应任何固定的黑箱猎犬。我们展示了Conqrr在最近的开放式CQA数据集上实现了最先进的结果,这是来自三个不同来源的对话组合。我们还开展了广泛的实验,以表明CONQRR为任何给定的固定猎犬的有效性。
translated by 谷歌翻译
在该职位论文中,我们提出了一种新方法,以基于问题的产生和实体链接来生成文本的知识库(KB)。我们认为,所提出的KB类型具有传统符号KB的许多关键优势:尤其是由小型模块化组件组成,可以在组合上合并以回答复杂的查询,包括涉及“多跳跃”的关系查询和查询。“推论。但是,与传统的KB不同,该信息商店与常见的用户信息需求相符。
translated by 谷歌翻译
Open-Domain Generative Question Answering has achieved impressive performance in English by combining document-level retrieval with answer generation. These approaches, which we refer to as GenQA, can generate complete sentences, effectively answering both factoid and non-factoid questions. In this paper, we extend GenQA to the multilingual and cross-lingual settings. For this purpose, we first introduce GenTyDiQA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian. Based on GenTyDiQA, we design a cross-lingual generative model that produces full-sentence answers by exploiting passages written in multiple languages, including languages different from the question. Our cross-lingual generative system outperforms answer sentence selection baselines for all 5 languages and monolingual generative pipelines for three out of five languages studied.
translated by 谷歌翻译
Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (BEGIN), comprised of 12k dialogue turns generated by neural dialogue systems trained on three knowledgegrounded dialogue corpora. We collect human annotations assessing the extent to which the models' responses can be attributed to the given background information. We then use BEGIN to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make BEGIN publicly available at https://github.com/ google/BEGIN-dataset.
translated by 谷歌翻译