知识依赖任务通常使用两个知识来源:参数,在培训时间和上下文中学到的,作为推理时间的段落给出。要了解模型如何使用这些来源,我们正式化知识冲突问题,其中上下文信息与学到的信息相矛盾。分析流行模型的行为,我们衡量其过度依赖记忆信息(幻觉的原因),并揭示加剧这种行为的重要因素。最后,我们提出了一种简单的方法来减轻对参数知识的过度依赖,这最大限度地减少了幻觉,并提高了分配的推广4%-7%。我们的调查结果表明了从业者评估模型倾向于幻觉而不是阅读的重要性,并表明我们的缓解战略鼓励向不断发展的信息(即时间依赖查询)概括。为鼓励这些做法,我们发布了我们的框架,以产生知识冲突。
translated by 谷歌翻译
最近的开放式域问题回答表明,新颖的测试问题之间的模型性能和那些在很大程度上与培训问题重叠的模型性能存在很大差异。然而,目前尚不清楚新颖的问题的哪些方面使他们成为挑战。在进行系统泛化的研究时,我们根据三个类别介绍和注释问题,这些类别测量了不同的水平和概括的种类:培训设定重叠,组成泛化(Comp-Gen)和新颖的实体概括(新实体)。在评估六个流行的参数和非参数模型时,我们发现,对于既定的自然问题和TriviaQA数据集,即使是Comp-Gen /新颖实体的最强的模型性能也是13.1 / 5.4%和9.6 / 1.5%,而与此相比降低对于完整的测试集 - 表示这些类型的问题所带来的挑战。此外,我们表明,虽然非参数模型可以相对良好地处理含有新颖实体的问题,但它们与那些需要组成泛化的问题斗争。最后,我们发现关键问题是:来自检索组件的级联错误,问题模式的频率和实体的频率。
translated by 谷歌翻译
Question answering models commonly have access to two sources of "knowledge" during inference time: (1) parametric knowledge - the factual knowledge encoded in the model weights, and (2) contextual knowledge - external knowledge (e.g., a Wikipedia passage) given to the model to generate a grounded answer. Having these two sources of knowledge entangled together is a core issue for generative QA models as it is unclear whether the answer stems from the given non-parametric knowledge or not. This unclarity has implications on issues of trust, interpretability and factuality. In this work, we propose a new paradigm in which QA models are trained to disentangle the two sources of knowledge. Using counterfactual data augmentation, we introduce a model that predicts two answers for a given question: one based on given contextual knowledge and one based on parametric knowledge. Our experiments on the Natural Questions dataset show that this approach improves the performance of QA models by making them more robust to knowledge conflicts between the two knowledge sources, while generating useful disentangled answers.
translated by 谷歌翻译
Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dualencoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks. 1 * Equal contribution 1 The code and trained models have been released at https://github.com/facebookresearch/DPR.
translated by 谷歌翻译
Open-Domain Question Answering (ODQA) requires models to answer factoid questions with no context given. The common way for this task is to train models on a large-scale annotated dataset to retrieve related documents and generate answers based on these documents. In this paper, we show that the ODQA architecture can be dramatically simplified by treating Large Language Models (LLMs) as a knowledge corpus and propose a Self-Prompting framework for LLMs to perform ODQA so as to eliminate the need for training data and external knowledge corpus. Concretely, we firstly generate multiple pseudo QA pairs with background passages and one-sentence explanations for these QAs by prompting LLMs step by step and then leverage the generated QA pairs for in-context learning. Experimental results show our method surpasses previous state-of-the-art methods by +8.8 EM averagely on three widely-used ODQA datasets, and even achieves comparable performance with several retrieval-augmented fine-tuned models.
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
Recent work in open-domain question answering (ODQA) has shown that adversarial poisoning of the input contexts can cause large drops in accuracy for production systems. However, little to no work has proposed methods to defend against these attacks. To do so, we introduce a new method that uses query augmentation to search for a diverse set of retrieved passages that could answer the original question. We integrate these new passages into the model through the design of a novel confidence method, comparing the predicted answer to its appearance in the retrieved contexts (what we call Confidence from Answer Redundancy, e.g. CAR). Together these methods allow for a simple but effective way to defend against poisoning attacks and provide gains of 5-20% exact match across varying levels of data poisoning.
translated by 谷歌翻译
Entities, as important carriers of real-world knowledge, play a key role in many NLP tasks. We focus on incorporating entity knowledge into an encoder-decoder framework for informative text generation. Existing approaches tried to index, retrieve, and read external documents as evidence, but they suffered from a large computational overhead. In this work, we propose an encoder-decoder framework with an entity memory, namely EDMem. The entity knowledge is stored in the memory as latent representations, and the memory is pre-trained on Wikipedia along with encoder-decoder parameters. To precisely generate entity names, we design three decoding methods to constrain entity generation by linking entities in the memory. EDMem is a unified framework that can be used on various entity-intensive question answering and generation tasks. Extensive experimental results show that EDMem outperforms both memory-based auto-encoder models and non-memory encoder-decoder models.
translated by 谷歌翻译
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit nonparametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
translated by 谷歌翻译
在该职位论文中,我们提出了一种新方法,以基于问题的产生和实体链接来生成文本的知识库(KB)。我们认为,所提出的KB类型具有传统符号KB的许多关键优势:尤其是由小型模块化组件组成,可以在组合上合并以回答复杂的查询,包括涉及“多跳跃”的关系查询和查询。“推论。但是,与传统的KB不同,该信息商店与常见的用户信息需求相符。
translated by 谷歌翻译
Recent advances in open-domain question answering (ODQA) have demonstrated impressive accuracy on standard Wikipedia style benchmarks. However, it is less clear how robust these models are and how well they perform when applied to real-world applications in drastically different domains. While there has been some work investigating how well ODQA models perform when tested for out-of-domain (OOD) generalization, these studies have been conducted only under conservative shifts in data distribution and typically focus on a single component (ie. retrieval) rather than an end-to-end system. In response, we propose a more realistic and challenging domain shift evaluation setting and, through extensive experiments, study end-to-end model performance. We find that not only do models fail to generalize, but high retrieval scores often still yield poor answer prediction accuracy. We then categorize different types of shifts and propose techniques that, when presented with a new dataset, predict if intervention methods are likely to be successful. Finally, using insights from this analysis, we propose and evaluate several intervention methods which improve end-to-end answer F1 score by up to 24 points.
translated by 谷歌翻译
知识密集型任务,例如开放域问题答案(QA),需要访问大量的世界知识或领域知识。知识密集型任务的一种常见方法是采用检索到阅读的管道,该管道首先从诸如Wikipedia之类的外部语料库中检索少数相关的上下文文档,然后预测在检索文档的条件下得到答案。在本文中,我们提出了一种新的观点,可以通过用大型语言模型生成器代替文档检索器来解决知识密集型任务。我们称我们的方法生成-Read Read(GenRead),该方法首先提示大型语言模型根据给定问题生成上下文文档,然后读取生成的文档以产生最终答案。此外,我们提出了一种基于聚类的提示方法,该方法选择了不同的提示,从而产生了涵盖不同观点的生成文档,从而更好地回忆了可接受的答案。我们对三个不同的知识密集任务进行了广泛的实验,包括开放域质量检查,事实检查和对话系统。值得注意的是,GenRead在Triviaqa和WebQ上实现了71.6和54.4的精确匹配分数,显着超过了最先进的检索到+4.0和+3.9的最先进的dpr-fid,而无需从任何外部知识源中检索任何文档。最后,我们证明可以通过结合检索和生成来进一步提高模型性能。
translated by 谷歌翻译
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
translated by 谷歌翻译
Despite their impressive performance on diverse tasks, large language models (LMs) still struggle with tasks requiring rich world knowledge, implying the limitations of relying solely on their parameters to encode a wealth of world knowledge. This paper aims to understand LMs' strengths and limitations in memorizing factual knowledge, by conducting large-scale knowledge probing experiments of 10 models and 4 augmentation methods on PopQA, our new open-domain QA dataset with 14k questions. We find that LMs struggle with less popular factual knowledge, and that scaling fails to appreciably improve memorization of factual knowledge in the tail. We then show that retrieval-augmented LMs largely outperform orders of magnitude larger LMs, while unassisted LMs remain competitive in questions about high-popularity entities. Based on those findings, we devise a simple, yet effective, method for powerful and efficient retrieval-augmented LMs, which retrieves non-parametric memories only when necessary. Experimental results show that this significantly improves models' performance while reducing the inference costs.
translated by 谷歌翻译
自动问题应答(QA)系统的目的是以时间有效的方式向用户查询提供答案。通常在数据库(或知识库)或通常被称为语料库的文件集合中找到答案。在过去的几十年里,收购知识的扩散,因此生物医学领域的新科学文章一直是指数增长。因此,即使对于领域专家,也难以跟踪域中的所有信息。随着商业搜索引擎的改进,用户可以在某些情况下键入其查询并获得最相关的一小组文档,以及在某些情况下从文档中的相关片段。但是,手动查找所需信息或答案可能仍然令人疑惑和耗时。这需要开发高效的QA系统,该系统旨在为用户提供精确和精确的答案提供了生物医学领域的自然语言问题。在本文中,我们介绍了用于开发普通域QA系统的基本方法,然后彻底调查生物医学QA系统的不同方面,包括使用结构化数据库和文本集合的基准数据集和几种提出的方​​法。我们还探讨了当前系统的局限性,并探索潜在的途径以获得进一步的进步。
translated by 谷歌翻译
本文介绍了学习迭代查询细化的元策略的设计代理的首先成功步骤。我们的方法使用机器读取来指导从聚合搜索结果中选择细化项。然后,使用简单但有效的搜索操作员能够赋予代理,以对查询和搜索结果发挥细粒度和透明控制。我们开发一种新颖的方式来发电综合搜索会话,它通过(自我)监督学习来利用基于变压器的语言模型的力量。我们还提出了一种强化学习代理,具有动态约束的动作,从划痕中了解互动搜索策略。我们使用传统的基于术语的BM25排名函数获得与最近神经方法相当的检索和回答质量性能。我们对搜索政策进行了深入的分析。
translated by 谷歌翻译
为了解决现实世界应用需求的日益增长,知识密集型NLP(KI-NLP)的研究应通过捕获真正开放域环境的挑战:网络规模知识,结构缺乏,质量不一致,和噪音。为此,我们提出了一种新的设置,用于评估现有的KI-NLP任务,其中我们将背景语料库概括为通用Web快照。我们重新保证Kilt,最初为维基百科最初开发的标准Ki-NLP基准测试,并要求系统使用CCNet的子集 - 球体语料库 - 作为知识源。与维基百科相比,球体是较大的数量级,更好地反映了互联网上的全部知识。我们发现,尽管潜在的覆盖范围,规模挑战,结构缺乏,质量较低,来自领域的检索可以实现最先进的检索系统,以匹配和甚至优于基于Wikipedia的模型在几个kilt上任务 - 即使我们积极过滤看起来像维基百科的内容。我们还观察到Wikipedia的单一密集通道指数可以胜过稀疏的BM25版本,而在球体上尚不实现。为了促进进一步研究该领域,并尽量减少社区对专有黑匣子搜索引擎的依赖,我们将分享我们的指数,评估指标和基础设施。
translated by 谷歌翻译
会话问题应答(CQA)系统旨在为用户提供自然语言答案,以信息寻求对话。现有的CQA基准测试与预先收集的人类谈话进行比较模型,使用在会话历史中提供的地面真理答案。它仍然尚不清楚我们是否可以依赖于模型开发的这种静态评估,以及当前系统是否能够充分地概括为现实世界的人机对话。在这项工作中,我们开展了最先进的CQA系统的大规模人类评估,人类评估人员与模型交谈并判断了答案的正确性。我们发现,人机对话的分布与人类谈话的分配急剧不同,并且在模型排名方面存在人和金历史评估之间的分歧。我们进一步调查了如何改进自动评估,并提出基于预测历史的问题重写机制,与人类判断更好地相关。最后,我们讨论了各种建模策略和未来方向对更好的会话问题应答系统的影响。
translated by 谷歌翻译
知识密集型语言任务(苏格兰信)通常需要大量信息来提供正确的答案。解决此问题的一种流行范式是将搜索系统与机器读取器相结合,前者检索支持证据,后者检查它们以产生答案。最近,读者组成部分在大规模预培养的生成模型的帮助下见证了重大进展。同时,搜索组件中的大多数现有解决方案都依赖于传统的``索引 - retrieve-then-Rank''管道,该管道遭受了巨大的内存足迹和端到端优化的困难。受到最新构建基于模型的IR模型的努力的启发,我们建议用新颖的单步生成模型替换传统的多步搜索管道,该模型可以极大地简化搜索过程并以端到端的方式进行优化。我们表明,可以通过一组经过适当设计的预训练任务来学习强大的生成检索模型,并被采用以通过进一步的微调来改善各种下游苏格兰短裙任务。我们将预训练的生成检索模型命名为Copusbrain,因为有关该语料库的所有信息均以其参数进行编码,而无需构造其他索引。经验结果表明,在苏格兰语基准上的检索任务并建立了新的最新性能,Copusbrain可以极大地超过强大的基准。我们还表明,在零农源和低资源设置下,科体班运行良好。
translated by 谷歌翻译
We present SpaceQA, to the best of our knowledge the first open-domain QA system in Space mission design. SpaceQA is part of an initiative by the European Space Agency (ESA) to facilitate the access, sharing and reuse of information about Space mission design within the agency and with the public. We adopt a state-of-the-art architecture consisting of a dense retriever and a neural reader and opt for an approach based on transfer learning rather than fine-tuning due to the lack of domain-specific annotated data. Our evaluation on a test set produced by ESA is largely consistent with the results originally reported by the evaluated retrievers and confirms the need of fine tuning for reading comprehension. As of writing this paper, ESA is piloting SpaceQA internally.
translated by 谷歌翻译