大型语言模型在各种问题答案(QA)基准测试方面取得了高度的性能,但其产出的解释性仍然难以捉摸。最近建议将结构化的解释称为“综合树”,以解释和检查质量检查系统的答案。为了更好地生成此类树木,我们提出了一种称为迭代检索生成推理​​器(IRGR)的架构。我们的模型能够通过系统地生成文本前提的分步解释来解释给定的假设。 IRGR模型迭代地搜索合适的场所,一次构建单个零件步骤。与以前的方法相反,我们的方法结合了生成步骤和房屋的检索,允许模型利用中间结论,并减轻基线编码器模型的输入大小限制。我们使用IntailmentBank数据集进行实验,在该数据集中,我们在前提检索和索引树上的现有基准优于现有的基准,总体正确性增长了约300%。
translated by 谷歌翻译
translated by 谷歌翻译
已经提出了需要树木,以模拟在开放域的文本问题答案的背景下进行解释产生的人类推理过程。但是,实际上,手动构建这些解释树是一个艰苦的过程,需要积极的人类参与。鉴于捕获从问题到答案的推理线的复杂性,或者从索赔中捕获了前提,因此出现了如何帮助用户有效地构建多个级别的树木,并给定大量可用事实。在本文中,我们将需要树的构造作为一系列主动的前提选择步骤,即,对于说明树中的每个中间节点,专家需要注释大型候选人列表中的前提事实的正面和负面示例。然后,我们迭代地进行精细 - 训练前训练的变压器模型,并产生了正面和紧密控制的负面样本,并旨在平衡语义关系和解释性的关系关系的编码。实验评估证实了拟议的主动精细研究方法的可测量效率提高,以促进累积树的构建:与几种替代方案相比,解释性前提选择的提高了20 \%。
translated by 谷歌翻译
translated by 谷歌翻译
The emergence of large pretrained models has enabled language models to achieve superior performance in common NLP tasks, including language modeling and question answering, compared to previous static word representation methods. Augmenting these models with a retriever to retrieve the related text and documents as supporting information has shown promise in effectively solving NLP problems in a more interpretable way given that the additional knowledge is injected explicitly rather than being captured in the models' parameters. In spite of the recent progress, our analysis on retriever-augmented language models shows that this class of language models still lack reasoning over the retrieved documents. In this paper, we study the strengths and weaknesses of different retriever-augmented language models such as REALM, kNN-LM, FiD, ATLAS, and Flan-T5 in reasoning over the selected documents in different tasks. In particular, we analyze the reasoning failures of each of these models and study how the models' failures in reasoning are rooted in the retriever module as well as the language model.
translated by 谷歌翻译
当前的抽象摘要模型要么仅通过突出源文档的一部分而缺乏明显的解释性或提供不完整的理由。为此,我们提出了摘要程序(SP),这是一个由二进制树的(有序)列表组成的可解释的模块化框架,每个框架都编码来自源文档的抽象摘要句子的分步生成过程。一个摘要程序每个摘要句子包含一个根节点,一棵不同的树将每个摘要句子(根节点)连接到派生的文档句子(叶节点),其中包含中间生成的句子的连接节点。边缘代表涉及摘要的不同模块化操作,例如句子融合,压缩和释义。我们首先建议通过神经模块提出有效的最佳搜索方法,SP搜索通过直接优化Rouge分数来识别人类摘要的SP搜索。接下来,使用这些程序作为自动监督,我们建议使用生成摘要程序的SEQ2SEQ模型,然后执行以获取最终摘要。我们证明,SP搜索有效地代表了使用通常忠于其预期行为的模块的人类摘要背后的生成过程。我们还进行了一项仿真研究,以表明汇总计划通过允许人类更好地模拟模型推理来改善摘要模型的解释性。汇总计划构成了朝着可解释和模块化的抽象摘要迈出的有希望的步骤,这是先前主要通过黑框端到端神经系统解决的复杂任务。我们的代码可从https://github.com/swarnahub/summarization Programs获得
translated by 谷歌翻译
As large language models (LLMs) grow larger and more sophisticated, assessing their "reasoning" capabilities in natural language grows more challenging. Recent question answering (QA) benchmarks that attempt to assess reasoning are often limited by a narrow scope of covered situations and subject matters. We introduce WikiWhy, a QA dataset built around a novel auxiliary task: explaining why an answer is true in natural language. WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics. Each rationale is a set of supporting statements connecting the question to the answer. WikiWhy serves as a benchmark for the reasoning capabilities of LLMs because it demands rigorous explicit rationales for each answer to demonstrate the acquisition of implicit commonsense knowledge, which is unlikely to be easily memorized. GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition, leaving significant room for future improvements.
translated by 谷歌翻译
Despite the success of large language models (LLMs) in various natural language processing (NLP) tasks, the stored knowledge in these models may inevitably be incomplete, out-of-date, or incorrect. This motivates the need to utilize external knowledge to assist LLMs. Unfortunately, current methods for incorporating external knowledge often require additional training or fine-tuning, which can be costly and may not be feasible for LLMs. To address this issue, we propose a novel post-processing approach, rethinking with retrieval (RR), which retrieves relevant external knowledge based on the decomposed reasoning steps obtained from the chain-of-thought (CoT) prompting. This lightweight approach does not require additional training or fine-tuning and is not limited by the input length of LLMs. We evaluate the effectiveness of RR through extensive experiments with GPT-3 on three complex reasoning tasks: commonsense reasoning, temporal reasoning, and tabular reasoning. Our results show that RR can produce more faithful explanations and improve the performance of LLMs.
translated by 谷歌翻译
讨论的现有账户强调了事先经验在解决新问题方面的作用。然而,大多数用于多跳文本推理的当代模型构建解释,考虑每个测试用例的隔离。众所周知,这种范式遭受语义漂移,这导致伪装解释的构建导致错误的结论。相比之下,我们研究了解释的多跳推断的绑架框架,该框架采用了在基于案例的推理中主要研究的检索重新使用修正范例。具体地,我们通过检索和调整来自类似训练示例的先前自然语言解释,提出了一种地址和解释了不均义推理问题的新颖框架。我们在下游致辞和科学推理任务上统一地评估了基于案例的绑架框架。我们的实验表明,与现有可说明的方法相比,所提出的框架可以有效地与稀疏和密集的预训练编码机制或下游变压器集成。此外,我们研究了检索重新使用 - 修改范例对可解释性和语义漂移的影响,表明它提高了构造解释的质量,从而提高了下游推理性能。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit nonparametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
translated by 谷歌翻译
translated by 谷歌翻译
Machine reading comprehension (MRC) is a long-standing topic in natural language processing (NLP). The MRC task aims to answer a question based on the given context. Recently studies focus on multi-hop MRC which is a more challenging extension of MRC, which to answer a question some disjoint pieces of information across the context are required. Due to the complexity and importance of multi-hop MRC, a large number of studies have been focused on this topic in recent years, therefore, it is necessary and worth reviewing the related literature. This study aims to investigate recent advances in the multi-hop MRC approaches based on 31 studies from 2018 to 2022. In this regard, first, the multi-hop MRC problem definition will be introduced, then 31 models will be reviewed in detail with a strong focus on their multi-hop aspects. They also will be categorized based on their main techniques. Finally, a fine-grain comprehensive comparison of the models and techniques will be presented.
translated by 谷歌翻译
从头开始解决复杂问题通常是有挑战性的,但如果我们可以访问其解决方案的其他类似问题,则更容易 - 一种称为基于案例的推理(CBR)的范式。我们提出了一种神经象征性的CBR方法(CBR-KBQA),用于在大知识库上应答。 CBR-KBQA由非参数内存组成,该内存存储案例(问题和逻辑表单)和参数模型,该参数模型可以通过检索与其相关的案例来为新问题生成逻辑表单。在包含复杂问题的几个KBQA数据集上,CBR-KBQA实现了竞争性能。例如,在ComplexWebQuestions数据集上,CBR-KBQA以11 \%的准确度优于当前最新状态。此外,我们表明CBR-KBQA能够使用新案例\ EMPH {没有}任何进一步的培训:通过在案例存储器中纳入一些人类标记的示例,CBR-KBQA能够成功地生成包含未经看线KB实体的逻辑表格以及关系。
translated by 谷歌翻译
Recent work has shown that large language models are capable of generating natural language reasoning steps or Chains-of-Thoughts (CoT) to answer a multi-step question when prompted to do so. This is insufficient, however, when the necessary knowledge is not available or up-to-date within a model's parameters. A straightforward approach to address this is to retrieve text from an external knowledge source using the question as a query and prepend it as context to the model's input. This, however, is also insufficient for multi-step QA where \textit{what to retrieve} depends on \textit{what has already been derived}. To address this issue we propose IRCoT, a new approach that interleaves retrieval with CoT for multi-step QA, guiding the retrieval with CoT and in turn using retrieved results to improve CoT. Our experiments with GPT3 show substantial improvements in retrieval (up to 22 points) and downstream QA (up to 16 points) over the baselines on four datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. Notably, our method also works well for much smaller models such as T5-Flan-large (0.7B) without any additional training.
translated by 谷歌翻译
Deep Learning and Machine Learning based models have become extremely popular in text processing and information retrieval. However, the non-linear structures present inside the networks make these models largely inscrutable. A significant body of research has focused on increasing the transparency of these models. This article provides a broad overview of research on the explainability and interpretability of natural language processing and information retrieval methods. More specifically, we survey approaches that have been applied to explain word embeddings, sequence modeling, attention modules, transformers, BERT, and document ranking. The concluding section suggests some possible directions for future research on this topic.
translated by 谷歌翻译
translated by 谷歌翻译
Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.
translated by 谷歌翻译
We present Hybrid Infused Reranking for Passages Retrieval (HYRR), a framework for training rerankers based on a hybrid of BM25 and neural retrieval models. Retrievers based on hybrid models have been shown to outperform both BM25 and neural models alone. Our approach exploits this improved performance when training a reranker, leading to a robust reranking model. The reranker, a cross-attention neural model, is shown to be robust to different first-stage retrieval systems, achieving better performance than rerankers simply trained upon the first-stage retrievers in the multi-stage systems. We present evaluations on a supervised passage retrieval task using MS MARCO and zero-shot retrieval tasks using BEIR. The empirical results show strong performance on both evaluations.
translated by 谷歌翻译