This paper proposes a question-answering system that can answer questions whose supporting evidence is spread over multiple (potentially long) documents. The system, called Visconde, uses a three-step pipeline to perform the task: decompose, retrieve, and aggregate. The first step decomposes the question into simpler questions using a few-shot large language model (LLM). Then, a state-of-the-art search engine is used to retrieve candidate passages from a large collection for each decomposed question. In the final step, we use the LLM in a few-shot setting to aggregate the contents of the passages into the final answer. The system is evaluated on three datasets: IIRC, Qasper, and StrategyQA. Results suggest that current retrievers are the main bottleneck and that readers are already performing at the human level as long as relevant passages are provided. The system is also shown to be more effective when the model is induced to give explanations before answering a question. Code is available at \url{https://github.com/neuralmind-ai/visconde}.
translated by 谷歌翻译
知识密集型任务,例如开放域问题答案(QA),需要访问大量的世界知识或领域知识。知识密集型任务的一种常见方法是采用检索到阅读的管道,该管道首先从诸如Wikipedia之类的外部语料库中检索少数相关的上下文文档,然后预测在检索文档的条件下得到答案。在本文中,我们提出了一种新的观点,可以通过用大型语言模型生成器代替文档检索器来解决知识密集型任务。我们称我们的方法生成-Read Read(GenRead),该方法首先提示大型语言模型根据给定问题生成上下文文档,然后读取生成的文档以产生最终答案。此外,我们提出了一种基于聚类的提示方法,该方法选择了不同的提示,从而产生了涵盖不同观点的生成文档,从而更好地回忆了可接受的答案。我们对三个不同的知识密集任务进行了广泛的实验,包括开放域质量检查,事实检查和对话系统。值得注意的是,GenRead在Triviaqa和WebQ上实现了71.6和54.4的精确匹配分数,显着超过了最先进的检索到+4.0和+3.9的最先进的dpr-fid,而无需从任何外部知识源中检索任何文档。最后,我们证明可以通过结合检索和生成来进一步提高模型性能。
translated by 谷歌翻译
Recent work has shown that large language models are capable of generating natural language reasoning steps or Chains-of-Thoughts (CoT) to answer a multi-step question when prompted to do so. This is insufficient, however, when the necessary knowledge is not available or up-to-date within a model's parameters. A straightforward approach to address this is to retrieve text from an external knowledge source using the question as a query and prepend it as context to the model's input. This, however, is also insufficient for multi-step QA where \textit{what to retrieve} depends on \textit{what has already been derived}. To address this issue we propose IRCoT, a new approach that interleaves retrieval with CoT for multi-step QA, guiding the retrieval with CoT and in turn using retrieved results to improve CoT. Our experiments with GPT3 show substantial improvements in retrieval (up to 22 points) and downstream QA (up to 16 points) over the baselines on four datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. Notably, our method also works well for much smaller models such as T5-Flan-large (0.7B) without any additional training.
translated by 谷歌翻译
开卷问答是问答任务在系统的目的是找到一个给定的文件(开卷)和一个话题常识答案的一个子集。本文提出了一种从亚马逊网络服务(AWS)技术文件没有特定的域标记数据(零次)的语料库回答自然语言问题的解决方案。这些问题可以是 - 否,没有答案,简答题,较长的答案,或任何以上的组合。该解决方案包括两个步骤的体系结构,其中一个检索找到正确的文件和一个提取器发现所检索文档中的答案。我们基于AWS技术文档客户实际问题为开卷QA引入一个新的测试数据集。基于采掘语言模型的几个信息检索系统和提取模型试验后,溶液试图找到在同通的是,没有没有答案和文字答案。该模型是在斯坦福大学的问题回答的培训数据集 - (Rajpurkaret人,2016)的阵容,自然问题的数据集(Kwiatkowski等,2019)。我们能够实现49%的F1和39%确切匹配的分数(EM)终端到终端的,没有特定领域的培训。
translated by 谷歌翻译
我们介绍了Realtime QA,这是一个动态的问答(QA)平台,该平台宣布问题并定期评估系统(此版本每周)。实时质量检查询问当前世界,质量检查系统需要回答有关新事件或信息的问题。因此,它挑战了QA数据集中的静态,常规假设,并追求瞬时应用。我们在包括GPT-3和T5在内的大型语言模型上建立了强大的基线模型。我们的基准是一项持续的努力,该初步报告在过去一个月中提出了实时评估结果。我们的实验结果表明,GPT-3通常可以根据新的退休文档正确更新其生成结果,从而突出了最新信息检索的重要性。尽管如此,我们发现GPT-3倾向于在检索文件时返回过时的答案,这些文件没有提供足够的信息来找到答案。这表明了未来研究的重要途径:开放式域质量检查系统是否可以确定无法回答的案例,并与用户甚至检索模块进行通信以修改检索结果?我们希望实时质量检查能够刺激问题答案及其他问题的瞬时应用。
translated by 谷歌翻译
Despite the success of large language models (LLMs) in various natural language processing (NLP) tasks, the stored knowledge in these models may inevitably be incomplete, out-of-date, or incorrect. This motivates the need to utilize external knowledge to assist LLMs. Unfortunately, current methods for incorporating external knowledge often require additional training or fine-tuning, which can be costly and may not be feasible for LLMs. To address this issue, we propose a novel post-processing approach, rethinking with retrieval (RR), which retrieves relevant external knowledge based on the decomposed reasoning steps obtained from the chain-of-thought (CoT) prompting. This lightweight approach does not require additional training or fine-tuning and is not limited by the input length of LLMs. We evaluate the effectiveness of RR through extensive experiments with GPT-3 on three complex reasoning tasks: commonsense reasoning, temporal reasoning, and tabular reasoning. Our results show that RR can produce more faithful explanations and improve the performance of LLMs.
translated by 谷歌翻译
问题答案(QA)是自然语言处理中最具挑战性的最具挑战性的问题之一(NLP)。问答(QA)系统试图为给定问题产生答案。这些答案可以从非结构化或结构化文本生成。因此,QA被认为是可以用于评估文本了解系统的重要研究区域。大量的QA研究致力于英语语言,调查最先进的技术和实现最先进的结果。然而,由于阿拉伯QA中的研究努力和缺乏大型基准数据集,在阿拉伯语问答进展中的研究努力得到了很大速度的速度。最近许多预先接受的语言模型在许多阿拉伯语NLP问题中提供了高性能。在这项工作中,我们使用四个阅读理解数据集来评估阿拉伯QA的最先进的接种变压器模型,它是阿拉伯语 - 队,ArcD,AQAD和TYDIQA-GoldP数据集。我们微调并比较了Arabertv2基础模型,ArabertV0.2大型型号和ARAElectra模型的性能。在最后,我们提供了一个分析,了解和解释某些型号获得的低绩效结果。
translated by 谷歌翻译
This paper proposes to tackle opendomain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.
translated by 谷歌翻译
关于信息检索的许多最新研究集中在如何从一项任务(通常具有丰富的监督数据)转移到有限的其他各种任务,并隐含地假设可以从一个任务概括到所有其余的任务。但是,这忽略了这样一个事实,即有许多多样化和独特的检索任务,每个任务都针对不同的搜索意图,查询和搜索域。在本文中,我们建议使用几乎没有散热的检索,每个任务都有一个简短的描述和一些示例。为了扩大一些示例的功能,我们提出了针对检索器(即将到来)的及时基本查询生成,该查询将大型语言模型(LLM)作为几个弹片查询生成器,并根据生成的数据创建特定于任务的检索器。通过LLM的概括能力提供动力,即要来源使得可以仅基于一些示例{没有自然问题或MS MARCO来训练%问题生成器或双重编码器,就可以仅基于一些示例{没有}来创建特定于任务的端到端检索。出乎意料的是,LLM提示不超过8个示例,允许双重编码器在MARCO(例如Colbert V2)上训练的大量工程模型平均在11个检索套件中超过1.2 NDCG。使用相同生成数据的进一步培训标准尺寸的重新级别可获得5.0点NDCG的改进。我们的研究确定,查询产生比以前观察到的更有效,尤其是在给出少量特定于任务知识的情况下。
translated by 谷歌翻译
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit nonparametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
We present SpaceQA, to the best of our knowledge the first open-domain QA system in Space mission design. SpaceQA is part of an initiative by the European Space Agency (ESA) to facilitate the access, sharing and reuse of information about Space mission design within the agency and with the public. We adopt a state-of-the-art architecture consisting of a dense retriever and a neural reader and opt for an approach based on transfer learning rather than fine-tuning due to the lack of domain-specific annotated data. Our evaluation on a test set produced by ESA is largely consistent with the results originally reported by the evaluated retrievers and confirms the need of fine tuning for reading comprehension. As of writing this paper, ESA is piloting SpaceQA internally.
translated by 谷歌翻译
自动问题应答(QA)系统的目的是以时间有效的方式向用户查询提供答案。通常在数据库(或知识库)或通常被称为语料库的文件集合中找到答案。在过去的几十年里,收购知识的扩散,因此生物医学领域的新科学文章一直是指数增长。因此,即使对于领域专家,也难以跟踪域中的所有信息。随着商业搜索引擎的改进,用户可以在某些情况下键入其查询并获得最相关的一小组文档,以及在某些情况下从文档中的相关片段。但是,手动查找所需信息或答案可能仍然令人疑惑和耗时。这需要开发高效的QA系统,该系统旨在为用户提供精确和精确的答案提供了生物医学领域的自然语言问题。在本文中,我们介绍了用于开发普通域QA系统的基本方法,然后彻底调查生物医学QA系统的不同方面,包括使用结构化数据库和文本集合的基准数据集和几种提出的方​​法。我们还探讨了当前系统的局限性,并探索潜在的途径以获得进一步的进步。
translated by 谷歌翻译
Researchers produce thousands of scholarly documents containing valuable technical knowledge. The community faces the laborious task of reading these documents to identify, extract, and synthesize information. To automate information gathering, document-level question answering (QA) offers a flexible framework where human-posed questions can be adapted to extract diverse knowledge. Finetuning QA systems requires access to labeled data (tuples of context, question and answer). However, data curation for document QA is uniquely challenging because the context (i.e. answer evidence passage) needs to be retrieved from potentially long, ill-formatted documents. Existing QA datasets sidestep this challenge by providing short, well-defined contexts that are unrealistic in real-world applications. We present a three-stage document QA approach: (1) text extraction from PDF; (2) evidence retrieval from extracted texts to form well-posed contexts; (3) QA to extract knowledge from contexts to return high-quality answers -- extractive, abstractive, or Boolean. Using QASPER for evaluation, our detect-retrieve-comprehend (DRC) system achieves a +7.19 improvement in Answer-F1 over existing baselines while delivering superior context selection. Our results demonstrate that DRC holds tremendous promise as a flexible framework for practical scientific document QA.
translated by 谷歌翻译
We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K questionanswer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers. We also present two baseline algorithms: a featurebased classifier and a state-of-the-art neural network, that performs well on SQuAD reading comprehension. Neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that Trivi-aQA is a challenging testbed that is worth significant future study. 1
translated by 谷歌翻译
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
translated by 谷歌翻译
Answering complex questions that require making latent decisions is a challenging task, especially when limited supervision is available. Recent works leverage the capabilities of large language models (LMs) to perform complex question answering in a few-shot setting by demonstrating how to output intermediate rationalizations while solving the complex question in a single pass. We introduce ``Successive Prompting'', where we iteratively break down a complex task into a simple task, solve it, and then repeat the process until we get the final solution. Successive prompting decouples the supervision for decomposing complex questions from the supervision for answering simple questions, allowing us to (1) have multiple opportunities to query in-context examples at each reasoning step (2) learn question decomposition separately from question answering, including using synthetic data, and (3) use bespoke (fine-tuned) components for reasoning steps where a large LM does not perform well. The intermediate supervision is typically manually written, which can be expensive to collect. We introduce a way to generate a synthetic dataset which can be used to bootstrap a model's ability to decompose and answer intermediate questions. Our best model (with successive prompting) achieves an improvement of ~5% absolute F1 on a few-shot version of the DROP dataset when compared with a state-of-the-art model with the same supervision.
translated by 谷歌翻译
问题回答(QA)是信息检索和信息提取领域内的一项自然理解任务,由于基于机器阅读理解的模型的强劲发展,近年来,近年来,近年来的计算语言学和人工智能研究社区引起了很多关注。基于读者的质量检查系统是一种高级搜索引擎,可以使用机器阅读理解(MRC)技术在开放域或特定领域特定文本中找到正确的查询或问题的答案。 MRC和QA系统中的数据资源和机器学习方法的大多数进步尤其是在两种资源丰富的语言中显着开发的,例如英语和中文。像越南人这样的低资源语言见证了关于质量检查系统的稀缺研究。本文介绍了XLMRQA,这是第一个在基于Wikipedia的文本知识源(使用UIT-Viquad语料库)上使用基于变压器的读取器的越南质量检查系统,使用深​​层神经网络模型优于DRQA和BERTSERINI,优于两个可靠的QA系统分别为24.46%和6.28%。从三个系统获得的结果中,我们分析了问题类型对质量检查系统性能的影响。
translated by 谷歌翻译
Open-Domain Question Answering (ODQA) requires models to answer factoid questions with no context given. The common way for this task is to train models on a large-scale annotated dataset to retrieve related documents and generate answers based on these documents. In this paper, we show that the ODQA architecture can be dramatically simplified by treating Large Language Models (LLMs) as a knowledge corpus and propose a Self-Prompting framework for LLMs to perform ODQA so as to eliminate the need for training data and external knowledge corpus. Concretely, we firstly generate multiple pseudo QA pairs with background passages and one-sentence explanations for these QAs by prompting LLMs step by step and then leverage the generated QA pairs for in-context learning. Experimental results show our method surpasses previous state-of-the-art methods by +8.8 EM averagely on three widely-used ODQA datasets, and even achieves comparable performance with several retrieval-augmented fine-tuned models.
translated by 谷歌翻译
该论文为罗马尼亚语提供了一个开放域的答案系统,回答了Covid-19相关问题。QA系统管道涉及自动问题处理,自动查询生成,Web搜索前10个最相关的文档,并使用用于提取质量质量质量质量质量质量质量的BERT模型回答提取,并在我们手动创建的COVID-19数据集上进行了培训。该论文将介绍质量检查系统及其与罗马尼亚语言技术的集成,COVID-19数据集以及对质量检查性能的不同评估。
translated by 谷歌翻译