与语音界面进行互动以查询问答(QA)系统越来越流行。通常,质量保证系统依靠通道检索来选择候选上下文并阅读理解以提取最终答案。尽管人们一直在关注质量检查系统的阅读理解部分,以防止自动语音识别(ASR)模型引入的错误,但段落检索部分仍未开发。但是,此类错误会影响通过检索的性能,从而导致端到端的性能较低。为了解决这一差距,我们通过合成的ASR噪声增强了两个现有的大规模通道排名和开放域QA数据集,并研究了ASR噪声的问题,并研究词汇和密度捕捞器的鲁棒性。此外,我们研究了不同领域的数据增强技术的普遍性。每个域都是不同的语言方言或口音。最后,我们创建了一个新数据集,其中包含人类用户提出的问题,并使用其转录表明,在处理自然ASR噪声而不是合成ASR噪声时,检索性能会进一步降低。
translated by 谷歌翻译
Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dualencoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks. 1 * Equal contribution 1 The code and trained models have been released at https://github.com/facebookresearch/DPR.
translated by 谷歌翻译
我们在11个类型的类型不同语言中展示了一个用于单语言检索的多语言基准数据集的Tydi先生,旨在评估与学习的密集表示的排名。该资源的目标是以非英语语言的密集检索技术进行培训,最近的观察结果是当应用于分发超出数据时的表示学习的现有技术表现不佳。作为一个起点,我们基于我们称之为“MDPR”的多语言调整,为此新数据集提供零拍摄线。实验表明,尽管MDPR的有效性远低于BM25,但仍然似乎提供了有价值的相关信号,改善了BM25导致稀疏致密的杂种。除了对我们的结果分析外,我们还讨论了未来的挑战,并在多语言密集检索中展示了一个研究议程。Tydi先生可以在https://github.com/castorini/mr.tydi下载。
translated by 谷歌翻译
口头问题答案(SQA)是要从一个问题中找到口语文件的答案,这对于个人助理回复用户的查询至关重要。现有的SQA方法均取决于自动语音识别(ASR)成绩单。不仅需要对ASR进行大量的注释数据,这些数据是时间且成本良好的低资源语言的收集,而且更重要的是,问题的答案通常包括名称实体或不可能的唱片词正确识别。此外,ASR旨在最大程度地减少所有单词的识别错误,包括与SQA任务无关的许多函数单词。因此,尽管非常困难,但始终是高度期望的无ASR转录本(无文本)的SQA。这项工作提出了离散的口语自适应学习(双重),利用未标记的数据进行预训练,并通过SQA下游任务进行了微调。口语答案的时间间隔可以直接从口语文件预测。我们还发布了一个新的SQA基准语料库NMSQA,以了解具有更现实的方案的数据。我们从经验上表明,双重收益结果与通过级联ASR和文本质量质量质量质量质量质量质量质量质量质量质量质量质量质量质量数据相媲美,并与现实世界中的数据相当。我们的代码和模型将是开源的。
translated by 谷歌翻译
当前的密集文本检索模型面临两个典型的挑战。首先,他们采用暹罗双重编码架构来独立编码查询和文档,以快速索引和搜索,同时忽略了较细粒度的术语互动。这导致了次优的召回表现。其次,他们的模型培训高度依赖于负面抽样技术,以在其对比损失中构建负面文档。为了应对这些挑战,我们提出了对抗猎犬速率(AR2),它由双重编码猎犬加上跨编码器等级组成。这两种模型是根据最小群体对手的共同优化的:检索员学会了检索负面文件以欺骗排名者,而排名者学会了对包括地面和检索的候选人进行排名,并提供渐进的直接反馈对双编码器检索器。通过这款对抗性游戏,猎犬逐渐生产出更难的负面文件来训练更好的排名者,而跨编码器排名者提供了渐进式反馈以改善检索器。我们在三个基准测试基准上评估AR2。实验结果表明,AR2始终如一地胜过现有的致密回收者方法,并在所有这些方法上实现了新的最新结果。这包括对自然问题的改进R@5%至77.9%(+2.1%),Triviaqa R@5%至78.2%(+1.4)和MS-Marco MRR@10%至39.5%(+1.3%)。代码和型号可在https://github.com/microsoft/ar2上找到。
translated by 谷歌翻译
我们介绍了关于多语言信息访问(MIA)2022共享任务的研讨会的结果,评估了16种类型上多样性的语言中的跨语性开放回程答案(QA)系统。在此任务中,我们在14种类型上多样化的语言中调整了两个大规模的跨语性开放式质疑QA数据集,并使用了2种代表性不足的语言中的新注释的开放式QA数据:Tagalog和Tamil。四个团队提交了他们的系统。利用迭代开采的最佳系统是不同的负面示例和较大的预审慎模型达到32.2 F1,表现优于我们的基线4.5分。第二最佳系统使用实体感知的上下文化表示文档检索,并在泰米尔语(20.8 F1)方面取得了重大改进,而其他大多数系统的得分几乎为零。
translated by 谷歌翻译
我们介绍了Art,这是一种新的语料库级自动编码方法,用于培训密集检索模型,不需要任何标记的培训数据。密集的检索是开放域任务(例如Open QA)的核心挑战,在该任务中,最先进的方法通常需要大量的监督数据集,并具有自定义的硬性采矿和肯定式示例。相反,艺术品仅需要访问未配对的投入和输出(例如问题和潜在的答案文件)。它使用新的文档 - 重新定义自动编码方案,其中(1)输入问题用于检索一组证据文档,并且(2)随后使用文档来计算重建原始问题的概率。基于问题重建的检索培训可以有效地学习文档和问题编码器,以后可以将其纳入完整的QA系统中,而无需任何进一步的填充。广泛的实验表明,ART在多个QA检索基准测试基准上获得最先进的结果,并且仅来自预训练的语言模型的一般初始化,从而消除了对标记的数据和特定于任务的损失的需求。
translated by 谷歌翻译
关于信息检索的许多最新研究集中在如何从一项任务(通常具有丰富的监督数据)转移到有限的其他各种任务,并隐含地假设可以从一个任务概括到所有其余的任务。但是,这忽略了这样一个事实,即有许多多样化和独特的检索任务,每个任务都针对不同的搜索意图,查询和搜索域。在本文中,我们建议使用几乎没有散热的检索,每个任务都有一个简短的描述和一些示例。为了扩大一些示例的功能,我们提出了针对检索器(即将到来)的及时基本查询生成,该查询将大型语言模型(LLM)作为几个弹片查询生成器,并根据生成的数据创建特定于任务的检索器。通过LLM的概括能力提供动力,即要来源使得可以仅基于一些示例{没有自然问题或MS MARCO来训练%问题生成器或双重编码器,就可以仅基于一些示例{没有}来创建特定于任务的端到端检索。出乎意料的是,LLM提示不超过8个示例,允许双重编码器在MARCO(例如Colbert V2)上训练的大量工程模型平均在11个检索套件中超过1.2 NDCG。使用相同生成数据的进一步培训标准尺寸的重新级别可获得5.0点NDCG的改进。我们的研究确定,查询产生比以前观察到的更有效,尤其是在给出少量特定于任务知识的情况下。
translated by 谷歌翻译
已经表明,在一个域上训练的双编码器经常概括到其他域以获取检索任务。一种广泛的信念是,一个双编码器的瓶颈层,其中最终得分仅仅是查询向量和通道向量之间的点产品,它过于局限,使得双编码器是用于域外概括的有效检索模型。在本文中,我们通过缩放双编码器模型的大小{\ em同时保持固定的瓶颈嵌入尺寸固定的瓶颈的大小来挑战这一信念。令人惊讶的是,令人惊讶的是,缩放模型尺寸会对各种缩放提高检索任务,特别是对于域外泛化。实验结果表明,我们的双编码器,\ textbf {g} enovalizable \ textbf {t} eTrievers(gtr),优先级%colbert〜\ cite {khattab2020colbertt}和现有的稀疏和密集的索取Beir DataSet〜\ Cite {Thakur2021Beir}显着显着。最令人惊讶的是,我们的消融研究发现,GTR是非常数据的高效,因为它只需要10 \%MARCO监督数据,以实现最佳域的性能。所有GTR模型都在https://tfhub.dev/google/collections/gtr/1发布。
translated by 谷歌翻译
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark datasets for several tasks. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. We follow the blueprint of the Spoken Language Understanding Evaluation (SLUE) benchmark suite. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will be publishing for each task (i) annotations for a relatively small fine-tuning set, (ii) annotated development and test sets, and (iii) baseline models for easy reproducibility and comparisons. In this work, we present the details of data collection and annotation and the performance of the baseline models. We also perform sensitivity analysis of pipeline models' performance (speech recognizer + text model) to the speech recognition accuracy, using more than 20 state-of-the-art speech recognition models.
translated by 谷歌翻译
We present SpaceQA, to the best of our knowledge the first open-domain QA system in Space mission design. SpaceQA is part of an initiative by the European Space Agency (ESA) to facilitate the access, sharing and reuse of information about Space mission design within the agency and with the public. We adopt a state-of-the-art architecture consisting of a dense retriever and a neural reader and opt for an approach based on transfer learning rather than fine-tuning due to the lack of domain-specific annotated data. Our evaluation on a test set produced by ESA is largely consistent with the results originally reported by the evaluated retrievers and confirms the need of fine tuning for reading comprehension. As of writing this paper, ESA is piloting SpaceQA internally.
translated by 谷歌翻译
对于开放式域问题的密集检索已被证明通过在问题通道对的大型数据集上培训来实现令人印象深刻的性能。我们调查是否可以以自我监督的方式学习密集的检索,并有效地应用没有任何注释。我们观察到这种情况下的检索斗争的现有借用模型,并提出了一种设计用于检索的新预制方案:重复跨度检索。我们在文档中使用经常性跨度来创建用于对比学习的伪示例。由此产生的模型 - 蜘蛛 - 在广泛的ODQA数据集上没有任何示例,并且与BM25具有竞争力,具有强烈的稀疏基线。此外,蜘蛛通常优于DPR在其他数据集的问题上培训的DPR培训的强大基线。我们将蜘蛛与BM25结合的混合猎犬改进了所有数据集的组件,并且通常与域中DPR模型具有竞争力,这些模型培训数万例培训。
translated by 谷歌翻译
除局部相关性外,开放域的Factoid问题回答的段落排名还需要一个段落以包含答案(答案)。尽管最近的一些研究将一些阅读能力纳入了排名者以说明答复性,但排名仍然受到该领域通常可用的训练数据的嘈杂性质的阻碍,这将考虑任何包含答案实体作为正样本的段落。但是,段落中的答案实体不一定与给定的问题有关。为了解决该问题,我们提出了一种基于生成对抗性神经网络的通道重新管理的方法,称为\ ttt {pregan},除了局部相关性外,还结合了关于答复性的歧视者。目的是强迫发电机对局部相关的段落进行排名,并包含答案。五个公共数据集的实验表明,\ ttt {pregan}可以更好地对适当的段落进行排名,从而提高质量检查系统的有效性,并在不使用外部数据的情况下优于现有方法。
translated by 谷歌翻译
Open-Domain Generative Question Answering has achieved impressive performance in English by combining document-level retrieval with answer generation. These approaches, which we refer to as GenQA, can generate complete sentences, effectively answering both factoid and non-factoid questions. In this paper, we extend GenQA to the multilingual and cross-lingual settings. For this purpose, we first introduce GenTyDiQA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian. Based on GenTyDiQA, we design a cross-lingual generative model that produces full-sentence answers by exploiting passages written in multiple languages, including languages different from the question. Our cross-lingual generative system outperforms answer sentence selection baselines for all 5 languages and monolingual generative pipelines for three out of five languages studied.
translated by 谷歌翻译
低资源语言的自动语音识别(ASR)改善了语言少数群体的访问,以便人工智能(AI)提供的技术优势。在本文中,我们通过创建一个新的粤语数据集来解决香港广东语言的数据稀缺问题。我们的数据集多域粤语语料库(MDCC)由73.6小时的清洁阅读语音与成绩单配对,从香港的粤语有声读物收集。它结合了哲学,政治,教育,文化,生活方式和家庭领域,涵盖了广泛的主题。我们还查看所有现有的粤语数据集,并在两个最大的数据集(MDCC和公共语音ZH-HK)上执行实验。我们根据其语音类型,数据源,总大小和可用性分析现有数据集。使用Fairseq S2T变压器,最先进的ASR模型进行实验结果,显示了我们数据集的有效性。此外,我们通过在MDCC和常见的声音ZH-HK上应用多数据集学习来创建一个强大而强大的粤语ASR模型。
translated by 谷歌翻译
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
translated by 谷歌翻译
We present Hybrid Infused Reranking for Passages Retrieval (HYRR), a framework for training rerankers based on a hybrid of BM25 and neural retrieval models. Retrievers based on hybrid models have been shown to outperform both BM25 and neural models alone. Our approach exploits this improved performance when training a reranker, leading to a robust reranking model. The reranker, a cross-attention neural model, is shown to be robust to different first-stage retrieval systems, achieving better performance than rerankers simply trained upon the first-stage retrievers in the multi-stage systems. We present evaluations on a supervised passage retrieval task using MS MARCO and zero-shot retrieval tasks using BEIR. The empirical results show strong performance on both evaluations.
translated by 谷歌翻译
We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questionssampled from Bing's search query logs-each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages-extracted from 3,563,535 web documents retrieved by Bing-that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.
translated by 谷歌翻译
我们介绍了CVSS,这是一种大规模的多语言对语音转换(S2ST)语料库,从21种语言覆盖了21种语言的句子级并行S2ST对。通过将Covost 2从Covost 2的翻译文本综合将翻译文本与最先进的TTS系统合成语音,源自公共语音语音语料库和COVOST 2语音到文本转换(ST)语料库。提供了两个版本的翻译演讲:1)CVSS-C:所有翻译演讲都是一种高质量的规范声音; 2)CVSS-T:翻译语音从相应的源语音传输。此外,CVSS提供标准化的翻译文本,它与翻译语音中的发音匹配。在每个版本的CVSS上,我们建立了基线多语言直接S2ST模型和Cascade S2ST模型,验证了语料库的有效性。为了构建强大的Cascade S2ST基准,我们在Covost 2上培训了St模型,这优于前一种最先进的培训,而无需额外的数据。尽管如此,直接S2ST模型的性能在从头开始训练时接近强级联基线,并且在匹配ST模型中初始化时,仅在ASR转换转换时的0.1或0.7bleu差异。
translated by 谷歌翻译
我们提出了Drboost,一个受升压启发的密集检索合奏。Drboost在阶段接受培训:通过仅关注当前合奏制作的检索错误来依次学习和专注于每个组件模型。最终的表示是所有组件模型的输出矢量的串联,使其成为测试时间标准密集检索器的替代品。与标准密集检索模型相比,Drboost享有几个优点。它产生的表示是4x更紧凑,同时提供可比的检索结果。它还在具有粗量化的近似搜索下进行令人惊讶的良好,从而减少另一个4x的延迟和带宽需求。在实践中,这可以在从内存中服务索引之间的服务指数之间的区别,为更便宜的部署铺平道路。
translated by 谷歌翻译