This paper presents a pre-training technique called query-as-context that uses query prediction to improve dense retrieval. Previous research has applied query prediction to document expansion in order to alleviate the problem of lexical mismatch in sparse retrieval. However, query prediction has not yet been studied in the context of dense retrieval. Query-as-context pre-training assumes that the predicted query is a special context for the document and uses contrastive learning or contextual masked auto-encoding learning to compress the document and query into dense vectors. The technique is evaluated on large-scale passage retrieval benchmarks and shows considerable improvements compared to existing strong baselines such as coCondenser and CoT-MAE, demonstrating its effectiveness. Our code will be available at https://github.com/caskcsg/ir/tree/main/cotmae-qc .
translated by 谷歌翻译
密集的段落检索旨在根据查询和段落的密集表示(即矢量)从大型语料库中检索查询的相关段落。最近的研究探索了改善预训练的语言模型,以提高密集的检索性能。本文提出了COT-MAE(上下文掩盖自动编码器),这是一种简单而有效的生成性预训练方法,可用于密集通道检索。 COT-MAE采用了不对称的编码器架构,该体系结构学会通过自我监督和上下文监督的掩盖自动编码来将句子语义压缩到密集的矢量中。精确,自我监督的掩盖自动编码学会学会为文本跨度内的令牌的语义建模,并学习上下文监督的蒙版自动编码学学习以建模文本跨度之间的语义相关性。我们对大规模通道检索基准进行实验,并显示出对强基础的大量改进,证明了COT-MAE的效率很高。
translated by 谷歌翻译
在本文中,我们提出了一个新的密集检索模型,该模型通过深度查询相互作用学习了各种文档表示。我们的模型使用一组生成的伪Queries编码每个文档,以获取查询信息的多视文档表示。它不仅具有较高的推理效率,例如《香草双编码模型》,而且还可以在文档编码中启用深度查询文档的交互,并提供多方面的表示形式,以更好地匹配不同的查询。几个基准的实验证明了所提出的方法的有效性,表现出色的双重编码基准。
translated by 谷歌翻译
Dense retrieval aims to map queries and passages into low-dimensional vector space for efficient similarity measuring, showing promising effectiveness in various large-scale retrieval tasks. Since most existing methods commonly adopt pre-trained Transformers (e.g. BERT) for parameter initialization, some work focuses on proposing new pre-training tasks for compressing the useful semantic information from passages into dense vectors, achieving remarkable performances. However, it is still challenging to effectively capture the rich semantic information and relations about passages into the dense vectors via one single particular pre-training task. In this work, we propose a multi-task pre-trained model, MASTER, that unifies and integrates multiple pre-training tasks with different learning objectives under the bottlenecked masked autoencoder architecture. Concretely, MASTER utilizes a multi-decoder architecture to integrate three types of pre-training tasks: corrupted passages recovering, related passage recovering and PLMs outputs recovering. By incorporating a shared deep encoder, we construct a representation bottleneck in our architecture, compressing the abundant semantic information across tasks into dense vectors. The first two types of tasks concentrate on capturing the semantic information of passages and relationships among them within the pre-training corpus. The third one can capture the knowledge beyond the corpus from external PLMs (e.g. GPT-2). Extensive experiments on several large-scale passage retrieval datasets have shown that our approach outperforms the previous state-of-the-art dense retrieval methods. Our code and data are publicly released in https://github.com/microsoft/SimXNS
translated by 谷歌翻译
Long document retrieval aims to fetch query-relevant documents from a large-scale collection, where knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder. However, in contrast to passages or sentences, retrieval on long documents suffers from the scope hypothesis that a long document may cover multiple topics. This maximizes their structure heterogeneity and poses a granular-mismatch issue, leading to an inferior distillation efficacy. In this work, we propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers. While preserving the conventional dense retrieval paradigm, it first produces global-consistent representations crossing different fine granularity and then applies multi-granular aligned distillation merely during training. In experiments, we evaluate our framework on two long-document retrieval benchmarks, which show state-of-the-art performance.
translated by 谷歌翻译
近年来,在应用预训练的语言模型(例如Bert)上,取得了巨大进展,以获取信息检索(IR)任务。在网页中通常使用的超链接已被利用用于设计预训练目标。例如,超链接的锚文本已用于模拟查询,从而构建了巨大的查询文档对以进行预训练。但是,作为跨越两个网页的桥梁,尚未完全探索超链接的潜力。在这项工作中,我们专注于建模通过超链接连接的两个文档之间的关系,并为临时检索设计一个新的预训练目标。具体而言,我们将文档之间的关系分为四组:无链接,单向链接,对称链接和最相关的对称链接。通过比较从相邻组采样的两个文档,该模型可以逐渐提高其捕获匹配信号的能力。我们提出了一个渐进的超链接预测({php})框架,以探索预训练中超链接的利用。对两个大规模临时检索数据集和六个提问数据集的实验结果证明了其优于现有的预训练方法。
translated by 谷歌翻译
已经表明,在一个域上训练的双编码器经常概括到其他域以获取检索任务。一种广泛的信念是,一个双编码器的瓶颈层,其中最终得分仅仅是查询向量和通道向量之间的点产品,它过于局限,使得双编码器是用于域外概括的有效检索模型。在本文中,我们通过缩放双编码器模型的大小{\ em同时保持固定的瓶颈嵌入尺寸固定的瓶颈的大小来挑战这一信念。令人惊讶的是,令人惊讶的是,缩放模型尺寸会对各种缩放提高检索任务,特别是对于域外泛化。实验结果表明,我们的双编码器,\ textbf {g} enovalizable \ textbf {t} eTrievers(gtr),优先级%colbert〜\ cite {khattab2020colbertt}和现有的稀疏和密集的索取Beir DataSet〜\ Cite {Thakur2021Beir}显着显着。最令人惊讶的是,我们的消融研究发现,GTR是非常数据的高效,因为它只需要10 \%MARCO监督数据,以实现最佳域的性能。所有GTR模型都在https://tfhub.dev/google/collections/gtr/1发布。
translated by 谷歌翻译
对于开放式域问题的密集检索已被证明通过在问题通道对的大型数据集上培训来实现令人印象深刻的性能。我们调查是否可以以自我监督的方式学习密集的检索,并有效地应用没有任何注释。我们观察到这种情况下的检索斗争的现有借用模型,并提出了一种设计用于检索的新预制方案:重复跨度检索。我们在文档中使用经常性跨度来创建用于对比学习的伪示例。由此产生的模型 - 蜘蛛 - 在广泛的ODQA数据集上没有任何示例,并且与BM25具有竞争力,具有强烈的稀疏基线。此外,蜘蛛通常优于DPR在其他数据集的问题上培训的DPR培训的强大基线。我们将蜘蛛与BM25结合的混合猎犬改进了所有数据集的组件,并且通常与域中DPR模型具有竞争力,这些模型培训数万例培训。
translated by 谷歌翻译
密集的检索方法可以克服词汇差距并导致显着改善的搜索结果。但是,它们需要大量的培训数据,这些数据不适用于大多数域。如前面的工作所示(Thakur等,2021b),密集检索的性能在域移位下严重降低。这限制了密集检索方法的使用,只有几个具有大型训练数据集的域。在本文中,我们提出了一种新颖的无监督域适配方法生成伪标签(GPL),其将查询发生器与来自跨编码器的伪标记相结合。在六种代表性域专用数据集中,我们发现所提出的GPL可以优于箱子外的最先进的密集检索方法,最高可达8.9点NDCG @ 10。 GPL需要来自目标域的少(未标记)数据,并且在其培训中比以前的方法更强大。我们进一步调查了六种最近训练方法在检索任务的域改编方案中的作用,其中只有三种可能会产生改善的结果。最好的方法,Tsdae(Wang等,2021)可以与GPL结合,在六个任务中产生了1.0点NDCG @ 10的另一个平均改善。
translated by 谷歌翻译
The dual-encoder has become the de facto architecture for dense retrieval. Typically, it computes the latent representations of the query and document independently, thus failing to fully capture the interactions between the query and document. To alleviate this, recent work expects to get query-informed representations of documents. During training, it expands the document with a real query, while replacing the real query with a generated pseudo query at inference. This discrepancy between training and inference makes the dense retrieval model pay more attention to the query information but ignore the document when computing the document representation. As a result, it even performs worse than the vanilla dense retrieval model, since its performance depends heavily on the relevance between the generated queries and the real query. In this paper, we propose a curriculum sampling strategy, which also resorts to the pseudo query at training and gradually increases the relevance of the generated query to the real query. In this way, the retrieval model can learn to extend its attention from the document only to both the document and query, hence getting high-quality query-informed document representations. Experimental results on several passage retrieval datasets show that our approach outperforms the previous dense retrieval methods1.
translated by 谷歌翻译
This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.
translated by 谷歌翻译
信息检索是自然语言处理中的重要组成部分,用于知识密集型任务,如问题应答和事实检查。最近,信息检索已经看到基于神经网络的密集检索器的出现,作为基于术语频率的典型稀疏方法的替代方案。这些模型在数据集和任务中获得了最先进的结果,其中提供了大型训练集。但是,它们不会很好地转移到没有培训数据的新域或应用程序,并且通常因未经监督的术语 - 频率方法(例如BM25)的术语频率方法而言。因此,自然问题是如果没有监督,是否有可能训练密集的索取。在这项工作中,我们探讨了对比学习的限制,作为培训无人监督的密集检索的一种方式,并表明它导致强烈的检索性能。更确切地说,我们在15个数据集中出现了我们的模型胜过BM25的Beir基准测试。此外,当有几千例的示例可用时,我们显示微调我们的模型,与BM25相比,这些模型导致强大的改进。最后,当在MS-Marco数据集上微调之前用作预训练时,我们的技术在Beir基准上获得最先进的结果。
translated by 谷歌翻译
深度及时调整(DPT)在大多数自然语言处理〜(NLP)任务中取得了巨大成功。然而,在微调〜(ft)仍然占主导地位的密集检索中,它并没有得到很好的评价。当使用相同的骨干模型〜(例如,Roberta)部署多个检索任务时,基于FT的方法在部署成本方面是不友好的:每个新的检索模型都需要在不重复使用的情况下反复部署骨干模型。为了在这种情况下降低部署成本,这项工作调查了在密集检索中应用DPT。面临的挑战是,直接在密集检索中直接应用DPT在很大程度上表现不佳。为了弥补性能下降,我们建议针对基于DPT的检索器的两种模型不合时宜的和任务不足的策略,即以检索为导向的中间体预处理和统一的负面采矿,作为一种一般方法,可以与任何预先培训的语言模型兼容和检索任务。实验结果表明,所提出的方法(称为DPTDR)在MS-Marco和自然问题上都优于先前的最新模型。我们还进行消融研究以检查每种策略在DPTDR中的有效性。我们认为,这项工作有助于该行业,因为它节省了巨大的部署和成本,并增加了计算资源的实用性。我们的代码可在https://github.com/tangzhy/dptdr上找到。
translated by 谷歌翻译
稀疏的词汇表现学习已经证明了在近期模型中提高通道检索效果,例如Deepumact,Unicoil和Splade。本文介绍了一种简单而有效的方法,用于通过引入稀疏屏蔽方案来控制稀疏性和自学方法来控制诽谤和自学方法来模拟脱锁表示模拟缺陷表示来缩小通道检索的词汇表格的简单但有效的方法。我们模型的基本实施具有更精致的方法,实现了有效性和效率之间的良好平衡。我们的方法简单地为未来的词汇表达学习探索开辟了门,以便检索。
translated by 谷歌翻译
To improve the performance of the dual-encoder retriever, one effective approach is knowledge distillation from the cross-encoder ranker. Existing works construct the candidate passages following the supervised learning setting where a query is paired with a positive passage and a batch of negatives. However, through empirical observation, we find that even the hard negatives from advanced methods are still too trivial for the teacher to distinguish, preventing the teacher from transferring abundant dark knowledge to the student through its soft label. To alleviate this issue, we propose ADAM, a knowledge distillation framework that can better transfer the dark knowledge held in the teacher with Adaptive Dark exAMples. Different from previous works that only rely on one positive and hard negatives as candidate passages, we create dark examples that all have moderate relevance to the query through mixing-up and masking in discrete space. Furthermore, as the quality of knowledge held in different training instances varies as measured by the teacher's confidence score, we propose a self-paced distillation strategy that adaptively concentrates on a subset of high-quality instances to conduct our dark-example-based knowledge distillation to help the student learn better. We conduct experiments on two widely-used benchmarks and verify the effectiveness of our method.
translated by 谷歌翻译
我们提出了一种以最小计算成本提高广泛检索模型的性能的框架。它利用由基本密度检索方法提取的预先提取的文档表示,并且涉及训练模型以共同评分每个查询的一组检索到的候选文档,同时在其他候选的上下文中暂时转换每个文档的表示。以及查询本身。当基于其与查询的相似性进行评分文档表示时,该模型因此意识到其“对等”文档的表示。我们表明,我们的方法导致基本方法的检索性能以及彼此隔离的评分候选文档进行了大量改善,如在一对培训环境中。至关重要的是,与基于伯特式编码器的术语交互重型器不同,它在运行时在任何第一阶段方法的顶部引发可忽略不计的计算开销,允许它与任何最先进的密集检索方法容易地结合。最后,同时考虑给定查询的一组候选文档,可以在检索中进行额外的有价值的功能,例如评分校准和减轻排名中的社会偏差。
translated by 谷歌翻译
知识密集型语言任务(苏格兰信)通常需要大量信息来提供正确的答案。解决此问题的一种流行范式是将搜索系统与机器读取器相结合,前者检索支持证据,后者检查它们以产生答案。最近,读者组成部分在大规模预培养的生成模型的帮助下见证了重大进展。同时,搜索组件中的大多数现有解决方案都依赖于传统的``索引 - retrieve-then-Rank''管道,该管道遭受了巨大的内存足迹和端到端优化的困难。受到最新构建基于模型的IR模型的努力的启发,我们建议用新颖的单步生成模型替换传统的多步搜索管道,该模型可以极大地简化搜索过程并以端到端的方式进行优化。我们表明,可以通过一组经过适当设计的预训练任务来学习强大的生成检索模型,并被采用以通过进一步的微调来改善各种下游苏格兰短裙任务。我们将预训练的生成检索模型命名为Copusbrain,因为有关该语料库的所有信息均以其参数进行编码,而无需构造其他索引。经验结果表明,在苏格兰语基准上的检索任务并建立了新的最新性能,Copusbrain可以极大地超过强大的基准。我们还表明,在零农源和低资源设置下,科体班运行良好。
translated by 谷歌翻译
We present Hybrid Infused Reranking for Passages Retrieval (HYRR), a framework for training rerankers based on a hybrid of BM25 and neural retrieval models. Retrievers based on hybrid models have been shown to outperform both BM25 and neural models alone. Our approach exploits this improved performance when training a reranker, leading to a robust reranking model. The reranker, a cross-attention neural model, is shown to be robust to different first-stage retrieval systems, achieving better performance than rerankers simply trained upon the first-stage retrievers in the multi-stage systems. We present evaluations on a supervised passage retrieval task using MS MARCO and zero-shot retrieval tasks using BEIR. The empirical results show strong performance on both evaluations.
translated by 谷歌翻译
当前的密集文本检索模型面临两个典型的挑战。首先,他们采用暹罗双重编码架构来独立编码查询和文档,以快速索引和搜索,同时忽略了较细粒度的术语互动。这导致了次优的召回表现。其次,他们的模型培训高度依赖于负面抽样技术,以在其对比损失中构建负面文档。为了应对这些挑战,我们提出了对抗猎犬速率(AR2),它由双重编码猎犬加上跨编码器等级组成。这两种模型是根据最小群体对手的共同优化的:检索员学会了检索负面文件以欺骗排名者,而排名者学会了对包括地面和检索的候选人进行排名,并提供渐进的直接反馈对双编码器检索器。通过这款对抗性游戏,猎犬逐渐生产出更难的负面文件来训练更好的排名者,而跨编码器排名者提供了渐进式反馈以改善检索器。我们在三个基准测试基准上评估AR2。实验结果表明,AR2始终如一地胜过现有的致密回收者方法,并在所有这些方法上实现了新的最新结果。这包括对自然问题的改进R@5%至77.9%(+2.1%),Triviaqa R@5%至78.2%(+1.4)和MS-Marco MRR@10%至39.5%(+1.3%)。代码和型号可在https://github.com/microsoft/ar2上找到。
translated by 谷歌翻译
尽管在训练有素的语言模型方面取得了进展,但缺乏用于预训练的句子表示的统一框架。因此,它要求针对特定方案采用不同的预训练方法,并且预培训的模型可能受到其普遍性和表示质量的限制。在这项工作中,我们扩展了最近提出的MAE风格的预训练策略RELOMAE,以便可以有效地支持各种句子表示任务。扩展的框架由两个阶段组成,在整个过程中进行了逆转录。第一阶段对通用语料库进行了逆转,例如Wikipedia,BookCorpus等,从中可以从中学习基本模型。第二阶段发生在特定于领域的数据上,例如Marco和NLI,在该数据中,基本模型是基于逆转和对比度学习的。这两个阶段的训练前输出可能会服务于不同的应用,其有效性通过全面的实验验证。具体来说,基本模型被证明对零射击检索有效,并且在贝尔基准上取得了显着的性能。继续进行预训练的模型进一步受益于更多的下游任务,包括针对马可女士的特定领域的密集检索,自然问题以及句子嵌入式标准STS的质量和延性端的转移任务。这项工作的经验见解可能会激发预训练的句子代表的未来设计。我们的预培训模型和源代码将发布给公共社区。
translated by 谷歌翻译