基于语义空间中密集表示的检索模型已成为第一阶段检索的必不可少的分支。这些检索员受益于代表学习朝着压缩全球序列级嵌入的进步。但是,它们很容易忽略本地的显着短语和实体在文本中提到的,这些短语通常在第一阶段的检索中扮演枢轴角色。为了减轻这种弱点,我们提议使一个密集的检索器对齐一个表现出色的词典意识代表模型。对齐方式是通过弱化的知识蒸馏来实现的,以通过两个方面来启发猎犬 - 1)词汇扬声的对比目标,以挑战密集编码器和2)一个配对的等级正规化,以使密集的模型的行为倾向于其他人的行为。我们在三个公共基准上评估了我们的模型,这表明,凭借可比的词典觉得回收犬作为老师,我们提议的密集人可以带来一致而重大的改进,甚至超过教师。此外,我们发现我们对密集猎犬的改进是与标准排名蒸馏的补充,这可以进一步提高最先进的性能。
translated by 谷歌翻译
To improve the performance of the dual-encoder retriever, one effective approach is knowledge distillation from the cross-encoder ranker. Existing works construct the candidate passages following the supervised learning setting where a query is paired with a positive passage and a batch of negatives. However, through empirical observation, we find that even the hard negatives from advanced methods are still too trivial for the teacher to distinguish, preventing the teacher from transferring abundant dark knowledge to the student through its soft label. To alleviate this issue, we propose ADAM, a knowledge distillation framework that can better transfer the dark knowledge held in the teacher with Adaptive Dark exAMples. Different from previous works that only rely on one positive and hard negatives as candidate passages, we create dark examples that all have moderate relevance to the query through mixing-up and masking in discrete space. Furthermore, as the quality of knowledge held in different training instances varies as measured by the teacher's confidence score, we propose a self-paced distillation strategy that adaptively concentrates on a subset of high-quality instances to conduct our dark-example-based knowledge distillation to help the student learn better. We conduct experiments on two widely-used benchmarks and verify the effectiveness of our method.
translated by 谷歌翻译
Long document retrieval aims to fetch query-relevant documents from a large-scale collection, where knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder. However, in contrast to passages or sentences, retrieval on long documents suffers from the scope hypothesis that a long document may cover multiple topics. This maximizes their structure heterogeneity and poses a granular-mismatch issue, leading to an inferior distillation efficacy. In this work, we propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers. While preserving the conventional dense retrieval paradigm, it first produces global-consistent representations crossing different fine granularity and then applies multi-granular aligned distillation merely during training. In experiments, we evaluate our framework on two long-document retrieval benchmarks, which show state-of-the-art performance.
translated by 谷歌翻译
排名者在事实上的“检索和rerank”管道中起着必不可少的作用,但其训练仍然落后 - 从中​​度的负面因素或/和/和/和作为回收者的辅助模块中学习。在这项工作中,我们首先确定了强大的排名者的两个主要障碍,即是由训练有素的回猎犬和非理想的负面负面的固有标签噪声,该噪声是为高能力的排名所采样的。因此,我们提出多个检索器,因为负面发电机改善了排名者的鲁棒性,其中i)涉及广泛的分发标签噪声,使排名者与每个噪声分布相对,而ii)与排名相对较接近排名负分配,导致更具挑战性的培训。为了评估我们的强大排名者(称为r $^2 $ anker),我们在各种环境中进行了有关流行通道检索基准测试的各种实验,包括BM25级,全等级,回收者蒸馏等。经验结果验证了新的州 - 新州 - 新州 - 我们模型的效果。
translated by 谷歌翻译
在本文中,我们提出了一个新的密集检索模型,该模型通过深度查询相互作用学习了各种文档表示。我们的模型使用一组生成的伪Queries编码每个文档,以获取查询信息的多视文档表示。它不仅具有较高的推理效率,例如《香草双编码模型》,而且还可以在文档编码中启用深度查询文档的交互,并提供多方面的表示形式,以更好地匹配不同的查询。几个基准的实验证明了所提出的方法的有效性,表现出色的双重编码基准。
translated by 谷歌翻译
我们提出了一种以最小计算成本提高广泛检索模型的性能的框架。它利用由基本密度检索方法提取的预先提取的文档表示,并且涉及训练模型以共同评分每个查询的一组检索到的候选文档,同时在其他候选的上下文中暂时转换每个文档的表示。以及查询本身。当基于其与查询的相似性进行评分文档表示时,该模型因此意识到其“对等”文档的表示。我们表明,我们的方法导致基本方法的检索性能以及彼此隔离的评分候选文档进行了大量改善,如在一对培训环境中。至关重要的是,与基于伯特式编码器的术语交互重型器不同,它在运行时在任何第一阶段方法的顶部引发可忽略不计的计算开销,允许它与任何最先进的密集检索方法容易地结合。最后,同时考虑给定查询的一组候选文档,可以在检索中进行额外的有价值的功能,例如评分校准和减轻排名中的社会偏差。
translated by 谷歌翻译
当前的密集文本检索模型面临两个典型的挑战。首先,他们采用暹罗双重编码架构来独立编码查询和文档,以快速索引和搜索,同时忽略了较细粒度的术语互动。这导致了次优的召回表现。其次,他们的模型培训高度依赖于负面抽样技术,以在其对比损失中构建负面文档。为了应对这些挑战,我们提出了对抗猎犬速率(AR2),它由双重编码猎犬加上跨编码器等级组成。这两种模型是根据最小群体对手的共同优化的:检索员学会了检索负面文件以欺骗排名者,而排名者学会了对包括地面和检索的候选人进行排名,并提供渐进的直接反馈对双编码器检索器。通过这款对抗性游戏,猎犬逐渐生产出更难的负面文件来训练更好的排名者,而跨编码器排名者提供了渐进式反馈以改善检索器。我们在三个基准测试基准上评估AR2。实验结果表明,AR2始终如一地胜过现有的致密回收者方法,并在所有这些方法上实现了新的最新结果。这包括对自然问题的改进R@5%至77.9%(+2.1%),Triviaqa R@5%至78.2%(+1.4)和MS-Marco MRR@10%至39.5%(+1.3%)。代码和型号可在https://github.com/microsoft/ar2上找到。
translated by 谷歌翻译
密集的段落检索旨在根据查询和段落的密集表示(即矢量)从大型语料库中检索查询的相关段落。最近的研究探索了改善预训练的语言模型,以提高密集的检索性能。本文提出了COT-MAE(上下文掩盖自动编码器),这是一种简单而有效的生成性预训练方法,可用于密集通道检索。 COT-MAE采用了不对称的编码器架构,该体系结构学会通过自我监督和上下文监督的掩盖自动编码来将句子语义压缩到密集的矢量中。精确,自我监督的掩盖自动编码学会学会为文本跨度内的令牌的语义建模,并学习上下文监督的蒙版自动编码学学习以建模文本跨度之间的语义相关性。我们对大规模通道检索基准进行实验,并显示出对强基础的大量改进,证明了COT-MAE的效率很高。
translated by 谷歌翻译
稀疏的词汇表现学习已经证明了在近期模型中提高通道检索效果,例如Deepumact,Unicoil和Splade。本文介绍了一种简单而有效的方法,用于通过引入稀疏屏蔽方案来控制稀疏性和自学方法来控制诽谤和自学方法来模拟脱锁表示模拟缺陷表示来缩小通道检索的词汇表格的简单但有效的方法。我们模型的基本实施具有更精致的方法,实现了有效性和效率之间的良好平衡。我们的方法简单地为未来的词汇表达学习探索开辟了门,以便检索。
translated by 谷歌翻译
我们介绍了Art,这是一种新的语料库级自动编码方法,用于培训密集检索模型,不需要任何标记的培训数据。密集的检索是开放域任务(例如Open QA)的核心挑战,在该任务中,最先进的方法通常需要大量的监督数据集,并具有自定义的硬性采矿和肯定式示例。相反,艺术品仅需要访问未配对的投入和输出(例如问题和潜在的答案文件)。它使用新的文档 - 重新定义自动编码方案,其中(1)输入问题用于检索一组证据文档,并且(2)随后使用文档来计算重建原始问题的概率。基于问题重建的检索培训可以有效地学习文档和问题编码器,以后可以将其纳入完整的QA系统中,而无需任何进一步的填充。广泛的实验表明,ART在多个QA检索基准测试基准上获得最先进的结果,并且仅来自预训练的语言模型的一般初始化,从而消除了对标记的数据和特定于任务的损失的需求。
translated by 谷歌翻译
Dense retrieval aims to map queries and passages into low-dimensional vector space for efficient similarity measuring, showing promising effectiveness in various large-scale retrieval tasks. Since most existing methods commonly adopt pre-trained Transformers (e.g. BERT) for parameter initialization, some work focuses on proposing new pre-training tasks for compressing the useful semantic information from passages into dense vectors, achieving remarkable performances. However, it is still challenging to effectively capture the rich semantic information and relations about passages into the dense vectors via one single particular pre-training task. In this work, we propose a multi-task pre-trained model, MASTER, that unifies and integrates multiple pre-training tasks with different learning objectives under the bottlenecked masked autoencoder architecture. Concretely, MASTER utilizes a multi-decoder architecture to integrate three types of pre-training tasks: corrupted passages recovering, related passage recovering and PLMs outputs recovering. By incorporating a shared deep encoder, we construct a representation bottleneck in our architecture, compressing the abundant semantic information across tasks into dense vectors. The first two types of tasks concentrate on capturing the semantic information of passages and relationships among them within the pre-training corpus. The third one can capture the knowledge beyond the corpus from external PLMs (e.g. GPT-2). Extensive experiments on several large-scale passage retrieval datasets have shown that our approach outperforms the previous state-of-the-art dense retrieval methods. Our code and data are publicly released in https://github.com/microsoft/SimXNS
translated by 谷歌翻译
知识蒸馏是将知识从强大的教师转移到有效的学生模型的有效方法。理想情况下,我们希望老师越好,学生越好。但是,这种期望并不总是成真。通常,由于教师和学生之间的不可忽略的差距,更好的教师模型通过蒸馏导致不良学生。为了弥合差距,我们提出了一种渐进式蒸馏方法,以进行致密检索。产品由教师渐进式蒸馏和数据进行渐进的蒸馏组成,以逐步改善学生。我们对五个广泛使用的基准,MARCO通道,TREC Passage 19,TREC文档19,MARCO文档和自然问题进行了广泛的实验,其中POD在蒸馏方法中实现了密集检索的最新方法。代码和模型将发布。
translated by 谷歌翻译
This paper presents a pre-training technique called query-as-context that uses query prediction to improve dense retrieval. Previous research has applied query prediction to document expansion in order to alleviate the problem of lexical mismatch in sparse retrieval. However, query prediction has not yet been studied in the context of dense retrieval. Query-as-context pre-training assumes that the predicted query is a special context for the document and uses contrastive learning or contextual masked auto-encoding learning to compress the document and query into dense vectors. The technique is evaluated on large-scale passage retrieval benchmarks and shows considerable improvements compared to existing strong baselines such as coCondenser and CoT-MAE, demonstrating its effectiveness. Our code will be available at https://github.com/caskcsg/ir/tree/main/cotmae-qc .
translated by 谷歌翻译
信息检索是自然语言处理中的重要组成部分,用于知识密集型任务,如问题应答和事实检查。最近,信息检索已经看到基于神经网络的密集检索器的出现,作为基于术语频率的典型稀疏方法的替代方案。这些模型在数据集和任务中获得了最先进的结果,其中提供了大型训练集。但是,它们不会很好地转移到没有培训数据的新域或应用程序,并且通常因未经监督的术语 - 频率方法(例如BM25)的术语频率方法而言。因此,自然问题是如果没有监督,是否有可能训练密集的索取。在这项工作中,我们探讨了对比学习的限制,作为培训无人监督的密集检索的一种方式,并表明它导致强烈的检索性能。更确切地说,我们在15个数据集中出现了我们的模型胜过BM25的Beir基准测试。此外,当有几千例的示例可用时,我们显示微调我们的模型,与BM25相比,这些模型导致强大的改进。最后,当在MS-Marco数据集上微调之前用作预训练时,我们的技术在Beir基准上获得最先进的结果。
translated by 谷歌翻译
This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.
translated by 谷歌翻译
Knowledge distillation is often used to transfer knowledge from a strong teacher model to a relatively weak student model. Traditional knowledge distillation methods include response-based methods and feature-based methods. Response-based methods are used the most widely but suffer from lower upper limit of model performance, while feature-based methods have constraints on the vocabularies and tokenizers. In this paper, we propose a tokenizer-free method liberal feature-based distillation (LEAD). LEAD aligns the distribution between teacher model and student model, which is effective, extendable, portable and has no requirements on vocabularies, tokenizer, or model architecture. Extensive experiments show the effectiveness of LEAD on several widely-used benchmarks, including MS MARCO Passage, TREC Passage 19, TREC Passage 20, MS MARCO Document, TREC Document 19 and TREC Document 20.
translated by 谷歌翻译
Dense retrievers have made significant strides in obtaining state-of-the-art results on text retrieval and open-domain question answering (ODQA). Yet most of these achievements were made possible with the help of large annotated datasets, unsupervised learning for dense retrieval models remains an open problem. In this work, we explore two categories of methods for creating pseudo query-document pairs, named query extraction (QExt) and transferred query generation (TQGen), to augment the retriever training in an annotation-free and scalable manner. Specifically, QExt extracts pseudo queries by document structures or selecting salient random spans, and TQGen utilizes generation models trained for other NLP tasks (e.g., summarization) to produce pseudo queries. Extensive experiments show that dense retrievers trained with individual augmentation methods can perform comparably well with multiple strong baselines, and combining them leads to further improvements, achieving state-of-the-art performance of unsupervised dense retrieval on both BEIR and ODQA datasets.
translated by 谷歌翻译
密集的检索方法可以克服词汇差距并导致显着改善的搜索结果。但是,它们需要大量的培训数据,这些数据不适用于大多数域。如前面的工作所示(Thakur等,2021b),密集检索的性能在域移位下严重降低。这限制了密集检索方法的使用,只有几个具有大型训练数据集的域。在本文中,我们提出了一种新颖的无监督域适配方法生成伪标签(GPL),其将查询发生器与来自跨编码器的伪标记相结合。在六种代表性域专用数据集中,我们发现所提出的GPL可以优于箱子外的最先进的密集检索方法,最高可达8.9点NDCG @ 10。 GPL需要来自目标域的少(未标记)数据,并且在其培训中比以前的方法更强大。我们进一步调查了六种最近训练方法在检索任务的域改编方案中的作用,其中只有三种可能会产生改善的结果。最好的方法,Tsdae(Wang等,2021)可以与GPL结合,在六个任务中产生了1.0点NDCG @ 10的另一个平均改善。
translated by 谷歌翻译
基于强大的预训练语言模型(PLM)的密集检索方法(DR)方法取得了重大进步,并已成为现代开放域问答系统的关键组成部分。但是,他们需要大量的手动注释才能进行竞争性,这是不可行的。为了解决这个问题,越来越多的研究作品最近着重于在低资源场景下改善DR绩效。这些作品在培训所需的资源和采用各种技术的资源方面有所不同。了解这种差异对于在特定的低资源场景下选择正确的技术至关重要。为了促进这种理解,我们提供了针对低资源DR的主流技术的彻底结构化概述。根据他们所需的资源,我们将技术分为三个主要类别:(1)仅需要文档; (2)需要文件和问题; (3)需要文档和提问对。对于每种技术,我们都会介绍其一般形式算法,突出显示开放的问题和利弊。概述了有希望的方向以供将来的研究。
translated by 谷歌翻译
Dual encoders are now the dominant architecture for dense retrieval. Yet, we have little understanding of how they represent text, and why this leads to good performance. In this work, we shed light on this question via distributions over the vocabulary. We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space. We show that the resulting distributions over vocabulary tokens are intuitive and contain rich semantic information. We find that this view can explain some of the failure cases of dense retrievers. For example, the inability of models to handle tail entities can be explained via a tendency of the token distributions to forget some of the tokens of those entities. We leverage this insight and propose a simple way to enrich query and passage representations with lexical information at inference time, and show that this significantly improves performance compared to the original model in out-of-domain settings.
translated by 谷歌翻译