LED: Lexicon-Enlightened Dense Retriever for Large-Scale Retrieval

Kai Zhang

^{1}

, Chongyang Tao

^{2}

, Tao Shen

^{2}

, Can Xu

^{2}

,
Xiubo Geng

^{2}

, Binxing Jiao

^{2}

, Daxin Jiang

^{2}

^{1}

The Ohio State University

^{2}

Microsoft, Beijing, China
{zhang.13253}@osu.edu, {chotao,shentao,caxu,xigeng,binxjia,djiang}@microsoft.com Work done during the internship at Microsoft.

Abstract

Retrieval models based on dense representations in semantic space have become an indispensable branch for first-stage retrieval. These retrievers benefit from surging advances in representation learning towards compressive global sequence-level embeddings. However, they are prone to overlook local salient phrases and entity mentions in texts, which usually play pivot roles in first-stage retrieval. To mitigate this weakness, we propose to make a dense retriever align a well-performing lexicon-aware representation model. The alignment is achieved by weakened knowledge distillations to enlighten the retriever via two aspects – 1) a lexicon-augmented contrastive objective to challenge the dense encoder and 2) a pair-wise rank-consistent regularization to make dense model’s behavior incline to the other. We evaluate our model on three public benchmarks, which shows that with a comparable lexicon-aware retriever as the teacher, our proposed dense one can bring consistent and significant improvements, and even outdo its teacher. In addition, we found our improvement on the dense retriever is complementary to the standard ranker distillation, which can further lift state-of-the-art performance.

\useunder

\ul

1 Introduction

Large-scale passage retrieval Cai et al. (2021) aims to fetch relevant passages from a million- or billion-scale collection for a given query to meet users’ information needs, serving as an important role in many downstream applications including open domain question answering Karpukhin et al. (2020), search engine Zou et al. (2021), and recommendation system Zhang et al. (2019), etc. Recent years have witnessed an upsurge of interest and remarkable performance of dense passage retrievers on first-stage retrieval. Built upon powerful pre-trained language models (PLM) Devlin et al. (2019); Liu et al. (2019); Radford et al. (2018), dense retrievers Karpukhin et al. (2020); Xiong et al. (2021); Qu et al. (2021) encode queries and passages into a joint low-dimensional semantic space in a Siamese manner (i.e. dual-encoder), so that the passages could be offline pre-indexed and query could be encoded online and searched via approximate nearest neighbor (ANN) Johnson et al. (2021), reaching an efficiency-effectiveness trade-off.

Although dense retrieval becomes indispensable in modern systems, a long-term challenge is that the dense representations in a latent semantic space are abstractive and condensed, exposing the systems to a risk that pivot phrases and mentions may be overlooked and thus leading to sub-optimal efficacy. For example, DPR Karpukhin et al. (2020) didn’t regard “Thoros of Myr” as an entity mention in the query “Who plays Thoros of Myr in Game of Thrones?”, and ANCE Xiong et al. (2021) could not recognize “active margin” as the local salient phrase in the query “What is an active margin”. As a remedy, prior works resort to either coupling a dense retriever with the term matching scores (e.g., TF-IDF, BM25) Gao et al. (2021a); Lin and Ma (2021); Dai and Callan (2019); Mallia et al. (2021) or learning BM25 ranking into a dense model as additional features to complement the original one Chen et al. (2021). However, these approaches are limited by superficial combinations and almost unlearnable BM25 scoring.

To circumvent demerits from the superficial hybrid or learning with inferior lexicon-based representations upon PLM, we propose a brand-new lexicon-enlightened dense (LED) retriever learning framework to inject rich lexicon information into a single dense encoder, while keeping the sequence-level semantic representation capability. Rather than prevailing BM25 as the lexicon-rich source, we propose to leverage the recently advanced lexicon-centric representation learning model transferred from large-scale masked language modeling (MLM), and attempt to align a dense encoder by two brand-new weakened distilling objectives. On the one hand, we present lexicon-augmented contrastive learning that incorporates the hard negatives provided by lexicon-aware retrievers for contrastive training. Intuitively, the negatives given by the lexicon-aware models could be regarded as adversarial examples to challenge the dense one, so as to transfer lexical knowledge to the dense model. On the other hand, we propose a novel pair-wise rank-consistent regularization to make dense model’s behavior incline to the lexicon-aware ones. Compared to distribution regularization such as KL-divergence Zhang et al. (2022) and strict fine-grained distillation like Margin-MSE Hofstätter et al. (2020), our method provides weak supervision signals from the lexicon-aware retrievers, leading to desirable partial knowledge injection while maintaining the dense retriever’s own properties.

We evaluate our method on three real-world human-annotated benchmarks. Experimental results show that our methods consistently and significantly improve dense retriever’s performance, even outdoing its teacher. Notably, these significant improvements are brought by the supervision of a performance-comparable lexicon-aware retriever. Besides, a detailed analysis of retrieval results shows that our knowledge distillation strategies indeed equip the dense retriever with the lexicon-aware capabilities. Lastly, we found our improvement on the dense retriever is complementary to the standard ranker distillation, achieving further improvement and a new state-of-the-art performance.

Our contributions are summarized as follows: (1) To the best of our knowledge, we are the first to consider improving the dense retriever by imitating the retriever based on the lexicon-aware representation model upon PLM; (2) We propose two strategies including lexicon-augmented contrastive training and pair-wise rank-consistent regularization to inject lexical knowledge into the dense retriever; (3) Evaluation results on three benchmarks show that our method brings consistent and significant improvements to the dense retriever with a comparable lexicon-aware retriever as a teacher and a new state-of-the-art performance is achieved.

2 Related Work

The training framework of our LED retriever. The Lexical teacher is independently trained following a two-stage process.
After warming up, LED is trained with negatives mined by self and two lexicon-aware retrievers for lexicon-augmented contrastive learning, during which the Lexical model enhances LED with pair-wise rank-consistent regularization. — Figure 1: The training framework of our LED retriever. The Lexical teacher is independently trained following a two-stage process. After warming up, LED is trained with negatives mined by self and two lexicon-aware retrievers for lexicon-augmented contrastive learning, during which the Lexical model enhances LED with pair-wise rank-consistent regularization.

Current passage retrieval systems are widely deployed as retrieve-then-rank pipelines Huang et al. (2020); Zou et al. (2021). The first-stage retriever (e.g., dual-encoder) Xiong et al. (2021); Qu et al. (2021); Ren et al. (2021a) selects a small number of candidate passages (usually at most thousands) from the entire collection, and the second-stage ranker (i.e., cross-encoder Zhou et al. (2022)) scores these candidates again to provide a more accurate passages order. In this paper, we focus on enhancing the first-stage retriever.

Dense Retriever.

Built upon Pre-trained Language Models Devlin et al. (2019); Liu et al. (2019), dense retriever Karpukhin et al. (2020); Qu et al. (2021) is to capture the semantic meaning of an entire sequence by encoding sequential text as a continuous representation into a low-dimensional space (e.g., 768). In this way, the dense retriever could handle vocabulary and semantic mismatch issues within the traditional term-based techniques like BM25 Robertson and Zaragoza (2009). To train a better dense retriever, various techniques are proposed for providing hard negatives including reusing in-batch negatives Karpukhin et al. (2020); Luan et al. (2021); Qu et al. (2021), iteratively sampling Xiong et al. (2021), mining by a well-trained model or dynamic sampling Zhan et al. (2021); Zhang et al. (2022), and denoising by cross-encoder Qu et al. (2021). To build retrieval-specific pre-trained language models, Lee et al. (2019) proposed an unsupervised pre-training task, namely Inverse Cloze Task (ICT), Gao and Callan (2021a) decoupled model architecture during pre-training and further designed corpus-level contrastive learning Gao and Callan (2021b) for better passage representations.

Lexicon-Aware Retriever.

Another paradigm of work Formal et al. (2021b); Gao et al. (2021a); Lin and Ma (2021) takes advantage of strong PLMs to build lexicon-aware sparse retrievers by term-importance Dai and Callan (2019); Lin and Ma (2021) and top coordinate terms Formal et al. (2021b, a). These models have lexical properties and could be coupled with inverted indexing techniques. Based on contextualized representation generated by PLMs Devlin et al. (2019), Dai and Callan (2019) learned to estimate individual term weights, Mallia et al. (2021) further optimized the sum of query terms weights for better term interaction, COIL-series works used token-level interactions on weight vector Gao et al. (2021a) or scalar Lin and Ma (2021) to obtain exact word matching scores, and Formal et al. (2021b, a) trained a retriever encoding passages as vocabulary-size highly sparse embeddings. Recently, Chen et al. (2021) trained a PLM-based retriever from scratch with data generated by BM25. The trained lexicon-aware retriever could encode texts as low-dimensional embeddings and have identical lexical properties and comparable performance with BM25.

Hybrid Retriever.

Dense sequence-level retrievers and lexicon-aware retrievers have distinctive pros and are complementary to each other Chen et al. (2021). So researchers combined their advantages of them by directly aggregating scores Kuzi et al. (2020), weighted sum, or concatenation Chen et al. (2021) in an ensemble system.

Knowledge Distillation.

Cross-encoder empirically outperforms dual-encoder for it inputs query and passage as a whole, so that attention mechanism will be applied between them, leading to in-depth token-level interactions. Its superior performance motivates many works Hofstätter et al. (2020); Zhang et al. (2022) to enhance dual-encoders by distillation knowledge from cross-encoder. KL-Divergence, which minimizes the distances of distributions between teacher and student, has proven effective in many works Zhang et al. (2022); Santhanam et al. (2021). Margin-MSE Hofstätter et al. (2020) aims to minimize the difference of margins in two passages, and it’s been applied in later works Formal et al. (2021a); Hofstätter et al. (2021). Reddi et al. (2021) used the teacher’s top passages as positive examples to teach students point-wisely. ListNet Cao et al. (2007); Xiao et al. (2022) ensured the consistency of list-wise ranking order by minimizing the difference in score distributions over passages.

The above methods are designed for the cross-encoder teacher that is much stronger than the dense student. But in our situation, the lexicon-aware representation model can only achieve comparable performance with a dense retriever. Besides, we find that existing distillation methods may not be perfect choices in our setting, as pointed out in our experiments (i.e., Tab. 2). Therefore, we propose a novel approach including lexicon-augmented contrastive objective and pair-wise rank-consistent regularization to transfer lexical knowledge.

3 Methodology

We first introduce the task formalization, general training framework, and retriever architectures in Sec. 3.1. Then we present our Lexical- Enlightened Dense (LED) retriever in Sec. 3.2

3.1 Preliminary

Task Definition.

In the first-stage retrieval, given a query $q$ , a retriever is required to fetch top- $k$ relevant passages from a million- even billion-scale passage collection $C$ . Due to the efficiency requirement, dual-encoder is widely applied in this task for its lightweight metric calculation. Formally, dual-encoder represents text $x$ (could be query $q$ or passage $p$ ) to $d$ -dimensional embeddings, i.e.,

x = Dual-Enc (x; θ) \in R^{d},

(1)

where $θ$ could be dense retriever ( $θ^{d e n}$ ) or lexicon-aware retriever ( $θ^{l e x}$ ). With the encoded query $q$ and passage $p$ , we could calculate the relevance score via dot product for retrieval, i.e.,

R (q, p; θ) = q^{T} p .

(2)

Learning Framework for Retriever.

To train the dual-encoder $θ$ , we utilize contrastive learning following previous works Xiong et al. (2021); Gao et al. (2021b). Specifically, with a given query $q$ , a labeled positive passage $p^{+}$ , and negatives $p \in N$ , contrastive loss can be applied to optimize the dual-encoder $θ$ by maximizing the relevance of the $q$ and $p^{+}$ while minimizing that of $q$ and $p \in N$ , i.e.,

L_{θ}^{c l} = - log \frac{e x p (R (q, p^{+}; θ))}{\sum_{p \in {p^{+}} \cap N} e x p (R (q, p; θ))},

(3)

where negatives $N$ can be generated from top-ranked passages in retrieval results of BM25 model Nguyen et al. (2016) or a trained retrievers Zhan et al. (2021); Zhang et al. (2022), i.e.,

(4)

where $P$ is a probability distribution over $C$ , which can be defined as non-parametric (e.g., $θ^{s a m p} = ⊘$ ) or parametric (e.g., $θ^{s a m p} \neq ⊘$ ).

Dense & Lexicon-Aware Retrievers.

Both dense retriever ( $θ^{d e n}$ ) and lexicon-aware retriever ( $θ^{l e x}$ ) follow dual-encoder architecture and the encoders are built upon PLMs like BERT Devlin et al. (2019). Precisely, a PLM ( $θ^{p l m}$ ) encodes a given text (i.e., query or passage), $x = {t_{1}, t_{2}, . . . t_{n}}$ , to contextualized embeddings, i.e.,

	$H^{x}$	$= PLM (x; θ^{p l m})$		(5)
		$= PLM ([C L S], t_{1}, t_{2}, . . ., t_{n}, [S E P]),$		(5)

where $H^{x} = [h_{[C L S]}^{x}, h_{1}^{x}, . . ., h_{n}^{x}, h_{[S E P]}^{x}]$ , and $[C L S]$ and $[S E P]$ are special tokens designed for sentence representation and separation by recent PLMs Devlin et al. (2019); Liu et al. (2019). Dense retriever Xiong et al. (2021); Qu et al. (2021) represents text by using the embedding of special token $[C L S]$ (i.e., $h_{[C L S]}^{x}$ ) as following,

x^{d e n} = Dual-Enc (x; θ^{d e n}) = % CLS-Pool (H^{x}),

(6)

where $θ^{d e n} = θ^{p l m}$ with no additional parameters.

For lexicon-aware retriever, we adopt SPLADE Formal et al. (2021a) which learns the weights of terms in PLM vocab for each token in the input $x$ by the Masked Language Modeling (MLM) layer and sparse regularization, then max-pooling these weights into a discrete text representation after log-saturation. Formally, with $H^{x}$ by the PLM ( $θ^{p l m}$ ), a MLM layer ( $θ^{m l m}$ ) linearly transform it into ${~ H}^{x}$ , then term weight representation of $x$ could be obtained as follows,

	$x^{l e x}$	$= Dual-Enc (x; θ^{l e x})$		(7)
		$= MAX-Pool (log (1 + ReLU (W^{e} \cdot ~ H^{x}))),$		(7)

where $W^{e} \in R^{| V | \times e}$ is the transpose of the input embedding matrix in PLM as the MLM head, and the $θ^{l e x} = {θ^{p l m}, θ^{m l m}, W^{e}}$ .

The dense encoder represents texts as global sequence-level embeddings and is good at global semantic matching, while the lexicon-aware encoder represents local term-level embeddings and handles salient phrases and entity mentions well. Both encoders can be optimized with the Eq. 3.

3.2 Lexical Enlightened Dense Retriever

Fig. 1 illustrates the training workflow of our LED retriever. Specifically, we follow a two-stage training procedure. In the Warmup stage, we independently train the dense retriever and the lexicon-aware one by Eq. 3, both with BM25 negatives ( $N^{b m 25}$ ). This stage ends up with two retrievers, namely the Lexical Warm-up ( $θ^{l e x 1}$ ) and the Dense Warm-up ( $θ^{d e n 1}$ ). Start from the warm-up checkpoint, in the Continue Training stage, we train the lexical retriever by Eq. 3 with negatives ( $N^{l e x 1}$ ) sampled by the Lexcial Warm-up ( $θ^{l e x 1}$ ), resulting in the Lexical ( $θ^{l e x 2}$ ), which plays a role of a teacher for later lexical knowledge teaching.

With a well-trained lexicon-aware teacher ( $θ^{l e x 2}$ ) and dense student after warming up ( $θ^{d e n 1}$ ), we further enlighten the dense retriever ( $θ^{d e n 1}$ ) by transferring knowledge from the well-performing lexicon-aware representation model ( $θ^{l e x 2}$ ). The knowledge transfer is achieved from two perspectives – 1) a lexicon-augmented contrastive objective to challenge the dense encoder and 2) a rank-consistent regularization to make the dense model’s behavior inclined to the other. We will detail the two objectives in the following paragraphs.

Lexicon-Augmented Contrastive.

Following previous work Zhan et al. (2021), we use the dense negatives ( $N^{d e n 1}$ ) sampled by the Dense Warm-up ( $θ^{d e n 1}$ ) to boost dense retrieval. Meanwhile, inspired by Chen et al. (2021) who trained a lexical retriever with negatives provided by term-based techniques such as BM25, we try to enhance the dense retriever from the negatives augmentation perspective. Considering the similar backbone and the same optimization objectives of both dense and lexicon-aware retrievers, intuitively, dense one can use the lexical negatives ( $N^{l e x 1}$ ) to partially follow the training process of the lexical teacher ( $θ^{l e x 2}$ ), thus learning lexicon-aware ability.

One step further, we use the lexical negatives ( $N^{l e x 2}$ ) for learning more lexical knowledge. Comparing the dense negatives, these lexical negatives ( $N^{l e x 1}$ and $N^{l e x 2}$ ) can provide more diverse examples for robust retriever training. Also, they could be regarded as adversarial examples to challenge the dense retriever, thus enhancing the lexicon-aware capability of the dense model. Formally, the lexicon-augmented contrastive loss for LED is,

L_{θ^{l e d}}^{c l} = - log \frac{e x p (R (q, p^{+}; θ^{l e d}))}{\sum_{p \in {p^{+}} \cap N^{m i x}} e x p (R (q, p; θ^{l e d}))},

(8)

where $N^{m i x} = {N^{l e x 1} \cap N^{l e x 2} \cap N^{d e n 1}}$ .

Rank-Consistent Regularization.

From the retrieval behavior perspective, for given query-passage pairs, we utilize the lexicon-aware teacher ( $θ^{l e x 2}$ ) to generate ranking pairs to regularize and guide LED’s retrieval behavior.

Specifically, given a query $q$ and passages from $D^{q} = {p \in {p^{+}} \cap N^{m i x}}$ , the Lexical ( $θ^{l e x 2}$ ) scores each query-passage pair (abbr. $R (p; θ^{l e x 2})$ ) with Eq. 2 and generate ranking pairs as follows,

K^{q} = {(p_{i}, p_{j}) | p_{i}, p_{j} \in D^{q}, R (p_{i}; θ^{l e x 2}) > R (p_{j}; θ^{l e x 2})} .

(9)

Then, a pair-wise rank-consistent regularization is employed to make dense model’s behavior incline to the lexicon-aware one by minimizing the following margin-based ranking loss,

L_{θ^{l e d}}^{l l} = \frac{1}{∣ K^{q} ∣} \sum p_{i}, p_{j} \in K^{q} max [0, R (p_{i}; θ^{l e d}) - R (p_{j}; θ^{l e d})],

(10)

where $R (p; θ^{l e d})$ is the abbrivation of relevance $R (q, p; θ^{l e d})$ calculated by LED ( $θ^{l e d}$ ) with Eq. 2. Compared to logits distillation with a list-wise KL-divergence loss, our training objective provides a weak supervision signal pair-wisely, thus keeping the effects of lexical knowledge on dense properties at a minimum level. Especially that we don’t punish the dense student as long as its ranking of a given pair is the same as the teacher, with no strict requirement on the score difference like Margin-MSE Hofstätter et al. (2020). Experiments in Tab. 2 demonstrate the merit of our method.

Training and Inference.

To incorporate lexicon-aware ability while keeping its sequence-level semantic representation ability for passage retrieval, we combine contrastive loss ( $L_{θ^{l e d}}^{c l}$ ) in Eq. 8 and lexical learning loss ( $L_{θ^{l e d}}^{l l}$ ) in Eq. 10 to train our LED retriever ( $θ^{l e d}$ ) as following,

L_{θ^{l e d}} = L_{θ^{l e d}}^{c l} + λ L_{θ^{l e d}}^{l l},

(11)

where $λ$ is a hyperparameter to control if the training inclines to transfer lexical-ware knowledge from the lexicon-aware teacher ( $θ^{l e x 2}$ ).

For inference, LED pre-computes the embeddings of all passages in the entire collection $C$ and builds index with FAISS Johnson et al. (2021). Then LED encodes queries online and retrieves top-ranked $k$ passages based on the relevance score.

4 Experiments

Methods	PLM	Ranker	MS MARCO Dev			TREC DL 19		TREC DL 20
Methods	PLM	Taught	MRR@10	R@50	R@1k	NDCG@10	R@1k	NDCG@10	R@1k
Lexicon-Aware Retriever
BM25 Robertson and Zaragoza (2009)	-		18.7	59.2	85.7	49.7	-	48.7	-
DeepCT Dai and Callan (2019)	BERT $_{base}$		24.3	69.0	91.0	55.1	-	55.6	-
COIL-full Gao et al. (2021a)	BERT $_{base}$		35.5	-	96.3	70.4	-	-	-
UniCOIL Lin and Ma (2021)	BERT $_{base}$		35.2	80.7	95.8	-	-	-	-
SPLADE-max Formal et al. (2021a)	DistilBERT		34.0	-	96.5	68.4	\ul85.1	-	-
DistilSPLADE-max Formal et al. (2021a)	DistilBERT	$✓$	36.8	-	97.9	\ul72.9	86.5	-	-
UniCOIL $Λ$ Chen et al. (2021)	BERT $_{base}$		34.1	82.1	97.0	-	-	-	-
Dense Retriever
ANCE Xiong et al. (2021)	RoBERTa $_{base}$		33.0	-	95.9	64.8	-	64.6	-
ADORE Zhan et al. (2021)	RoBERTa $_{base}$		34.7	-	-	68.3	-	66.6	-
TAS-B Hofstätter et al. (2021)	DistilBERT	$✓$	34.7	-	97.8	71.7	84.3	\ul68.5	-
ColBERTv1 Khattab and Zaharia (2020)	BERT $_{base}$		36.0	82.9	96.8	-	-	-	-
ColBERTv2 Santhanam et al. (2021)	BERT $_{base}$	$✓$	\ul39.7	\ul86.8	98.4	-	-	-	-
coCondenser Gao and Callan (2021b)	BERT $_{base}$		38.2	-	98.4	-	-	-	-
PAIR Ren et al. (2021a)	ERNIE $_{base}$	$✓$	37.9	86.4	98.2	-	-	-	-
RocketQAv2 Ren et al. (2021b)	ERNIE $_{base}$	$✓$	38.8	86.2	98.1	-	-	-	-
AR2-G Zhang et al. (2022)	coCon $_{base}$	$✓$	39.5	-	-	-	-	-	-
LED	coCon $_{base}$		39.6	86.6	\ul98.3	70.5	75.3	67.9	76.0
LED (w/ RT)	coCon $_{base}$	$✓$	40.2	87.6	98.4	74.4	77.9	70.2	\ul78.3
Our Performance Breakdown
LEX (Warm-up)	DistilBERT		36.1	84.2	97.5	67.4	75.2	66.4	77.7
LEX (Continue)	DistilBERT		38.3	85.9	98.0	72.8	80.0	67.7	79.4
DEN (Warm-up)	coCon $_{base}$		36.1	83.5	97.7	64.7	70.9	65.9	73.1
DEN (Continue)	coCon $_{base}$		38.1	86.3	98.4	69.1	74.4	67.8	76.4
DEN (w/ RT)	coCon $_{base}$	$✓$	39.6	86.7	98.4	71.8	74.9	69.7	75.6

Table 1: Experimental results on MS MARCO, TREC DL 2019, and TREC DL 2020 datasets (%). We mark the best results in bold and the second-best \ulunderlined. coCon

_{base}

refers the base version of coCondenser Gao and Callan (2021b). DEN and LED refer to the dense and lexicon-aware retrievers, respectively. ’Warm-up’ indicates the retrievers trained after the warm-up stage. ’Continue’ indicates the model that is continuously trained with the negatives produced by the warm-up model. ’w/ RT’ indicates the retrievers are further enhanced with a ranker distillation with KL-divergence loss Zhou et al. (2022).

We evaluate our retriever on three public human-annotated real-world benchmarks, namely MS-MARCO Nguyen et al. (2016), TREC Deep Learning 2019 Craswell et al. (2020a), and TREC Deep Learning 2020 Craswell et al. (2020b). MS-MARCO Dev has 6980 queries, TREC 2019 has 43 queries, and TREC 2020 includes 54 queries. In all three benchmarks, first-stage retrievers are required to fetch relevant passages from an 8-million scale collection. Due to limited space, please refer to Appendix A for more details on evaluation metrics.

4.1 Baselines and Implementation Details

We compare our model with previous state-of-the-art retrieval models, including traditional approaches (e.g. BM25 Robertson and Zaragoza (2009), lexicon-aware retriever (e.g. DeepCT Dai and Callan (2019), SPLADE Formal et al. (2021a)) and dense retriever (e.g. ANCE Xiong et al. (2021), RocketQAv2 Ren et al. (2021b), AR2 Zhang et al. (2022)) . Due to limited space, please refer to Appendix B for more baseline details and Appendix C for implementation details.

4.2 Main Results

Tab. 1 presents the evaluation results on the aforementioned three public benchmarks.

Firstly, our LED retriever achieves comparable performance with state-of-the-art methods ColBERTv2 Santhanam et al. (2021) and AR2 Zhang et al. (2022) on MS MARCO Dev, although both baselines are taught by the powerful ranking model (a.k.a., cross-encoder). After coupling with a similar ranker distillation¹¹1Like AR2 Zhang et al. (2022) and ColBERTv2 Santhanam et al. (2021), we use KL-divergence to distill the ranker’s scores into the LED model, but we use ERNIE-2.0-base Sun et al. (2020) instead of ERNIE-2.0-large in AR2., our dense retriever (i.e., LED (w/ RT)) can be further improved and meanwhile outperforms state-of-the-art baselines, showing the compatibility of distillation from lexicon-aware sparse retriever. Note that we neither use heavy ranker teacher in AR2 Zhang et al. (2022) nor multiple vector representation applied in ColBERTv2 Santhanam et al. (2021).

Secondly, LED (w/ RT) achieves better performance than DEN (w/ RT) on all three datasets, demonstrating that our training method can transfer some complementary lexicon-aware knowledge not covered by the cross-encoder. The weak intensity of the supervision signal makes our lexical-enlightened strategy a promising plug-and-play technique for other dense retrievers.

Thirdly, our LED retriever taught by a smaller lexicon-aware retriever is similarly performant as the dense retriever taught by a strong cross-encoder (i.e., DEN (w/ RT)), showing the effectiveness of injecting lexical knowledge into the dense retriever. The reason is two-fold: (1) The dual-encoder architecture of the lexicon-aware teacher enables the relevance calculation can be easily integrated into in-batch techniques to scale up the teaching amount. (2) More importantly, lexicon-aware retriever could provide self-mining hard negatives for more direct supervision while cross-encoder can only provide score distribution over given passages.

Lastly, we notice that our LED performs significantly better than DEN (Continue) on both the MS MARCO DEV and TREC DL 19/20 datasets, which demonstrates the superiority of our proposed training objective. However, our LED achieves slightly worse results than LEX (Continue) on TREC DL datasets. The phenomenon may steam from the fact that the pure lexicon-aware retriever can perform better and more robust on the out-of-distribution data, which is aligned with the observations in SPLADE Formal et al. (2021a).

4.3 Further Analysis

Comparison of Teaching Strategies.

Methods	MRR@10	R@1k
No Distillation	38.1	98.4
Filter Qu et al. (2021)	38.4	98.4
Margin-MSE Hofstätter et al. (2020)	38.5	98.3
ListNet Xiao et al. (2022)	38.7	98.2
KL-Divergence Zhang et al. (2022)	39.0	98.4
Ours	39.6	98.3

Table 2: Evalution results of different teaching strategies on MS MARCO Dev (%).

Tab. 2 shows the comparison of our proposed pair-wise rank-consistent regularization with other teaching strategies. Filter means the negatives with high scores (i.e., false negatives) are filtered by LEX (Continue). The other three strategies (e.g., Margin-MSE, ListNet, and KL-Divergence) are borrowed from knowledge distillation in IR. From the table, we can find that all strategies can bring performance gain, even in an indirect way like Filter. This observation proves that learning from the lexicon-aware representation model leads to a better dense retriever. Also, our rank-consistent regularization outperforms other baselines on MRR@10 metric by a large margin, showing the superiority of our method. Besides, we can find that the point-wise objective (i.e., Margin-MSE) brings the least gain, followed by the list-wise objectives (i.e., ListNet and KL-Divergence) and our pair-wise rank-consistent regularization brings the most significant gain. The phenomenon implies that a soft teaching objective is more functional for transferring knowledge from the lexicon-aware model than strict objectives. In fact, enforcing dense retrievers to be aligned with fine-grained differences between scores of the LEX often leads to training collapse. In our experiment, we found that only with hyperparameters chosen carefully, especially small distillation loss weight, Margin-MSE can enhance the dense retriever.

Comparison of Ensemble Retrievers.

Systems	MRR@10	R@50	R@1k
ANCE + BM25	34.7	81.6	96.9
RocketQA + BM25	38.1	85.9	98.0
RocketQA + UniCOIL	38.8	86.5	97.3
RocketQA + BM25 $Λ$	37.9	85.7	98.0
RocketQA + UniCOIL $Λ$	38.6	86.3	98.5
DEN (Continue) + LED	39.3	86.9	98.5
DEN (Continue) + DEN (w/ RT)	39.4	87.0	98.5
DEN (Continue) + LEX (Continue)	40.4	88.4	98.7
DEN (w/ RT) + LEX (Continue)	40.7	88.4	98.7
LED + LEX (Continue)	40.9	88.3	98.6
LED (w/ RT) + LEX (Continue)	41.1	88.5	98.7

Table 3: Comparison with Ensemble Systems on MS MARCO Dev (%). The first block results are copied from Chen et al. (2021).

Λ

refers to a dense retriever trained with data generated by lexicon-based methods such as BM25 and UniCOIL.

We are also curious whether our LED can improve the performance of ensemble retrievers. With LEX (Continue) ( $θ^{l e x 2}$ ) and LED ( $θ^{l e d}$ ), we simply use the summation of the normalized relevance scores of two retrievers, and then return a new order of retrieval results. Tab. 3 gives the evaluation results of our systems and other strong baselines reported in Chen et al. (2021). We can observe:

(1) Aligned with results in SPAR Chen et al. (2021), the ensemble of two dense retrievers (i.e., DEN (Continue) + LED and DEN (Continue) + DEN (w/ RT)) is not as performant as that of one dense and one lexicon-aware retriever. In particular, the ensemble of two dense retrievers is even less competitive than a single LED or DEN (w/ RT). The results are rational because two base models have similar retrieval behaviors and the strong one will be impeded by the weak one if they have the same weight in the ensemble system.

(2) Although learning from LEX (Continue) will not introduce new knowledge to the hybrid ensemble system, LED + LEX can further boost the performance of DEN + LEX. The reason behind this is that the LED scores golden query-passage pairs higher than DEN, so these pairs are ranked higher in the later ensemble process. This behavior could be regarded as an instance-level weighted score aggregation inside the network and it is more feasible to obtain than tuning the weights of retrievers for each query in the ensemble system.

(3) LED (w/ RT) + LEX (Continue) is slightly better than LED + LEX (Continue) and DEN (w/ RT) + LEX (Continue), once again proving that our lexical rank-consistent regularization is complementary to the ranker distillation.

Impact of Different Components.

Retrievers	MRR@10	R@1k
DEN (Continue)	38.1	98.4
LED	39.6	98.3
w/o Rank Regularization	37.9	98.5
w/o LEX Continue Negs ( $N^{l e x 2}$ )	39.4	98.3
w/o LEX Warm-up Negs ( $N^{l e x 1}$ )	39.4	98.3
w/o LEX Mixed Negs ( $N^{l e x 1}$ $\cap$ $N^{l e x 2}$ )	39.2	98.4

Table 4: Ablation Study on MS MARCO Dev (%). Negs is short for negatives.

We conduct a comprehensive ablation study to investigate the impact of lexical hard negatives and rank-consistent regularization method. Tab. 4 reports the results of removing each components. We can observe that the pair-wise rank-consistent regularization plays an important role in lexical learning because removing it will bring significant performance degradation. Without regularization, dense retriever trained with mixed hard negatives can only achieve comparable even slightly worse performance than the dense retriever trained with only self negatives. The reason behind this is a part of LEX-sampled negatives, as self-adversarial examples for LEX, may not effectively challenge pure dense retrievers but disturb dense self-adversarial training over its own hard negatives. When the dense retriever is equipped with lexicon-aware capability by our rank-consistent regularization, the LEX-augmented negatives become very effective to challenge the retriever for great performance. This can prove from the side that our method can inject lexicon-rich information into a dense retriever. As we can observe, the performance can be significantly improved only when lexical hard negatives are equipped with rank regularization from the lexicon-aware teacher. In addition, we can find that both negatives provided by LEX (Warm-up) and LEX (Continue) are both helpful for the contrastive training of the dense retriever, and removing both of them results in a more obvious performance drop.

Distributions of model prediction for DEN (Continue), LEX (Continue), and LED retrievers over MS MARCO Dev.
For visual clarity, we use the query-passage pairs which the LEX and DEN predict discrepantly as data samples.
The discrepancy is determined by that there is a — (a) Top-ranked samples

Effects on Model Predictions.

To further check the effect of learning lexicon-aware capability on the LED, we illustrate the distribution shift of predictions of dense retriever before and after lexical enlightenment in Fig. 2. We can make the following observations: (1) In both two sets of query-passage pairs, compared to DEN distributions, the score distributions of LED are clearly shifting to the LEX, showing the success of lexical knowledge learning. (2) LED’s distribution remains more overlap with DEN instead of LEX, which proves that our rank-consistent regularization method could keep LED’s dense retriever properties, thanks to its weak supervision signal.

Zoom-in Study of Retrieval Ranking.

Tab. 5 shows how the average rank of golden passages varies across different rank ranges, bucketed by LEX-predicted ranks of the golden passages. We can observe that: (1) More than 50% golden passages are ranked in the top 5 by the LEX, paving the way for good lexical teaching. (2) The average ranking of golden passages by LED is consistently improved until the top 100, which means approximately 90% of answers are ranked higher by the retriever after learning lexical knowledge, proving the effectiveness of our lexical knowledge transfer. Meanwhile, similar even more gain can also be observed in LED (w/ RT), once again proving that our method is complementary to distillation from a cross-encoder. (3) In the queries that the LEX performs unfavorably (i.e., ground truth ranked lower than 100), LED and LED (w/ RT) are negatively impacted by the lexicon-aware teacher’s mistakes. Interestingly, their original rankings of these ground truths are not very high, either. So these queries are intractable for both dense and lexicon-aware retrievers, which we leave for future work.

Ranges	Count	Average Ranking
Ranges	Count	LEX	DEN	LED	LED (w/ RT)
Top 1	1,787	1.0	2.5	2.3	2.4
(1, 5]	2,242	3.1	6.4	5.2	4.8
(5, 10]	875	7.8	14.7	13.8	12.7
(10, 50]	1,428	23.3	31.4	31.1	28.9
(50, 100]	358	70.5	80.0	75.9	74.0
(100, 500]	445	216.8	156.3	166.7	154.1
(500, 1000]	69	698.0	298.9	334.7	289.1

Table 5: The average rank of golden passages by four retrievers on MS MARCO Dev. We bin the dev examples into buckets with the rank predicted by LEX and calculate the average ranking of other retrievers by group.

5 Conclusion

In this paper, we consider building a lexicon-enlightened dense retriever by transferring knowledge from a lexicon-aware representation model. Specifically, we propose to enlighten the dense retriever from two aspects, including lexicon-augmented contrastive objective and a pair-wise rank-consistent regularization. Experimental results on three benchmarks show that with a performance-comparable lexicon-aware representation model as the teacher, a dense retriever can be consistently and significantly improved, even outdo its teacher. Extensive analysis and discussions demonstrate the effectiveness, compatibility, and interpretability of our method.

6 Limitations

Although our method effectively enhances the dense retriever, the average ranking change (i.e., Tab. 5) and distribution shift (i.e., Fig. 2(b)) disclose that our method will also introduce negative effects of lexicon-aware retriever. Meanwhile, to train a LED retriever, we need to build a lexicon-aware retriever beforehand and teach the dense retriever then. This training flow comes with a longer training time.

References

Y. Cai, Y. Fan, J. Guo, F. Sun, R. Zhang, and X. Cheng (2021) Semantic models for the first-stage retrieval: A comprehensive review. CoRR abs/2103.04831. External Links: Link, 2103.04831 Cited by: §1.
Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li (2007) Learning to rank: from pairwise approach to listwise approach. In Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, Z. Ghahramani (Ed.), ACM International Conference Proceeding Series, Vol. 227, pp. 129–136. External Links: Link, Document Cited by: §2.
X. Chen, K. Lakhotia, B. Oguz, A. Gupta, P. S. H. Lewis, S. Peshterliev, Y. Mehdad, S. Gupta, and W. Yih (2021) Salient phrase aware dense retrieval: can a dense retriever imitate a sparse one?. CoRR abs/2110.06918. External Links: Link, 2110.06918 Cited by: Appendix B, §1, §2, §2, §3.2, §4.3, §4.3, Table 1, Table 3.
N. Craswell, B. Mitra, E. Yilmaz, D. Campos, and E. M. Voorhees (2020a) Overview of the TREC 2019 deep learning track. CoRR abs/2003.07820. External Links: Link, 2003.07820 Cited by: §4.
N. Craswell, B. Mitra, E. Yilmaz, and D. Campos (2020b) Overview of the TREC 2020 deep learning track. In Proceedings of the Twenty-Ninth Text REtrieval Conference, TREC 2020, Virtual Event [Gaithersburg, Maryland, USA], November 16-20, 2020, E. M. Voorhees and A. Ellis (Eds.), NIST Special Publication, Vol. 1266. External Links: Link Cited by: §4.
Z. Dai and J. Callan (2019) Context-aware sentence/passage term importance estimation for first stage retrieval. CoRR abs/1910.10687. External Links: Link, 1910.10687 Cited by: Appendix B, §1, §2, §4.1, Table 1.
J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4171–4186. External Links: Link, Document Cited by: §1, §2, §2, §3.1.
T. Formal, C. Lassance, B. Piwowarski, and S. Clinchant (2021a) SPLADE v2: sparse lexical and expansion model for information retrieval. CoRR abs/2109.10086. External Links: Link, 2109.10086 Cited by: Appendix B, Appendix C, §2, §2, §3.1, §4.1, §4.2, Table 1.
T. Formal, B. Piwowarski, and S. Clinchant (2021b) SPLADE: sparse lexical and expansion model for first stage ranking. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, and T. Sakai (Eds.), pp. 2288–2292. External Links: Link, Document Cited by: §2.
L. Gao and J. Callan (2021a) Condenser: a pre-training architecture for dense retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih (Eds.), pp. 981–993. External Links: Link, Document Cited by: Appendix B, §2.
L. Gao and J. Callan (2021b) Unsupervised corpus aware language model pre-training for dense passage retrieval. CoRR abs/2108.05540. External Links: Link, 2108.05540 Cited by: Appendix B, Appendix C, §2, Table 1.
L. Gao, Z. Dai, and J. Callan (2021a) COIL: revisit exact lexical match in information retrieval with contextualized inverted list. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou (Eds.), pp. 3030–3042. External Links: Link, Document Cited by: Appendix B, §1, §2, Table 1.
T. Gao, X. Yao, and D. Chen (2021b) SimCSE: simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih (Eds.), pp. 6894–6910. External Links: Link, Document Cited by: §3.1.
S. Hofstätter, S. Althammer, M. Schröder, M. Sertkan, and A. Hanbury (2020) Improving efficient neural ranking models with cross-architecture knowledge distillation. CoRR abs/2010.02666. External Links: Link, 2010.02666 Cited by: §1, §2, §3.2, Table 2.
S. Hofstätter, S. Lin, J. Yang, J. Lin, and A. Hanbury (2021) Efficiently teaching an effective dense retriever with balanced topic aware sampling. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, and T. Sakai (Eds.), pp. 113–122. External Links: Link, Document Cited by: Appendix B, §2, Table 1.
J. Huang, A. Sharma, S. Sun, L. Xia, D. Zhang, P. Pronin, J. Padmanabhan, G. Ottaviano, and L. Yang (2020) Embedding-based retrieval in facebook search. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, R. Gupta, Y. Liu, J. Tang, and B. A. Prakash (Eds.), pp. 2553–2561. External Links: Link, Document Cited by: §2.
J. Johnson, M. Douze, and H. Jégou (2021) Billion-scale similarity search with gpus. IEEE Trans. Big Data 7 (3), pp. 535–547. External Links: Link, Document Cited by: §1, §3.2.
V. Karpukhin, B. Oguz, S. Min, P. S. H. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 6769–6781. External Links: Link, Document Cited by: §1, §1, §2.
O. Khattab and M. Zaharia (2020) ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, J. Huang, Y. Chang, X. Cheng, J. Kamps, V. Murdock, J. Wen, and Y. Liu (Eds.), pp. 39–48. External Links: Link, Document Cited by: Appendix B, Table 1.
S. Kuzi, M. Zhang, C. Li, M. Bendersky, and M. Najork (2020) Leveraging semantic and lexical matching to improve the recall of document retrieval systems: A hybrid approach. CoRR abs/2010.01195. External Links: Link, 2010.01195 Cited by: §2.
K. Lee, M. Chang, and K. Toutanova (2019) Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 6086–6096. External Links: Link, Document Cited by: §2.
J. Lin and X. Ma (2021) A few brief notes on deepimpact, coil, and a conceptual framework for information retrieval techniques. CoRR abs/2106.14807. External Links: Link, 2106.14807 Cited by: Appendix B, §1, §2, Table 1.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §1, §2, §3.1.
Y. Luan, J. Eisenstein, K. Toutanova, and M. Collins (2021) Sparse, dense, and attentional representations for text retrieval. Trans. Assoc. Comput. Linguistics 9, pp. 329–345. External Links: Link, Document Cited by: §2.
A. Mallia, O. Khattab, T. Suel, and N. Tonellotto (2021) Learning passage impacts for inverted indexes. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, and T. Sakai (Eds.), pp. 1723–1727. External Links: Link, Document Cited by: §1, §2.
T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, T. R. Besold, A. Bordes, A. S. d’Avila Garcez, and G. Wayne (Eds.), CEUR Workshop Proceedings, Vol. 1773. External Links: Link Cited by: §3.1, §4.
Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao, D. Dong, H. Wu, and H. Wang (2021) RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou (Eds.), pp. 5835–5847. External Links: Link, Document Cited by: §1, §2, §2, §3.1, Table 2.
A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. (2018) Improving language understanding by generative pre-training. Cited by: §1.
S. J. Reddi, R. K. Pasumarthi, A. K. Menon, A. S. Rawat, F. X. Yu, S. Kim, A. Veit, and S. Kumar (2021) RankDistil: knowledge distillation for ranking. In The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, A. Banerjee and K. Fukumizu (Eds.), Proceedings of Machine Learning Research, Vol. 130, pp. 2368–2376. External Links: Link Cited by: §2.
R. Ren, S. Lv, Y. Qu, J. Liu, W. X. Zhao, Q. She, H. Wu, H. Wang, and J. Wen (2021a) PAIR: leveraging passage-centric similarity relation for improving dense passage retrieval. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli (Eds.), Findings of ACL, Vol. ACL/IJCNLP 2021, pp. 2173–2183. External Links: Link, Document Cited by: Appendix B, §2, Table 1.
R. Ren, Y. Qu, J. Liu, W. X. Zhao, Q. She, H. Wu, H. Wang, and J. Wen (2021b) RocketQAv2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih (Eds.), pp. 2825–2835. External Links: Link, Document Cited by: Appendix B, §4.1, Table 1.
S. E. Robertson and H. Zaragoza (2009) The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr. 3 (4), pp. 333–389. External Links: Link, Document Cited by: Appendix B, §2, §4.1, Table 1.
V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108. External Links: Link, 1910.01108 Cited by: Appendix C.
K. Santhanam, O. Khattab, J. Saad-Falcon, C. Potts, and M. Zaharia (2021) ColBERTv2: effective and efficient retrieval via lightweight late interaction. CoRR abs/2112.01488. External Links: Link, 2112.01488 Cited by: Appendix B, §2, §4.2, Table 1, footnote 1.
Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang (2020) ERNIE 2.0: A continual pre-training framework for language understanding. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 8968–8975. External Links: Link Cited by: footnote 1.
S. Xiao, Z. Liu, W. Han, J. Zhang, D. Lian, Y. Gong, Q. Chen, F. Yang, H. Sun, Y. Shao, D. Deng, Q. Zhang, and X. Xie (2022) Distill-vq: learning retrieval oriented vector quantization by distilling knowledge from dense embeddings. CoRR abs/2204.00185. External Links: Link, Document, 2204.00185 Cited by: §2, Table 2.
L. Xiong, C. Xiong, Y. Li, K. Tang, J. Liu, P. N. Bennett, J. Ahmed, and A. Overwijk (2021) Approximate nearest neighbor negative contrastive learning for dense text retrieval. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Link Cited by: Appendix B, §1, §1, §2, §2, §3.1, §3.1, §4.1, Table 1.
J. Zhan, J. Mao, Y. Liu, J. Guo, M. Zhang, and S. Ma (2021) Optimizing dense retrieval model training with hard negatives. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, and T. Sakai (Eds.), pp. 1503–1512. External Links: Link, Document Cited by: Appendix B, §2, §3.1, §3.2, Table 1.
H. Zhang, Y. Gong, Y. Shen, J. Lv, N. Duan, and W. Chen (2022) Adversarial retriever-ranker for dense text retrieval. In International Conference on Learning Representations, External Links: Link Cited by: Appendix B, §1, §2, §2, §3.1, §4.1, §4.2, Table 1, Table 2, footnote 1.
S. Zhang, L. Yao, A. Sun, and Y. Tay (2019) Deep learning based recommender system: A survey and new perspectives. ACM Comput. Surv. 52 (1), pp. 5:1–5:38. External Links: Link, Document Cited by: §1.
Y. Zhou, T. Shen, X. Geng, C. Tao, C. Xu, G. Long, B. Jiao, and D. Jiang (2022) Towards robust ranker for text retrieval. arXiv. External Links: Document, Link Cited by: §2, Table 1.
L. Zou, S. Zhang, H. Cai, D. Ma, S. Cheng, S. Wang, D. Shi, Z. Cheng, and D. Yin (2021) Pre-trained language model based ranking in baidu search. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, F. Zhu, B. C. Ooi, and C. Miao (Eds.), pp. 4014–4022. External Links: Link, Document Cited by: §1, §2.

Appendix A Evaluation Metrics

We report MRR@10, Recall@50, Recall@1000 for MS MARCO Dev, and NDCG@10 and R@1000 for both TREC Deep Learning 2019 and TREC Deep Learning 2020. We use the official TREC evaluation files²²2https://github.com/usnistgov/trec_eval.

Appendix B Baselines

We compare with previous state-of-the-art baselines including traditional term-based techniques like BM25 Robertson and Zaragoza (2009), and dense as well as lexicon-aware retrievers. For lexicon-aware retriever, DeepCT Dai and Callan (2019) was trained to predict term weights.COIL Gao et al. (2021a) used contextualized representation for exact term matching and UniCOIL Lin and Ma (2021) compressed vectors in COIL into scalars. SPLADE-max and DistilSPLADE-max Formal et al. (2021a) were both trained with Eq. 7 and the latter one was further enhanced by a cross-encoder. The UniCOIL $Λ$ was a lexicon-aware model trained with UniCOIL’s top-ranked passages and nagatives Chen et al. (2021). For dense retriever, ANCE Xiong et al. (2021) selected hard training negatives from the entire collection. ADORE Zhan et al. (2021) used self-mining static negatives and then dynamic negatives. TAS-B Hofstätter et al. (2021) proposed balanced topic aware negative sampling strategies for effective teaching. ColBERTv1 Khattab and Zaharia (2020) and ColBERTv2 Santhanam et al. (2021) utilized late-interaction and the latter one further incorporates ranker distillation. coCondenser Gao and Callan (2021b) augmented MLM loss with contrastive learning and based a model architecture Gao and Callan (2021a) with a decoupled sentence and token interaction. PAIR Ren et al. (2021a) introduced passage-centric loss to assist the contrastive loss and combine cross-encoder teaching. RocketQAv2 Ren et al. (2021b) utilized K-L divergence to align the list-wise distributions between retriever and ranker, and proposed hybrid data augmentation. AR2-G Zhang et al. (2022) used an adversarial framework to train retriever and ranker simultaneously.

Notably, AR2³³3https://github.com/microsoft/AR2/tree/main/AR2 used a different Recall@N evaluation from the official TREC Recall@N. Therefore, we don’t report their Recall@N performances in Tab. 1.

Appendix C Implementation Details

All experiments run on $1$ NVIDIA Tesla A100 GPU having 80GB memories with a fixed random seed. We train our models with mixed precision to speed up the training and meet the huge memory need. The training time will last about 32 hours. For the lexical teacher, we train a DistilBERT⁴⁴4https://huggingface.co/distilbert-base-uncased Sanh et al. (2019) following SPLADE-max Formal et al. (2021a). In the warm-up stage, we train the lexical retriever with batch size $48$ , $5$ negatives for each query, and a learning rate $3 e^{- 5}$ for three epochs. In the second stage, we remain all hyperparameters unchanged except lower the learning rate to $2 e^{- 5}$ and use negative passages from the top $200$ self-mining hard negatives.

For dense retriever, we train coCondenser Gao and Callan (2021b) checkpoint⁵⁵5https://huggingface.co/Luyu/co-condenser-marco with batch size $16$ , $7$ negatives per query, and a learning rate of $1 e^{- 5}$ for three epochs. In the second stage, with other hyperparameters unchanged, we set the learning rate $5 e^{- 6}$ and randomly select $32$ hard negatives from the mixture of each top $200$ negatives mined by the warm-up dense student, warm-up lexical teacher, and final lexical teacher. The number of negatives per query $32$ is selected from ${8, 16, 24, 32}$ . The higher number of negatives per query indicates more pair-wise rank constructed by lexical teacher, leading to more lexical knowledge transfer.

For rank-consistent regularization, we set loss weight $λ = 1.2$ after searching from ${1.0, 1.2, 1.5, 1.8, 2.0}$ .

Appendix D Further Discussions

We conduct extra experiments to explore the robustness and effectiveness of our lexical enhancement methods.

d.1 Impacts of Hyperparameters

Figure 3: Effects of the number of negatives per query on the performance of LED on MS MARCO Dev.

Figure 4: Effects of the regularization weight $λ$ on the performance of LED on MS MARCO Dev.

Query	ID: 1090413// state the benefits of internet
Passage+	ID: 7998365// What Are Some Benefits of Using the Internet?<sep>Some of the benefits of the Internet include reduced geographical distance and fast communication. The Internet is also a hub of information where users can simply upload, download and publish ideas…
Rank	LEX: 1; DEN: 3; LED: 1; LED w/ RT: 1
Retrieved	DEN’s 1st. ID: 7339157 // -<sep>Advantages of the Internet. The Internet provides opportunities galore, and can be used for a variety of things. Some of the things that you can do via the Internet are: 1 E-mail: E-mail is an online correspondence system. 2 With e-mail you can send and receive instant electronic messages, which works like writing letters.
Query	ID: 68832// can i drink ginger tea before breakfast
Passage+	ID: 7389957// Ginger and Lemon, a Perfect Combination for Weight Loss<sep>Heat one cup of water and remove it from heat once it is boiling. Once the water is ready, add a small slice of ginger and let it steep for 5 minutes, covered. Once the 5 minutes is up, add the juice from one lemon and drink.It’s best to drink this tea before breakfast. This delicious lemonade combines the classic recipe that we all know, with the powers of ginger.f you want to lose weight by taking advantage of ginger and lemon as two very good friends, you also need to follow a healthy diet, free of junk food, soft drinks, flours, fats, sodium, and other things. You also need to do physical activity and drink at least two liters of water a day.
Rank	LEX: 1; DEN: 7; LED: 1; LED w/ RT: 1
Retrieved	DEN’s 1st. ID: 602379 // One of the most recognized benefits of ginger tea is its ability to relieve nausea. A cup of tea before the traveling can prevent nausea and vomiting. Ginger tea is excellent for pregnant women who have problems with morning sickness. Always drink a cup of ginger tea once you feel the first signs of nausea.
Query	ID:1033652// what is the purpose of pencil tool
Passage+	ID: 7212314// Pencil<sep>Should I remove Pencil by Evolus Co? Pencil is built for the purpose of providing a free and open-source GUI prototyping tool that people can easily install and use to create mockups in popular desktop platforms.
Rank	LEX: 1; DEN: 11; LED: 1; LED w/ RT: 1
Retrieved	DEN’s 1st. ID:313304 // Pencil<sep>This article is about the writing implement. For other uses, see Pencil (disambiguation). A pencil is a writing implement or art medium constructed of a narrow, solid pigment core inside a protective casing which prevents the core from being broken or leaving marks on the users hand during use.

Table 6: Case study on MS-Marco Dev. ‘Passage+’ denotes the golden passage of the corresponding query. ’Rank’ indicates the ranking of golden passage by retrievers.

Number of Negatives in Batch

Fig. 3 illustrates the impact of changing negative passages on the LED. We can observe that as the number of negative passages increases, the MRR@10 performance goes up and the R@1k performance reaches the peak when $16$ and decreases gradually. The main table shows that Lexical is less performant than Dense at R@1k metric (98.0 $<$ 98.4). So the trend of increasing the number of negatives proves that imitating too much the lexical retriever will also be negatively influenced by the weakness of the teacher. These two trends indicate that, with more negatives, the teacher will construct more rank pairs for more lexical knowledge transfer.

Regularization Weight

Fig. 4 shows the performance with regard to different regularization weight $λ$ . It is observed that the performances don’t fluctuate significantly as the weight $λ$ changes, demonstrating the robustness of lexical enhancement strategies. Interestingly, the increase in MRR@10 comes with the drop in R@1k to some extent, once again showing that a well-enhanced LED also inherits the weakness of Lexical.

d.2 Case Study

Tab. 6 shows the three case queries with rankings of 4 retrievers. With lexicon-aware capability, LED and LED w/ RT could retrieve golden passages as the top-1 result like their teacher LEX.

In particular, in the first case, DEN mismatches the “benefits” in the query with “advantages” in the passage since they are both positive words. On the contrary, the LEX and LED-series retrievers could exactly match the phrase “benefits of the Internet”. In the second query, although DEN well captures the “drink ginger tea” of the query and returns a passage containing benefits of ginger tea as the top-1 result, DEN overlooks the local salient phrase “before breakfast”, leading to inferior retrieval results. In the third query, the “Pencil tool” refers to a specific GUI prototyping tool (as highlighted in the positive passage). DEN misunderstands the mention “Pencil tool” in the query and returns passages about the vanilla pencil, which is non-relevant to the user’s intention. The above three cases show that retrievers with lexicon-aware capability (i.e., LEX, LED, LED w/ RT) could well capture the salient phrases and entity mentions, providing more precise retrieval results on some queries.