Isotropic Representation Can Improve Dense Retrieval

Euna Jung xlpczv@snu.ac.kr GSCST
at Seoul National UniversitySeoulSouth Korea , Jungwon Park quoded97@snu.ac.kr GSCST
at Seoul National UniversitySeoulSouth Korea , Jaekeol Choi jaekeol.choi@snu.ac.kr Seoul National University
& Naver Corp.SeoulSouth Korea , Sungyoon Kim clifter0122@snu.ac.kr Seoul National UniversitySeoulSouth Korea and Wonjong Rhee wrhee@snu.ac.kr GSCST, GSAI, AIIS
at Seoul National UniversitySeoulSouth Korea

Abstract.

The recent advancement in language representation modeling has broadly affected the design of dense retrieval models. In particular, many of the high-performing dense retrieval models evaluate representations of query and document using BERT, and subsequently apply a cosine-similarity based scoring to determine the relevance. BERT representations, however, are known to follow an anisotropic distribution of a narrow cone shape and such an anisotropic distribution can be undesirable for the cosine-similarity based scoring. In this work, we first show that BERT-based DR also follows an anisotropic distribution. To cope with the problem, we introduce unsupervised post-processing methods of Normalizing Flow and whitening, and develop token-wise method in addition to the sequence-wise method for applying the post-processing methods to the representations of dense retrieval models. We show that the proposed methods can effectively enhance the representations to be isotropic, then we perform experiments with ColBERT and RepBERT to show that the performance (NDCG at 10) of document re-ranking can be improved by 5.17% $\sim$ 8.09% for ColBERT and 6.88% $\sim$ 22.81% for RepBERT. To examine the potential of isotropic representation for improving the robustness of DR models, we investigate out-of-distribution tasks where the test dataset differs from the training dataset. The results show that isotropic representation can achieve a generally improved performance. For instance, when training dataset is MS-MARCO and test dataset is Robust04, isotropy post-processing can improve the baseline performance by up to 24.98%. Furthermore, we show that an isotropic model trained with an out-of-distribution dataset can even outperform a baseline model trained with the in-distribution dataset.¹¹1The code will be made available at https://github.com/SNU-DRL/IsotropicIR.git.

dense retrieval, isotropic representation, Normalizing Flow, whitening, robustness

1. Introduction

Dense retrieval (DR) maps queries and documents to the representation space and estimates the relevance score between them based on the similarity. Recent DR models often encode representations of queries and documents using BERT and compute the similarity between them using cosine similarity or dot product. The representations of language models such as BERT, however, are known to follow an anisotropic distribution (Gao et al., 2019; Wang et al., 2019; Luo et al., 2020; Kovaleva et al., 2021; Timkey and van Schijndel, 2021; Yu et al., 2022). Anisotropic distribution refers to a directionally non-uniform distribution, such as a narrow cone (Li et al., 2020). This phenomenon can harm the performance of DR because estimating a score using cosine similarity can be misleading if the representations are anisotropically distributed. In this study, we aim to show that BERT representations of DR also follow an anisotropic distribution and to improve the performance of BERT-based DR by enforcing isotropy to the representations. In the field of Semantic Textual Similarity (STS), Li et al. (2020) and Su et al. (2021) used representation transformation methods to relieve the problem of anisotropic representations. They adopted Normalizing Flow or whitening before computing cosine similarities such that isotropy on BERT representations can be enforced, and showed that the methods could improve the STS performance. In our study, we also apply these post-processing methods to BERT representations of DR after making a proper adjustment for DR. With our best knowledge, we are the first to introduce and study the methods in the DR field. To demonstrate the effectiveness of isotropic representation for DR, we experiment with three document re-ranking datasets, MS-MARCO, Robust04, and ClueWeb09b, and two representative DR models, ColBERT (Khattab and Zaharia, 2020) and RepBERT (Zhan et al., 2020). ColBERT is a multi-vector DR model that estimates the relevance score by computing the cosine similarity among the token representations in query and document. RepBERT is a single-vector DR model that encodes each of query and document to a single vector and estimates the relevance score by calculating the cosine similarity between the two sequence representations. Since the post-processing methods of STS transform the representation for each sequence (i.e., sentence), they cannot be directly applied to multi-vector DR models that compute the token-wise similarity. Therefore, we consider a token-wise transformation in addition to the sequence-wise transformation for improving the isotropy. As we will show later, the token-wise transformation turns out to be useful even for RepBERT that is a single-vector DR model. We empirically show the effectiveness of these post-processing methods and compare the token-wise and the sequence-wise isotropy transformations when applicable. By enforcing isotropy to the BERT representations, we show that we can significantly improve the re-ranking performance of both ColBERT and RepBERT. Adopting Normalizing Flow or whitening increases the performance of ColBERT by 5.17% $\sim$ 8.09% and the performance of RepBERT by 6.88% $\sim$ 22.81% on NDCG at 10 (NDCG@10) across the three datasets. In the experiment of RepBERT, we have found that either token-wise method or sequence-wise method can perform better depending on the characteristics of the dataset. Overall, token-wise methods performed better for Robust04 and ClueWeb09b that have short queries and sequence-wise methods performed better for MS-MARCO that have long queries. To examine the potential of isotropic representation beyond the basic re-ranking task in the In-Distribution (ID) setting where the source data used for training and the target data used for the test are the same, we additionally investigate Out-Of-Distribution (OOD) setting where the source data is different from the target data. We evaluate the robustness of ColBERT and RepBERT for OOD tasks following Wu et al. (2021), who measured the robustness of DR by varying the source and target data. With our experiments, we have found that enforcing isotropy on BERT representations can improve the robustness of DR models by 5.27% $\sim$ 24.98% for NDCG at 10 (NDCG@10). OOD performance of ColBERT trained on MS-MARCO can even surpass the ID performance of Robust04 and ClueWeb09b when the post-processing methods are applied. The result is surprising in that OOD performance of NRMs generally falls behind the ID performance as shown in Wu et al. (2021). A summary of the ColBERT results can be found in Figure 1 and the contributions of this study can be summarized as the following.

We point out for the first time that BERT representations used in DR follow an anisotropic distribution that negatively affects the DR performance.
We adopt Normalizing Flow and whitening for enforcing representation isotropy, and propose token-wise and sequence-wise methods for DR models.
We show the effectiveness of the proposed methods for re-ranking tasks (in-domain).
We show the capability of the proposed methods for improving robustness of DR models by investigating out-of-distribution tasks.

2. Related Works

2.1. Dense Retrieval and Similarity Function

Depending on whether DR uses a single vector or multiple vectors for encoding each of queries and documents, DR models are divided into single-vector and multi-vector models. RepBERT (Zhan et al., 2020), ANCE (Xiong et al., 2020), and RocketQA (Qu et al., 2020; Ren et al., 2021) are examples of single-vector DR models. ColBERT (Khattab and Zaharia, 2020; Santhanam et al., 2021) and COIL (Gao et al., 2021) are examples of multi-vector DR models, and they are generally known to perform better. Both types of DR models estimate the relevance score using a similarity function between the representation vectors of query and document. The similarity function is typically a simple cosine or a dot product, although more expressive functions such as networks consisting of multiple layers can be used, too. For an efficient computation of dense retrieval, the similarity function needs to be decomposable such that the representations of the documents can be pre-computed and stored (Karpukhin et al., 2020), and an efficient Approximate Nearest Neighbor (ANN) retrieval (Johnson et al., 2019; Guo et al., 2020) can be performed. Most of the decomposable similarity functions are based on cosine similarity (Mussmann and Ermon, 2016; Ram and Gray, 2012). In this paper, we focus on two representative DR models that are based on cosine similarity - RepBERT that is a single-vector DR and ColBERT that is a multi-vector DR.

2.2. Anisotropic Distribution of BERT Representations

Representations of large-scale language models are known to follow anisotropic distributions (Gao et al., 2019; Wang et al., 2019; Luo et al., 2020; Kovaleva et al., 2021; Timkey and van Schijndel, 2021; Yu et al., 2022). Some studies (Gao et al., 2019; Yu et al., 2022) assert that the anisotropic distribution is caused by rare tokens. Other studies (Luo et al., 2020; Kovaleva et al., 2021; Timkey and van Schijndel, 2021) show the existence of outlier dimensions having extreme values in representation vectors and attribute the anisotropic distribution to the outlier dimensions. Although the anisotropic distribution can negatively affect the relevance score estimation between a query and a document, this phenomenon has not been studied for DR models before. In this study, we show that the representations of BERT-based DR models also suffer from the anisotropic distribution.

2.3. Enforcing Isotropy for STS Task

Reimers and Gurevych (2019) proposed a framework named Sentence-BERT (S-BERT) for the Semantic Textual Similarity (STS) task. To estimate the semantic similarity of two input sentences, S-BERT computes the cosine similarity between the two sentences’ BERT representations. As S-BERT employs cosine similarity, the anisotropic distribution of BERT representations can be problematic. To make the distribution of BERT representations isotropic, Li et al. (2020) employed a Normalizing Flow model named Glow, and Su et al. (2021) applied a linear whitening that makes the representation vectors have zero mean and identity covariance. Both methods increase isotropy of the representations. DR and S-BERT are similar because they both model the relationship between the two input sequences based on the cosine similarity of the representations. DR models, however, are different because the two input sequences, query and document, have distinct characteristics and because multi-vector DR models such as ColBERT compute token-wise cosine similarity. In this DR study, we consider token-wise transformation as well as sequence-wise transformation.

2.4. Robustness of Ranking Models

The robustness of ranking model refers to the ability of model to operate properly in abnormal situations, and it is an essential factor when a ranking model is deployed as a real-world application (Wang et al., 2012; Wu et al., 2021). While the robustness of the ranking models can be defined with multi-dimensional factors (Shafique et al., 2020; Goren et al., 2018), OOD generalizability can be regarded as one of the most important factors, if not the most important factor, of robustness for Neural Ranking Models (NRMs) because NRMs tend to show a relatively poor performance for OOD tasks compared to the traditional ranking models (Wu et al., 2021). Some of the existing works have focused on improving OOD generalizability of ranking models (Ma et al., 2019; Zhang et al., 2019; Thakur et al., 2022; Wu et al., 2021). None of the works so far, however, was able to show a case where the OOD performance surpasses the ID performance. In this study, we show that OOD performance of NRMs can be sufficiently improved by enforcing representation isotropy such that OOD outperforms the baseline ID.

3. Backgrounds

3.1. DR models: ColBERT and RepBERT

In our study, we investigate two well-known DR models that employ cosine similarity for scoring. The first is ColBERT (Khattab and Zaharia, 2020) that estimates the relevance score by summing up the token-wise cosine similarity between query and document token representations. When $x_{j}^{q} \in R^{D}$ and $x_{j}^{d_{i}} \in R^{D}$ refer to $D$ -dimensional $j^{t h}$ token representation of a query and the $i^{t h}$ associated document, respectively, ColBERT computes the relevance between a pair of query and document as below, where $| q |$ is the number of tokens in the query $q$ .

(1)

s c o r e_{i} = | q | \sum j = 1 max k c o s (x_{j}^{q}, x_{k}^{d_{i}})

In our work, a token representation refers to a representation vector of a token in an input sequence.

While ColBERT’s scoring utilizes all of the individual token representations as the input, RepBERT (Zhan et al., 2020) first generates a single sequence representation from the token representations by averaging them. Then, it computes the cosine similarity between the query-sequence’s representation vector and the document-sequence’s representation vector as shown below, where $| d_{i} |$ is the number of tokens in the document. In our work, a sequence representation refers to a representation vector that contains the information of an input sequence.

(2)

s c o r e_{i} = c o s (\frac{1}{| q |} | q | \sum j = 1 x_{j}^{q}, \frac{1}{| d_{i} |} | d_{i} | \sum j = 1 x_{j}^{d_{i}})

A typical BERT-based DR model consists of a pre-trained BERT model followed by an aggregator that performs the scoring. Such a DR model is trained by fine-tuning with an information retrieval dataset. This training process is shown in Figure 2(a), and the aggregator in the figure implements either Eq. (1) or Eq. (2) for ColBERT and RepBERT.

3.2. Metrics

For a set of given representations, we measure the degree of isotropy by utilizing two metrics. They are measured by averaging the metric values of the entire test batches.

3.2.1. $I (W)$

Consider a set of representation vectors given by $W$ , where each row of $W$ corresponds to a representation vector. Mu et al. (2017) suggested $I (W)$ for measuring isotropy where a partition function $q (a) = \sum_{i = 1}^{N} exp (w_{i} a^{T})$ introduced in (Arora et al., 2016) is utilized. Then, $I (W)$ is approximated as

(3)

I (W) \approx \frac{m i n_{a \in E} q (a)}{m a x_{a \in E} q (a)},

where $E$ represents the set of eigenvector of $W^{T} W$ . The range of $I (W)$ is $[0, 1] \subset R$ . If the representations follow the standard Gaussian distribution, $I (W)$ becomes close to one. In previous studies, $I (W)$ was used to measure the isotropy of representations (Yu et al., 2022; Wang et al., 2019).

3.2.2. Average cosine similarity

Suppose that the representation vectors are isotropically spread around the origin. In this case, the average cosine similarity between two randomly selected representation vectors should be close to zero. Based on this simple heuristics, the average cosine similarity $a v g c o s (W)$ is adopted as the second metric of isotropy. It can be expressed as

(4)

a v g c o s (W) = \frac{2}{n (n - 1)} \sum i \neq j c o s (w_{i}, w_{j}),

where $w_{i}$ represents the $i^{t h}$ row of the representation matrix $W$ . Ethayarajh (2019) also used the average cosine similarity of representations to measure the degree of isotropy.

4. Methodology

(a) Baseline: Training a DR model by fine-tuning BERT

4.1. Enforcing Isotropy

The baseline method of training a BERT-based DR model is shown in Figure 2(a). In our work, we enhance the isotropy of BERT representations with Normalizing Flow or whitening as shown in Figure 2(b) and 2(c). The two post-processing processes follow entirely unsupervised frameworks.

4.1.1. Normalizing Flow

Normalizing Flow transforms a simple and tractable distribution, such as a standard Gaussian, into the target data distribution by applying a series of invertible and (almost everywhere) differentiable mappings (Tabak and Vanden-Eijnden, 2010; Tabak and Turner, 2013; Kobyzev et al., 2020). By doing this, flow-based models can easily infer the target density and perform sampling from it. Formally speaking, let $Z$ be a random variable of the simple distribution, and $X$ be a random variable of the target distribution. Let $g$ be an invertible and differentiable mapping satisfying $X = g (Z)$ , and $f$ be the inverse of $g$ . The probability density of the target distribution can then be computed using the probability density of the simple distribution and volume correction:

(5)

p_{X} (x) = p_{Z} (f (x)) | det D f (x) |,

where $p_{X}$ and $p_{Z}$ are the densities of $X$ and $Z$ , respectively, and $D f (x) = \frac{\partial f}{\partial x}$ is the Jacobian matrix of $f$ . Here, the determinant term represents the volume correction induced by the change of variables. This approach is referred to as density estimation. On the other hand, sampling data can be described as the following:

(6)

x = g (z) where z \sim p_{Z} (z)

Normalizing Flow can be used for either density estimation or sampling. Between the two purposes, we use Normalizing Flow for density estimation, i.e., to construct flow function $f$ that maps BERT representations into an isotropic Gaussian distribution as Li et al. (2020) did. Among the various flow-based models, we use NICE and Glow in our experiments. More details on each model can be found in the original papers (Dinh et al., 2014; Kingma and Dhariwal, 2018).

As illustrated in Figure 2(b)-left, we aim to transform the BERT representations of query and document into the standard Gaussian distribution using the Normalizing Flow’s density estimation. To do so, in Eq. (5), we set $p_{Z}$ as the density of the standard Gaussian distribution. Then, we maximize the likelihood of BERT representations, that is, $p_{X}$ in Eq. (5). Consequently, the flow function $f$ is trained to transform the BERT representations $X$ into the standard Gaussian distribution. Because there is no need for any annotation information, the training is unsupervised. For the inference shown in Figure 2(b)-right, we simply cascade an aggregator that performs the cosine similarity calculation.

For DR, we propose two different implementations of Normalizing Flow. In token-wise implementation, Normalizing Flow is applied to each individual output token of BERT. In this case, each row of $X$ is a token representation of BERT. In sequence-wise implementation, Normalizing Flow is applied to the entire sequence of BERT output. In this case, each row of $X$ is a sequence representation of BERT. Both are applicable to RepBERT that is a single-vector model, but only token-wise is applicable to ColBERT because it is a multi-vector model and isotropy needs to be enforced at the token level and not at the sequence level.

4.1.2. Whitening

Whitening is a linear transformation that renders the data distribution spherical. That is, it eliminates the structures of location, scale, and correlations in the distribution (Friedman, 1987). Let ${x_{i}}_{i = 1}^{N}$ be a set of BERT representations where each $x_{i}$ is a row vector in $R^{D}$ . To apply whitening, a mean vector $μ$ and a covariance matrix $Σ$ of ${x_{i}}_{i = 1}^{N}$ are pre-computed as

(7)

μ

= \frac{1}{N} N \sum i = 1 x_{i}

(8)

Σ

= \frac{1}{N - 1} N \sum i = 1 (x_{i} - μ)^{T} (x_{i} - μ),

where $Σ$ denotes the unbiased covariance. The covariance matrix $Σ$ has SVD decomposition

(9)

Σ = U Λ U^{T},

where $Λ \in R^{D \times D}$ is a diagonal matrix of positive elements, and $U \in R^{D \times D}$ is an orthogonal matrix, i.e., $U^{T} U = I$ . Then, the whitened representation vector $z_{i}$ can be represented as

(10)

z_{i} = (x_{i} - μ) U \sqrt{Λ^{- 1}},

where each $z_{i}$ follows a distribution of zero mean and identity covariance (Su et al., 2021).

For DR, we first pre-compute the mean and covariance using a train set of the source data as shown in Figure 2(c)-left. For the inference shown in Figure 2(c)-right, we obtain the whitened output $z_{i}$ in Eq. (10) using the pre-computed mean and covariance and cascade an aggregator that performs the cosine similarity calculation. As in the Normalizing Flow, we consider both token-wise and sequence-wise implementations of whitening.

4.2. Robustness for out-of-distribution data

While an anisotropic distribution is likely to overfit to the characteristics of the training dataset, we hypothesize that an isotropic distribution is less likely to overfit. To study this hypothesis, we need to compare the OOD performance of before and after enforcing isotropy. For this purpose, we train a DR model on the source data and evaluate the re-ranking performance on the different target data. This concept of evaluating robustness is commonly addressed as OOD generalizability on an unforeseen corpus (Wu et al., 2021), and we will interchangeably use the terms OOD robustness and OOD generalizability in this work.

As the baseline of OOD generalizability, we consider the same OOD method considered in the previous works where isotropy is not enforced. In the baseline, training is performed as in Figure 2(a) using the source data and evaluation is performed with the same network but by using the different target data. In our work, our main interest lies in the improvement that we can obtain by enforcing isotropy. Therefore, the isotropy experiments are performed in a similar manner as in the baseline where the source data and target data differ but the rest of training and inference process are the same as shown in Figure 2(b) and 2(c). Because we consider three different datasets in this work, we investigate six different combinations of OOD experiments where the source data differs from the target data.

5. Experiments

\topruleDataset	Method	P@20	NDCG@10	$I (W)$ ( $↑$ )	$a v g c o s (W)$ ( $↓$ )
\midruleMS-MARCO	Fine-tuning (Ft)	0.5585	0.5904	0.6105	0.2229
	Ft $\to$ Whitening	0.5771 (3.33%) $^{* *}$	0.6304 (6.77%) $^{*}$	0.9178	0.0088
	Ft $\to$ NICE	0.5732 (2.64%) $^{*}$	0.6224 (5.41%) $^{*}$	0.6362	0.1905
	Ft $\to$ Glow	0.5779 (3.47%) $^{*}$	0.6382 (8.09%) $^{*}$	0.8372	0.0267
Robust04	Fine-tuning (Ft)	0.3455	0.4022	0.6017	0.2249
	Ft $\to$ Whitening	0.3528 (2.11%)	0.4230 (5.17%) $^{* *}$	0.8912	0.0155
	Ft $\to$ NICE	0.3504 (1.40%)	0.4119 (2.42%) $^{*}$	0.7685	0.0636
	Ft $\to$ Glow	0.3463 (0.22%)	0.4118 (2.39%) $^{*}$	0.8239	0.0307
ClueWeb09b	Fine-tuning (Ft)	0.2900	0.2903	0.6142	0.2192
	Ft $\to$ Whitening	0.3033 (4.57%) $^{* *}$	0.3032 (4.41%) $^{* *}$	0.9152	0.0089
	Ft $\to$ NICE	0.3055 (5.34%) $^{* *}$	0.3073 (5.84%) $^{* *}$	0.7568	0.0741
	Ft $\to$ Glow	0.3050 (5.17%) $^{* *}$	0.3019 (3.97%) $^{* *}$	0.8566	0.0196
\bottomrule

Table 1. Performance of ColBERT on three re-ranking datasets. We compare the performance of a fine-tuned model before and after enforcing isotropy on BERT representations. Precision at 20 and NDCG at 10 are used as re-ranking metrics, and

I (W)

and

a v g c o s (W)

are used to measure the isotropy of the representations. Note:

^{*} p \leq 0.05

^{* *} p \leq 0.01

(1-tailed).

\topruleDataset	Method	Token-wise Method		Sequence-wise Method
\topruleDataset	Method	P@20	NDCG@10	P@20	NDCG@10
\midruleMS-MARCO	Fine-tuning (Ft)	0.3884	0.3304	0.3884	0.3304
	Ft $\to$ Whitening	0.4287 (10.39%) $^{* *}$	0.3875 (17.28%) $^{* *}$	0.4287 (10.38%) $^{*}$	0.4058 (22.81%) $^{* *}$
	Ft $\to$ NICE	0.3864 (-0.5%)	0.3392 (2.64%)	0.3903 (0.50%)	0.3373 (2.07%)
	Ft $\to$ Glow	0.3945 (1.59%) $^{* *}$	0.3648 (10.40%) $^{*}$	0.4329 (11.48%) $^{* *}$	0.3979 (20.42%) $^{* *}$
Robust04	Fine-tuning (Ft)	0.3012	0.3495	0.3012	0.3495
	Ft $\to$ Whitening	0.3148 (4.53%)	0.3735 (6.88%) $^{*}$	0.2965 (-1.56%)	0.3470 (-0.70%)
	Ft $\to$ NICE	0.3043 (1.02%)	0.3522 (0.80%)	0.2898 (-3.79%)	0.3334 (-4.61%)
	Ft $\to$ Glow	0.3135 (4.08%)	0.3710 (6.16%) $^{*}$	0.3059 (1.57%)	0.3540 (1.30%)
ClueWeb09b	Fine-tuning (Ft)	0.2227	0.1962	0.2227	0.1962
	Ft $\to$ Whitening	0.2617 (17.53%) $^{* *}$	0.2407 (22.71%) $^{* *}$	0.2423 (8.79%)	0.2232 (13.80%) $^{*}$
	Ft $\to$ NICE	0.2373 (6.54%)	0.2111 (7.63%)	0.2161 (-2.98%)	0.1868 (-4.75%)
	Ft $\to$ Glow	0.2562 (15.03%) $^{*}$	0.2359 (20.24%) $^{* *}$	0.2510 (12.71%) $^{*}$	0.2314 (17.96%) $^{*}$
\bottomrule

Table 2. Performance of RepBERT on three re-ranking datasets. We compare the performance of a fine-tuned model before and after enforcing isotropy on BERT representations. For RepBERT, we compare the performances of token-wise and sequence-wise representation transformation methods. Note:

^{*} p \leq 0.05

^{* *} p \leq 0.01

(1-tailed).

\topruleSource data	Target data	Method	P@20	NDCG@10	$I (W)$ ( $↑$ )	$a v g c o s (W)$ ( $↓$ )
\midrule\midruleMS-MARCO	Robust04	Fine-tuning (Ft)	0.2818	0.3434	0.5884	0.2477
		Ft $\to$ Whitening	0.3436 (21.92%) $^{* *}$	0.4118 (19.95%) $^{* *}$	0.8485	0.0219
		Ft $\to$ NICE	0.3465 (22.97%) $^{* *}$	0.4123 (20.07%) $^{* *}$	0.6771	0.1378
		Ft $\to$ Glow	0.3593 (27.50%) $^{* *}$	0.4291 (24.98%) $^{* *}$	0.7994	0.0408
MS-MARCO	ClueWeb09b	Fine-tuning (Ft)	0.2967	0.2976	0.5462	0.3230
		Ft $\to$ Whitening	0.3071 (3.51%) $^{*}$	0.3121 (4.87%)	0.8943	0.0118
		Ft $\to$ NICE	0.3090 (4.17%) $^{*}$	0.3133 (5.27%)	0.6557	0.1649
		Ft $\to$ Glow	0.3065 (3.31%)	0.3093 (3.91%)	0.8351	0.0275
Robust04	MS-MARCO	Fine-tuning (Ft)	0.4533	0.3835	0.6273	0.2055
		Ft $\to$ Whitening	0.4760 (4.99%) $^{* *}$	0.4175 (8.86%) $^{* *}$	0.8086	0.0365
		Ft $\to$ NICE	0.4762 (5.04%) $^{* *}$	0.4187 (9.19%) $^{* *}$	0.7614	0.0654
		Ft $\to$ Glow	0.4779 (5.42%) $^{* *}$	0.4195 (9.39%) $^{* *}$	0.7862	0.0479
Robust04	ClueWeb09b	Fine-tuning (Ft)	0.2488	0.2225	0.6177	0.2191
		Ft $\to$ Whitening	0.2672 (7.37%) $^{* *}$	0.2452 (10.16%) $^{* *}$	0.7871	0.0467
		Ft $\to$ NICE	0.2722 (9.37%) $^{* *}$	0.2480 (11.46%) $^{* *}$	0.7454	0.0744
		Ft $\to$ Glow	0.2711 (8.94%) $^{* *}$	0.2455 (10.31%) $^{* *}$	0.7692	0.0573
ClueWeb09b	MS-MARCO	Fine-tuning (Ft)	0.4696	0.4507	0.6110	0.2319
		Ft $\to$ Whitening	0.4957 (5.55%) $^{* *}$	0.4690 (4.06%) $^{*}$	0.9005	0.0107
		Ft $\to$ NICE	0.4934 (5.07%) $^{* *}$	0.4762 (5.66%) $^{* *}$	0.7646	0.0692
		Ft $\to$ Glow	0.4909 (4.54%) $^{* *}$	0.4700 (4.27%) $^{*}$	0.8517	0.0212
ClueWeb09b	Robust04	Fine-tuning (Ft)	0.3112	0.3590	0.6064	0.2369
		Ft $\to$ Whitening	0.3381 (8.67%) $^{* *}$	0.4025 (12.11%) $^{* *}$	0.8420	0.0240
		Ft $\to$ NICE	0.3369 (8.28%) $^{* *}$	0.3978 (10.81%)	0.7427	0.0799
		Ft $\to$ Glow	0.3409 (9.55%) $^{* *}$	0.4050 (12.83%) $^{* *}$	0.7999	0.0405
\bottomrule

Table 3. Out-of-distribution generalizability of ColBERT. Performance and isotropy are evaluated on the target data using the models trained with the source data. Note:

^{*} p \leq 0.05

^{* *} p \leq 0.01

(1-tailed).

5.1. Experimental Settings

5.1.1. Models, datasets, and ranking metrics

As explained earlier, ColBERT (Khattab and Zaharia, 2020) and RepBERT (Zhan et al., 2020) are investigated. In case of ColBERT, we removed the linear layer just before the cosine similarity to clearly examine the effects of Normalizing Flow and whitening. As for the datasets, we examined three popular document re-ranking datasets, Robust04 (Voorhees and others, 2004), WebTrack 2009 (ClueWeb09b) (Callan et al., 2009), and MS-MARCO (Nguyen et al., 2016) as in (MacAvaney et al., 2019). For Robust04, we used the document collections from TREC Disks 4 and 5²²2520K documents, 7.5K triplet data samples, https://trec.nist.gov/data-disks.html. For ClueWeb09b, the document collections from ClueWeb09b³³350M web pages, 4.5K triplet data samples, https://lemurproject.org/clueweb09/ of WebTrack 2009 were used. We also used the large document collections ⁴⁴422G documents, 372K triplet data samples, https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019 from MS-MARCO. To evaluate the ranking models, we used P@20 and NDCG@10 as the performance metrics.

5.1.2. Training and optimization

Following Huston and Croft (2014), we divided each of Robust04 and ClueWeb09b into five folds: three folds for training, one for validation, and the remaining one for the test. For MS-MARCO, we divided the dataset into training, validation, and test data. Three different random seeds were used in each experiment. Using the three performance values for the three random seeds, we conducted one-tailed $t$ -test under the assumption of homoscedasticity.

For fine-tuning the BERT weights as shown in Figure 2(a), we used a learning rate of 1e-4, batch size of 16, and an Adam optimizer. We set the maximum epoch to be 30 for Robust04 and ClueWeb09b and 10 for MS-MARCO, following the advice of Zhang et al. (2020). We then selected the model with the highest validation performance among the checkpoints of all epochs.

To train Normalizing Flow as shown in Figure 2(b), we stack a Normalizing Flow network at the top of the fine-tuned BERT and train it in an unsupervised manner for either ten epochs (Robust04 and ClueWeb09b) or three epochs (MS-MARCO) with a learning rate of 1e-4, while keeping the fine-tuned BERT weights frozen. For the NICE network, we used a five-layer network with 1000 units in each layer. For the Glow network, we used two levels and depth of three following (Li et al., 2020). For both NICE and Glow, we used a learning rate of 1e-4.

For whitening, we pre-compute mean and covariance as shown in Figure 2(c). Specifically, we collect training data for ten epochs (Robust04 and ClueWeb09b) or three epochs (MS-MARCO) and use the collected representation vectors for the pre-computation of mean and covariance.

For all the experiments, we tuned the hyper-parameters only lightly.

5.1.3. Implementation

Our experiments are implemented using Python 3.8, Torch 1.10, and huggingface-hub 0.4.0. We fine-tuned the pre-trained BERT-base-uncased model provided by the huggingface transformers. We used RTX3090 GPUs, each of which has 25.6G memory.

5.2. Experimental Results

In this section, we examine the effect of isotropy enforcement on the BERT representations and present the experimental results.

5.2.1. Isotropic representations improve re-ranking performance of ColBERT

In Table 1, we compare the re-ranking performance of fine-tuned ColBERT before and after applying Normalizing Flow or whitening to the BERT representations. Because ColBERT computes the cosine similarity between a query and a document’s multiple token representations, we perform the token-wise post-processing. The results demonstrate that the degree of isotropy is enhanced and the re-ranking performance is improved after the post-processing. For example, when we whiten the BERT representations of ColBERT that is fine-tuned on MS-MARCO, $I (W)$ increases from 0.6105 to 0.9178, and $a v g c o s (W)$ drops from 0.2229 to 0.0088. At the same time, P@20 and NDCG@10 improves by 3.33% and 6.77%, respectively. Across all three datasets, all of whitening, NICE, and Glow methods improve the re-ranking performance of the fine-tuned ColBERT. It can be confirmed that transforming BERT token representations to follow an isotropic distribution consistently improves the performance of ColBERT.

5.2.2. Isotropic representations improve re-ranking performance of RepBERT

Table 2 presents the results for RepBERT. We compare the performance improvement of both token-wise and sequence-wise post-processing methods. Although RepBERT computes the cosine similarity between sequence representations of queries and documents, we also examine token-wise methods in addition to the sequence-wise methods because sequence-wise transformation might not be powerful enough if the query and document sequences have sufficiently different characteristics. As for the overall improvement, the results in Table 2 show that the performance is improved for all the cases except for one when the token-wise method is used, but not so consistently when the sequence-wise method is used. The peak improvement is as large as 22.81%. Between token-wise and sequence-wise methods, it can be observed that token-wise performs better for Robust04 and ClueWeb09b and sequence-wise performs better for MS-MARCO. We provide a discussion on this issue in Section 6.3.

Figure 3. OOD performance of ColBERT. The source data is MS-MARCO and the target data is Robust04. The red line corresponds to the baseline ID performance for fine-tuning where the source and target data are both Robust04.

Among the three post-processing methods, Glow is the most robust method for RepBERT in the sense that the performance is always improved regardless of the choice of dataset or the choice of token-wise or sequence-wise method. While not as consistent, whitening almost always achieves the best performance for each dataset as long as token-wise is used for Robust04 and ClueWeb09b and sequence-wise is used for MS-MARCO. Overall, whitening improves the performance of the fine-tuned model by 6.88% $\sim$ 22.81% on NDCG@10 over the three datasets.

5.2.3. Isotropic representations make DR models robust to OOD data.

(a) Representation vectors of a pre-trained BERT model.

We show the effectiveness of isotropic representations for OOD generalizability in Table 3. In the results, it can be observed that the baseline OOD generalizability of a fine-tuned model is always improved by applying a post-processing method that enforces isotropy to the BERT representations. Performance of P@20 is improved by 4.17% $\sim$ 27.50%, and NDCG@10 by 5.27% $\sim$ 24.98%, across the three datasets. The overall result indicates that isotropically distributed BERT representations are robust in the sense that they can perform better for an OOD dataset that has been unseen during the training.

In particular, when ColBERT is trained with MS-MARCO as the source data and used for inference with the target data of either Robust04 or ClueWeb09b, the OOD performance with isotropy even surpasses the ID performance. This is visualized in Figure 3 for Robust04. In the figure, the ID performance without enforcing isotropy is shown as a red line (the ID performance values can be found in Table 1). The baseline OOD performance with fine-tuning only results in a performance drop of 18.44% and 14.63% on P@20 and NDCG@10, respectively, when compared to the red line. After the fine-tuned OOD model undergoes a post-processing (whitening, NICE, or Glow), however, the OOD model mostly performs better than the baseline ID performance shown in the red line. This result may be interpreted to be surprising because the OOD models were trained without the target data and because the OOD models outperform the ID models that were trained with the target data. The conclusion remains the same even if we choose the red line to be the best ID performance with isotropy that can be found in Table 1. Although we did not include the results for RepBERT, similar results can be obtained for RepBERT as well.

6. Discussion

6.1. Handling of Outlier Dimensions

As explained in Section 2.2, the existence of outlier dimensions with extreme values have been known for BERT and other language models’ representations. For example, Luo et al. (2020) pointed out that the outlier dimensions can have a negative impact on the task performance and showed that an improvement was possible by simply clipping the outliers. Timkey and van Schijndel (2021) also observed that a small subset of outlier dimensions dominated the cosine similarity of two representation vectors, and showed that aligning the representation space with post-processing methods could improve the performance of word similarity/relatedness judgment tasks. In this study, we show that the outlier dimensions can be also observed in the field of DR as shown in Figure 4(b). Even though we didn’t explicitly aim to handle such outlier dimensions, it can be seen in Figure 4(c) and 4(d) that the outlier dimensions are tempered and become hardly observable after enforcing isotropy.

6.2. Normalizing Flow vs. Whitening

Whitening is a linear transformation that does not require any learning. On the other hand, Normalizing Flow is a nonlinear transformation that requires a learning of the network parameters. Therefore, whitening suffers from the constraint of linearity but it is not exposed to the risk of a sub-optimal learning. In all of our experiment cases, whitening was always able to enhance isotropy further than Normalizing Flow. In terms of performance, however, the linear transformation with a higher isotropy level did not guarantee a better performance. By investigating the bold numbers in Table 1, 2, and 3, it can be confirmed that neither of the two methods dominantly performs better than the other. Perhaps, an additional improvement will be possible if an advanced nonlinear transformation for isotropy is developed and applied to DR.

6.3. Token-wise vs. Sequence-wise

As explained in Section 4.1, isotropy can be enforced with a token-wise transformation or with a sequence-wise transformation. For ColBERT that is a multi-vector DR model, only token-wise method is applicable because the score calculation in Eq. (1) requires token-level operations. For RepBERT that is a single-vector DR model, however, both token-wise method and sequence-wise method can be applied because the final scoring calculation in Eq. (2) is based on a single query vector and a single document vector. The RepBERT results in Table 2 show that token-wise performs better for Robust04 and ClueWeb09b but sequence-wise performs better for MS-MARCO. A possible explanation for the results is the length of queries. While MS-MARCO tends to have long queries of complete sentences, Robust04 and ClueWeb09b tend to have short queries that consist of a small number of keywords (Jung et al., 2022). Because the representations of the queries play an important role in NRMs, it might be natural for token-wise transformation to perform better for Robust04 and ClueWeb09b while sequence-wise transformation performs better for MS-MARCO.

6.4. Robustness and OOD Generalization

As explained in Section 2.4, robustness is an essential requirement of DR, and OOD generalization can be considered as one of the most important factors of robustness. From Table 3, it can be confirmed that a significant improvement in OOD generalization can be attained by enforcing isotropy. When the source data is MS-MARCO, we were able to make the OOD with isotropy perform even better than the baseline ID. An example was visualized in Figure 3. While we have analyzed only the most fundamental scenarios, the results imply that it might be possible to further improve the robustness of DR models by enforcing isotropy and concurrently considering more complex schemes such as use of multiple source datasets, advancement of the methods for enforcing isotropy, etc.

7. Conclusion

In this work, we have confirmed that the representations of BERT-based DR models are anisotropically distributed. Such an anisotropy can negatively affect the DR models that adopt cosine similarity for the relevance score estimation. Normalizing Flow and whitening can be applied to improve the representation isotropy, and we have shown how to apply the post-processing methods in token-wise and sequence-wise manners for the DR models. With the proposed methods, we were able to improve the re-ranking performance of ColBERT and RepBERT for all the cases that we have studied. In addition to the commonly studied re-ranking with an in-distribution dataset, we have shown that isotropy of representations can be an essential factor for enhancing robustness of DR models. For the out-of-distribution tasks, we were able to achieve large improvements for many cases. Based on our results, isotropy can be deemed to be a crucial element for studying and improving the representations of DR models.

References

S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski (2016) A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics 4, pp. 385–399. Cited by: §3.2.1.
J. Callan, M. Hoy, C. Yoo, and L. Zhao (2009) Clueweb09 data set. Cited by: §5.1.1.
L. Dinh, D. Krueger, and Y. Bengio (2014) Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §4.1.1.
K. Ethayarajh (2019) How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. arXiv preprint arXiv:1909.00512. Cited by: §3.2.2.
J. H. Friedman (1987) Exploratory projection pursuit. Journal of the American statistical association 82 (397), pp. 249–266. Cited by: §4.1.2.
J. Gao, D. He, X. Tan, T. Qin, L. Wang, and T. Liu (2019) Representation degeneration problem in training natural language generation models. arXiv preprint arXiv:1907.12009. Cited by: §1, §2.2.
L. Gao, Z. Dai, and J. Callan (2021) COIL: revisit exact lexical match in information retrieval with contextualized inverted list. arXiv preprint arXiv:2104.07186. Cited by: §2.1.
G. Goren, O. Kurland, M. Tennenholtz, and F. Raiber (2018) Ranking robustness under adversarial document manipulations. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 395–404. Cited by: §2.4.
R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, F. Chern, and S. Kumar (2020) Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, pp. 3887–3896. Cited by: §2.1.
S. Huston and W. B. Croft (2014) Parameters learned in the comparison of retrieval models using term dependencies. Ir, University of Massachusetts. Cited by: §5.1.2.
J. Johnson, M. Douze, and H. Jégou (2019) Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7 (3), pp. 535–547. Cited by: §2.1.
E. Jung, J. Choi, and W. Rhee (2022) Semi-siamese bi-encoder neural ranking model using lightweight fine-tuning. In Proceedings of the ACM Web Conference 2022, pp. 502–511. Cited by: §6.3.
V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906. Cited by: §2.1.
O. Khattab and M. Zaharia (2020) Colbert: efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 39–48. Cited by: §1, §2.1, §3.1, §5.1.1.
D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. Advances in neural information processing systems 31. Cited by: §4.1.1.
I. Kobyzev, S. J. Prince, and M. A. Brubaker (2020) Normalizing flows: an introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence 43 (11), pp. 3964–3979. Cited by: §4.1.1.
O. Kovaleva, S. Kulshreshtha, A. Rogers, and A. Rumshisky (2021) BERT busters: outlier dimensions that disrupt transformers. arXiv preprint arXiv:2105.06990. Cited by: §1, §2.2.
B. Li, H. Zhou, J. He, M. Wang, Y. Yang, and L. Li (2020) On the sentence embeddings from pre-trained language models. arXiv preprint arXiv:2011.05864. Cited by: §1, §2.3, §4.1.1, §5.1.2.
Z. Luo, A. Kulmizev, and X. Mao (2020) Positional artefacts propagate through masked language model embeddings. arXiv preprint arXiv:2011.04393. Cited by: §1, §2.2, §6.1.
X. Ma, P. Xu, Z. Wang, R. Nallapati, and B. Xiang (2019) Domain adaptation with bert-based domain classification and data selection. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pp. 76–83. Cited by: §2.4.
S. MacAvaney, A. Yates, A. Cohan, and N. Goharian (2019) CEDR: contextualized embeddings for document ranking. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1101–1104. Cited by: §5.1.1.
J. Mu, S. Bhat, and P. Viswanath (2017) All-but-the-top: simple and effective postprocessing for word representations. arXiv preprint arXiv:1702.01417. Cited by: §3.2.1.
S. Mussmann and S. Ermon (2016) Learning and inference via maximum inner product search. In International Conference on Machine Learning, pp. 2587–2596. Cited by: §2.1.
T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS marco: a human generated machine reading comprehension dataset. In CoCo@ NIPS, Cited by: §5.1.1.
Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao, D. Dong, H. Wu, and H. Wang (2020) RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191. Cited by: §2.1.
P. Ram and A. G. Gray (2012) Maximum inner-product search using cone trees. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 931–939. Cited by: §2.1.
N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. Cited by: §2.3.
R. Ren, Y. Qu, J. Liu, W. X. Zhao, Q. She, H. Wu, H. Wang, and J. Wen (2021) Rocketqav2: a joint training method for dense passage retrieval and passage re-ranking. arXiv preprint arXiv:2110.07367. Cited by: §2.1.
K. Santhanam, O. Khattab, J. Saad-Falcon, C. Potts, and M. Zaharia (2021) Colbertv2: effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488. Cited by: §2.1.
M. Shafique, M. Naseer, T. Theocharides, C. Kyrkou, O. Mutlu, L. Orosa, and J. Choi (2020) Robust machine learning systems: challenges, current trends, perspectives, and the road ahead. IEEE Design & Test 37 (2), pp. 30–57. Cited by: §2.4.
J. Su, J. Cao, W. Liu, and Y. Ou (2021) Whitening sentence representations for better semantics and faster retrieval. arXiv preprint arXiv:2103.15316. Cited by: §1, §2.3, §4.1.2.
E. G. Tabak and C. V. Turner (2013) A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics 66 (2), pp. 145–164. Cited by: §4.1.1.
E. G. Tabak and E. Vanden-Eijnden (2010) Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences 8 (1), pp. 217–233. Cited by: §4.1.1.
N. Thakur, N. Reimers, and J. Lin (2022) Domain adaptation for memory-efficient dense retrieval. arXiv preprint arXiv:2205.11498. Cited by: §2.4.
W. Timkey and M. van Schijndel (2021) All bark and no bite: rogue dimensions in transformer language models obscure representational quality. arXiv preprint arXiv:2109.04404. Cited by: §1, §2.2, §6.1.
E. M. Voorhees et al. (2004) Overview of trec 2004.. In Trec, Cited by: §5.1.1.
L. Wang, P. N. Bennett, and K. Collins-Thompson (2012) Robust ranking models via risk-sensitive optimization. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pp. 761–770. Cited by: §2.4.
L. Wang, J. Huang, K. Huang, Z. Hu, G. Wang, and Q. Gu (2019) Improving neural language generation with spectrum control. In International Conference on Learning Representations, Cited by: §1, §2.2, §3.2.1.
C. Wu, R. Zhang, J. Guo, Y. Fan, and X. Cheng (2021) Are neural ranking models robust?. arXiv preprint arXiv:2108.05018. Cited by: §1, §2.4, §4.2.
L. Xiong, C. Xiong, Y. Li, K. Tang, J. Liu, P. Bennett, J. Ahmed, and A. Overwijk (2020) Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808. Cited by: §2.1.
S. Yu, J. Song, H. Kim, S. Lee, W. Ryu, and S. Yoon (2022) Rare tokens degenerate all tokens: improving neural text generation via adaptive gradient gating for rare token embeddings. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 29–45. Cited by: §1, §2.2, §3.2.1.
J. Zhan, J. Mao, Y. Liu, M. Zhang, and S. Ma (2020) RepBERT: contextualized text embeddings for first-stage retrieval. arXiv preprint arXiv:2006.15498. Cited by: §1, §2.1, §3.1, §5.1.1.
T. Zhang, F. Wu, A. Katiyar, K. Q. Weinberger, and Y. Artzi (2020) Revisiting few-sample bert fine-tuning. arXiv preprint arXiv:2006.05987. Cited by: §5.1.2.
Y. Zhang, T. Liu, M. Long, and M. Jordan (2019) Bridging theory and algorithm for domain adaptation. In International Conference on Machine Learning, pp. 7404–7413. Cited by: §2.4.

Isotropic Representation Can Improve Dense Retrieval

Abstract.

1. Introduction

2. Related Works

2.1. Dense Retrieval and Similarity Function

2.2. Anisotropic Distribution of BERT Representations

2.3. Enforcing Isotropy for STS Task

2.4. Robustness of Ranking Models

3. Backgrounds

3.1. DR models: ColBERT and RepBERT

3.2. Metrics

3.2.1. I(W)

3.2.2. Average cosine similarity

4. Methodology

4.1. Enforcing Isotropy

4.1.1. Normalizing Flow

4.1.2. Whitening

4.2. Robustness for out-of-distribution data

5. Experiments

5.1. Experimental Settings

5.1.1. Models, datasets, and ranking metrics

5.1.2. Training and optimization

5.1.3. Implementation

5.2. Experimental Results

5.2.1. Isotropic representations improve re-ranking performance of ColBERT

5.2.2. Isotropic representations improve re-ranking performance of RepBERT

5.2.3. Isotropic representations make DR models robust to OOD data.

6. Discussion

6.1. Handling of Outlier Dimensions

6.2. Normalizing Flow vs. Whitening

6.3. Token-wise vs. Sequence-wise

6.4. Robustness and OOD Generalization

7. Conclusion

References

3.2.1. $I (W)$