Addressing Token Uniformity in Transformers via Singular Value Transformation

Hanqi Yan Department of Computer Science, University of Warwick, United Kingdom Lin Gui Department of Computer Science, University of Warwick, United Kingdom Wenjie Li Department of Computing, The Hong Kong Polytechnic University, China Yulan He

Abstract

Token uniformity is commonly observed in transformer-based models, in which different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. In this paper, we propose to use the distribution of singular values of outputs of each transformer layer to characterise the phenomenon of token uniformity and empirically illustrate that a less skewed singular value distribution can alleviate the ‘token uniformity’ problem. Base on our observations, we define several desirable properties of singular value distributions and propose a novel transformation function for updating the singular values. We show that apart from alleviating token uniformity, the transformation function should preserve the local neighbourhood structure in the original embedding space. Our proposed singular value transformation function is applied to a range of transformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT, and improved performance is observed in semantic textual similarity evaluation and a range of GLUE tasks. Our source code is available at https://github.com/hanqi-qi/tokenUni.git.

1 Introduction

In Natural Language Processing (NLP), approaches built on the transformer architecture have achieved the state-of-the-art performance in many tasks [DBLP:conf/uai/VeitchSB20]. However, recent work identified an anisotropy problem in language representations generated by transformer-based deep models [DBLP:conf/emnlp/Ethayarajh19, DBLP:conf/iclr/GaoHTQWL19, DBLP:conf/emnlp/LiZHWYL20], i.e., the learned embeddings occupy a narrow cone in the representation space. Such anisotropic shape is very different from what would be expected in an expressive embedding space [DBLP:journals/tacl/AroraLLMR16, DBLP:conf/iclr/MuV18]. This problem is called token uniformity or information diffusion, i.e., different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. pmlr-v119-goyal20a showed that using different transformer-encoded tokens in an input sample as a classification unit can achieve almost the same result.

Recently, DBLP:journals/corr/abs-2103-03404 found that pure self-attention networks, i.e., transformers without skip-connections and MLPs, have their outputs converging to a rank one matrix, and such rank deficiency can lead to token uniformity. They therefore concluded that skip-connection and MLP help alleviate the token uniformity problem. However, in our experiments, we still observe the token uniformity problem in the full transformer model with self-attention layers, skip-connections and MLPs, even when its output hidden state matrices are full-rank.

In this paper, we instead investigate the token uniformity problem via exploring the distribution of singular values of the transformer-encoded hidden states of input samples. Our analysis reveals that the learned embedding space is a high-dimensional cone-like hypersphere which is bounded by the singular values. Also, skewed probability distribution of singular values is indicative of token uniformity (See in §3.1). Therefore, making the distribution less skewed towards small singular values can help alleviate the token uniformity issue. Unlike existing methods [DBLP:conf/iclr/GaoHTQWL19, DBLP:conf/iclr/Wang0HHWG20] that implicitly or explicitly guide the spectra training of the output embedding matrix by adding a regularisation term to control the singular value decay, we propose a novel approach to address the token uniformity via smoothing the singular value distribution (See in §4.2). In order to verify the effectiveness of our proposed singular value transformation function in transformer-based structures, we apply it to four commonly used large-scale pretrained language models (PTLMs). In particular, the singular value distribution of the final layer output from a PTLM is modified using our proposed transformation function. Then, the transformed singular values are used to reconstruct the hidden states in the last layer of the PTLM, which are subsequently used for prediction in downstream tasks. Our extensive experiments on a variety of NLP tasks including semantic textual similarity evaluation and a range of General Language Understanding Evaluation (GLUE) tasks [DBLP:conf/iclr/WangSMHLB19] across thirteen datasets show that our proposed transformation function can effectively reduce the skewness of singular value distributions in qualitative analysis and achieve noticeable performance gains.

Our contribution can be summarised as follows:

We have presented both geometric interpretation and empirical study of the token uniformity problem. Based on our observations, we have designed a set of desirable properties of singular value distributions and proposed a singular value transformation function to alleviate the token uniformity issue.
We have proposed a range of methods to evaluate the transformed features in terms of uniformity and the preservation of the local neighbourhood structure.
Our proposed transformation function has been applied to four widely-used PTLMs and evaluated on both unsupervised and supervised tasks. The results demonstrate the effectiveness of our proposed method on addressing the token uniformity problem while preserving the local neighbourhood structure in the original embedding space.

2 Related Work

Transformer-based mask language models, such as BERT [devlin2018bert], ALBERT [lan2019albert], RoBERTa [liu2019roberta] and DistilBERT [sanh2019distilbert], have achieved significant success in NLP. However, token uniformity, i.e., different tokens share similar representations, is commonly observed with the increasing network depth. Many studies [DBLP:conf/icml/GaneaGBS19, DBLP:conf/nips/YangLSL19] claimed that token uniformity is caused by rank collapse of the layer-wise outputs because the transformer architecture learns the token representation based on the normalised weighted sum of the context representations.

Another line of work, which observed token uniformity in empirical studies, argued that the desirable word representations should be isotropic and focused on studying the distribution of the learned embeddings [DBLP:conf/iclr/MuV18]. DBLP:conf/iclr/GaoHTQWL19 and DBLP:conf/naacl/BisPL21 defined the problem as ‘representation degeneration’ and gave a theoretical analysis, which asserts that this phenomenon is caused by frequencies of rare words. DBLP:conf/iclr/Wang0HHWG20 proposed to add an exponential decay term in training objective so as to control the singular value distribution. All the aforementioned work focused on token-level features and tasks. More recent work argued that the sentence-level features can also be anisotropic due to the anisotropy in word features. Contrasting learning can also alleviate the anisotropy problem both theoretically and empirically [DBLP:conf/iclr/CarlssonGGHS21, DBLP:conf/emnlp/GaoYC21, DBLP:conf/uai/GuY21]. DBLP:conf/emnlp/LiZHWYL20 proposed BERT-flow to transform the representations learned by PTLMs into a standard Gaussian distribution in order to make the token/sentence representations isotropic. Other researchers turned to the whitening-based post-processing strategy to normalise sentence embeddings to derive almost perfectly isotropic representation distribution [DBLP:journals/corr/abs-2103-15316, DBLP:journals/corr/abs-2104-01767].

We argue that on the one hand addressing rank collapse does not necessarily solve the token uniformity problem, as it is still observed even with the network components such as skip-connections which can guarantee the full rank feature space [DBLP:journals/corr/abs-2103-03404]. On the other hand, while whitening methods can effectively solving the token uniformity problem, they failed to preserve the local neighbourhood structure of the original embedding space, which is important for downstream tasks. We propose a novel singular value transformation function which can alleviate the token uniformity and at the same time preserve the local neighbourhood structure.

3 Singular Value Distribution of Transformer Block Outputs

In a typical transformer block $ℓ$ , assuming the input token is $x^{l - 1}$ , the information propagation process is given by:

	$v^{l} = Self-Attention (x^{l - 1}), Φ (v^{l}) = φ (W^{l} v^{l} + b^{l}),$
	$x^{l} = LayerNorm (Φ (v^{l}) + x^{l - 1})$		(1)

where $φ$ is an element-wise nonlinear function applied to a feed-forward layer, whose weight matrix, $W^{l} \in R^{n_{l} \times n_{l - 1}}$ , transforms the feature dimension from $n_{l - 1}$ to $n_{l}$ , $Self-Attention (x^{l - 1})$ returns the weighted value vector of all input representations where weights are derived by multiplying the query vector of the current input $x^{l - 1}$ with the key vectors from other inputs. Between every two transformer blocks, there is a skip-connection and a layer normalisation. The former mechanism bypasses the transformer block $ℓ$ and adds the input $x^{l - 1}$ directly to the output $v^{l}$ of this block, while the latter normalises the input across the feature dimension.

Taking BERT as an example, we assume that the input of the model is $X = x_{1} \oplus x_{2} \oplus . . . \oplus x_{m}, X \in R^{n \times m}$ , where $x_{i} \in R^{n}$ , $m$ is the number of tokens in an input sentence (we do not distinguish special tokens such as $[CLS])$ , and $\oplus$ is the concatenation operator. The output of a transformer block $ℓ$ is denoted as $X^{l} \in R^{n_{l} \times m}$ , where $n_{l}$ is the dimension of output feature in layer $ℓ$ . Without loss of generality, we assume $n_{l} > m$ for all layers since the embedding size for tokens is larger than the maximum length of sentences in most BERT models.

Existing work mainly focused on the discussion of the rank of features learned by a transformer-based language model. For example, it was stated in [DBLP:journals/corr/abs-2103-03404] that with the growth of depth in a pure transformer model, the rank of the output representation matrix will converge to 1 exponentially. However, in practice, a position embedding matrix, which is usually full rank, is added to the output representation of each layer. In addition, skip connections are used. Therefore, rank collapse rarely occurs. It can be observed from the empirical cumulative density function of singular values from different layers of BERT, derived from real-world data and shown in Figure 2, that there is no zero singular value.

In this section, we instead study the token uniformity problem by the singular value density of the representation matrix $X$ for a deep network with $ℓ$ transformer blocks rather than the rank of $X$ . Since the distribution of the singular values of $X^{l}$ determines the extent to which the input signals become distorted or stretched as they propagate through the network, we aim to understand the entire singular value spectrum of $X^{l}$ in transformers. In particular, we want to study the degree of skewness of the singular value distribution, because highly skewed distributions indicate strong anisotropy in the embedded feature space, the radical reason for token uniformity.

3.1 Singular Value Vanishing in Transformer

In this subsection, we give a geometric interpretation of the problem of vanishing singular values in transformers. We assume that $X^{l} \in R^{n_{l} \times m}$ is a full rank matrix. We can perform SVD on $(X^{l})^{⊺} = U Σ V$ , where $U$ and $V$ are orthogonal matrices and $Σ$ is the diagonal singular value matrix. Without loss of generality, we sort the singular values in a descending order, $λ_{1} \geq λ_{2} \geq . . . \geq λ_{m} \geq 0$ , where $λ_{k}, k \in {1, \dots, m}$ , is a diagonal element in $Σ$ . We can choose a positive value $C$ such that $λ_{1} \geq . . . \geq λ_{k} \geq C \geq λ_{k + 1} \geq . . . \geq 0$ , which defines two subspaces, denoted as $S_{[1, k]}^{l}$ , and $S_{[k + 1, m]}^{l}$ , respectively. For any token embedding, $x \in X^{l}$ , we can find a point $x^{'}$ in the subspace $S_{[1, k]}^{l}$ such that their difference is no larger than $C$ . That is, we are able to establish the following bound¹¹1The proof is shown in Appendix A.:

Theorem: $\forall x \in X^{l}$ , $\exists x^{'} \in S_{[1, k]}^{l}$ , where the subspace $S_{[1, k]}^{l}$ is defined based on $λ_{k} \geq C \geq λ_{k + 1}$ , then $∥ x - x^{'} ∥_{2} \leq C$ .

According to this result, the embedding space is bounded by two components, the largest singular value in the subspace $S_{[1, k]}^{l}$ , and the upper bound $C$ of the small singular values, as the radius to span the $k$ -dimensional $S_{[1, k]}^{l}$ subspace into the $m$ -dimensional space. Furthermore, the weights of the $m$ components in the $ℓ$ -th layer is constrained by self-attention: $Σ_{i = 1}^{m} α_{i}^{l} = 1$ , which indicates that the embedding space is a spherical cone-like $k$ -dimension hypershere. This phenomenon has been observed in many studies [DBLP:conf/iclr/GaoHTQWL19, DBLP:conf/emnlp/0004GXMYS20, DBLP:conf/iclr/Wang0HHWG20, DBLP:conf/emnlp/LiZHWYL20].

We assume the Probability Density Function (PDF) of the distribution of singular values is $f_{p} (λ)$ , where $λ$ is the singular values in the learned embedding space, then we can obtain the value of $k$ based on the Cumulative Distribution Function (CDF) of singular values larger than $C$ , and the size of input tokens $m$ , $k = ┌ \int_{C}^{\infty} m \cdot f_{p} (λ) d λ ┐$ .

Therefore, the shape of the embedding space is now decided by two parameters: $C$ , the radius of the hypercone, and $m - k$ , the size of almost vanished dimensions when $C$ is small. However, due to the complex network operations in transformer blocks, it is difficult to derive the exact form of $f_{p} (λ)$ . Hence, we resort to empirical study to understand the singular value distribution which appears to be an exponential long-tail distribution, and use the skewness to measure the risk of dimension vanishing in the rest of this paper.

3.2 Empirical Study of Token Uniformity in BERT

Existing studies have observed the token uniformity issue in PTLMs and skewed singular value distributions of outputs from the intermediate network layers [pmlr-v119-goyal20a]. Few of them though has explored the impact of different shapes of singular value distributions on the downstream task performance. Here, taking BERT as an example, we empirically illustrate that the singular value distribution of outputs from different intermediate transformer blocks on the Corpus of Linguistic Acceptability (CoLA) dataset. The empirical CDF of singular values of the hidden states (i.e., intermediate token representations) from BERT in layer 2, 4, 6, 8, 10 and 12 is shown in Figure 1. It clearly reveals that the CDF is steeper when closer to the origin, which indicates that the probability of singular values $λ$ less than a small $x$ is high ( $F (x) = P r (λ < x)$ ).

With the increase of network depth, the CDF curve tends to be steeper, indicating that the shape of the spherical cone-like embedding space tends to be long and narrow, leading to token uniformity. In the last layer for the prediction (i.e., Layer 12), the representation is projected to an embedding space guided by supervised label information, therefore showing a lower degree of token uniformity. Nevertheless, simply leveraging the supervision from labels is not enough to address the problem of vanished dimensions during deep network training.

Figure 1: The empirical CDF of singular values from different layers of BERT (on the GLUE-CoLA dataset), $x$ -axis: normalised singular values; $y$ -axis: CDF of singular values. More flattened curve indicates a more balanced distribution of singular values. For a given $x$ , the larger $F (x)$ can cover a higher percentage of singular values which are less than $x$ . Hence, the top curve in this figure indicates more vanished dimensions in the embedding space.

We calculate the average cosine similarity between every token pair, and $[CLS]$ tokens from different BERT layers as a proxy measure of token uniformity. Figure 2 shows the skewness of singular values and token uniformity measurement increase as the network layer goes deeper. We also observe the gradual vanishing of smaller singular values as the median of the singular value decreases drastically (from 0.12 to 0.0397). Our results empirically show that the degree of skewness of singular value distributions is indicative of the degree of token uniformity.

Figure 2: Singular value distribution of the outputs from BERT layer 0, 7 and 12 (from top to bottom) in the GLUE-MRPC dataset. The second moment (skewness), token uniformity and $[CLS]$ uniformity values increase as BERT layer goes deeper, while the median of singular values decreases drastically, close to vanishing.

4 Transformation Function

Having empirically illustrating the changes of the singular value distributions of the transformer layer outputs and the measures of token uniformity across the transformer blocks, we now provide insights of designing a desirable singular value transformation function.

4.1 Motivation

As in the geometric interpretation presented in Section 3, the highly skewed singular value distribution in the embedded feature space would mean that the axis of the ellipsoid is significantly longer than the corresponding axis of the sphere, leading to the token uniformity problem. A variety of techniques have been developed to alleviate this problem, and the most popular ones are a series of normalisation methods that can be explained in a unified framework of constraining the contribution of every feature onto a sphere [DBLP:journals/corr/IoffeS15, wu2018group, ba2016layer]. A notable example is layer normalisation, an essential module in the Transformer architecture, which scales the variance of an individual activation across the layers to one. One common property of these normalisation methods is that they preserve the trace of the covariance matrix (i.e., the first moment of its singular value distribution), but they do not control higher moments of the distribution, such as the skewness. A consequence is that there may still be a large imbalance in singular values. Here, we propose a transformation function that can adjust the skewness of singular value distributions by modifying small singular values to avoid dimension vanishing.

4.2 Properties of Desirable Singular Value Transformation

We want to alleviate the token uniformity problem in PTLMs by adjusting the singular value distributions of outputs of transformer layers (see Section 3.2). Since SVD is computationally expensive²²2Approximation methods exist which can speed up the computation of SVD. However, the time cost is still 1.5 times higher than that of the original transformer-based language models. and a common practice is to fine-tune a PTLM on downstream tasks, rather than applying the transformation in each transformer block³³3We have also applied the transformation function to different layers of transformers, but have not observed significant improvements., we propose to only apply it in the last transformer block to modify the final output token distribution.

On one hand, we do not want the singular value distribution to be too skewed towards very few large singular values. On the other hand, we do not want it to be too flat as we want to keep the relative difference between singular values learned from powerful pre-trained models. To this end, we propose the following three desirable properties of an singular value transformation function $f (λ)$ :

1) $f^{'} (λ) \geq 0$

The large PTLMs have achieved promising performance in a broad spectrum of NLP tasks, showing their capabilities in learning good feature representations. In order to preserve the original data manifold that is mainly defined by the feature singular values, we would like to keep the original order of the singular values. As the input to $f$ is a monotonically increased singular value sequence, $f$ should be monotonically increasing:

f (λ_{i}) > f (λ_{j}), if λ_{i} > λ_{j}, i, j \in [0, n_{l} - 1] .

2) $f^{''} (λ) \leq 0$

To make the transformed singular value distribution more balanced while keeping the largest singular value unchanged, intuitively, we should increase the smaller singular values. The increment $Δ_{i}$ for each singular value is defined as $f (λ_{i}) - λ_{i}$ and $Δ_{0} = 0$ (i.e., the largest singular value is kept unchanged). To reduce that gap between larger and smaller singular values, we propose a simple solution that a smaller singular values should have an equal or larger increment than those larger ones while preserving the original order of the singular values. That is:

Δ_{i} \geq Δ_{j}, if λ_{i} < λ_{j}, % w h e r e Δ_{i} = f (λ_{i}) - λ_{i}

i.e., $\frac{f^{'} (λ_{i}) - f^{'} (λ_{j})}{λ_{i} - λ_{j}} \leq 0$ , the second-order derivative of $f$ should be monotonically decreasing.

3) $f (λ_{m a x}) \approx λ_{m a x}$

To guarantee the bounded embedding space, we require the transformation keep the largest singular value unchanged as much as possible. Existing studies have also shown that the largest singular value of the data covariance matrix is significant to model training [DBLP:conf/nips/PenningtonW17].

4.3 SoftDecay Function

We develop a non-linear and trainable function built on the soft-exponential function [godfrey2015continuum].

f (x) = - \frac{ln (1 - α (x + α))}{α}

(2)

Specially, when $α < 0$ , these curves are monotonically increasing with smaller slop. This is consistent with our properties (1)(2). For the property (3), it can be proved that for any $λ$ , there is $λ \geq f (λ)$ when $α < 0$ . Hence, we have $λ_{m a x} \geq f (λ_{m a x}) \geq λ_{2}$ , where $λ_{2}$ is the second largest singular value.

Combing the desirable properties of singular value distribution and the non-linear transformation function, we describe our proposed transformation in Algorithm 1:

Input: Original representations $X \in R^{n_{l} \times m}$ , $m$ is the number of tokens, $n_{l}$ is the embedding dimension.

1:SVD decomposition

U, Σ, V^{⊺} = SVD (X)

2:Apply transformation

^Σ = SoftDecay (Σ)

3:Rescaling factor

K = max (λ) / max (^λ)

4:Compute transformed singular value

~ λ = K^λ

5:Compute transformed representation

~ X = U ~ Σ V^{⊺}

Output: Transformed representation $~ X$

Algorithm 1 SoftDecay tranformation

4.4 Transformed Feature Evaluation

Existing research in text representation learning showed that the features should be roughly isotropic (i.e., directionally uniform) [DBLP:conf/iclr/MuV18, DBLP:conf/emnlp/0004GXMYS20, DBLP:conf/iclr/Wang0HHWG20, DBLP:conf/emnlp/LiZHWYL20, DBLP:journals/corr/abs-2103-15316] to prevent the feature space squeezing into a narrow cone and preserve as much information of the data as possible. We argue that the evaluation of transformed features should consider both the uniformity and the preservation of local neighbourhood structure in the original embedding space.

Uniformity.

We propose to measure the distribution uniformity in three different ways. First, we examine the features similarity (TokenUni):

TokenUni (x_{i}, x_{j}) = cos (f (x_{i}), f (x_{j}))

(3)

where $f (\cdot)$ transforms an input feature by the SoftDecay.

Second, we use the Radial Basis Function (RBF) kernel, RBF $_{d i s}$ , to measure feature similarity, as it has been shown a great potential in evaluating representation uniformity [DBLP:conf/icml/0001I20].

{RBF}_{d i s} (x_{i}, x_{j}) = exp (- \frac{∥ f (x_{i}) - f (x_{j}) ∥^{2}}{t}),

(4)

where $t$ is a constant. We use the logarithmic value of RBF $_{d i s}$ in experiments.

Finally, as few predominant singular values will result in an anisotropic embedding space, we can check the difference of variances in different directions or singular values and use the Explained Variance (EV $_{k}$ ) [DBLP:conf/aaai/ZhouL021]:

{EV}_{k} (f (X)) = \frac{\sum_{i = 1}^{k} λ_{i}^{2}}{\sum_{j = 1}^{m} λ_{j}^{2}},

(5)

where $λ_{i}$ is the $i$ -th singular value sorted in a descending order, $m$ is the number of all the singular values. In the extreme case when $E V_{1}$ approximates to 1, most of the variations concentrate on one direction, and the feature space squeezes to a narrow cone.

Preservation of Local Neighbourhood.

Ideally, the transformed embeddings should preserve the local neighbourbood structure in the original embedding space. Inspired by the Locally Linear Embedding [roweis2000nonlinear], we propose the Local Structure Discrepancy Score (LSDS) to measure the degree of preserving the original local neighbourhood. First, for a data point $x_{i}$ in the original embedding space, we choose its $k$ -nearest-neighbours, then define the weight connecting $x_{i}$ and its neighbour $x_{j}$ as the distance measured by the RBF kernel, $w_{i j} = exp (- ∥ x_{i} - x_{j} ∥^{2} / t)$ . In the transformed space, the new feature $_{i} = f (x_{i})$ is supposed to be close to the linear combination of its original neighbours in the transformed space weighted by the distance computed in the original space:

LSDS (x_{i}) = ∥ f (x_{i}) - \sum j \in N (x_{i}) w_{i j} f (x_{j}) ∥^{2},

(6)

where $N (x_{i})$ denotes the $k$ -nearest-neighbours of $x_{i}$ .

5 Experiments

We implement our proposed transformation functions on four transformer-based Pre-Trained Language Models (PTLMs), BERT [devlin2018bert], ALBERT [lan2019albert], RoBERTa [liu2019roberta] and DistilBERT [sanh2019distilbert], and evaluate on semantic textual similarity (STS) datasets and General Language Understanding Evaluation (GLUE) tasks [DBLP:conf/iclr/WangSMHLB19], including unsupervised and supervised comparison. ⁴⁴4Model training details and additional results can be found in the supplementary material.

Model	STSB	STS-12	STS-13	STS-14	STS-15	STS-16	SICK-R	Avg( $Δ$ %).
Results based on Bert-base-cased
BERT	59.05	57.72	58.38	61.97	70.28	69.63	63.75	62.97
SBERT-WK [DBLP:journals/taslp/WangK20]	16.07	26.66	14.74	24.32	28.84	34.32	41.54	26.64
BERT-flow(NLI) [DBLP:conf/emnlp/LiZHWYL20]	58.56	59.54	64.69	64.66	72.92	71.84	65.44	65.38
BERT-whitening(NLI) [DBLP:journals/corr/abs-2103-15316]	68.19	61.69	65.70	66.02	75.11	73.11	63.60	67.63
BERT-whitening(NLI)-256 [DBLP:journals/corr/abs-2103-15316]	67.51	61.46	66.71	66.17	74.82	72.10	64.90	67.67
WhiteBERT [DBLP:journals/corr/abs-2104-01767]	68.72	62.20	68.52	67.35	74.73	72.42	60.43	67.77( $↑$ 7.6)
SoftDecay	72.41**	65.16**	72.10**	69.49**	77.09**	77.05**	65.55**	71.26( $↑$ 12.0)
Results based on DistilBERT-base
DistilBERT	61.45	59.68	59.60	63.54	70.95	69.90	63.84	64.12
WhiteBERT [DBLP:journals/corr/abs-2104-01767]	69.41	61.82	66.90	67.69	74.27	72.81	59.43	67.48( $↑$ 5.2)
SoftDecay	71.10**	63.33**	70.62**	68.39**	76.34**	75.29**	63.40**	69.78( $↑$ 8.8)
Results based on ALBERT-base
ALBERT	46.18	51.02	43.94	50.79	60.83	55.35	54.99	51.87
WhiteBERT [DBLP:journals/corr/abs-2104-01767]	61.76	58.33	62.89	59.92	68.84	65.90	58.03	62.24( $↑$ 19.9)
SoftDecay	63.30**	59.42**	62.93**	61.09**	70.84**	68.60**	62.26**	64.06( $↑$ 23.5)
Results based on RoBERTa-base
RoBERTa	57.54	58.56	50.37	59.62	66.64	63.21	60.75	59.53
WhiteBERT [DBLP:journals/corr/abs-2104-01767]	68.18	62.21	67.13	67.63	74.78	71.43	58.80	67.17( $↑$ 12.83)
SoftDecay	${69.47}^{* *}$	${62.97}^{* *}$	${67.65}^{* *}$	${68.09}^{* *}$	${75.33}^{* *}$	${73.26}^{* *}$	${62.87}^{* *}$	68.50( $↑$ 15.10)

Table 1: Spearman’s rank results on STS tasks using sentence representation learning methods applied to different PTLMs. Results with ** are significant at

p < 0.001

, * at

p < 0.05

by comparing with the best baseline. The improvement

Δ

% is calculated by comparing with the base PTLM (first row in each PTLM group).

5.1 Unsupervised Evaluation on STS

Setup

The STS task is a widely-used benchmark of evaluating sentence representation learning. We conduct experiments on seven STS datasets, namely, the SICK-R [DBLP:conf/lrec/MarelliMBBBZ14], and the STS tasks (DBLP:conf/semeval/AgirreCDG12 DBLP:conf/starsem/AgirreCDGG13, DBLP:conf/lrec/MarelliMBBBZ14, DBLP:conf/semeval/AgirreBCCDGGLMM15, DBLP:conf/semeval/AgirreBCDGMRW16). We compare our approach with unsupervised methods on adjusting anisotropy in STS tasks, including BERT-flow [DBLP:conf/emnlp/LiZHWYL20], SBERT-WK [DBLP:journals/taslp/WangK20], BERT-whitening [DBLP:journals/corr/abs-2103-15316] and WhiteBERT [DBLP:journals/corr/abs-2104-01767]. BERT-flow argued that ideal token/sentence representations should be isotropic and proposed to transform the representations learned by PTLMs into a standard Gaussian distribution. Similar to BERT-flow, SBERT-WK also used Natural Language Inference datasets to train the top transformation layer while keeping parameters in the PTLM fixed. BERT-whitening and WhiteBERT dissect BERT-based word models through geometric analysis on the feature space. Our SoftDecay is directly applied to the last layer of the original PTLMs to derive the transformed sentence representation without any fine-tuning ⁵⁵5Here, we empirically search for the best value of $α$ in $[- 0.2, - 0.4, - 0.6, - 0.8, - 1.0]$ ..

Results

It can be observed from Table 1 that: (1) whitening-based methods (BERT-Whitening and WhiteBERT), which transform the derived representations to be perfectly isotropic, perform better than the other baselines such as BERT-flow, which applies a flow-based approach to generate sentence-embedding from a Gaussian distribution. (2) Our proposed SoftDecay gives superior results across all seven datasets significantly, 23.5% of improvement over the base PLTMs and 5% over the best baseline on BERT-based methods. (3) When comparing the results from different PLTMs, we observe more significant improvements on the ALBERT-based models (23%), and modest improvements on the DistilBERT-based models (8%). This is somewhat expected as the token uniformity issue is more likely to occur in deeper models. Therefore, less obvious improvements are found on DistilBERT with only 6 layers, compared to others with 12 layers. The cross-layer parameter sharing in ALBERT could potentially lead to more serious token uniformity, and thus benefits more from the mitigation strategies.

To further understand how SoftDecay alleviates token uniformity, we show the CDF of singular values from DistilBERT and ALBERT before and after applying SoftDecay in Figure 3. We can observe that before applying SoftDecay, the outputs of ALBERT across various layers are very similar while the outputs of DistilBERT across different layers are more different. After applying SoftDecay, the singular value distribution of the last layer output (red curve) of ALBERT is less skewed compared to DistilBERT (the brown curve).

Figure 3: Cumulative distribution function (CDF) of singular values from DistilBERT (left column) and ALBERT (right column), before (top) and after (bottom) applying SoftDecay on the STS dataset.

Figure 4: Data points are tSNE mapping results of sentence (pair) representations in STS-15, from left to right derived from the vanilla BERT, BERT+whitening and BERT+SoftDecay. The two sentences in each pair are denoted by two different colors, e.g., black and red in BERT. The metrics measuring uniformity and local neighbourhood structure (see in §4.4) are listed on the right. We can see our method preserves the local neighbourhood structure better than Whitening with lower LSDS and address token uniformity in BERT well with lower scores in the first three metrics.

Feature Evaluation

To gain insights into the characteristics of desirable features for the STS task, we visualise the sentence representations in STS-15 via tSNE and present the results using our proposed metrics in Figure 4. BERT-Whitening transforms vanilla features from BERT into perfectly isotropic distribution, which is evidenced in results of the uniformity measures that nearly all the features are orthogonal to each other as TokenUni is zero and they have the smallest RBF $_{d i s}$ . It also has the lowest EV $_{k}$ score of its top singular value. However, BERT-Whitening fails to preserve the local neighbourhood of BERT embeddings in its transformed space as shown by its larger Local Structure Discrepancy Score (LSDS) compared to SoftDecay. By contrast, SoftDecay not only significantly improves the uniformity compared to the vanilla BERT feature distribution, but also maintains a similar distribution shape. Our results show that transforming learned text representations into isotropic distributions does not necessarily lead to better performance. Our proposed SoftDecay is better in preserving the local neighbourhood structure in the transformed embedding space, leading to superior results compared to others.⁶⁶6The full results of uniformity and structural evaluation of different methods over the seven STS datasets can be found in Appendix C. In Appendix C.3, we further discuss a comparison between SoftDecay and a representative contrastive learning method SimCSE [DBLP:conf/emnlp/GaoYC21], which also aims to alleviate the anisotropy problem in language representations.

Dataset (size)	BERT	+SoftDecay( $Δ$ %)	ALBERT	+SoftDecay( $Δ$ %)	DistilBERT	+SoftDecay( $Δ$ %)
CoLA(8.5k)	59.57	59.84*( $↑$ 0.45)	46.47	48.91**( $↑$ 5.25)	50.60	50.73*( $↑$ 0.26)
SST2(67k)	92.32	93.12**( $↑$ 0.87)	90.02	89.91*( $↓$ 0.12)	90.48	91.40**( $↑$ 1.00)
MRPC-Acc(3.7k)	84.00	85.20**( $↑$ 1.43)	85.54	85.05( $↓$ 0.57)	84.56	84.31*( $↓$ 0.30)
MRPC-F1(3.7k)	89.50	89.65( $↑$ 0.17)	89.67	89.28( $↓$ 0.43)	89.16	89.00( $↓$ 0.18)
QNLI(105k)	91.25	91.98**( $↑$ 0.80)	89.99	90.24*( $↑$ 0.28)	87.66	88.81**( $↑$ 1.31)
RTE(2.5k)	64.98	68.23**( $↑$ 5.00)	66.43	68.23**( $↑$ 2.71)	56.68	59.21**( $↑$ 4.46)

Table 2: Sentence-level classification results on five representative GLUE validation datasets. Matthews correlation is used to evaluate CoLA, Accuracy/F1 is used in other datasets.

Δ %

represents the relative improvement over the baseline.

	MNLI	MNLI(mm)	QQP	QNLI	SST2	COLA	MRPC	RTE	Average( $Δ %$ )
S-BERT	83.9	83.1	71.3	90.5	90.9	47.0	85.3	61.6	76.7
BERT-CT	82.3	81.9	70.1	89.7	91.3	48.8	84.4	61.1	76.2
SoftDecay	84.6**	84.0**	71.6*	90.9*	93.3**	50.3**	86.2**	64.5**	78.2 ( $↑$ 2.6%)

Table 3: GLUE test results returned by the GLUE leaderboard. The first two rows are reported in BERT-CT [DBLP:conf/iclr/CarlssonGGHS21]. Our results outperform BERT-CT by 2.6% on average.

5.2 Supervised evaluation on GLUE datasets

Setup

We evaluate our method on five sentence-level classification datasets in GLUE [DBLP:conf/iclr/WangSMHLB19], including grammar acceptability assessment on the Corpus of Linguistic Acceptability (CoLA) [DBLP:journals/tacl/WarstadtSB19], sentiment classification on the Stanford Sentiment Treebank (SST2) [DBLP:conf/emnlp/SocherPWCMNP13], paraphrase detection on the Microsoft Research Paraphrase Corpus (MRPC) [DBLP:conf/acl-iwp/DolanB05], natural language inference on the Question-Answering NLI (QNLI) data and the Recognizing Textual Entailment (RTE) data.⁷⁷7We exclude WNLI as it has only 634 training samples and is often excluded in previous work [devlin2018bert]. We also exclude STS-B as it is a benchmark in the STS task..

We apply our proposed SoftDecay on top of the last encoder layer in BERT, ALBERT and DistilBERT, and then fine-tune the PTLM weights, along with $α$ on different tasks. In addition to the PTLMs, we include two more baselines, i.e., sentence-level embedding learning models, Sentence-BERT (S-BERT for short) [DBLP:conf/emnlp/ReimersG19] and BERT-CT [DBLP:conf/iclr/CarlssonGGHS21] .⁸⁸8We further compare SoftDecay with a method by adding regularisation during training in order to alleviate the anisotropy problem in language representations [DBLP:conf/iclr/Wang0HHWG20] in Appendix D.1.

S-BERT adds a pooling operation to the output of BERT to derive a sentence embedding and fine-tunes a siamese BERT network structure on sentence pairs.
BERT-CT improves the PTLMs by incorporating contrastive loss in the training objective to retain a semantically distinguishable sentence representation.

The two methods aim at making the sentence-level embeddings more discriminative, which in turn alleviate the token uniformity problem.

Since GLUE did not release the test set, the test results can only be obtained by submitting the trained models to the GLUE leaderboard ⁹⁹9https://gluebenchmark.com/leaderboard. We show the test results returned by the GLUE leaderboard in Table 3.

Results

It can be observed from Table 2 that SoftDecay is more effective on BERT-based model, while gives less noticeable improvement on DistilBERT, similar to what we observed for the STS tasks since DistillBERT has fewer layers. For the vanilla PLTMs, BERT has the better results over all the single-sentence tasks (except for MRPC, sentence-pair paraphrase detection). All the three models achieve better results on inference task (QNLI and RTE), especially on the smaller dataset RTE. The CDF of singular value distributions on RTE before and after applying SoftDecay shown in Figure 5 further verifies the effectiveness of our proposed transformation function. We also observe that models trained on a larger training set tend to generate more similar representations¹⁰¹⁰10We investigate the impact of the training set size on model performance in Appendix D.2, Figure 3.. On MRPC, using SoftDecay is effective on BERT, but gives slight performance drop on ALBERT and DistilBERT. One possible reason is the much smaller training set size. On the GLUE test results shown in Table 3, we observe that SoftDecay outperforms both S-BERT and BERT-CT across all tasks.

Figure 5: CDF of singular value distributions on RTE before (left) and after (right) applying SOftDecay on BERT. It is clear that SoftDecay can produce a set of larger singular values as evidenced from the curves of $F (x)$ .

6 Conclusion and future Work

In this paper, we have empirically shown that the degree of skewness of singular value distributions correlates with the degree of token uniformity. To address the token uniformity problem, we have proposed a singular value transformation function by alleviating the skewness of the singular values. We have also shown that a perfect isotropic feature space fails to capture the local neighborhood information and leads to inferior performance in downstream tasks. Our proposed transformation function has been evaluated on unsupervised and supervised tasks. Experimental results show that our methods can more effectively address token uniformity compared to existing approaches.

Our paper explores the token uniformity issue in information propagation in the transformer encoder, where self-attention is used. It would be interesting to extend our approach to the encoder-decoder structure and explore its performance in language generation tasks. One promising future direction is to improve the generation diversity via addressing the token uniformity since it has been previously shown that anisotropy is related to the word occurrence frequencies [DBLP:conf/emnlp/0004GXMYS20, DBLP:conf/naacl/BisPL21]. As such, in the decoding phase, sampling words from more isotropic word embedding distributions could potentially lead to more diverse results.

Acknowledgements

This work was funded by the the UK Engineering and Physical Sciences Research Council (grant no. EP/T017112/1, EP/V048597/1). YH is supported by a Turing AI Fellowship funded by the UK Research and Innovation (grant no. EP/V020579/1).

References

Appendix A Proof of Theorem in Section 3

Theorem: $\forall x \in X^{l}$ , $\exists x^{'} \in S_{[1, k]}^{l}$ , where the subspace $S_{[1, k]}^{l}$ is defined based on $λ_{k} \geq C \geq λ_{k + 1}$ , then $∥ x - x^{'} ∥_{2} \leq C$ .

Proof We assume that $X^{l}$ can be represented as a $n_{l} \times m$ matrix:

X^{l} = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} {\to x}_{1} {\to x}_{2} ⋮ {\to x}_{n_{l}} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦,

where ${\to x}_{i} \in R^{m}$ is an $m$ -dimensional embedding of a token in the output of $l$ -th layer. After performing SVD on $X^{l}$ , we have:

⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} {\to x}_{1} {\to x}_{2} ⋮ {\to x}_{n_{l}} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} {\to u}_{1} {\to u}_{2} ⋮ {\to u}_{m} ⋮ {\to u}_{n_{l}} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ \cdot ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} λ_{1} & 0 & \dots & 0 0 & λ_{2} & \dots & 0 ⋮ & ⋮ & ⋱ & ⋮ 0 & 0 & \dots & λ_{m} ⋮ & ⋮ & ⋱ & ⋮ 0 & 0 & \dots & 0 \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ \cdot ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} {\to v}_{1} {\to v}_{2} ⋮ {\to v}_{m} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎦,

where the unitary matrix $U = [{\to u}_{1}^{⊤}, {\to u}_{2}^{⊤}, . . ., {\to u}_{n^{l}}^{⊤}]^{⊤}$ , $V = [{\to v}_{1}^{⊤}, {\to v}_{2}^{⊤}, . . ., {\to v}_{m}^{⊤}]^{⊤}$ are $n^{l} \times n^{l}$ left singular matrix and $m \times m$ right singular matrix, respectively. Therefore, the two collections of vectors, i.e. ${\to u}_{i} = {u_{i 1}, u_{i 2}, . . ., u_{i n^{l}}}$ and ${\to v}_{i} = {v_{i 1}, v_{i 2}, . . ., v_{i m}}$ , are two subsets of basis for the $m$ -dimensional vector space ( $m << n^{l}$ ). Without loss of generality, we assume $x_{i} \in X^{l}$ can be represented by its corresponding left singular vector, singular values, and the right singular matrix $V$ , which yields:

	${\to x}_{i}$	$= {\to u}_{i} \cdot ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} λ_{1} & 0 & \dots & 0 0 & λ_{2} & \dots & 0 ⋮ & ⋮ & ⋱ & ⋮ 0 & 0 & \dots & λ_{m} ⋮ & ⋮ & ⋱ & ⋮ 0 & 0 & \dots & 0 \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ \cdot ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} {\to v}_{1} {\to v}_{2} ⋮ {\to v}_{m} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎦$
		$= [\begin{matrix} λ_{1} \cdot {\to u}_{1}, & λ_{2} \cdot {\to u}_{2}, & \dots, & λ_{m} \cdot {\to u}_{m} \end{matrix}] \cdot ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} {\to v}_{1} {\to v}_{2} ⋮ {\to v}_{m} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎦$

If we separate the singular values into two parts by $C$ , where $λ_{k} \geq C \geq λ_{k + 1} \geq 0$ , we can rewrite Eq. (LABEL:eq:combination) by:

	${\to x}_{i}$	$= Σ_{j = 1}^{m} λ_{j} \cdot u_{i j} \cdot {\to v}_{j}$
		$= Σ_{j = 1}^{k} λ_{j} \cdot u_{i j} \cdot {\to v}_{j} + Σ_{j = k + 1}^{m} λ_{j} \cdot u_{i j} \cdot {\to v}_{j}$

By defining ${\to x}_{i}^{'} = Σ_{j = 1}^{k} λ_{j} \cdot u_{i j} \cdot {\to v}_{j}$ , where singular values are taken from the larger group, we have:

	$\| \| {\to x}_{i} - {\to x}_{i}^{'} \| \|$	$= \| \| Σ_{j = k + 1}^{m} λ_{j} \cdot u_{i j} \cdot {\to v}_{j} \| \|$
		$= \| < {\to λ}^{[k + 1, m]} \otimes {\to u}_{i}^{[k + 1, m]}, V^{(m - k - 1) \times m} > \|$

Where $| | \cdot | |$ is the norm, $\otimes$ is the pairwise product, and $| < \cdot, \cdot > |$ is the inner product in a vector space, ${\to λ}^{[k + 1, m]}$ , and ${\to u}_{i}^{[k + 1, m]}$ are the sub-vectors of singular values and ${\to u}_{i}$ from $k + 1$ -th to $m$ -th dimensions, respectively, and $V^{(m - k - 1) \times m}$ is the corresponding right singular sub-matrix. According to H $¨ o$ lder inequality, we have:

| | {\to x}_{i} - {\to x}_{i}^{'} | | \leq | | {\to λ}^{[k + 1, m]} \otimes {\to u}^{[k + 1, m]} | | \cdot | | V_{(m - k - 1) \times m} | |

Since $V$ is a unitary matrix, $V^{⊤} \cdot V = I$ , which yields $| | V_{(m - k - 1) \times m} | | = 1$ . Hence,

	$\| \| {\to x}_{i} - {\to x}_{i}^{'} \| \|$	$\leq \| \| {\to λ}^{[k + 1, m]} \otimes {\to u}^{[k + 1, m]} \| \|$
		$= \sqrt{Σ_{j = k + 1}^{m} λ_{j}^{2} \cdot u_{i j}^{2}}$

Considering $| | \to u | | = 1$ and $λ_{k + 1} \leq C$ , obviously we have $| | {\to u}_{[k + 1, m]} | | \leq 1$ and $λ_{j} \leq C$ , when $j \geq k + 1$ . Therefore,

| | {\to x}_{i} - {\to x}_{i}^{'} | |

\leq C \cdot \sqrt{Σ_{j = k + 1}^{m} u_{i j}^{2}} \leq C

$□$

A case study where the vectors in the unitary matrix $U$ follows a uniform distribution in a $L_{2}$ -norm based metric space

The theorem states that the learned features from a transformer-based language model can be represented as a closure which is defined as a $C$ -neighbour of a $k$ -dimensional space. Here, we present a case study, assuming the vectors ${\to u}_{i}$ in the unitary matrix $U$ follows a uniform distribution within a $L_{2}$ -norm based metric space.

Under such an assumption, the probability of $P (Σ_{j = k + 1}^{m} \sqrt{u_{i j}^{2}} \leq d)$ is the integral of the probability density function in the corresponding area of a $n$ -sphere, denoted as $S_{n - 1}$ , defined by $Σ_{j = k + 1}^{m} \sqrt{u_{i j}^{2}}$ . It is clear that $P (Σ_{j = k + 1}^{m} \sqrt{u_{i j}^{2}} \leq d) \geq 0$ . Hence, we only discuss the upper boundary of $P$ in the following. We denote the sub-area of $Σ_{j = k + 1}^{m} \sqrt{u_{i j}^{2}} \leq d$ as $S_{ϕ}$ . To simplify the notation, without loss of generality, we re-order the elements in ${\to u}_{i} \in R^{n}$ such that its last $k$ dimensions correspond to the small singular values. Then, we have

	$P ({\to u}_{i} \in S_{ϕ})$
$=$	$\int_{S_{ϕ}} \frac{Γ (n / 2)}{2 π^{n / 2}} Π_{i = 1}^{n - 2} {s i n}^{n - 1 - i} (ψ_{i}) d ϕ_{1} . . . d ϕ_{(n - 1)}$
$\leq$	$\frac{Γ (n / 2)}{2 π^{n / 2}} \cdot d^{k} \int_{S_{ϕ}} Π_{i = 1}^{n - k} {s i n}^{n - k - i} (ψ_{i}) d ϕ_{1} . . . d ϕ_{(n - k + 1)} \cdot$
	$\int_{S_{ϕ}} Π_{i = 1}^{k} {s i n}^{k - i} (ψ_{n - k - 1 + i}) d ϕ_{k} . . . d ϕ_{(n - 1)}$
$\leq$	$\frac{Γ (n / 2)}{2 π^{n / 2}} \cdot \frac{2 π^{(n - k) / 2}}{Γ ((n - k) / 2 + 1)} \cdot \frac{2 π^{k / 2}}{Γ (k / 2 + 1)} \cdot (1 - d^{n - k}) \cdot d^{2 k}$
$\leq$	$\frac{2}{k (n - k)} \cdot \frac{Γ (n / 2)}{Γ ((n - k) / 2) Γ (k / 2)} \cdot d^{2 k} (1 - d^{n - k})$
$=$	$\frac{2}{k (n - k) \cdot B (k / 2, (n - k) / 2)} \cdot d^{2 k} (1 - d^{n - k}),$	(8)

where $B (\cdot, \cdot)$ is the beta function. The result show that the probability of a singular vector residing in the sub-area $S_{ϕ}$ will converge to 0 exponentially with the growth of $k$ . As such, when $k$ , the number of smaller singular vectors, is large, the distance between the embedding space and the subspace spanned by the larger singular vectors is bounded by $C$ , the smallest value in the larger singular value group.

Appendix B Model Configurations and Training Details

Unsupervised Setting

In the unsupervised setting on the STS task, we use the datasets processed by [DBLP:journals/corr/abs-2104-01767] and follow their evaluation pipeline by replacing their Whitening function with our SoftDecay function in their released code¹¹¹¹11https://github.com/Jun-jie-Huang/WhiteningBERT. We do not use any dataset to train the transformation function, instead, we choose a fixed $α$ empirically ( $α$ is the hyper-parameter in Eq.(3)). As we did not see significant changes across different $α$ , we set $α$ to $- 0.6$ for all the datasets and PTLMs. For metrics calculation, we use $t = 0.5$ in RBF $_{d i s}$ and we choose the nearest 12 points to reconstruct the query point in LSDS.

Supervised Setting

We apply SoftDecay to the output of the last layer of a PTLM provide by huggingface, before layer normalisation. We use the default parameters configured in BERT-base-uncased¹²¹²12https://huggingface.co/docs/transformers/master/en/model_doc/bert, ALBERT-base-v1¹³¹³13 https://huggingface.co/docs/transformers/master/en/model_doc/albert, RoBERTa-base¹⁴¹⁴14 https://huggingface.co/docs/transformers/master/en/model_doc/roberta and DistilBERT-base-uncased¹⁵¹⁵15 https://huggingface.co/docs/transformers/master/en/model_doc/distilbert as the baselines. For hyper-parameter setting, we search the initial alpha for different datasets from $[- 0.2, - 0.5, - 0.8]$ , and set different learning rates from $[2 e - 3, 2 e - 5]$ for the transformation layer and the pretrained models.¹⁶¹⁶16As SVD decomposition generates an error in the RoBERTa-base model, we exclude it in GLUE evaluation.

		BERT	+SoftDecay	ALBERT	+SoftDecay	DistilBERT	+SoftDecay
STS-B	Evs	0.6259	0.0252	0.6987	0.0326	0.7301	0.0341
	RBF $_{d i s}$	-1.4624	-3.8534	-1.1602	-3.8016	-1.0549	-3.8052
	TokenUni	0.6195	0.0274	0.6983	0.036	0.7282	0.037
SICK	Evs	0.7383	0.0212	0.7711	0.0274	0.8135	0.0289
	RBF $_{d i s}$	-1.0323	-3.8671	-0.8979	-3.8268	-0.7367	-3.8241
	TokenUni	0.7361	0.023	0.7706	0.0295	0.8130	0.0311
STS-12	Evs	0.6219	0.0182	0.7052	0.0247	0.7321	0.0245
	RBF $_{d i s}$	-1.4785	-3.8717	-1.4785	-1.1438	-3.8308	-3.8381
	TokenUni	0.6193	0.0203	0.7058	0.0273	0.7021	0.0329
STS-13	Evs	0.5823	0.0221	0.6632	0.0287	0.7015	0.0302
	RBF $_{d i s}$	-1.6189	-3.8706	-1.3032	-3.8258	-1.1594	-3.8262
	TokenUni	0.5817	0.024	0.6637	0.031	0.7021	0.0329
STS-14	Evs	0.5933	0.6729	0.0204	0.0151	0.712	0.0202
	RBF $_{d i s}$	-1.593	-3.9124	-1.2712	-3.8787	-1.1288	-3.8855
	TokenUni	0.5929	0.016	0.6743	0.0217	0.7127	0.0215
STS-15	Evs	0.6072	0.0183	0.6827	0.0239	0.7225	0.0248
	RBF $_{d i s}$	-1.5177	-3.8706	-1.2178	-3.8379	-1.0772	-3.8313
	TokenUni	0.6057	0.0216	0.6848	0.0273	0.7228	0.0291
STS-16	Evs	0.6049	0.0267	0.6824	0.0333	0.7190	0.0363
	RBF $_{d i s}$	-1.5262	-3.8375	-1.5262	-1.2095	-3.7952	-3.7869
	TokenUni	0.6054	0.0286	0.6864	0.0360	0.7201	0.0390

Table 4: Uniformity metrics (EVs, TokenUni, RBF

_{dis}

) evaluates the isotropy in transformed feature space comparing to the vanilla PTLMs features. Smaller values means the features are better uniformly distributed. It can be seen that SoftDecay can greatly improve the uniformity.

Appendix C Additional results on Semantic Textual Similarity Dataset

In this section, we first examine the potential reasons of improvement by comparing the learnt representations from baselines models (i.e., vanilla PLTMs and WhiteningBERT) and our proposed SoftDecay through quantitative evaluation results and the visualisation results (See in §C.1. and §C.2). We then discuss a comparison between SoftDecay and a representative contrastive learning method, SimCSE [DBLP:conf/emnlp/GaoYC21], which also aims to alleviate the anisotropy problem in language representations.

c.1 Feature Evaluation Results on STS Datasets

We show in Table 4 and Figure 6 both the uniformity and local neighborhood preservation evaluation results of different methods over the seven STS datasets. The lower scores returned by SoftDecay in Table 4 in comparison to the base PTLMs verify its capability of alleviating anisotropic feature space derived from BERT. In Figure 6), SoftDecay preserves the local neighbourhood structure better among all the datasets, which explains its performance superiority comparing with Whitening which ignores the original local manifold structure.

Figure 6: Local Structure Discrepancy Score (LSDS) for Whitening and SoftDecay transformed Representations. Smaller scores are preferred as the original local neighborhood information learnt in the pretrained model is preserved better.

c.2 Visualisation of Features in STS Datasets

We show the representations of sentence pairs generated from BERT, with Whitening and with SoftDecay via tSNE for the rest five STS datasets in Figure 7. In STSB, STS13 and STS16, the representation mapping results in Whitening are not unit Gaussian due to some abnormal data point. Our proposed method SoftDecay gives better uniformity score than vanilla BERT and better LSDS than WhiteningBERT, as have been shown in Figure 6 and Table 4.

Figure 7: The tSNE visualisation of representations of sentence pairs in datasets SICKR, STSB, STS12-16 (except STS15) in different columns. These representations from top to bottom are derived from vanilla BERT, BERT+whitening and BERT+SoftDecay. For each sentence pair, the two sentences are denotes by different colors, e.g., black and red in BERT. We can see clear clusters in BERT and BERT+SoftDecay for STS-B, STS-12 and STS-14 datasets.

c.3 Comparison with Contrastive Learning on STS

The objective of contrastive learning methods is to align semantically-related positive data pairs and make the learned representations evenly distributed in the resulting embedding space [DBLP:conf/icml/0001I20]. The latter property naturally addresses the token uniformity issue. Therefore, we further compare Softdecay with a representative contrastive learning method, SimCSE [DBLP:conf/emnlp/GaoYC21], on STS. As SimCSE needs to be trained on datasets to fine-tune its parameters, we conduct experiments using SimCSE following its original setup: (1) Unsupervised. Train the model on sampled 1 million sentences from English Wikipedia ¹⁷¹⁷17Download link for Sampled English Wikipedia dataset and pass the same sentence twice to a pre-trained encoder with standard dropout to generate two different sentence embeddings as positive pairs. Other sentences in the same mini-batch are taken as negative pairs; (2) Supervised. Train the model on natural language inference datasets, MNLI and SNLI ¹⁸¹⁸18Download link for the combined NLI dataset, and use the annotated entailment and contradictory pairs as positive and negative sentence pairs, respectively. The results are shown in Table 5. It can be observed that SoftDecay outperforms SimCSE in general, especially under the supervised setting. The end goal of our approach (via increasing the weights of small singular values in the output embedding space) is similar to SimCSE (via random dropout masks) under the unsupervised setting, as both aim to learn an isotropic embedding distribution. However, in the supervised SimCSE, its contrastive loss is calculated on a subset of training pairs, as such, it is relatively difficult to achieve the universal isotropy, which is not the case in our approach.

Model	STSB	STS-12	STS-13	STS-14	STS-15	STS-16	SICK-R
Trained on wiki-text (unsupervised)
SimCSE [pmlr-v119-goyal20a]	74.48	66.01	81.48	71.77	77.55	76.53	69.36
SoftDecay	75.81	63.25	78.67	70.41	79.37	77.69	71.15
Trained on MNLI and SNLI dataset (supervised)
SimCSE [pmlr-v119-goyal20a]	82.26	77.37	78.12	77.81	84.65	81.10	78.73
SoftDecay	83.51	75.31	81.70	79.88	86.33	81.37	79.04

Table 5: Comparison with contrastive learning method, SimCSE. Our methods demonstrate overall better results under the supervised setting.

Appendix D Additional Results on GLUE datasets

In this section, we first show the results of comparing SoftDecay with another method, which applies regularisation during training to alleviate the anisotropy issue. Then, we display the Cumulative distribution function (CDF) of singular value distributions before and after applying SoftDecay.

d.1 Comparing with another singular value transformation function

In addition to Sentence-BERT (S-BERT for short) [DBLP:conf/emnlp/ReimersG19] and BERT-CT [DBLP:conf/iclr/CarlssonGGHS21], we also compare with another method which applies regularisation on the output embedding matrix with an exponentially decayed singular value prior distribution during training (ExpDecay for short) [DBLP:conf/iclr/Wang0HHWG20].

ExpDecay is designed for an encoder-decoder architecture in language generation. The singular value distribution of the output embedding matrix is derived from the decoder. This approach is not directly applicable to our setup since we don’t use the encoder-decoder architecture here. Nevertheless, we modify our training objective by adding the singular values ${λ_{k}}_{k = 1}^{K}$ of output feature $X$ : $γ e \sum_{k = 1}^{K} (λ_{k} - c_{1} e^{- c_{2} k^{γ}})$ . where $γ e$ is a hyperparameter used to adjust the weight of the added term, $c_{1}, c_{2}$ , and $γ$ are hyperparameters in the desirable exponential prior term of singualar values. We empirically set $c_{1}, c_{2} = 1, γ = 2, γ e = 1 e - 4$ .

By comparing with the results of ExpDecay in Table 6, we don’t see substantial improvement using the fixed exponential decay term. It can be explained by 1) the difficulty of balancing two losses by adding the exponential decay term into the training objective function; 2) the sensitivity of the hyper-parameter in the prior decay term in ExpDecay. In our method, we only has a single parameter $α$ in Eq. (2) and its value can be automatically adjusted during training to fit the downstream tasks under the supervised setting.

Dataset (size)	BERT	+SoftDecay( $Δ$ %)	+ExpDecay( $Δ$ %)
CoLA(8.5k)	59.57	59.84*( $↑$ 0.45)	59.37( $↓$ 0.34)
SST2(67k)	92.32	93.12**( $↑$ 0.87)	92.43( $↑$ 1.19)
MRPC-Acc(3.7k)	84.00	85.20**( $↑$ 1.43)	83.25( $↓$ 0.89)
MRPC-F1(3.7k)	89.50	89.65( $↑$ 0.17)	87.92( $↓$ 1.21)
QNLI(105k)	91.25	91.98**( $↑$ 0.80)	89.21( $↓$ 2.23)
RTE(2.5k)	64.98	68.23**( $↑$ 5.00)	64.98( $↑$ 0.00)

Table 6: Sentence-level classification results on five representative GLUE validation datasets. Matthews correlation is used to evaluate CoLA, Accuracy/F1 is used in other datasets.

Δ %

represents the relative improvement over the baseline. Better results than BERT are in bold. No substantial improvements are observed using ExpDecay.

d.2 Singular Value Distribution

The effects of dataset size on NLI dataset

We highlight the different singular value distribution in QNLI and RTE, two datasets for language inference task (See in Figure 8).

Figure 8: The CDF of singular value in QNLI (left) and RTE (right) dataset derived from vanilla BERT. For the same percentage 0.8, the larger dataset QNLI dataset has smaller $Δ L_{i}$ among all the layers, refers to a more serious token uniformity issue.

BERT-Based Model Results

For BERT-based model, we show the CDF of singular values on all the evaluated datasets in Figure 9. We observe that by applying SoftDecay (bottom row of Figure 9), the CDF of singular values in the last layer becomes more flattened compared to that in vanilla BERT (top row of Figure 9).

ALBERT-Based and DistilBERT-based Model Results

We also show the results for ALBERT (Figure 11 and Figure 11) and DistilBERT (Figure 13 and Figure 13). By comparing with the vanilla PTLMs (the top row of each figure), we notice that the application of SoftDecay has a larger impact on ALBERT compared to DistilBERT, especially on the CoLA dataset. For DistilBERT, its feature space becomes anisotropic gradually as layers go deeper.

Figure 9: Cumulative distribution function (CDF) of singular value distributions. The upper ones are from vanilla BERT, bottom ones are from BERT+SoftDecay. From left to right, the evaluation datasets are SST-2, MRPC, QNLI and CoLA. Different curves represent distributions derived from different model layers. The x-axis represents the normalised singular values sorted in an ascending order. SoftDecay adjusts the anisotropy of the feature space with the effect more noticeable in MRPC and less obvious in QNLI.

Figure 10: CDF of SST-2, MRPC and QNLI datasets. The upper row results are from the vanilla ALBERT, the bottom ones are from ALBERT+SoftDecay.

Figure 12: CDF of SST-2, MRPC and QNLI datasets. The upper row results are from the vanilla DistilBERT, the bottom ones are from DistilBERT+SoftDecay.

Addressing Token Uniformity in Transformers via Singular Value Transformation

Abstract

1 Introduction

2 Related Work

3 Singular Value Distribution of Transformer Block Outputs

3.1 Singular Value Vanishing in Transformer

3.2 Empirical Study of Token Uniformity in BERT

4 Transformation Function

4.1 Motivation

4.2 Properties of Desirable Singular Value Transformation

1) f′(λ)≥0

2) f′′(λ)≤0

3) f(λmax)≈λmax

4.3 SoftDecay Function

4.4 Transformed Feature Evaluation

Uniformity.

Preservation of Local Neighbourhood.

5 Experiments

5.1 Unsupervised Evaluation on STS

Setup

Results

Feature Evaluation

5.2 Supervised evaluation on GLUE datasets

Setup

Results

6 Conclusion and future Work

Acknowledgements

References

Appendix A Proof of Theorem in Section 3

Appendix B Model Configurations and Training Details

Unsupervised Setting

Supervised Setting

Appendix C Additional results on Semantic Textual Similarity Dataset

c.1 Feature Evaluation Results on STS Datasets

c.2 Visualisation of Features in STS Datasets

c.3 Comparison with Contrastive Learning on STS

Appendix D Additional Results on GLUE datasets

d.1 Comparing with another singular value transformation function

d.2 Singular Value Distribution

The effects of dataset size on NLI dataset

BERT-Based Model Results

ALBERT-Based and DistilBERT-based Model Results

1) $f^{'} (λ) \geq 0$

2) $f^{''} (λ) \leq 0$

3) $f (λ_{m a x}) \approx λ_{m a x}$