ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers

Yutong Xie & Jianpeng Zhang & Yong Xia & Anton van den Hengel & Qi Wu
The University of Adelaide.
DAMO Academy, Alibaba Group.
Northwestern Polytechnical University.
{yutong.xie678}@gmail.com; {qi.wu01}@adelaide.edu.au
Corresponding author.
Abstract

Although Transformers have successfully transitioned from their language modelling origins to image-based applications, their quadratic computational complexity remains a challenge, particularly for dense prediction. In this paper we propose a content-based sparse attention method, as an alternative to dense self-attention, aiming to reduce the computation complexity while retaining the ability to model long-range dependencies. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost. Besides, we further extend the clustering-guided attention from single-scale to multi-scale, which is conducive to dense prediction tasks. We label the proposed Transformer architecture ClusTR, and demonstrate that it achieves state-of-the-art performance on various vision tasks but at lower computational cost and with fewer parameters. For instance, our ClusTR small model with 22.7M parameters achieves 83.2% Top-1 accuracy on ImageNet. Source code and ImageNet models will be made publicly available.

1 Introduction

Transformers have driven rapid progress in natural language processing, and have become the predominant model in the field as a result (Vaswani et al., 2017; Brown et al., 2020). The first Transformer to achieve image recognition performance comparable to the firmly established CNN models (\egResNet (He et al., 2016) and EfficientNet (Tan and Le, 2019)) was ViT (Dosovitskiy et al., 2021). ViT splits images into patches, resulting in a sequence of visual tokens. In contrast to the local receptive fields of CNNs, each token in ViT is able to interact with every other token, irrespective of location, thus enabling the modelling of long-range dependencies.

Although its strength has been demonstrated in various tasks, ViT still suffers from the quadratic complexity in both computation and memory due to the dense token-to-token self-attention. This particularly hinders the applications in dense prediction, such as semantic segmentation. Inspired by CNN models (Krizhevsky et al., 2012; Szegedy et al., 2015; He et al., 2016), recent research (Liu et al., 2021; Wang et al., 2021; Heo et al., 2021; Chu et al., 2021) has developed pyramid architectures for Transformers. The resultant variation in regulable token length and number of channels at various locations and scales enables greater computational and memory efficiency. To further reduce complexity, Swin Transformer (Liu et al., 2021) limited self-attention to a local window, and enabled cross-window connection through the window shifting. This means the computational burden scales linearly with the number of tokens, but at the cost of long-range dependencies. Pyramid Vision Transformer (PVT) (Wang et al., 2021) reduced the spatial dimension of queries and keys using the large-kernel and large-stride convolution. Such a spatial reduction attention suffers from the following two drawbacks. First, the reduced tokens are limited by the lack of fine-grained information. As shown in Figure 1, the downsampled token includes a wide range of content information. Taking the token located in the second row and second column for example, the object of “woman” only occupies a small part of the whole token, and the token also contains a small part of “child” and a large object of “sky”. This may lead to ambiguous semantics for these tokens. Second, the background tokens, like the sky and beach, take up quite a large portion of the entire sequence, which are full of redundant information whilst investing most of the computations. Hence, the aforementioned deficiencies may have a negative effect on the performance.

Comparison of grid-based self-attention and our clustering-guided self-attention.
The downsampled grid tokens not only weaken the fine-grained information but also suffer from the redundant information caused by the background or large-size objects.
With the clustering, the reduced tokens are endowed with explicit and intensive semantics. The large-size object like “sky” only occupies one of the entire reduced sequences. Other informative tokens can be assigned with rich and diverse semantic information.
Figure 1: Comparison of grid-based self-attention and our clustering-guided self-attention. The downsampled grid tokens not only weaken the fine-grained information but also suffer from the redundant information caused by the background or large-size objects. With the clustering, the reduced tokens are endowed with explicit and intensive semantics. The large-size object like “sky” only occupies one of the entire reduced sequences. Other informative tokens can be assigned with rich and diverse semantic information.

In this paper, we propose a content-based sparse attention to facilitate the efficient and versatile vision Transformer. We aim to reduce the self-attention complexity by cutting down the key and value tokens. Different from the grid-based downsampling solution used by (Wang et al., 2021, 2022b), shown in Figure 1, we propose to cluster the tokens according to their content similarity and aggregate the tokens in the same cluster as one representative token. This offers the following merits for the vision domain: the clustered tokens contain not only rich but also explicit semantic information, not affected by the background or other large-size objects; and the number of reduced tokens can be set more flexibly than the grid downsampling. We then utilize the query tokens, clustered key and value tokens to perform the self-attention, called clustering-guided self-attention. Moreover, multi-scale information plays a critical role in vision tasks (Zhang et al., 2021; Chen et al., 2021a; Ren et al., 2022). We can easily obtain the multi-scale aggregation tokens by adjusting the number of clusters, and further introduce a multi-scale self-attention to strengthen the multi-scale long-range dependencies. Since the clustering operation was used, we name the proposed Transformer as ClusTR for clarity. We scale up the ClusTR to three variants, ClusTR-T/S/B, corresponding to the tiny, small, and base models.

We demonstrate the effectiveness of our clustering-based self-attention on several extensive tasks, including classification, segmentation, detection, and pose estimation. The experimental results show that our ClusTR outperforms the competitive CNN-based and Transformer-based counterparts. For instance, ClusTR achieves the 83.2% and 84.1% Top-1 accuracy on ImageNet with the 22.7M and 40.3M parameters, respectively. Our contributions are summarized as follows:

  • We propose a content-based self-attention Transformer that clusters and aggregates the visual tokens according to their semantic information. Our clustering-guided self-attention not only maintains the ability for long-range context modelling but also reduces the quadratic computation complexity.

  • We introduce the multi-scale modelling to our clustering-guided self-attention, leading to the multi-scale self-attention that brings benefits to various dense prediction tasks.

  • Our ClusTR, as a versatile Transformer backbone, achieves state-of-the-art performance on four representative vision tasks including classification, segmentation, detection, and pose estimation, setting a new state of the art.

2 Related work

Vision Transformer. Transformer, a dominant architecture in language modelling, has recently been extended to the field of computer vision. (Dosovitskiy et al., 2021) designed a vision Transformer (ViT) and successfully achieved comparable or even superior performance on image recognition tasks than the competitive CNN counterparts. Subsequently, many attempts have been made to explore the potential of Transformers in various vision tasks, including segmentation (Zheng et al., 2021), detection (Carion et al., 2020), low-level vision (Chen et al., 2021b), and image generation (Jiang et al., 2021). It is well known that ViT is heavily dependent on a huge amount of data due to its weak inductive bias (d’Ascoli et al., 2021). DeiT (Touvron et al., 2021a) utilizes an efficient Transformer optimization strategy that distils another strong classifier to reduce data consumption. T2T-ViT(Yuan et al., 2021) models the local image structure via a Tokens-to-Token (T2T) transformation. CaiT (Touvron et al., 2021b) uses layer scaling to increase the stability of the optimization when training large-scale Transformers. Although achieving record performance on ImageNet (Deng et al., 2009), these methods suffer from the quadratic complexity of dense self-attention, which is inferior to CNN models when carrying out dense prediction tasks or processing high-resolution images. Inspired by CNN models, the community has sprouted the pyramid Transformer structure (Wang et al., 2021; Liu et al., 2021; Heo et al., 2021; Li et al., 2021; Chu et al., 2021; Chen et al., 2022a; Ren et al., 2022), which breaks with the established patterns in ViT, such as the fixed token length and fixed channels. These methods have a pyramid structure that can be used as a versatile backbone for both image classification and dense prediction tasks. Among these pyramid Transformer variants, downsampling tokens at each stage is a common but essential operation, which is implemented by convolution with strides (Wang et al., 2021, 2022b), patch merging with linear projections (Liu et al., 2021), or clustering-based patch embedding (Zeng et al., 2022a).

Efficient sparse self-attention. Different from language models, vision Transformers usually accept a mass of visual tokens, resulting in high computational complexity when computing the dense self-attention, especially on high-resolution images. Efficient sparse self-attention, an alternative to vanilla dense self-attention, allows for arbitrary sparsity patterns instead of interacting with all tokens in a dense manner. It can be categorized into location-based and content-based methods. The location-based sparse attention assumes that not all tokens are valuable and thus only interact with a portion of the tokens. Typical examples in this category include local window sliding attention, global attention, and the combination ones (Beltagy et al., 2020; Zaheer et al., 2020), which have been widely explored in language modelling. In the vision field, (Liu et al., 2021) achieved efficient self-attention by limiting self-attention in a local region and interacting with regions through shifting windows. (Wang et al., 2021) reduced the key and value tokens by aggregating the local region to a single token through convolution with large kernels and large strides. However, these artificially-designed sparsity patterns do not necessarily match the characteristics of data, thereby possibly impacting its performance. To address this issue, the content-based sparse attention partitions the tokens according to their content correlation. Based on it, (Roy et al., 2021) clustered the tokens using the -means algorithm and performed the self-attention in each cluster. (Kitaev et al., 2019) presented an efficient locality sensitivity hashing clustering to divide tokens into chunks. (Wang et al., 2022a) proposed the NN attention to select the top- tokens from keys and ignored the rest for each query when computing the attention matrix, thus filtering out noisy tokens and speeding up training. Although spare attention has been studied in these attempts, our ClusTR is different in the following aspects: 1) Compared with Wang et al. (2021); Liu et al. (2021), ClusTR breaks the rigid rules of grid-based token aggregation and makes full use of token representation for efficient vision modelling. 2) (Roy et al., 2021; Kitaev et al., 2019; Wang et al., 2022a) limited the range of self-attention to achieve efficiency, in which only similar tokens in the same cluster can communicate with each other. In contrast, our ClusTR breaks the constraints of limited self-attention range, and encourages to explore global attention patterns from the diverse clustered tokens. 3) Moreover, with the proposed multi-scale attention, ClusTR is superior to these single-scale attention methods when processing dense prediction tasks.

3 Method

As an efficient vision Transformer, ClusTR is different from other counterparts in terms of the self-attention mechanism. As shown in Figure 2, we group vision tokens and aggregate the semantic-similar tokens in the same cluster, aiming to reduce the computational complexity of self-attention. Based on the clustering-guided self-attention, we can easily extend it to a multi-scale version which is benefited from the multi-scale aggregation. In the following, we delve into the ClusTR self-attention and architecture details.

Comparison of self-attention in ViT, Swin, PVT and our proposed method. ViT performs the dense token-to-token self-attention; Swin Transformer divides all tokens into several windows and performs the window-based self-attention; PVT aggregates tokens in a grid by using strided convolution. The proposed method groups vision tokens according to the feature similarity, resulting in compact but semantic tokens for efficient self-attention.
Figure 2: Comparison of self-attention in ViT, Swin, PVT and our proposed method. ViT performs the dense token-to-token self-attention; Swin Transformer divides all tokens into several windows and performs the window-based self-attention; PVT aggregates tokens in a grid by using strided convolution. The proposed method groups vision tokens according to the feature similarity, resulting in compact but semantic tokens for efficient self-attention.

3.1 NN-based Density Peaks Clustering

We denote the set of vision tokens as , where and represent the number of tokens and dimension of the token channel, respectively. Following (Rodriguez and Laio, 2014), we characterize token clusters by a higher density than their neighbors and by a relatively large distance from other tokens with higher densities. As for a token , its local density is defined as

(1)

where refers to the Euclidean distance between and , , is the -th neighbor of . Here, we also define another variable for the token , which measures the distance between and other high-density tokens.

(2)

If is characterized as a cluster, its local density should be higher than that of its neighbors. Besides, it should also have a relatively large distance from other higher-density tokens. To this end, a decision value can be computed to locate the density peaks efficiently. The token clusters are specialized with both large density and large distance . After that, the remaining tokens are assigned to the same cluster as their nearest tokens with higher density. Based on the cluster index, we can partition all tokens in into clusters, denoted by .

The tokens in the same cluster are aggregated to generate a cluster representative token, formulated by

(3)

where is the token reduction ratio, , and is the learnable parameter for each token . Note that the number of aggregated cluster representative tokens is far smaller than that of the original visual tokens , i.e., . Such a clustering-guided token aggregation condenses a lot of visual tokens, benefiting the efficient self-attention process.

3.2 Clustering-Guided Self-Attention

The attention module is one of the core components of the Transformer. Following (Vaswani et al., 2017), most of Transformers and their variants apply the multi-head self-attention mechanism to model the long-range dependencies. For each head, the query , key , and value have the size of . The scaled dot-product attention can be formulated as

(4)

where is the scaling factor. Although the above self-attention can be implemented in a fast manner by using highly optimized matrix multiplication, it still suffers from the high computation complexity, i.e., , especially for the abundant vision tokens. To address this issue, we propose a clustering-guided efficient self-attention that clusters and aggregates the semantic-similar tokens in the same cluster to reduce the computation complexity. Based on the clustering algorithm in Sec. 3.1, the proposed efficient self-attention is reformulated as

(5)

After clustering, the tokens of key and value are decreased by times, reducing the computation complexity from to . Based on the single-head attention, the multi-head attention can be implemented in parallel as

(6)

where refers to the concatenation operation, aggregates the feature representation of attention heads through a linear projection function. , , and are linear projections to generate query, key, and value tokens.

3.3 Multi-Scale Self-Attention

Here we extend the proposed clustering-guided self-attention from single-scale to multi-scale. For the multi-scale aggregation, we replace the single in Eq. 3 with a set of factors , where refers to the number of scales. Then, the multi-scale clustering can be described as

(7)

The computational complexity of multi-scale attention is . And the multi-head multi-scale clustering-guided self-attention can be described as

(8)

where the linear projection is used to aggregate the feature representation of attention heads and scales.

3.4 ClusTR Transformer Architecture

The architecture of our ClusTR.
Figure 3: The architecture of our ClusTR.

The basic ClusTR model is composed of four stages, as shown in Figure 3. We follow  (Ren et al., 2022) and employ the overlapped patch embedding at the beginning of each stage to model local continuity. Based on the clustering-guided self-attention, the Transformer block of ClusTR can be computed as

(9)

where LN is the layer normalization, and FFN is the fully connected feedforward network. Note that the token reduction ratio can be defined as any value during the clustering process. To balance the efficiency and accuracy, we set to , , , from the first to the last stage, respectively. We build the tiny model, called ClusTR-T, that has a similar model size and computation complexity to PVT-Tiny/PVTv2-B1. Based on this, we scale up ClusTR-T to the small and base variants, called ClusTR-S, and ClusTR-B, which have the model size and computation complexity of about , and compared to the tiny version. The specific architecture details and hyper-parameters can be found in Table 1.

Output_size ClusTR-T ClusTR-S ClusTR-B
L C H L C H L C H
Stage1 W/4 * H/4 1 64 1 {64,16} 3 64 1 {64,16} 3 64 1 {64,16}
Stage2 W/8 * H/8 2 128 2 {16,4} 5 128 2 {16,4} 5 128 2 {16,4}
Stage3 W/16 * H/16 6 256 4 {4,1} 13 256 4 {4,1} 18 320 5 {4,1}
Stage4 W/32 * H/32 1 512 8 1 2 512 8 1 3 512 8 1
Table 1: Architecture details of ClusTR variants. Here ‘L, C, H’ represents the number of Transformer layers, channels, and heads, respectively.

4 Experiment

We evaluate ClusTR on four representative computer vision tasks, including image classification, semantic segmentation, object detection, and pose estimation. We also investigate the effectiveness of each part of ClusTR in the ablation section.

4.1 Classification on ImageNet-1K

Dataset: We conduct image classification experiments on the ImageNet-1K dataset (Deng et al., 2009), which includes 1.28 million training images and 50K validation images from 1,000 categories.

Setting: We randomly crop regions as the input. Following (Wang et al., 2022b), we apply a rich set of data augmentations to diversify the training set, including random cropping, random flipping, random erasing, label-smoothing regularization, CutMix, and Mixup. We adopt the AdamW optimizer (Loshchilov and Hutter, 2018) with a cosine decaying learning rate (Loshchilov and Hutter, 2017), a momentum of 0.9, and a weight decay of 0.05, to train our ClusTR model. We set the initial learning rate to 0.001, batch size to 1024, and epochs to 300, which are popular for ImageNet training. During the inference time, we take a center crop as the input and adapt the Top-1 accuracy as the evaluation metric.

Results: In Table 2, we compare ClusTR to other advanced backbones based on ConvNets, MLPs, and Transformers. Compared with the state-of-the-art Transformer-based methods, ClusTR outperforms the Transformer-based architectures with comparable or fewer parameters and computation budgets, surpassing 1.9% than Swin Transformer (ClusTR-S 83.2 vs. Swin-T 81.3), and 1.5% than PVTv2 (ClusTR-T 80.2 vs. PVTv2-b1 78.7). Compared to the state-of-the-art ConvNet-based methods, ClusTR is superior to keep a balance between accuracy and complexity. With a similar complexity budget, ClusTR achieves 1.1% performance gain over ConvNets (ClusTR-S 83.2 vs. ConvNeXt-T 82.1). With a comparable accuracy (ClusTR 83.2 vs. ConvNeXt-S 83.1), ClusTR reduces the model complexity of ConvNexts by half (ClusTR-S 22.7M/4.8G vs. ConvNeXt-S 50M/8.7G). Such an advantageous accuracy-complexity trade-off still remains when compared to MLP-based methods.

Methods Resolution
Params.
(M)
FLOPs (G) Top-1 (%) Reference
ConvNets
RegNetY-4G (Radosavovic et al., 2020) 224 21.0 4.0 80.0 CVPR20
RegNetY-8G (Radosavovic et al., 2020) 224 39.0 8.0 81.7 CVPR20
ConvNeXt-T (Liu et al., 2022) 224 29.0 4.5 82.1 CVPR22
ConvNeXt-S (Liu et al., 2022) 224 50.0 8.7 83.1 CVPR22
MLPs
CycleMLP-T (Chen et al., 2022b) 224 28.0 4.4 81.3 ICLR22
CycleMLP-S (Chen et al., 2022b) 224 50.0 8.5 82.9 ICLR22
AS-MLP-T (Lian et al., 2022) 224 28.0 4.4 81.3 ICLR22
AS-MLP-S (Lian et al., 2022) 224 50.0 8.5 83.1 ICLR22
Transformers
PVT-T (Wang et al., 2021) 224 13.0 1.9 75.1 ICCV21
PVT-ACmix-T (Pan et al., 2022) 224 13.0 2.0 78.0 CVPR22
PVTv2-b1 (Wang et al., 2022b) 224 13.1 2.1 78.7 CVM22
QuadTree-B-b1 (Tang et al., 2022) 224 13.6 2.3 80.0 ICLR22
ClusTR-T 224 11.7 2.2 80.2 Ours
PVT-S (Wang et al., 2021) 224 24.5 3.8 79.8 ICCV21
Swin-T (Liu et al., 2021) 224 29.0 4.5 81.3 ICCV21
Twins-SVT-S (Chu et al., 2021) 224 24.0 2.9 81.7 NeurIPS21
PVTv2-b2 (Wang et al., 2022b) 224 25.4 4.0 82.0 CVM22
HRViT-b2 (Gu et al., 2022) 224 32.5 5.1 82.3 CVPR22
TCFormer (Zeng et al., 2022b) 224 25.6 5.9 82.4 CVPR22
CrossFormer-S (Wang et al., 2022c) 224 30.7 4.9 82.5 ICLR22
RegionViT-S (Chen et al., 2022a) 224 30.6 5.3 82.6 ICLR22
CSWin-T (Dong et al., 2022) 224 23.0 4.3 82.7 CVPR22
QuadTree-B-b2 (Tang et al., 2022) 224 24.2 4.5 82.7 ICLR22
ClusTR-S 224 22.7 4.8 83.2 Ours
PVT-L (Wang et al., 2021) 224 61.4 9.8 81.7 ICCV21
HRViT-b3 (Gu et al., 2022) 224 37.9 5.7 82.8 CVPR22
Swin-S (Liu et al., 2021) 224 50.0 8.7 83.0 ICCV21
RegionViT-M (Chen et al., 2022a) 224 41.2 7.4 83.1 ICLR22
Twins-SVT-B (Chu et al., 2021) 224 56.0 8.6 83.2 NeurIPS21
CrossFormer-B (Wang et al., 2022c) 224 52.0 9.2 83.4 ICLR22
PVTv2-b4 (Wang et al., 2022b) 224 62.6 10.1 83.6 CVM22
Quadtree-B-b3 (Tang et al., 2022) 224 46.3 7.8 83.7 ICLR22
ClusTR-B 224 40.2 7.5 84.1 Ours
Table 2: Image classification performance of different backbones on the ImageNet-1K validation set. Here ‘Params.’ refers to the number of the model parameters, and FLOPs is calculated based on the input size of .

4.2 Semantic Segmentation on ADE20K

Dataset: We conduct semantic segmentation experiments on the ADE20K dataset (Zhou et al., 2017), which includes 20,210 training images and 2,000 validation images from 150 fine-grained semantic categories.

Settings: We randomly resize and crop image patches as the input and set the batch size to 16. We empoy the ClusTR-S, pre-trained on ImageNet, as the backbone, and evaluate it with two segmentation architectures, i.e., Semantic FPN (Kirillov et al., 2019) and UperNet (Xiao et al., 2018b). The segmentation training process follows the default settings in  (Wang et al., 2022b) and  (Liu et al., 2021). When training the Semantic FPN, we adopt the AdamW optimizer (Loshchilov and Hutter, 2018) with an initial learning rate of 0.0001 and a weight decay of 0.0001, and set the number of iterations to 80K. As for UperNet, we adopt the AdamW optimizer with an initial learning rate of 0.00006 and a weight decay of 0.01, and set the number of iterations to 160K. We also warm up the model linearly for the first 1500 iterations. During the test, we re-scale the shorter side of the input image to pixels and adapt the mIOU metric for evaluation.

Results: As shown in Table 3, we can see that ClusTR outperforms the state-of-the-art backbones, including ConvNets-based and Transformer-based, in both semantic FPN and UpperNet mode. Compared to the ConvNet-based backbones, the proposed ClusTR achieves better segmentation performance (ClusTR 48.0 vs. ResNet 36.7 with Semantic FPN; ClusTR 49.6 vs. ConvNeXt 46.0 with UpperNet) while using fewer parameters. Compared with the Transformer-based methods, ClusTR outperforms the state-of-the-art backbones in both semantic FPN and UpperNet mode with comparable or even fewer parameters, surpassing CrossFormer (Wang et al., 2022c) by 2.0%, and MPViT (Lee et al., 2022) by 1.3%.

Semantic FPN 80k UperNet 160K
Methods Params. (M) mIOU (%) Params. (M) mIOU (%)
ResNet-50 (He et al., 2016) 28.5 36.7 - -
PVT-S (Wang et al., 2021) 28.2 39.8 - -
Swin-T* (Liu et al., 2021) 31.9 41.5 59.9 44.5
CycleMLP-b2 (Chen et al., 2022b) 30.6 43.4 - -
ConvNeXt-T (Liu et al., 2022) - - 60.0 46.0
Twins-SVT-S (Chu et al., 2021) 28.3 43.2 54.4 46.2
RegionViT-S+ (Chen et al., 2022a) 35.7 45.3 - -
CrossFormer-S (Wang et al., 2022c) 34.3 46.0 62.3 47.6
MPViT-S (Lee et al., 2022) - - 52.0 48.3
ClusTR-S (Ours) 26.4 48.0 52.5 49.6
Table 3: Semantic segmentation performance of different backbones on the ADE-20K validation set. Here ‘*’ indicates that the numbers are cited from the reproduced results of Twins.

4.3 Object detection on COCO

Dataset: We perform object detection and instance segmentation experiments on the COCO2017 dataset (Lin et al., 2014), which includes 118,287 training images and 5,000 validation images from 80 categories.

Settings: We use the ClusTR-S pre-trained on ImageNet as the backbone of two mainstream detectors, i.e., RetinaNet (Lin et al., 2017) and Mask R-CNN (He et al., 2017). We follow the default settings of PVTv2 (Wang et al., 2022b) and mmdetection (Chen et al., 2019). We adopt the AdamW optimizer with a batch size of 16, and perform the training schedule with 12 epochs. During training, we re-scale the shorter side of the input image to pixels while keeping the longer side no more than pixels. During test, the shorter side of input images is resized to pixels, and the bbox mAP (AP) and mask mAP (AP) are used as evaluation metrics.

Results: As shown in Table 4, with comparable/fewer parameters, our ClusTR model surpasses both ConvNet- and Transformer-based state-of-the-art backbones when using Mask-RCNN for object detection and instance segmentation. Compared to ConvNet backbones, our model outperforms ResNet (He et al., 2016) by 9.0 points for box AP, and 8.1 points for mask AP. Compared to Transformer backbones, our model achieves 6.6 box AP/4.7 mask AP over PVT, and 4.8 box AP/3.4 mask AP over Swin. Besides, Table 5 reports the detection performance of different backbones when using RetinaNet as a detector. Our model achieves the 45.8 box AP with only 32.4M parameters, outperforming other competitors especially in detecting small objects. We clarify that these results are expected, since the proposed clustering-guided self-attention is able to pay equal attention to diverse objects, insensitive to the object size, which is particularly beneficial for small objects.

Methods Params. (M) AP AP AP AP AP AP
ResNet-50 (He et al., 2016) 44.2 38.0 58.6 41.4 34.4 55.1 36.7
PVT-S (Wang et al., 2021) 44.1 40.4 62.9 43.8 37.8 60.1 40.3
Swin-T (Liu et al., 2021) 47.8 42.2 64.6 46.2 39.1 61.6 42.0
Twins-SVT-S (Chu et al., 2021) 44.0 43.4 66.0 47.3 40.3 63.2 43.4
CrossFormer-S (Wang et al., 2022c) 50.2 45.4 68.0 49.7 41.4 64.8 44.6
ClusTR-S (Ours) 42.3 47.0 68.7 51.6 42.5 65.9 45.9
Table 4: Detection and instance segmentation performance of Mask-RCNN with different backbones on the COCO validation set.
Methods Params. (M) AP AP AP AP AP AP
ResNet-50 (He et al., 2016) 37.7 36.3 55.3 38.6 19.3 40.0 48.8
PVT-S (Wang et al., 2021) 34.2 40.4 61.3 43.0 25.0 42.9 55.7
CycleMLP-b2 (Chen et al., 2022b) 36.5 40.6 61.4 43.2 22.9 44.4 54.5
Swin-T (Liu et al., 2021) 38.5 41.5 62.1 44.2 25.1 44.9 55.5
Twins-SVT-S (Chu et al., 2021) 34.3 43.0 64.2 46.3 28.0 46.4 57.5
RegionViT-B (Chen et al., 2022a) 83.4 43.3 65.2 46.4 29.2 46.4 57.0
CrossFormer-S (Wang et al., 2022c) 40.8 44.4 65.8 47.4 28.2 48.4 59.4
Shunted-S (Ren et al., 2022) 32.1 45.4 65.9 49.2 28.7 49.3 60.0
ClusTR-S (Ours) 32.4 45.8 66.4 49.5 30.4 49.5 61.2
Table 5: Detection performance of RetinaNet with different backbones on the COCO validation set.

4.4 2D Whole-body Pose Estimation on COCO.

Dataset: We perform pose estimation experiments on the COCOWholeBody V1.0 dataset (Jin et al., 2020), which contains 133 keypoints, including 17 for the body, 6 for the feet, 68 for the face, and 42 for the hands.

Implementation details: We follow the same settings in (Zeng et al., 2022a), and adopt the AdamW optimizer with an initial learning rate of 0.0005 (Loshchilov and Hutter, 2017), a momentum of 0.9, and a weight decay of 0.01. We set the batch size to 512, and the number of epochs to 210. The OKS-based Average Precision (AP) and Average Recall (AR) are used as evaluation metrics.

Results: In Table 6, we compare ClusTR with other advanced models on COCOWholeBody V1.0 dataset. Our model achieves the new state-of-the-art performance on the pose estimation (59.4% AP and 69.7% AR), outperforming the best ConvNet-based HRNet by 4.1 AP and 7.1 AR, and surpassing the best Transformer-based TCFormer by 2.2 AP and 1.9 AR.

Methods Resolution body foot face hand whole
AP AR AP AR AP AR AP AR AP AR
ZoomNet* (Jin et al., 2020) 384288 74.3 80.2 79.8 86.9 62.3 70.1 40.1 49.8 54.1 65.8
SBL-Res152* (Xiao et al., 2018a) 256192 68.2 76.4 66.2 78.8 62.4 72.8 48.2 60.6 54.8 66.1
HRNet-w32* (Sun et al., 2019) 256192 70.0 74.6 56.7 64.5 63.7 68.8 47.3 54.6 55.3 62.6
PVTv2-b2 (Wang et al., 2022b) 256192 69.6 77.3 69.0 80.3 64.9 74.8 54.5 65.9 57.5 68.0
TCFormer (Zeng et al., 2022a) 256192 69.1 77.0 69.8 81.3 64.9 74.6 53.5 65.0 57.2 67.8
ClusTR-S (Ours) 256192 71.4 78.8 73.3 83.8 66.5 75.7 55.9 67.1 59.4 69.7
Table 6: Pose estimation performance of different backbones on the COCOWholeBody V1.0 dataset. Here ‘*’ indicates that the numbers are cited from the reproduced results of TCFormer.

4.5 Ablations

We perform the following ablation experiments to further verify the effectiveness of ClusTR. All classification experiments are conducted based on ClusTR-T and the number of training epochs is set to 100. The segmentation experiments are conducted based on the pre-trained ClusTR-T and the Semantic FPN segmentation architecture.

Grid-based vs. clustering-guided token aggregation: Token aggregation is an important operation in the self-attention process that dramatically reduces the computation complexity. We compare the clustering-guided token aggregation to the convolution-based grid token aggregation. Following (Wang et al., 2021), we utilize the convolution with large strides to achieve the grid token aggregation. Note that the other settings are the same for a fair comparison. Table 7 reveals that our clustering-guided method not only reduce the parameters and FLOPs, but also improve 0.5 points of Top1 accuracy (grid-based 76.7 vs. clustering-guided 77.2).

Token Aggregation Params. (M) FLOPs (G) Top1 (%)
Grid-based 13.2 2.1 76.7
Clustering 10.8 2.0 77.2
Table 7: Comparison of different token aggregation methods on the ImageNet-1K dataset.

Compared to different sparse attentions: We also compare the clustering-guided attention to the spatial-reduction attention (SRA) as done in PVT (Wang et al., 2021) and NN based sparse attention (Wang et al., 2022a) in Table 8. It reveals that the NN attention achieves a slight performance gain (0.1 points) over SRA without increasing parameters and FLOPs. It is noteworthy that our ClusTR not only outperforms NN attention by 0.4% but also reduces about 18% parameters. It demonstrates that the proposed ClusTR is superior to modelling the token-wise dependencies, thus leading to better performance.

Methods Params. (M) FLOPs (G) Top1 (%)
Spatial-reduction attention (Wang et al., 2021) 13.2 2.1 76.7
NN attention (Wang et al., 2022a) 13.2 2.1 76.8
ClusTR (Ours) 10.8 2.0 77.2
Table 8: Comparison of different sparse attentions on the ImageNet-1K dataset.

Single-scale vs. multi-scale attention: In Table 9, we compare the single-scale attention with two reduction ratios and multi-scale attention. For the single-scale, the smaller reduction ratio keeps more detailed information, thus contributing to better accuracy, especially for dense prediction tasks (+0.6 points for segmentation). By contrast, the multi-scale attention outperforms the single-scale attention by at least 0.5 points on ImageNet and at least 0.8 points on segmentation, though it suffers from a slight increase of parameters (+0.9M) and FLOPs (+0.1G).

Reduction ratios Params. (M) FLOPs (G) Top1 mIOU
Stage1 Stage2 Stage3 Stage4
Single-scale 64 16 4 1 10.8 2.0 77.2 41.2
16 4 1 1 10.8 2.1 77.4 41.8
Multi-scale {64, 16} {16, 4} {4, 1} 1 11.7 2.2 77.9 42.6
Table 9: Comparison of single-scale and multi-scale attention on the ImageNet-1K and ADE-20K datasets.

5 Conclusion

The dense self-attention in Transformers suffers from the high computation complexity when processing vision tasks, especially on dense prediction scenarios or high-resolution images. In this work, we propose the content-based sparse attention that clusters vision tokens and aggregates them in the same cluster. The clustering-guided self-attention not only reduces the computation complexity but also invests the explicit and intensive semantics to each aggregated token, thus contributing to better performance. Moreover, we extend it from single-scale to multi-scale self-attention, benefiting the dense prediction tasks. Based on the proposed self-attention method, we build a versatile Transformer model, called ClusTR. We conduct extensive experiments to demonstrate the effectiveness of ClusTR, and achieve state-of-the-art performance on various vision tasks, including image recognition, semantic segmentation, object detection, and pose estimation.

References

  • I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv:2004.05150. Cited by: §2.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. Advances in neural information processing systems 33, pp. 1877–1901. Cited by: §1.
  • N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European conference on computer vision, pp. 213–229. Cited by: §2.
  • C. Chen, R. Panda, and Q. Fan (2022a) RegionViT: regional-to-local attention for vision transformers. In International Conference on Learning Representations, Cited by: §2, Table 2, Table 3, Table 5.
  • C. R. Chen, Q. Fan, and R. Panda (2021a) Crossvit: cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 357–366. Cited by: §1.
  • H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao (2021b) Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12299–12310. Cited by: §2.
  • K. Chen, J. Wang, J. Pang, et al. (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §4.3.
  • S. Chen, E. Xie, G. Chongjian, R. Chen, D. Liang, and P. Luo (2022b) CycleMLP: a mlp-like architecture for dense prediction. In International Conference on Learning Representations, Cited by: Table 2, Table 3, Table 5.
  • X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen (2021) Twins: revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems 34, pp. 9355–9366. Cited by: §1, §2, Table 2, Table 3, Table 4, Table 5.
  • S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and L. Sagun (2021) Convit: improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, pp. 2286–2296. Cited by: §2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2, §4.1.
  • X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo (2022) Cswin transformer: a general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12124–12134. Cited by: Table 2.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §1, §2.
  • J. Gu, H. Kwon, D. Wang, W. Ye, M. Li, Y. Chen, L. Lai, V. Chandra, and D. Z. Pan (2022) Multi-scale high-resolution vision transformer for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12094–12103. Cited by: Table 2.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §1, §4.3, Table 3, Table 4, Table 5.
  • B. Heo, S. Yun, D. Han, S. Chun, J. Choe, and S. J. Oh (2021) Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11936–11945. Cited by: §1, §2.
  • Y. Jiang, S. Chang, and Z. Wang (2021) Transgan: two pure transformers can make one strong gan, and that can scale up. Advances in Neural Information Processing Systems 34. Cited by: §2.
  • S. Jin, L. Xu, J. Xu, C. Wang, W. Liu, C. Qian, W. Ouyang, and P. Luo (2020) Whole-body human pose estimation in the wild. In European Conference on Computer Vision, pp. 196–214. Cited by: §4.4, Table 6.
  • A. Kirillov, R. Girshick, K. He, and P. Dollar (2019) Panoptic feature pyramid networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2.
  • N. Kitaev, L. Kaiser, and A. Levskaya (2019) Reformer: the efficient transformer. In International Conference on Learning Representations, Cited by: §2.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25. Cited by: §1.
  • Y. Lee, J. Kim, J. Willette, and S. J. Hwang (2022) MPViT: multi-path vision transformer for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7287–7296. Cited by: §4.2, Table 3.
  • Y. Li, K. Zhang, J. Cao, R. Timofte, and L. Van Gool (2021) Localvit: bringing locality to vision transformers. arXiv preprint arXiv:2104.05707. Cited by: §2.
  • D. Lian, Z. Yu, X. Sun, and S. Gao (2022) AS-mlp: an axial shifted mlp architecture for vision. In International Conference on Learning Representations, Cited by: Table 2.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §4.3.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.3.
  • Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. Cited by: §1, §2, §2, §4.2, Table 2, Table 3, Table 4, Table 5.
  • Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, and S. Xie (2022) A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986. Cited by: Table 2, Table 3.
  • I. Loshchilov and F. Hutter (2017) Sgdr: stochastic gradient descent with warm restarts. In ICLR, Cited by: §4.1, §4.4.
  • I. Loshchilov and F. Hutter (2018) Fixing weight decay regularization in adam. Cited by: §4.1, §4.2.
  • X. Pan, C. Ge, R. Lu, S. Song, G. Chen, Z. Huang, and G. Huang (2022) On the integration of self-attention and convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 815–825. Cited by: Table 2.
  • I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár (2020) Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10428–10436. Cited by: Table 2.
  • S. Ren, D. Zhou, S. He, J. Feng, and X. Wang (2022) Shunted self-attention via multi-scale token aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10853–10862. Cited by: §1, §2, §3.4, Table 5.
  • A. Rodriguez and A. Laio (2014) Clustering by fast search and find of density peaks. science 344 (6191), pp. 1492–1496. Cited by: §3.1.
  • A. Roy, M. Saffar, A. Vaswani, and D. Grangier (2021) Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics 9, pp. 53–68. Cited by: §2.
  • K. Sun, B. Xiao, D. Liu, and J. Wang (2019) Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5693–5703. Cited by: Table 6.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1.
  • M. Tan and Q. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105–6114. Cited by: §1.
  • S. Tang, J. Zhang, S. Zhu, and P. Tan (2022) Quadtree attention for vision transformers. In International Conference on Learning Representations, Cited by: Table 2.
  • H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou (2021a) Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pp. 10347–10357. Cited by: §2.
  • H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou (2021b) Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 32–42. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §1, §3.2.
  • P. Wang, X. Wang, F. Wang, M. Lin, S. Chang, W. Xie, H. Li, and R. Jin (2022a) Kvt: k-nn attention for boosting vision transformers. In European conference on computer vision, Cited by: §2, §4.5, Table 8.
  • W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 568–578. Cited by: §1, §1, §2, §2, §4.5, §4.5, Table 2, Table 3, Table 4, Table 5, Table 8.
  • W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao (2022b) Pvtv2: improved baselines with pyramid vision transformer. Computational Visual Media 8 (3), pp. 1–10. Cited by: §1, §2, §4.1, §4.2, §4.3, Table 2, Table 6.
  • W. Wang, L. Yao, L. Chen, B. Lin, D. Cai, X. He, and W. Liu (2022c) CrossFormer: a versatile vision transformer hinging on cross-scale attention. In International Conference on Learning Representations, Cited by: §4.2, Table 2, Table 3, Table 4, Table 5.
  • B. Xiao, H. Wu, and Y. Wei (2018a) Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pp. 466–481. Cited by: Table 6.
  • T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun (2018b) Unified perceptual parsing for scene understanding. In European Conference on Computer Vision, Cited by: §4.2.
  • L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. Tay, J. Feng, and S. Yan (2021) Tokens-to-token vit: training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567. Cited by: §2.
  • M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al. (2020) Big bird: transformers for longer sequences. Advances in Neural Information Processing Systems 33, pp. 17283–17297. Cited by: §2.
  • W. Zeng, S. Jin, W. Liu, C. Qian, P. Luo, W. Ouyang, and X. Wang (2022a) Not all tokens are equal: human-centric visual analysis via token clustering transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11101–11111. Cited by: §2, §4.4, Table 6.
  • W. Zeng, S. Jin, W. Liu, C. Qian, P. Luo, W. Ouyang, and X. Wang (2022b) Not all tokens are equal: human-centric visual analysis via token clustering transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11101–11111. Cited by: Table 2.
  • P. Zhang, X. Dai, J. Yang, B. Xiao, L. Yuan, L. Zhang, and J. Gao (2021) Multi-scale vision longformer: a new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2998–3008. Cited by: §1.
  • S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, et al. (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6881–6890. Cited by: §2.
  • B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641. Cited by: §4.2.