gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window

Mocho Go Hideyuki Tachibana

PKSHA Technology, Inc.
Hongo, Bunkyo City, Tokyo, Japan

Abstract

Following the success in language domain, the self-attention mechanism (transformer) is adopted in the vision domain and achieving great success recently. Additionally, as another stream, multi-layer perceptron (MLP) is also explored in the vision domain. These architectures, other than traditional CNNs, have been attracting attention recently, and many methods have been proposed. As one that combines parameter efficiency and performance with locality and hierarchy in image recognition, we propose gSwin, which merges the two streams; Swin Transformer and (multi-head) gMLP. We showed that our gSwin can achieve better accuracy on three vision tasks, image classification, object detection and semantic segmentation, than Swin Transformer, with smaller model size.

1 Introduction

Since the great success of AlexNet (Krizhevsky, Sutskever, and Hinton, 2012) in ILSVRC2012 ushered in the era of deep learning, convolutional neural networks (CNNs) (Fukushima, 1980; LeCun et al., 1989; Zeiler and Fergus, 2014; Szegedy et al., 2015; Simonyan and Zisserman, 2014; He et al., 2016; Tan and Le, 2019), which naturally take into account the spatial shift invariance and the scale hierarchy, have been dominant in the image domain for the past decade. On the other hand, looking outside the image domain, the Transformer architecture based on the stacked multi-head self-attention (MSA) (Vaswani et al., 2017) that directly connects one token to another has been proposed in the language domain, and Transformer-based models such as BERT (Devlin et al., 2018) and GPT (Radford et al., 2018) have had great success. Following the success in the language domain, there has been active research in recent years on the application of transformers in the vision domain, viz. the Vision Transformer (ViT) (Dosovitskiy et al., 2021), with remarkably promising results.

A shortcoming of Transformer is that its multi-head attention is so huge that the module is computationally expensive and data inefficient. For example, ViT has too large data capacity, and is not adequately trained on a medium-sized dataset such as ImageNet-1K (Deng et al., 2009). To remedy such problems, a number of improved ViTs and learning strategies have been proposed (Section 2.1). In this vein, some Transformer architectures which explicitly incorporate the visual hierarchy have been proposed, including the Swin Transformer (Liu et al., 2021b), with the motivation that the scale hierarchy, an important property of vision, is naturally incorporated in CNNs but not in Transformers (Section 2.3).

Another important stream of research was model simplification, in particular a series of methods using only MLPs (Tolstikhin et al., 2021; Touvron et al., 2021a; Liu et al., 2021a). These methods were proposed based on the question of whether self-attention modules are really necessary for the vision tasks and the fact that an MLP can approximate arbitrary functions (Hornik, Stinchcombe, and White, 1989) without special structures such as convolution or self-attention (Section 2.2).

In this paper, we propose gSwin as a method to reintegrate the two pathways that are evolving in different directions after breaking off from ViT. Our gSwin is based on the basic framework of gMLP, which consists of MLP and a simple gating mechanism, and the specific process uses the same window function as Swin Transformer to capture the visual hierarchical structure. According to our experiments, gSwin-T outperforms Swin-T, +0.4 Top-1 accuracy on ImageNet-1K, +0.5 box AP and +0.4 mask AP on COCO and +1.9 mIoU on ADE20K, and gSwin-S is competitive with Swin-S, while both gSwin models are smaller than their Swin Transformer counterparts.

2 Related Work

Figure 1: Comparison of existing vision models. Layers for specific downstream tasks are omitted for brevity.

2.1 Transformers for Vision

Although the mechanism of attention that focuses directly on a specific region in an image has actually been studied in the image domain for quite some time (Mnih et al., 2014; Xu et al., 2015; Parmar et al., 2018; Woo et al., 2018; Cordonnier, Loukas, and Jaggi, 2019; Hu, Shen, and Sun, 2018; Zhang et al., 2020), it is only recently that vision models built principally on self-attention mechanisms have been proposed (Carion et al., 2020; Wu et al., 2020; Dosovitskiy et al., 2021). Of these, ViT (Dosovitskiy et al., 2021) (Figure 1) produced competitive results and marked the beginning of research on vision models based on the Transformer architecture. Although it has only been a short time since they were proposed, Transformer for vision has become popular, and it has also been applied in related domains including video (Liu et al., 2021c; Arnab et al., 2021) and 3D point cloud (Guo et al., 2021b).

In the 2D vision domain, a vast amount of research to improve ViT has been actively conducted. One trend is oriented toward improving learning strategies, such as DeiT (Touvron et al., 2021b) and MoCo v3 (Chen, Xie, and He, 2021). Another direction is the improvement of model architectures, including CaiT (Touvron et al., 2021c), DeepViT (Zhou et al., 2021), HaloNet (Vaswani et al., 2021), etc. A major stream of research on improving model structure is the incorporation of spatial structure in images, which has not been explicitly used in Transformers, except for the customary use of positional encoding and the patch partitioning before the initial layer. An approach of incorporating spatial structure is the consideration of hybrid architectures with the auxiliary use of convolution, e.g. BoTNet (Srinivas et al., 2021), ConViT (d’Ascoli et al., 2021), LeViT (Graham et al., 2021), CvT (Wu et al., 2021), LocalViT (Li et al., 2021), etc. Others adopted hierarchical structure that lacks in vanilla Transformer models despite its importance in vision domain (See Section 2.3), e.g. T2T-ViT (local token aggregation) (Yuan et al., 2021), HVT (pooling-based aggregation) (Pan et al., 2021), TNT (alternating local/global self-attention) (Han et al., 2021), MViT (fine-to-coarse aggregation using multi-head pooling attention) (Fan et al., 2021), PVT (pyramid structure for dense prediction) (Wang et al., 2021), ViL (efficient transformer on hierarchical structure) (Zhang et al., 2021), Swin Transformer (restricted self-attention in shifted windows) (Liu et al., 2021b), etc.

2.2 MLP Models for Vision

Looking back in history, the neural models of image recognition have been the multi-layer perceptron (MLP) since the early days when Rosenblatt (1958) proposed the perceptron in 1950s. Nevertheless, in the modern context, MLP-Mixer (Tolstikhin et al., 2021), ResMLP (Touvron et al., 2021a) and gMLP (Liu et al., 2021a), which were proposed almost simultaneously, would be the pioneers of the recent trend of using only axis-wise MLPs, or more precisely, axis-wise dense affine projections with non-linear activation functions.

Of these, MLP-Mixer and ResMLP simply apply axis-wise MLPs in channel and spatial axes alternately. Following these studies, a number of vision models based on axis-wise MLPs have been proposed recently. One of the current trends is the approach of constructing complex receptive fields despite being MLPs, by reshaping the tensors in the intermediate layers, e.g. Hire-MLP (Guo et al., 2021a), Vision Permutator (Hou et al., 2022). Some methods in this line even excluded the spatial-mixing MLPs, e.g. CycleMLP (Chen et al., 2021), S $^{2}$ -MLP (Yu et al., 2022), AS-MLP (Lian et al., 2021). In these methods, only the channel-mixing MLPs are used, and the exchange of information between tokens is taking place by spatially shifting each channel separately.

A method that deserves special mention in a different line is the gMLP (Liu et al., 2021a) (Figure 1). Its noteworthy characteristic is the use of the spatial gating unit (SGU), a gating mechanism for spatial mixing, which mimics an attention mechanism. Another feature of this method is its simplicity: only one linear (fully-connected projection) layer is used where the stacking of two or more dense layers are required in other related methods. The gMLP model has achieved competitive performance both in language and vision domains with standard Transformer models BERT (Devlin et al., 2018) and ViT (Dosovitskiy et al., 2021). It is also reported that the gMLP could learn shift-invariant features.

2.3 Hierarchy and Locality of Vision

In neuroscience, a hierarchical structure of visual cortex has already been argued in 1960s by Hubel and Wiesel (1962). It has been a common understanding that local patches of images are matched with Gabor-like patterns in the early stage viz. primary visual cortex (V1), and the results are integrated at later stages for more global and intelligent information processing (Marĉelja, 1980). Neocognitron (Fukushima, 1980), one of the earliest CNN models, was based on the insights from physiological hierarchy model above. Subsequent CNN models in the deep learning era also naturally utilize hierarchical structures of images.

In contrast, neither Vision Transformers nor Vision MLPs had structures that inherently incorporates the ‘spatial inductive bias’ of visual hierarchy. Therefore, from such motivations it is natural to consider integrating the vision hierarchy model into Transformers, as shown in Section 2.1. In addition to the physiological motivations, such a built-in hierarchical structure has also an advantage that the output of intermediate layers can be used in downstream tasks that require multiple resolutions, including the semantic segmentation as were in e.g. UPerNet (Xiao et al., 2018). Furthermore, divide-and-conquer-like hierarchical architectures that often logarithmically reduce the computational complexity will naturally improve the efficiency.

One such proposal is the Swin Transformer (Liu et al., 2021b). It first divides the image into many small patches and applies self-attention within those patches, which are shifted by half for each layer. The structure is then gradually integrated from local to global in a bottom-up manner. (Figure 1.) Particular essentials in the structure of Swin Transformer are the bottom-up hierarchical structure for each resolution and the shifting of windows. Indeed, it has been reported that as long as this structure is maintained, the attention can actually be replaced by average pooling (PoolFormer), or even an identity map, although performance is slightly reduced (Yu et al., 2021). This fact would be a evidence that the hierarchical structure of the Swin Transformer is advantageous as a vision model.

3 Methodology

3.1 Overall Network Architecture of gSwin

Figure 2: Overall Architecture of gSwin.

In this paper, we aim to achieve high performance while the computational and parameter efficiency is on par with the existing methods, by incorporating both merits of the two methods that have evolved in different directions from ViT (Dosovitskiy et al., 2021) (Figure 1). In particular, we consider the two daughters of ViT, viz. gMLP (Liu et al., 2021a) (Figure 1) and Swin Transformer (Liu et al., 2021b) (Figure 1). From Figure 1, we may observe that gMLP’s overall structure is largely the same as ViT, but the structure of individual blocks has been significantly modified. On the other hand, Swin Transformer’s overall structure has been significantly modified in terms of the introduction of a hierarchical structure, but the internal structure of the individual blocks is largely the same as in ViT, except for the introduction of shifted windows. Thus, there is little overlap and a high degree of independence in the differences from ViT in each of these methods. Therefore, by incorporating both at the same time, higher improvements can be expected.

Based on these observations, the authors propose gSwin as a new vision model that inherits features of both gMLP and Swin Transformer. Figure 2 shows the architecture of gSwin. In the following subsection, we will discuss the methods of Window-SGU.

Note that this is not the same as nor a variant of MetaFormers (Yu et al., 2021), which is a general form for a series of models in which the self-attention module is replaced with another module in Swin Transformer. In this study, they considered using gMLP module instead of self-attention, but the architecture is not the same as our gSwin. For example, gSwin does not have subsequent LayerNorm, MLP and residual connections after each gMLP block. Our method is more parameter efficient, which will be shown experimentally later in this paper.

3.2 Spatial Gating Unit (SGU) and Window-SGU

Let us first recall the gMLP (Liu et al., 2021a) model. Figure 1 shows the overall architecture of gMLP, which consists of stacking of axis-wise fully-connected layers, layer normalization (Ba, Kiros, and Hinton, 2016), GELU (Hendrycks and Gimpel, 2016) activation function and the spatial gating unit (SGU). Here, let us briefly introduce the SGU. Let $Z \in R^{H \times W \times 2 C}$ be the input features, and let $Z_{1}, Z_{2} \in R^{(H W) \times C}$ be the reshaped and split features $[Z_{1}, Z_{2}] \leftarrow s p l i t (r e s h a p e (Z))$ . Then the SGU performs the following operation,

SGU: Y = Z_{1} ⊙ (W Z_{2} + b),

(1)

where $W \in R^{(H W) \times (H W)}$ and $b \in R^{(H W)}$ are trainable parameters, and $⊙$ denotes the element-wise multiplication. Such a gating mechanism allows us to explicitly incorporate the second-order terms $Z_{1} [h_{1}, w_{1}, :] ⊙ Z_{2} [h_{2}, w_{2}, :]$ associated with the binomial relationship between all the site pairs $(h_{1}, w_{1})$ and $(h_{2}, w_{2})$ . The gMLP paper does not go so far as to say that this is the primary reason for the effectiveness of Transformers (in which up to ternary relations are incorporated), but the SGU could be an alternative mechanism to attention to explicitly incorporate higher-order relations, implicitly assuming that it could be a contributing factor. According to (Liu et al., 2021a, Figures 3), the resulting learned weight parameters $W$ were almost-sparse Gabor-like patterns. Furthermore, (Liu et al., 2021a, Figure 9) suggests that Toeplitz-like (i.e., shift invariant) structures are automatically acquired. These facts imply that it is not necessary to model $W$ as a matrix with full degrees of freedom. Instead, it would be sufficient to consider only local interactions by partitioning the features into small patches.

Based on this motivation, in this paper, we propose the Window-SGU module, which has basically the same structure as the SGU module of gMLP, but instead of the entire image, it is divided into small patches and token mixing is performed independently in each of them. That is, we first split the input feature $Z \in R^{H \times W \times 2 C}$ into small patches $Z^{(i)} \in R^{h \times w \times 2 C}$ where $1 < h ≪ H, 1 < w ≪ W$ . (The areas that extend beyond the $H \times W$ region during this patch partitioning are handled with zero padding as usual.) After the patch partitioning, each patch $Z^{(i)} \in R^{h \times w \times 2 C}$ is reshaped and split into $Z_{1}^{(i)}, Z_{2}^{(i)} \in R^{(h w) \times C}$ , and then the SGU is applied to each patch independently as follows,

Window-SGU: Y^{(i)} = Z_{1}^{(i)} ⊙ (W_{win} Z_{2}^{(i)} + b_{win}),

(2)

where $W_{win} \in R^{(h w) \times (h w)}, b_{win} \in R^{(h w)}$ . The resulting $Y^{(i)}$ -s are again tiled and the boundaries are processed appropriately to obtain $Y \in R^{H \times W \times C}$ .

If the patch partitioning is always fixed, there will always be elements that are in the neighborhood but do not interact, resulting in block noise. To prevent this, it is necessary to change the patch partitioning as necessary so that interaction between neighboring elements occurs. For layers with shifts, we cannot use cyclic-shifting as proposed in (Liu et al., 2021b) because SGUs lack the ability of masking, which self-attention has. Then naive approach is to add zero-padding so that we have uniform-shaped windows, which costs FLOPs. Here, we propose another approach equivalent to zero-padding but more efficient; the Padding-free shift. Let the shape of windows be $(h, w) = (7, 7)$ , then we have 9 groups of windows of shape $(h, w) = (3, 3), (3, 7), (3, 4), (7, 3), (7, 7) \dots$ , located at upper-left, upper, upper-right, left, center, etc. Calculate SGUs independently for each group and then concatenate all windows. The weight and bias of each group are copied from those of the main group at the center of shape $(7, 7)$ properly; e.g., $W_{upper} \in R^{(3 \times 7) \times (3 \times 7)}$ is defined as

W_{upper} [(x, y), (x^{'}, y^{'})] = W_{center} [(4 + x, y), (4 + x^{'}, y^{'})] .

(3)

The Padding-free shift reduces FLOPs of gSwin-T from 3.8G (zero-padding) to 3.6G.

4 Experiment

In this paper, following previous studies, we conducted evaluation experiments on the following three benchmark tasks: image classification using ImageNet-1K (Deng et al., 2009), and object detection using MS-COCO Lin et al. (2014) and semantic segmentation using ADE20K (Zhou et al., 2019) as downstream tasks.

4.1 Image Classification

Method	$C$	#layer	#heads
gSwin-VT	$60$	${2, 4, 10, 4}$	$6$
gSwin-T	$64$	${4, 4, 16, 4}$	$12$
gSwin-S	$72$	${4, 4, 32, 4}$	$12$

Table 1: Parameter settings of our method.

Method	ImageNet	COCO	ADE20K
gSwin-VT	$0.25$	$0.25$	$0.2$
gSwin-T	$0.35$	$0.3$	$0.3$
gSwin-S	$0.5$	$0.4$	$0.4$

Table 2: Parameter settings of drop path rates.

method	#param.	FLOPs	Top-1 acc.
DeiT-S	22M	4.6G	79.8%
DeiT-B	86M	17.5G	81.8%
Swin-T	28M	4.5G	81.29(3)%
Swin-S	50M	8.7G	83.02(7)%
gMLP-Ti	6M	1.4G	72.3%
gMLP-S	20M	4.5G	79.6%
gMLP-B	73M	15.8G	81.6%
Hire-MLP-Ti	18M	2.1G	79.7%
Hire-MLP-S	33M	4.2G	82.1%
Hire-MLP-B	58M	8.1G	83.2%
PoolFormer-S12	12M	2.0G	77.2%
PoolFormer-S24	21M	3.6G	80.3%
PoolFormer-S36	31M	5.2G	81.4%
PoolFormer-M36	56M	9.1G	82.1%
PoolFormer-M48	73M	11.9G	82.5%
gSwin-VT (ours)	16M	2.3G	80.32(1)%
gSwin-T (ours)	22M	3.6G	81.71(5)%
gSwin-S (ours)	40M	7.0G	83.01(4)%

Table 3: ImageNet-1K Deng et al. (2009) classification top-1 accuracy

(↑)

. For all methods, the input image has

224^{2}

resolution.

The experimental condition was as follows. The dataset was ImageNet-1K (Deng et al., 2009), divided into training and validation sets consisting of $\sim$ 1.3M and 50k images, respectively.

Table 1 shows the parameter settings for three variants of the proposed method (we will explain about the #heads column in Section 4.4.1). We set window size to $7$ for all gSwin models.

The training conditions were as follows (most of them align with those of Swin Transformer (Liu et al., 2021b) for fair comparison). The optimizer was AdamW (Kingma and Ba, 2014). Data augmentation and regularization techniques were adopted including randaugment (Cubuk et al., 2020; Wightman, 2019), and the drop-path regularization (Larsson, Maire, and Shakhnarovich, 2016), whose rates are shown in Table 2 (tuned by grid search). The initial learning rate was 0.001 and the weight decay was 0.05. The input image size was $224^{2}$ , the batch size was 1024 and the number of epochs was 300 with cosine scheduler (first 20 epochs for linear warm-up). Of these, the epoch with the best top 1 accuracy in the validation dataset was selected (in case of a tie, the selection was based on the top 5 accuracy). The training was performed twice with different seeds. We report the average (and unbiased variance) values of the two trials.

We compared the image recognition performance of our method with that of DeiT (Touvron et al., 2021b), Swin Transformer (Liu et al., 2021b), gMLP (Liu et al., 2021a) and Hire-MLP (Guo et al., 2021a). Table 3 shows the results. Compared to Swin Transformers, gSwin-T achieves +0.4% top 1 accuracy with 21% less parameters, and gSwin-S achieves the same accuracy with 20% less. We may observe that gSwin (proposed) is more efficient in terms of the number of parameters and floating point operations to achieve the same level of accuracy with others.

4.2 Object Detection

method	#param.	${AP}^{box}$	${AP}^{mask}$
Swin-T	86M	50.52(5)%	43.78(6)%
Swin-S	107M	51.98(3)%	44.99(6)%
PoolFormer-S12	32M	37.3%	34.6%
PoolFormer-S24	41M	40.1%	37.0%
PoolFormer-S36	51M	41.0%	37.7%
gSwin-VT (ours)	73M	49.48(2)%	42.87(2)%
gSwin-T (ours)	79M	50.97(2)%	44.16(3)%
gSwin-S (ours)	97M	52.00(7)%	45.03(5)%

Table 4: COCO Lin et al. (2014) object detection and instance segmentation.

We performed an evaluation experiment with object detection and instance segmentation tasks, to test performance of gSwin on downstream tasks. The dataset we used in this experiment was COCO (Lin et al., 2014). The evaluation metric was box and mask AP (average precision). We used Cascade Mask R-CNN (He et al., 2017; Cai and Vasconcelos, 2018) as object detection framework, as Swin Transformer (Liu et al., 2021b) does.

When we train the object detection, we reused checkpoints of the ImageNet-1K experiment rather than training a new model in a full-scratch fashion. The specific procedure was as follows. Firstly, the checkpoint model that achieved the highest accuracy on ImageNet-1K validation dataset was used as the starting point. Then, we continued training for 36 epochs on 4 V100 GPUs, where the batch size was 4 for each GPU. The optimizer was AdamW (Kingma and Ba, 2014), the initial learning rate was 0.0001, and the weight decay was 0.05. We used multi-scale training (the input image was resized so that the shorter is between 480 and 800 pixels and the longer is at most 1333). During the training, we evaluated the AP scores for every epoch and selected the best one with the highest box AP. This transfer learning was done three times using different random seeds, which resulted in 6 different models. We report the average score of these 6 models.

From results in Table 4, we observe that gSwin-T achieves +0.45/+0.38 box/mask AP with 8% less parameters, and gSwin-S achieves similar APs with 9% less parameters.

4.3 Semantic Segmentation

method	#param.	mIoU	mIoU(aug)
Swin-T	60M	44.3(1)%	45.8(1)%
Swin-S	81M	47.7(1)%	49.1(1)%
PoolFormer-S12	16M	37.2%
PoolFormer-S24	23M	40.3%
PoolFormer-S36	35M	42.0%
PoolFormer-M36	60M	42.4%
PoolFormer-M48	77M	42.7%
gSwin-VT (ours)	45M	43.4(1)%	45.1(1)%
gSwin-T (ours)	52M	46.1(1)%	47.6(1)%
gSwin-S (ours)	70M	48.2(1)%	49.7(1)%

Table 5: ADE20K Zhou et al. (2019) semantic segmentation.

We also performed an evaluation experiment with an image segmentation task. The dataset we used in this experiment was ADE20K (Zhou et al., 2019). The evaluation metric was mIoU (mean intersection over union). We used UPerNet (Xiao et al., 2018) as base framework, as in (Liu et al., 2021b).

When we train the semantic segmentation, we reused checkpoints of the ImageNet-1K experiment. The specific procedure was as follows. Firstly, the checkpoint model that achieved the highest accuracy on ImageNet-1K validation dataset was used as the starting point. Then, we continued training for 160k steps with linear scheduler (first 1.5k steps for linear warm-up) on 4 GPUs, where the batch size was 4 for each GPU. The optimizer was AdamW (Kingma and Ba, 2014), the initial learning rate was $6 \times 10^{- 5}$ , the weight decay was 0.01, and the input image size was $512^{2}$ . The augmentations we adopted was random horizontal flip, random rescaling (the resolution ratio within $[0.5, 2.0]$ ) and random photometric distortion, as in Swin Transformer (Liu et al., 2021b). During the training, we evaluated the mIoU score for every 4k steps and selected the best one with the highest validation score. This transfer learning was done three times using different random seeds, which resulted in 6 different models. We report the average score of these 6 models.

Results on ADE20-K are shown in Table 5, where the mIoU(aug) column is the score with test-time multi-scale ( $[0.5, 0.75, 1.0, 1.25, 1.5, 1.75] \times$ resolution) horizontal-flip augmentation. These show that gSwin-T achieves +1.8/+1.85% mIoU with/without augmentation with 13% less parameters, and gSwin-S achieves +0.46/+0.57% mIoU with 14% less parameters, compared to Swin Transformer.

4.4 Ablation Studies

	ImageNet	ADE20K
#head.	Top-1	mIoU	mIoU(aug)
1*	80.91%	43.8(1)%	45.3(1)%
3*	81.46%	45.5(3)%	47.2(3)%
6	81.61(1)%	45.8(1)%	47.2(1)%
12	81.71(5)%	46.1(1)%	47.6(1)%
24	81.80(9)%	46.2(1)%	47.7(1)%
48	81.78(10)%	45.7(1)%	47.1(1)%
12	81.71(5)%	46.1(1)%	47.6(1)%
12 (no pos.)*	81.27%	44.8(1)%	46.5(1)%

Table 6: Ablation study on the number of heads and relative positional bias in gSwin-T. * indicates that only one checkpoint for ImageNet-1K was trained.

4.4.1 Multi-Head

The SGU in Liu et al. (2021a) is uniform in the channel axis, and we propose the multi-head SGU, which improves the accuracy of Transformer-based models Vaswani et al. (2017). The multi-head SGU changes the term $W_{win} Z_{2}^{(i)} + b_{win}$ in (2) as follows,

Y^{(i)} [x, y, c] = Z_{1}^{(i)} [x, y, c] ⎛ ⎝ b_{win} [x, y, c] + \sum x^{'}, y^{'} W_{win} [(x, y), (x^{'}, y^{'}), c] Z_{2}^{(i)} [x^{'}, y^{'}, c] ⎞ ⎠,

(4)

and the channel-axis of $W_{win}$ and $b_{win}$ are grouped into $K$ heads ( $K$ is assumed to be a divisor of $C$ ) as follows:

1st head: $W_{win} [:, :, 1] = \dots = W_{win} [:, :, C / K]$ ,
2nd head: $W_{win} [:, :, C / K + 1] = \dots = W_{win} [:, :, 2 C / K]$ ,
$\dots$ ,
$K$ -th head: $W_{win} [:, :, (K - 1) C / K + 1] = \dots = W_{win} [:, :, C]$ .

In the case $K = 1$ , this reduces to the normal SGU; $W_{win} [:, :, c] = W_{win} [:, :]$ . The parameters to be learned increase linearly as $K$ increases, but it can be easily checked that the FLOPs are independent from the choice of $K$ . The Window-SGU limits the representation capacity of spatial interactions, which can be observed from the fact that the number of parameters in $W$ is reduced from $H^{2} W^{2}$ to $h^{2} w^{2}$ (the original gMLP uses $14^{2}$ tokens, therefore the typical ratio is $7^{4} / 14^{4} = 1 / 16$ ). Increasing $K$ allows the model to interact tokens in more complicated patterns at once and helps to keep the model capacity to represent complex spatial interactions.

We did ablation studies on the choice of the hyper parameter $K$ using gSwin-T model. Table 6 shows the results for $K = 1, 3, 6, 12, 24, 48$ , and we found that the single-head model is much worse than multi-head models, the accuracy increases as $K$ increases, and it saturates at $K \sim 12$ , which coincides the ratio $1 / 16$ in the argument about the model capacity. From these results, we chose $K = 12$ for gSwin-T and gSwin-S ( $K = 6$ for gSwin-VT, to keep model small).

4.4.2 Relative position bias

Figure 3: A head of the 11th layer of the 3rd block of gSwin-T with relative positional bias.

Liu et al. (2021b) used a relative positional bias to enhance attention mechanisms with the inductive bias of shift-invariance, which gives significant improvements to Swin Transformer, but as discussed in Section 3.2, Liu et al. (2021a) showed that the SGU can learn local and shift-invariant spatial-interactions from scratch, which is not inherent in attention mechanisms. We modified $W$ in the SGU to $W = W^{'} + W_{rel}$ , where bias term $W_{rel}$ has the same shape as $W$ but its components depend only on relative positions. We observed that out gSwin can be trained without this term, but it helps gSwin to achieve better accuracy with negligible increase in model size, as shown in Table 6. Figure 4 and Figure 4 show some (total) weights $W$ in case with or without the bias term, and we may observe that both can learn shift-invariance.

5 Conclusion

We proposed gSwin, a new vision MLP which merges (multi-head) gMLP and Swin Transformer, as a version of gMLP with the ability to generate hierarchical feature maps and compatible with downstream tasks where feature pyramids are crucial. gSwin achieves more efficient performances than Swin Transformer on ImageNet-1K, COCO and ADE20K, and could be applied to other downstream tasks just as Swin Transformer is widely used. We hope that vision MLPs attract attentions again.

Acknowledgments

This paper is based on results obtained from a project subsidized by the New Energy and Industrial Technology Development Organization (NEDO).

References

Arnab et al. (2021) Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; and Schmid, C. 2021. ViViT: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6836–6846.
Ba, Kiros, and Hinton (2016) Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
Cai and Vasconcelos (2018) Cai, Z.; and Vasconcelos, N. 2018. Cascade R-CNN: Delving Into High Quality Object Detection. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6154–6162.
Carion et al. (2020) Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; and Zagoruyko, S. 2020. End-to-end object detection with transformers. In European conference on computer vision, 213–229. Springer.
Chen et al. (2021) Chen, S.; Xie, E.; Ge, C.; Liang, D.; and Luo, P. 2021. CycleMLP: A MLP-like architecture for dense prediction. arXiv preprint arXiv:2107.10224.
Chen, Xie, and He (2021) Chen, X.; Xie, S.; and He, K. 2021. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9640–9649.
Cordonnier, Loukas, and Jaggi (2019) Cordonnier, J.-B.; Loukas, A.; and Jaggi, M. 2019. On the relationship between self-attention and convolutional layers. arXiv preprint arXiv:1911.03584.
Cubuk et al. (2020) Cubuk, E. D.; Zoph, B.; Shlens, J.; and Le, Q. V. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 702–703.
Deng et al. (2009) Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
Devlin et al. (2018) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dosovitskiy et al. (2021) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; and Houlsby, N. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
d’Ascoli et al. (2021) d’Ascoli, S.; Touvron, H.; Leavitt, M. L.; Morcos, A. S.; Biroli, G.; and Sagun, L. 2021. ConViT: Improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, 2286–2296. PMLR.
Fan et al. (2021) Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; and Feichtenhofer, C. 2021. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6824–6835.
Fukushima (1980) Fukushima, K. 1980. Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position. Biological Cybernetics, 36: 193–202.
Graham et al. (2021) Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jégou, H.; and Douze, M. 2021. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12259–12269.
Guo et al. (2021a) Guo, J.; Tang, Y.; Han, K.; Chen, X.; Wu, H.; Xu, C.; Xu, C.; and Wang, Y. 2021a. Hire-MLP: Vision MLP via hierarchical rearrangement. arXiv preprint arXiv:2108.13341.
Guo et al. (2021b) Guo, M.-H.; Cai, J.-X.; Liu, Z.-N.; Mu, T.-J.; Martin, R. R.; and Hu, S.-M. 2021b. PCT: Point cloud transformer. Computational Visual Media, 7(2): 187–199.
Han et al. (2021) Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; and Wang, Y. 2021. Transformer in transformer. Advances in Neural Information Processing Systems, 34.
He et al. (2017) He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. B. 2017. Mask R-CNN. 2017 IEEE International Conference on Computer Vision (ICCV), 2980–2988.
He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
Hendrycks and Gimpel (2016) Hendrycks, D.; and Gimpel, K. 2016. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415.
Hornik, Stinchcombe, and White (1989) Hornik, K.; Stinchcombe, M.; and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural networks, 2(5): 359–366.
Hou et al. (2022) Hou, Q.; Jiang, Z.; Yuan, L.; Cheng, M.-M.; Yan, S.; and Feng, J. 2022. Vision permutator: A permutable MLP-like architecture for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Hu, Shen, and Sun (2018) Hu, J.; Shen, L.; and Sun, G. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7132–7141.
Hubel and Wiesel (1962) Hubel, D. H.; and Wiesel, T. N. 1962. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1): 106.
Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Krizhevsky, Sutskever, and Hinton (2012) Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. ImageNet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
Larsson, Maire, and Shakhnarovich (2016) Larsson, G.; Maire, M.; and Shakhnarovich, G. 2016. FractalNet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648.
LeCun et al. (1989) LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4): 541–551.
Li et al. (2021) Li, Y.; Zhang, K.; Cao, J.; Timofte, R.; and Van Gool, L. 2021. LocalViT: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707.
Lian et al. (2021) Lian, D.; Yu, Z.; Sun, X.; and Gao, S. 2021. AS-MLP: An axial shifted MLP architecture for vision. arXiv preprint arXiv:2107.08391.
Lin et al. (2014) Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft COCO: Common objects in context. In European conference on computer vision, 740–755. Springer.
Liu et al. (2021a) Liu, H.; Dai, Z.; So, D.; and Le, Q. V. 2021a. Pay attention to MLPs. Advances in Neural Information Processing Systems, 34: 9204–9215.
Liu et al. (2021b) Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; and Guo, B. 2021b. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022.
Liu et al. (2021c) Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; and Hu, H. 2021c. Video swin transformer. arXiv preprint arXiv:2106.13230.
Marĉelja (1980) Marĉelja, S. 1980. Mathematical description of the responses of simple cortical cells. JOSA, 70(11): 1297–1300.
Mnih et al. (2014) Mnih, V.; Heess, N.; Graves, A.; et al. 2014. Recurrent models of visual attention. Advances in neural information processing systems, 27.
Pan et al. (2021) Pan, Z.; Zhuang, B.; Liu, J.; He, H.; and Cai, J. 2021. Scalable visual transformers with hierarchical pooling. arXiv e-prints, arXiv–2103.
Parmar et al. (2018) Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; and Tran, D. 2018. Image transformer. In International Conference on Machine Learning, 4055–4064. PMLR.
Radford et al. (2018) Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pre-training.
Rosenblatt (1958) Rosenblatt, F. 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6): 386.
Simonyan and Zisserman (2014) Simonyan, K.; and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Srinivas et al. (2021) Srinivas, A.; Lin, T.-Y.; Parmar, N.; Shlens, J.; Abbeel, P.; and Vaswani, A. 2021. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16519–16529.
Szegedy et al. (2015) Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1–9.
Tan and Le (2019) Tan, M.; and Le, Q. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, 6105–6114. PMLR.
Tolstikhin et al. (2021) Tolstikhin, I. O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. 2021. MLP-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34.
Touvron et al. (2021a) Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J.; et al. 2021a. ResMLP: Feedforward networks for image classification with data-efficient training. arXiv preprint arXiv:2105.03404.
Touvron et al. (2021b) Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; and Jégou, H. 2021b. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 10347–10357. PMLR.
Touvron et al. (2021c) Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; and Jégou, H. 2021c. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 32–42.
Vaswani et al. (2021) Vaswani, A.; Ramachandran, P.; Srinivas, A.; Parmar, N.; Hechtman, B.; and Shlens, J. 2021. Scaling local self-attention for parameter efficient visual backbones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12894–12904.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30.
Wang et al. (2021) Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; and Shao, L. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 568–578.
Wightman (2019) Wightman, R. 2019. PyTorch Image Models. https://github.com/rwightman/pytorch-image-models.
Woo et al. (2018) Woo, S.; Park, J.; Lee, J.-Y.; and Kweon, I. S. 2018. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), 3–19.
Wu et al. (2020) Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Yan, Z.; Tomizuka, M.; Gonzalez, J.; Keutzer, K.; and Vajda, P. 2020. Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677.
Wu et al. (2021) Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; and Zhang, L. 2021. CvT: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 22–31.
Xiao et al. (2018) Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; and Sun, J. 2018. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), 418–434.
Xu et al. (2015) Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, 2048–2057. PMLR.
Yu et al. (2022) Yu, T.; Li, X.; Cai, Y.; Sun, M.; and Li, P. 2022. S $^{2}$ -MLP: Spatial-shift MLP architecture for vision. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 297–306.
Yu et al. (2021) Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; and Yan, S. 2021. Metaformer is actually what you need for vision. arXiv preprint arXiv:2111.11418.
Yuan et al. (2021) Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.-H.; Tay, F. E.; Feng, J.; and Yan, S. 2021. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 558–567.
Zeiler and Fergus (2014) Zeiler, M. D.; and Fergus, R. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision, 818–833. Springer.
Zhang et al. (2020) Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. 2020. ResNeST: Split-attention networks. arXiv preprint arXiv:2004.08955.
Zhang et al. (2021) Zhang, P.; Dai, X.; Yang, J.; Xiao, B.; Yuan, L.; Zhang, L.; and Gao, J. 2021. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2998–3008.
Zhou et al. (2019) Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; and Torralba, A. 2019. Semantic understanding of scenes through the ADE20K dataset. International Journal of Computer Vision, 127(3): 302–321.
Zhou et al. (2021) Zhou, D.; Kang, B.; Jin, X.; Yang, L.; Lian, X.; Jiang, Z.; Hou, Q.; and Feng, J. 2021. DeepViT: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886.