Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for Few-Shot Classification

Xixi Wang, Xiao Wang, Member, IEEE, Bo Jiang*, Bin Luo, Senior Member, IEEE The authors are all from School of Computer Science and Technology, Anhui University, Hefei 230601, ChinaCorresponding author: Bo Jiang
Abstract

Few-shot classification which aims to recognize unseen classes using very limited samples has attracted more and more attention. Usually, it is formulated as a metric learning problem. The core issue of few-shot classification is how to learn (1) consistent representations for images in both support and query sets and (2) effective metric learning for images between support and query sets. In this paper, we show that the two challenges can be well modeled simultaneously via a unified Query-Support TransFormer (QSFormer) model. To be specific, the proposed QSFormer involves global query-support sample Transformer (sampleFormer) branch and local patch Transformer (patchFormer) learning branch. sampleFormer aims to capture the dependence of samples in support and query sets for image representation. It adopts the Encoder, Decoder and Cross-Attention to respectively model the Support, Query (image) representation and Metric learning for few-shot classification task. Also, as a complementary to global learning branch, we adopt a local patch Transformer to extract structural representation for each image sample by capturing the long-range dependence of local image patches. In addition, a novel Cross-scale Interactive Feature Extractor (CIFE) is proposed to extract and fuse multi-scale CNN features as an effective backbone module for the proposed few-shot learning method. All modules are integrated into a unified framework and trained in an end-to-end manner. Extensive experiments on four popular datasets demonstrate the effectiveness and superiority of the proposed QSFormer.

Few-Shot Learning, Transformer, Metric Learning, Deep Learning.

I Introduction

Current deep neural networks learn from large-scale training samples and achieve good performance on many tasks. However, in many scenarios, data collection and annotation is expensive and it is usually very challenging to collect enough data for the training of deep neural networks. The Few-shot classification aims to recognize unseen/query classes by using very limited seen/support samples has attracted more and more attention.

 Illustration of our proposed unified Query-Support Transformer for few-shot learning.
It models the feature engineering on query/support samples and metric learning simultaneously.

Fig. 1: Illustration of our proposed unified Query-Support Transformer for few-shot learning. It models the feature engineering on query/support samples and metric learning simultaneously.

Many deep learning methods [34, 14, 48] have been proposed to address few-shot learning problem. These methods can be roughly classified into three types, i.e., generation-based methods, optimization-based methods and metric-based methods. Metric-based methods are derived to distinguish support and query samples by using some image representation and metric learning techniques. As we know, the core issues for metric-based few-shot classification are two aspects: 1) How to learn consistent representations for images in both support and query sets. 2) How to conduct effective metric learning for images between support and query sets. According to our observation, existing works [40, 35, 2, 48, 3] usually first employ Convolution Neural Networks (CNNs) to learn image feature representation and then use a metric function to directly compute the similarities (e.g., cosine) between query and support images for few-shot classification. The good performance can be achieved, however, many recent studies [6, 27] demonstrate that CNN only captures the local relations well due to its limited receptive field. To address this issue, some researchers [11, 5, 51] propose to combine or replace CNN with Transformer networks to model the long-range relationships of local image patches and obtain better image representation results. However, they may still obtain sub-optimal performance due to the following two reasons: 1) Existing works generally adopt Transformers (or CNN+Transformer) as the backbone network for engineering each image representation, which obviously ignores the inherent relationships among samples in query and support sets for image representation. 2) Existing works generally adopt the two-stage learning scheme, i.e., ‘representation learning + metric learning’. Although the two stages are usually learned together in an end-to-end manner, this decoupling way may lead to sub-optimal learning results.

To address these challenges, in this work, we propose a unified Query-Support Transformer architecture for few-shot learning, termed QSFormer. The core of QSFormer is our new design of query-support sample Transformer (named sampleFormer) module, which aims to explore the relationships of samples for coupling sample representations and metric learning of samples together in a unified module for few-shot classification. To be specific, as shown in Figure 1, we dexterously adopt the Encoder, Decoder and Cross-Attention in our sampleFormer architecture to model the Support, Query (image) representation and Metric learning in few-shot classification task, respectively. For the support branch, we represent all support images as a sequence of image tokens and feed them into the Transformer encoder to enhance the support features. For the query branch, it receives a sequence of query image tokens to learn their representations. Meanwhile, it interacts with the previous support branch via the cross-attention for modeling the similarities/affinities between query and support tokens, therefore, naturally achieving metric learning in the decoding procedure.

Based on our newly proposed sampleFormer, we further extend it by introducing two additional new modules for high-performance few-shot learning, including Cross-scale Interactive Feature Extractor (CIFE) and local patch Transformer (patchFormer) module. Specifically, as shown in Figure 2, given the query and support images, we first use CIFE as the backbone module to extract the image features. Then, the sampleFormer takes the embedded image tokens as input and outputs global metrics. Meanwhile, the local/patch correspondence of query-support image pairs is also considered using the patchFormer. The global and local metrics are combined for few-shot classification. Note that, the whole network can be optimized in an end-to-end way.

To sum up, the contributions of this paper can be summarized as follows:

  • We propose a unified Query-Support Transformer (termed QSFormer) for few-shot learning, which models the representation learning and metric learning simultaneously.

  • We propose a novel Sample Transformer module (sampleFormer) to capture the sample relationships in few-shot problem setting. Also, we propose a patch Transformer (patchFormer) module for few-shot image representation and metric learning.

  • We propose a Cross-scale Interactive Feature Extractor for image representation by considering the interaction of different CNN levels.

  • Extensive experiments on four widely used few-shot classification datasets demonstrate the effectiveness and superiority of our proposed method.

Ii Related Work

Few-shot Learning. Current few-shot learning algorithms can be broadly divided into two categories: optimization-based approaches [2, 14] and metric-based approaches [40, 13, 48, 44]. Our method is more relevant to the metric-based approaches, which mainly focus on the representation learning and metric learning of samples. Specifically, Sung et al. [37] propose a Relation Network (RN) for few-shot learning, which computes the relation scores between query examples and the few examples of each new class to classify the examples of new classes. Hou et al. [13] develop a Cross Attention Network, which highlights the target object regions to enhance the feature representation by producing cross attention maps for each feature. Zhang et al. [48] introduce Earth Mover’s Distance to capture a structural distance between the local image representations for few-shot classification. Xie et al. [44] introduce a deep Brownian Distance Covariance approach to learn image representations and then use distance metric for classification.

Transformer for Few-shot Classification. Transformer  [39] has universal modeling capability because its core module self-attention learning mechanism. In recent years, Transformer has been employed by a large number of researchers for various visual tasks, including object tracking [43, 47], object detection [1, 10], object re-identification [18, 15], multi-label classification [22, 4], Medical Image Segmentation [38, 28], and so on. For few-shot learning tasks, some works [46, 21, 16, 11, 51, 5] demonstrate that Transformer architecture is also promising. For example, Ye et al. [46] develop a Few-Shot Embedding Adaptation Transformer (FEAT) to instantiate set-to-set transformation and thus make instance embedding task-specific for few-shot learning. Liu et al. [21] propose a Universal Representation Transformer (URT) layer by combining feature representations from multiple domains together for multi-domain few-shot classification. Zhmoginov et al. [51] introduce a transformer-based model, called HyperTransformer (HT), which encodes task-dependent variations in the weights of a small CNN model for few-shot learning. These works mainly employ Transformer architecture for representation learning. Differently, in our work, we develop a Query-Support Transformer (QSFormer) to accomplish both feature representation and metric learning simultaneously.

An overview of the proposed QSFormer framework, which mainly consists of Cross-scale Interactive Feature Extractor (CIFE), Sample Transformer Module, Patch Transformer Module, Metric Learning and Few-shot Classification.
More details can be found in Section III.
Fig. 2: An overview of the proposed QSFormer framework, which mainly consists of Cross-scale Interactive Feature Extractor (CIFE), Sample Transformer Module, Patch Transformer Module, Metric Learning and Few-shot Classification. More details can be found in Section III.

Iii The Proposed Method

The purpose of few-shot classification is to classify the unseen samples when only a small number of samples are available. Many recent approaches [13, 48, 7, 42] indicate that the episode mechanism provides an effective way for few-shot classification task and we follow them in both training and testing phases. Formally, let , and respectively represent meta-training, meta-validation and meta-testing set, where . Taking -way -shot few-shot classification task as an example, each episode consists of support set and query set . Concretely, we randomly select classes and labeled samples per class to form the support set , i.e., . Meanwhile, we randomly sample samples per class to form the query set , i.e., .

As shown in Figure 2, we propose a novel Query-Support Transformer (QSFormer) framework for few-shot learning, which contains the following four parts:

  • Cross-Scale Interactive Feature Extractor (CIFE): we propose a cross-scale interactive feature extractor as backbone network to obtain the spatial enhanced support/query CNN feature representations.

  • Sample Transformer Module: we introduce a query-support sample Transformer (sampleFormer) module to couple image sample representation and global metric learning of samples together for few-shot learning.

  • Patch Transformer Module: we also propose a patch Transformer (patchFormer) module to model the context correlation of patches in each image sample to conduct the local metric learning between query-support sample pairs.

  • Metric Learning and Few-shot Classification: we acquire the final metric by combining global metric obtained via sampleFormer and local metric obtained via patchFormer together and final achieve few-shot classification.

Below, we introduce the details of these modules.

Iii-a Cross-scale Interactive Feature Extractor

We introduce a novel Cross-scale Interactive Feature Extractor (CIFE) as backbone module, which aims to obtain the ego-context CNN feature representations for support and query samples.

As shown in Figure 3, taking the support image set as inputs, we first use the pre-trained ResNet-12 to generate the initial multi-scale feature representations , where represents the number of support samples in each episode and , and denote the channel, height and width of support feature map in the -th level respectively. Then, we employ a Transformer architecture [39] consisting of multi-head self-attention (MSA), layer normalization (LN), feed-forward network (FFN) and residual connection to achieve the interaction of multi-scale features. Finally, we can obtain the spatial enhanced feature representations for support samples as . Similarly, we obtain the spatial enhanced features for query samples as . The parameters of CIFE are shared for support and query branches. In practice, we empirically set and .

Illustration of Cross-scale Interactive Feature Extractor (CIFE) for feature extraction.
Fig. 3: Illustration of Cross-scale Interactive Feature Extractor (CIFE) for feature extraction.

Iii-B Sample Transformer Module

To achieve both image sample representation and metric learning of samples in a unified module, we design a novel query-support sample Transformer module, named sampleFormer. The proposed sampleFormer mainly consists of Encoder and Decoder, as shown in Figure 2.

Encoder. The purpose of the Encoder is to mine the relationships of samples in support set to obtain better support feature representations. To this end, based on the aforementioned support features , we first introduce image tokenize, which utilizes a global average pooling and reshape operation to gain the token sequence of support samples, where each token denotes a support image sample. As shown in Figure 2, we can see that the main component of encoder is attention mechanism, whose inputs are Query , Key , and Value obtained by conducting three linear projections on respectively. Next, it employs dot-product operation to obtain a correlation/affinity matrix of different support samples as

(1)

where denotes the dimension of support features. It learns the representations for support samples by conducting the message passing operation as

(2)

where refers to layer normalization. Besides, we add Feed-Forward Network (FFN) [6] and residual operation to obtain the final support sample representations as,

(3)

where . denotes the number of support samples and is the feature dimension. FFN consists of two fully-connection layers.

Decoder. The Decoder aims to explore the dependence of samples in query set to learn the representations for query samples and also mines the intrinsic metrics of samples in query and support sets. To be specific, it takes the aforementioned encoded support features and query feature embeddings as its inputs. The image tokenize is applied on to obtain the initial query token sequence , where each token denotes a query image sample. Similar to the Encoder branch, we first leverage self-attention message passing mechanism to model the relationships among query samples and learn representations for query samples as

(4)
(5)

where denotes layer normalization.

Afterward, based on the support features and query features , we employ a cross-attention mechanism to explore the relationships between support and query samples for query sample representations. Specifically, it first computes the cross-affinities between support and query samples as follows

(6)

Then, it learns query sample representations by aggregating the information from support samples as follows

(7)

where and denotes layer normalization. is computed by conducting a linear projection on . and are obtained by conducting two different linear projections on , respectively.

Remark. The above cross-affinities naturally reflect the similarities/affinities between support and query samples. In our work, we regard them as global metric for all support and query samples, i.e.,

(8)

where contains the similarities for all query-support sample pairs in each episode. For convenience, in the following, we also use to denote the metric between image and , where . We can utilize for query sample classification, as discussed in the following Section Metric Learning and Few-shot Classification. Therefore, we can note that both query/support sample representation and metric learning in few-shot learning task are conducted simultaneously in our sampleFormer architecture. This is one main aspect of the proposed sampleFormer module.

Iii-C Patch Transformer Module

As a complementary to the above sampleFormer branch, we also develop a query-support Patch Transformer Module (patchFormer) to capture the more visual content of each image sample for local metric. As shown in Figure 2, patchFormer mainly consists of multi-head self-attention (MSA) and residual connection. Here, we omit Feed-Forward Network used in regular Transformer [6] for simplicity consideration. The parameters of MSA are shared on both support and query branches.

Concretely, for each input support sample and query sample , we first obtain their feature embedding and by using the above CIFE, followed by the patch tokenize [6] to obtain the initial patch token sequence for each support and query image, i.e., and . Then, we employ multi-head self-attention (MSA) [39] with shared weights and residual operation to transform the support and query image patch features as

(9)

where denotes layer normalization.

Based on the above patch representations and , we then adopt the Earth Mover’s Distance (EMD) [12, 48] to compute their structural similarity. It first computes the distance between all patch pairs and then acquires the optimal matching between patches of two images that have the minimum distance cost. Finally, it returns the image-level metric by aggregating the metrics of all matched patch pairs. In this paper, we denote this metric as local metric between support sample and query sample , i.e.,

(10)

Iii-D Metric Learning and Few-Shot Classification

Given the support samples with known labels and input query sample , few-shot classification aims to determine the label of the query sample. To achieve this task, we first obtain the sample-based global metric via Equ. (8) and patch-based local metric via Equ. (10) respectively and combine them together to obtain the final metric/similarity between and as

(11)

where is a tradeoff parameter.

Then, we can conduct few-shot classification by using the nearest neighbor classification strategy, i.e., the label of query is determined by the label of the support sample that is most similar with query , as used in previous works [40, 48].

Loss Function. In the training phase, we employ two loss functions for the proposed QSFormer. First, for the sampleFormer module, we specifically introduce a contrastive loss as suggested in work [24, 20], which encourages the positive query-support sample pairs with same label (i.e., ) to be closing and the negative query-support sample pairs with different labels (i.e., ) are far away in each episode. This loss function can be written as follows,

(12)

where is the global metric between query and support sample . The whole network is trained in an end-to-end way by minimizing the Cross-Entropy (CE) loss function  [48]. Thus, the total loss function can be formulated as

(13)

where is the label prediction obtained by our method and denotes the corresponding ground-truth label. is the balanced hyper-parameter.

Method Backbone miniImagenet tieredImagenet
1-shot 5-shot 1-shot 5-shot
DHL [50] Conv4 61.99 78.71 57.89 73.62
cosine classifier [2]* ResNet12 59.64 0.27 75.80 0.21 55.87 0.31 80.92 0.23
TADAM [25] ResNet12 58.50 0.30 76.70 0.30
ECM [31] ResNet12 59.00 77.46 63.99 81.97
TPN [23] ResNet12 59.46 75.65 59.91 0.94 73.30 0.75
ProtoNet [35]* ResNet12 63.03 0.29 78.72 0.21 68.68 0.34 85.09 0.23
MTL [36] ResNet12 61.20 1.80 75.50 0.80
DC [19] ResNet12 62.53 0.19 79.77 0.19
MetaOptNet [17] ResNet12 62.64 0.82 78.63 0.46 65.99 0.72 81.56 0.53
MatchNet [40]* ResNet12 61.24 0.29 73.93 0.23 71.01 0.33 83.12 0.24
Meta-Baseline [3] ResNet12 63.17 0.23 79.26 0.17 68.62 0.27 83.74 0.18
CAN [13] ResNet12 63.85 0.48 79.44 0.34 69.89 0.51 84.23 0.37
PPA [29] WRN-28-10 59.60 0.41 73.74 0.19 65.65 0.92 83.40 0.65
wDAE-GNN [9] WRN-28-10 61.07 0.15 76.75 0.11 68.18 0.16 83.09 0.12
LEO [34] WRN-28-10 61.76 0.08 77.59 0.12 66.33 0.05 81.44 0.09
FEAT [46]* ResNet12 64.75 0.28 79.96 0.20 71.34 0.33 85.28 0.23
HT [51] Transformer 54.10 68.50 56.10 73.30
DeepEMD [48]* ResNet12 65.43 0.28 79.28 0.20 69.84 0.32 84.06 0.23
DeepBDC [44]* ResNet12 60.76 0.28 78.25 0.20 63.03 0.31 81.57 0.22
QSFormer (Ours) ResNet12 65.24 0.28 79.96 0.20 72.47 0.31 85.43 0.22
TABLE I: 5-way result comparison of ours and state-of-the-art methods on miniImageNet and tieredImagenet datasets. Most results are from [48] or the original papers. The 1, 2 and 3 are respectively in Red, Blue and Green. * denotes this method is reproduced with our settings.

Implementation Details. To achieve a fair comparison, the ResNet-12 [2, 48] with fully connected layers removed is adopted as the backbone module. It is firstly pre-trained from scratch and then use the episodic training based on meta-learning framework by following works [3, 48]. We empirically conduct the feature interaction of the last two levels in CIFE to obtain the enhanced sample features. We randomly sample 50/1000/5000 episodes from the training/validation/testing set on four public datasets. We compute the average accuracy and the corresponding 95 confidence interval to obtain the final performances of four datasets. Our proposed method is implemented by using Python on a server with a single 11G NVIDIA 2080Ti GPU. More hyper-parameter settings on four benchmarks for the proposed QSFormer are shown in Table VI.

Iv Experiments

Iv-a Datasets and Evaluation Metric

To verify our proposed QSFormer, we conduct extensive experiments on four publicly popular datasets for few-shot classification task, including miniImageNet [40], tieredImageNet [32], Fewshot-CIFAR100 [25] and Caltech-UCSD Birds-200-2011 [41]. We also conduct cross-domain experiments to evaluate the domain transfer ability of the proposed model. The recognition accuracy is adopted as the evaluation metric for our experiments. More details of datasets description are as follow.

miniImageNet. This dataset is a sub-dataset of ImageNet [33]. It contains a total of 100 classes with 600 samples in each class. As suggested in work [30], we divide these classes into training set, validation set and testing set, which respectively contains 64, 16 and 20 classes.

tieredImageNet. It contains 608 classes from 34 super-classes, with a total of 779,165 samples. Following [32], we split 34 super-classes into 20 super-classes (351 classes) for meta-training, 6 super-classes (97 classes) for meta-validation and 8 super-classes (160 classes) for meta-testing.

FC100. Fewshot-CIFAR100 is built upon the CIFAR100 dataset for few-shot classification task. It’s named FC100 for short hereafter. It contains a total of 60,000 images from 100 classes. To reduce the information overlap, we group the 100 classes into 20 super-classes by following work [25]. Then, we divide these super-classes into training set, validation set and testing set, which contains 12, 4 and 4 super-classes respectively.

CUB. Caltech-UCSD Birds-200-2011 dataset is an extended vision of CUB-200 dataset. It’s termed CUB for short hereafter. CUB is originally presented in fine-grained bird classification task. It contains the total of 11,788 images from 200 classes. As suggested by [46], we divide 200 classes into 100 classes for meta-training, 50 classes for meta-validation and 50 classes for meta-testing.

miniImageNet CUB. By following [2], we train a model on miniImageNet dataset and evaluate on the CUB dataset to verify the transfer ability of model. In this experimental setting, specifically, we use all 100 classes of miniImageNet, with 600 samples per class for meta-training and use the meta-testing set (50 classes) of CUB dataset for meta-testing.

Iv-B Comparison with State-of-the-art Methods

As shown in Table I, we report our results and compare with other state-of-the-art (SOTA) approaches on miniImageNet [40] and tieredImageNet [32] datasets. From this Table, we can find that the proposed QSFormer beats many SOTA models on the miniImageNet dataset. For example, QSFormer exceeds the transformer-based HT [51] method by +11.14% and +11.46% in 1-shot and 5-shot tasks, respectively. For the attention mechanism based CAN [13], our model also outperforms it on the 1-shot/5-shot task by +1.39/+0.52. Compared with FETA [46] that is also developed based on ResNet12 and Transformer, the proposed QSFormer has better results.

From Table I, we can see that QSFormer achieves the best performance on the tieredImageNet dataset, i.e., 72.470.31 and 85.430.22 in 1-shot and 5-shot tasks. It exceeds the CAN [13] by +2.58 and +1.2 points in 1-shot and 5-shot tasks. Similar conclusions can also be drawn from the experimental results of Fewshot-CIFAR100 [25] and CUB [41] datasets, as illustrated in Table II and Table III. All in all, the proposed QSFormer attains SOTA performance on multiple FSL datasets, which fully demonstrates the effectiveness and advantages of our proposed QSFormer model.

Method 1-shot 5-shot
cosine classifier [2]* 39.47 0.23 56.29 0.25
FEAT [46]* 42.28 0.26 56.37 0.25
TADAM [25] 40.10 0.40 56.10 0.40
ProtoNet [35]* 40.91 0.26 56.66 0.25
MTL [36] 45.10 1.8 57.60 0.9
DC [19] 42.04 0.17 57.05 0.16
MetaOptNet [17] 41.10 0.60 55.50 0.60
MatchNet [40]* 41.90 0.27 54.41 0.25
TDE-FSL [45] 44.61 0.96 57.93 0.81
DeepEMD [48]* 45.58 0.26 62.08 0.25
DeepBDC [44]* 43.57 0.25 59.49 0.25
QSFormer (Ours) 46.51 0.26 61.58 0.25
TABLE II: 5-way result comparison of ours and state-of-the-art methods on Fewshot-CIFAR100 dataset. The 1, 2 and 3 are respectively in Red, Blue and Green. * denotes this method is reproduced with our settings.
Method 1-shot 5-shot
MELR [7] 70.26 0.50 85.01 0.32
IEPT [49] 69.97 0.49 84.33 0.33
MVT [26] 85.35 0.55
FEAT [46]* 75.00 0.29 86.24 0.19
cosine classifier [2]* 62.09 0.29 80.04 0.21
ProtoNet [35]* 70.93 0.30 85.55 0.19
MatchNet [40]* 70.21 0.30 82.69 0.22
RelationNet [37] 66.20 0.99 82.30 0.58
MAML [8] 67.28 1.08 83.47 0.59
DEML [52] 66.95 1.06 77.11 0.78
DeepEMD [48]* 70.71 0.30 86.13 0.19
DeepBDC [44]* 65.45 0.29 85.01 0.19
QSFormer (Ours) 75.44 0.29 86.30 0.19
TABLE III: 5-way result comparison of ours and state-of-the-art methods on Caltech-UCSD Birds-200-2011 dataset. The 1, 2 and 3 are respectively in Red, Blue and Green. * denotes this method is reproduced with our settings.
Methods 1-shot 5-shot
ProtoNet [35] 50.01 0.82 72.02 0.67
MatchNet [40] 51.65 0.84 69.14 0.72
cosine classifier [2] 44.17 0.78 69.01 0.74
Baseline [2] 65.57 0.70
Baseline++ [2] 62.04 0.76
FEAT [46]* 52.67 0.29 72.65 0.25
DeepEMD [48] 54.24 0.86 78.86 0.65
DeepBDC [44]* 50.28 0.27 76.49 0.23
QSFormer (Ours) 55.04 0.29 77.12 0.24
TABLE IV: Cross-domain experiments (). * denotes this method is reproduced with our settings. The red represents the best results and blue denotes the second-best results.
Different Components Datasets
# Baseline CIFE sampleFormer patchFormer miniImageNet tieredImageNet FC100 CUB
1 59.64 0.27 55.87 0.31 39.47 0.23 62.09 0.29
2 61.15 0.28 70.73 0.32 41.54 0.25 65.95 0.30
3 63.97 0.28 71.64 0.32 45.46 0.26 72.93 0.29
4 65.24 0.28 72.47 0.31 46.51 0.26 75.44 0.29
TABLE V: Ablation study for the different components of the proposed QSFormer. The best results are highlighted in bold.
Hyper-parameters Datasets
miniImageNet tieredImageNet FC100 CUB miniImageNet CUB
Optimizer SGD SGD SGD SGD SGD
Initial LR 5e-4 5e-4 1e-4 5e-4 5e-4
Steps of LR decay 10 10 10 10 10
Coefficient of LR decay 0.9 0.5 0.9 0.95 0.9
N 3 3 4 2 3
Number of Head 10,8 8,8 8,1 8,1 10,8
dropout rates 0.5,0.5,0.5,0.1 0.5,0.5,0.5,0.1 0.5,0.5,0.5,0.1 0.1,0.5,0.5,0.1 0.5,0.5,0.5,0.1
0.7 0.5 0.5 0.05 0.7
0.1 0.1 0.4 0.3 0.1
Epochs 100 100 50 150 100
TABLE VI: Hyperparameter settings of our proposed QSFormer.
Methods Metric miniImageNet tieredImageNet FC100 CUB
cosine classifier [2]* Cosine 59.64 0.27 55.87 0.31 39.47 0.23 62.09 0.29
MatchNet [40]* Cosine 61.24 0.29 71.01 0.33 41.90 0.27 70.20 0.30
ProtoNet [35]* Euclidean 63.03 0.29 68.68 0.34 40.91 0.26 70.93 0.30
DeepEMD [48]* EMD 65.43 0.28 69.84 0.32 45.58 0.26 70.71 0.30
QSFormer Ours 65.24 0.28 72.47 0.31 46.51 0.26 75.44 0.29
TABLE VII: Performance comparison of the classical methods based on different metric learning. * denotes the comparison methods is reproduced with our setting. The bold black represents the best results.

Iv-C Ablation Study

To better understand the effectiveness of our proposed QSFormer, in this section, we conduct extensive ablation studies, including component analysis, similarity metric analysis, cross-domain analysis, etc.

Component Analysis. Our proposed QSFormer mainly contains three components: Cross-scale Interactive Feature Extractor (CIFE), Sample Transformer Module (sampleFormer) and Patch Transformer Module (patchFormer). The experimental results of ablation study are shown in Table V. We reproduce cosine classifier method [2] consisting of CNN network and cosine distance as the Baseline network for comparison. From Table V, we can observe: (1) By comparing #1 with #2, the performance of Baseline network can be significantly improved with the help of CIFE, which demonstrates the effectiveness of CIFE. (2) By comparing #2 with #3, we can find that sampleFormer significantly improves the performance of model based on #2, which indicates the effectiveness of sampleFormer module. (3) By adding patchFormer into #3, we further improve the performance of whole network, which shows the effectiveness of patchFormer module. All these experiments fully validate the effectiveness of each component in our proposed QSFormer framework.

Similarity Metric Analysis. To verify the effectiveness of the proposed QSFormer on metric learning, we visualize the similarity distribution of Baseline and QSFormer on the more challenging 5-way 1-shot task, as shown in Figure 4. For 5-way 1-shot task, each query sample generates the similarity results of one positive query-support sample pair (i.e., “Q-S pos”) and four negative query-support sample pairs (i.e., “Q-S neg”) during the metric learning process. To facilitate the comparison of the similarity results of “Q-S pos” and “Q-S neg”, we average the similarity values of four “Q-S neg” corresponding to each query sample. For this experiment, we perform 10 episodes, where each episode random selects query samples for classification, i.e., we can get the similarity values of “Q-S pos” and “Q-S neg”, respectively. Subsequently, we count the number of “Q-S pos” and “Q-S neg” within a certain range according to the normalized similarity values and thus produce the similarity distribution as shown in Figure 4. We can observe that: (1) the similarity values of “Q-S pos” obtained by the Baseline method are generally below 0.5, while “Q-S neg” are above 0.25. (2) In our proposed QSFormer, the similarity values of “Q-S pos” are mostly above 0.5, while “Q-S neg” are mostly below 0.25. Therefore, our proposed QSFormer can separate positive and negative query-support sample pairs more accurately.

In addition, we also compare our QSFormer with other metric learning algorithms, including cosine classifier [2], MatchNet [40], ProtoNet [35] and DeepEMD [48]. These compared methods are reproduced with the same settings and training schemes as ours for a more fair comparison. As shown in Table VII, we can observe that our proposed method obtains the best performance on four publicly popular datasets, which fully demonstrates the effectiveness and superiority of our proposed QSFormer. These experiments fully demonstrate the effectiveness of our proposed QSFormer for metric learning.

Cross-domain Analysis. To validate the transferable ability of our proposed QSFormer, we conduct a cross-domain experiment by following [2, 48]. The training and testing are implemented on miniImagenet dataset and CUB dataset, respectively. As shown in Table IV, our proposed QSFormer achieves the best performance on the 1-shot setting (55.04 0.29) and the second-best results on the 5-shot, i.e., 77.12 0.24. These results demonstrate that the proposed QSFormer learns the discriminative information across domains, and adaptively explores the correspondence of query-support samples.

Comparison of similarity distribution between Baseline and our QSFormer.
The similarities of “Q-S pos” become larger while the similarities of “Q-S neg” become smaller, which indicates they are more easily separated.
(a) Baseline
Comparison of similarity distribution between Baseline and our QSFormer.
The similarities of “Q-S pos” become larger while the similarities of “Q-S neg” become smaller, which indicates they are more easily separated.
(b) QSFormer
Fig. 4: Comparison of similarity distribution between Baseline and our QSFormer. The similarities of “Q-S pos” become larger while the similarities of “Q-S neg” become smaller, which indicates they are more easily separated.

Parameter Analysis. There are two important parameters in our model, including the balanced parameter in Equ. (11) for local and global metric, and the number of sampleFormer layers . In this section, we conduct experiments on the FC100 dataset on 5-way 1-shot task to check their influence. As shown in Figure 5, we can observe that the performance is relatively stable when we slightly adjust the balanced parameter in the range of (0.2, 0.6). For the number of sampleFormer layers, we can find that our performance is increasing continuously when the is changing from 2 to 4. Therefore, we set and for our experiments.

Ablation study of two parameters (i.e.,
Fig. 5: Ablation study of two parameters (i.e., and ).

V Conclusion

In this paper, we propose a novel unified Query-Support Transformer (QSFormer) to deeply exploit the sample relationships in query and support sets for few-shot classification task. QSFormer mainly contains sample Transformer (sampleFormer) module and patch Transformer (patchFormer) module. sampleFormer is designed to meet the problem setting of few-shot classification, i.e., it couples the sample representation and metric learning between query and support sets together via a single Transformer architecture. Meanwhile, as a complementary, patchFormer is also adopted to model the local structural metric between query and support samples. A new CNN feature extractor (CIFE) is also proposed to provide an effective CNN backbone for our approach. Extensive experiments demonstrate the effectiveness and superiority of our proposed QSFormer approach.

References

  • [1] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In Proceedings of European Conference on Computer Vision, pp. 213–229. Cited by: §II.
  • [2] W. Chen, Y. Liu, Z. Kira, Y. F. Wang, and J. Huang (2019) A closer look at few-shot classification. In Proceedings of the IEEE/CVF International Conference on Learning Representations, Cited by: §I, §II, §III-D, TABLE I, §IV-A, §IV-C, §IV-C, §IV-C, TABLE II, TABLE III, TABLE IV, TABLE VII.
  • [3] Y. Chen, Z. Liu, H. Xu, T. Darrell, and X. Wang (2021) Meta-baseline: exploring simple meta-learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9062–9071. Cited by: §I, §III-D, TABLE I.
  • [4] Z. Chen, Q. Cui, B. Zhao, R. Song, X. Zhang, and O. Yoshie (2022) SST: spatial and semantic transformers for multi-label image recognition. IEEE Transactions on Image Processing 31, pp. 2570–2583. Cited by: §II.
  • [5] B. Dong, P. Zhou, S. Yan, and W. Zuo (2022) Self-promoted supervision for few-shot transformer. arXiv preprint arXiv:2203.07057. Cited by: §I, §II.
  • [6] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §I, §III-B, §III-C, §III-C.
  • [7] N. Fei, Z. Lu, T. Xiang, and S. Huang (2021) MELR: meta-learning via modeling episode-level relationships for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Learning Representations, Cited by: §III, TABLE III.
  • [8] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the IEEE/CVF International Conference on Machine Learning, pp. 1126–1135. Cited by: TABLE III.
  • [9] S. Gidaris and N. Komodakis (2019) Generating classification weights with gnn denoising autoencoders for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21–30. Cited by: TABLE I.
  • [10] T. Guan, J. Wang, S. Lan, R. Chandra, Z. Wu, L. Davis, and D. Manocha (2022) M3detr: multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 772–782. Cited by: §II.
  • [11] Y. He, W. Liang, D. Zhao, H. Zhou, W. Ge, Y. Yu, and W. Zhang (2022) Attribute surrogates learning and spectral tokens pooling in transformers for few-shot learning. arXiv preprint arXiv:2203.09064. Cited by: §I, §II.
  • [12] F. L. Hitchcock (1941) The distribution of a product from several sources to numerous localities. Journal of Mathematics and Physics 20 (1-4), pp. 224–230. Cited by: §III-C.
  • [13] R. Hou, H. Chang, B. Ma, S. Shan, and X. Chen (2019) Cross attention network for few-shot classification. Advances in Neural Information Processing Systems 32. Cited by: §II, TABLE I, §III, §IV-B, §IV-B.
  • [14] M. A. Jamal and G. Qi (2019) Task agnostic meta-learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11719–11727. Cited by: §I, §II.
  • [15] M. Jia, X. Cheng, S. Lu, and J. Zhang (2022) Learning disentangled representation implicitly via transformer for occluded person re-identification. IEEE Transactions on Multimedia, pp. 1–11. Cited by: §II.
  • [16] B. Jiang, K. Zhao, and J. Tang (2022) RGTransformer: region-graph transformer for image representation and few-shot classification. IEEE Signal Processing Letters 29, pp. 792–796. Cited by: §II.
  • [17] K. Lee, S. Maji, A. Ravichandran, and S. Soatto (2019) Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10657–10665. Cited by: TABLE I, TABLE II.
  • [18] S. Liao and L. Shao (2021) TransMatcher: deep image matching through transformers for generalizable person re-identification. Advances in Neural Information Processing Systems 34, pp. 1992–2003. Cited by: §II.
  • [19] Y. Lifchitz, Y. Avrithis, S. Picard, and A. Bursuc (2019) Dense classification and implanting for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9258–9267. Cited by: TABLE I, TABLE II.
  • [20] C. Liu, Y. Fu, C. Xu, S. Yang, J. Li, C. Wang, and L. Zhang (2021) Learning a few-shot embedding model with contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 8635–8643. Cited by: §III-D.
  • [21] L. Liu, W. Hamilton, G. Long, J. Jiang, and H. Larochelle (2021) A universal representation transformer layer for few-shot image classification. In Proceedings of the IEEE/CVF International Conference on Learning Representations, Cited by: §II.
  • [22] S. Liu, L. Zhang, X. Yang, H. Su, and J. Zhu (2021) Query2label: a simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834. Cited by: §II.
  • [23] Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. J. Hwang, and Y. Yang (2019) Learning to propagate labels: transductive propagation network for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Learning Representations, Cited by: TABLE I.
  • [24] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §III-D.
  • [25] B. Oreshkin, P. Rodríguez López, and A. Lacoste (2018) Tadam: task dependent adaptive metric for improved few-shot learning. Advances in Neural Information Processing Systems 31, pp. 719–729. Cited by: TABLE I, §IV-A, §IV-A, §IV-B, TABLE II.
  • [26] S. Park, S. Han, J. Baek, I. Kim, J. Song, H. B. Lee, J. Han, and S. J. Hwang (2020) Meta variance transfer: learning to augment from the others. In Proceedings of the IEEE/CVF International Conference on Machine Learning, pp. 7510–7520. Cited by: TABLE III.
  • [27] Z. Peng, W. Huang, S. Gu, L. Xie, Y. Wang, J. Jiao, and Q. Ye (2021) Conformer: local features coupling global representations for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 367–376. Cited by: §I.
  • [28] O. Petit, N. Thome, C. Rambour, L. Themyr, T. Collins, and L. Soler (2021) U-net transformer: self and cross attention for medical image segmentation. In Proceedings of International Workshop on Machine Learning in Medical Imaging, pp. 267–276. Cited by: §II.
  • [29] S. Qiao, C. Liu, W. Shen, and A. L. Yuille (2018) Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7229–7238. Cited by: TABLE I.
  • [30] S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. Cited by: §IV-A.
  • [31] A. Ravichandran, R. Bhotika, and S. Soatto (2019) Few-shot learning with embedded class models and shot-free meta training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 331–339. Cited by: TABLE I.
  • [32] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel (2018) Meta-learning for semi-supervised few-shot classification. In Proceedings of the IEEE/CVF International Conference on Learning Representations, Cited by: §IV-A, §IV-A, §IV-B.
  • [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §IV-A.
  • [34] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2018) Meta-learning with latent embedding optimization. In Proceedings of the IEEE/CVF International Conference on Learning Representations, pp. 6907–6917. Cited by: §I, TABLE I.
  • [35] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. Advances in Neural Information Processing Systems 30, pp. 4080–4090. Cited by: §I, TABLE I, §IV-C, TABLE II, TABLE III, TABLE IV, TABLE VII.
  • [36] Q. Sun, Y. Liu, T. Chua, and B. Schiele (2019) Meta-transfer learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 403–412. Cited by: TABLE I, TABLE II.
  • [37] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: §II, TABLE III.
  • [38] J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, and V. M. Patel (2021) Medical transformer: gated axial-attention for medical image segmentation. In Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 36–46. Cited by: §II.
  • [39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in Neural Information Processing Systems 30. Cited by: §II, §III-A, §III-C.
  • [40] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. Advances in Neural Information Processing Systems 29. Cited by: §I, §II, §III-D, TABLE I, §IV-A, §IV-B, §IV-C, TABLE II, TABLE III, TABLE IV, TABLE VII.
  • [41] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §IV-A, §IV-B.
  • [42] J. Wang, B. Song, D. Wang, and H. Qin (2022) Two-stream network with phase map for few-shot classification. Neurocomputing 472, pp. 45–53. Cited by: §III.
  • [43] N. Wang, W. Zhou, J. Wang, and H. Li (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1571–1580. Cited by: §II.
  • [44] J. Xie, F. Long, J. Lv, Q. Wang, and P. Li (2022) Joint distribution matters: deep brownian distance covariance for few-shot classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7972–7981. Cited by: §II, TABLE I, TABLE II, TABLE III, TABLE IV.
  • [45] L. Xing, S. Shao, W. Liu, A. Han, X. Pan, and B. Liu (2022) Learning task-specific discriminative embeddings for few-shot image classification. Neurocomputing 488, pp. 1–13. Cited by: TABLE II.
  • [46] H. Ye, H. Hu, D. Zhan, and F. Sha (2020) Few-shot learning via embedding adaptation with set-to-set functions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8808–8817. Cited by: §II, TABLE I, §IV-A, §IV-B, TABLE II, TABLE III, TABLE IV.
  • [47] E. Yu, Z. Li, S. Han, and H. Wang (2022) Relationtrack: relation-aware multiple object tracking with decoupled representation. IEEE Transactions on Multimedia, pp. 1–12. Cited by: §II.
  • [48] C. Zhang, Y. Cai, G. Lin, and C. Shen (2020) Deepemd: few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12203–12213. Cited by: §I, §II, §III-C, §III-D, §III-D, §III-D, TABLE I, §III, §IV-C, §IV-C, TABLE II, TABLE III, TABLE IV, TABLE VII.
  • [49] M. Zhang, J. Zhang, Z. Lu, T. Xiang, M. Ding, and S. Huang (2021) IEPT: instance-level and episode-level pretext tasks for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Learning Representations, Cited by: TABLE III.
  • [50] X. Zhang, Y. Zhang, Z. Zhang, and J. Liu (2022) Discriminative learning of imaginary data for few-shot classification. Neurocomputing 467, pp. 406–417. Cited by: TABLE I.
  • [51] A. Zhmoginov, M. Sandler, and M. Vladymyrov (2022) HyperTransformer: model generation for supervised and semi-supervised few-shot learning. arXiv preprint arXiv:2201.04182. Cited by: §I, §II, TABLE I, §IV-B.
  • [52] F. Zhou, B. Wu, and Z. Li (2018) Deep meta-learning: learning to learn in the concept space. arXiv preprint arXiv:1802.03596. Cited by: TABLE III.