FAKD: Feature Augmented Knowledge Distillation for Semantic Segmentation

Jianlong Yuan

^{1}

, Qian Qi

^{1}

, Fei Du

^{1}

, Zhibin Wang

^{1}

, Fan Wang

^{1}

, Yifan Liu

^{2}

Abstract

In this work, we explore data augmentations for knowledge distillation on semantic segmentation. To avoid over-fitting to the noise in the teacher network, a large number of training examples is essential for knowledge distillation. Image-level argumentation techniques like flipping, translation or rotation are widely used in previous knowledge distillation framework. Inspired by the recent progress on semantic directions on feature-space, we propose to include augmentations in feature space for efficient distillation. Specifically, given a semantic direction, an infinite number of augmentations can be obtained for the student in the feature space. Furthermore, the analysis shows that those augmentations can be optimized simultaneously by minimizing an upper bound for the losses defined by augmentations. Based on the observation, a new algorithm is developed for knowledge distillation in semantic segmentation. Extensive experiments on four semantic segmentation benchmarks demonstrate that the proposed method can boost the performance of current knowledge distillation methods without any significant overhead. Code is available at: https://github.com/jianlong-yuan/FAKD.

\affiliations

¹Alibaba Group
²University of Adelaide
{gongyuan.yjl, qi.qian, dufei.df, zhibin.waz, fan.w}@alibaba-inc.com
yifan.liu04@adelaide.edu.au

Introduction

Semantic segmentation/Pixel labeling is a fundamental problem in computer vision, which is widely used in scene parsing, human body parsing, and many downstream applications. It is a per-pixel classification problem that assigns each pixel in an image to a specific set of predefined classes. With the development of deep learning, semantic segmentation has made tremendous progress in recent years and achieved impressive results on large benchmark datasets Everingham10; mottaghi_cvpr14; cordts2016cityscapes; zhou2017scene. However, advanced segmentation models usually require complicated model designs and expensive computations. As a result, it limits the potential for computationally constrained applications on devices.

To alleviate the computational cost, some efforts have been devoted to designing lightweight networks, especially for semantic segmentation yu2018bisenet; mehta2018espnet; zhao2018icnet; yu2021bisenet; chen2019fasterseg; li2019dfanet. In addition, knowledge distillation (KD) attracts attention to train lightweight semantic segmentation networks effectively by leveraging the knowledge from the teacher network. The conventional KD method for the classification problem is to apply a Kullback–Leibler (KL) divergence between the output of the teacher and the student network for each example. However, mimicking every pixel’s class distribution is inappropriate for semantic segmentation since the noise accumulated from the pixel-level activation may degenerate the performance. SKDS liu2019structured; liu2020structured aggregates a sub-set of different spatial locations according pair-wise relations. Moreover, IFVD hou2020inter exploits the inter-class relations among different locations. CWD shu2021channel proposes channel-wise knowledge distillation by minimizing the channel distribution, indicating the spatial location for each class. CIRKD yang2022cross focuses on transferring structured pixel-to-pixel and pixel-to-region relations among the whole images.

However, the teacher and student can have different training procedures, which makes the transfer challenging even with the sophisticated objective function for semantic segmentation. Moreover, optimization based on a limited number of training examples can make the knowledge transfer for the whole data distribution more challenging. To capture the data distribution on unseen data, augmentation is a prevalent strategy to improve the generalization of deep models. Concretely, image-level data augmentation operators, including random crop, random scale, and flip he2016deep, are widely applied in existing methods. Apparently, the number of applicable augmentation operators for each image is limited, which may be insufficient for knowledge distillation that has to mimic the diverse patterns from the teacher. Moreover, this kind of image space based augmentations has to be fed forward from the input to the whole network, thus only a limited number of augmentations can be adopted for optimization.

Recently, some works upchurch2017deep; li2016convolutional; bengio2013better; wang2019implicit; wang2021regularizing propose there are many semantic directions in the deep feature space, such that translating a data sample along one of these directions in the feature space produces a feature representation corresponding to another sample with the same class but different semantics. Meanwhile, compared to the augmentation in raw images, that in high-level features is more flexible and can be obtained along with arbitrary semantic directions. Therefore, motivated by these, we propose to augment the samples in the feature space to apply sufficient and diverse augmentations for knowledge distillation.

Specifically, the direction of intra-class variation that contains the semantic information of each class is essential and adopted for the perturbation in feature space. Based on the intra-class variance, we augment examples in feature space according to the corresponding Gaussian distribution. Thanks to this procedure, we can have an infinite number of augmentations from feature space, and an upper bound of the loss function can be optimized accordingly to transfer the knowledge from the teacher sufficiently. With massive augmented examples for the student, the nice properties, e.g., large margin, from the teacher can be preserved for generalization.

Finally, based on the distribution definition, different knowledge distillation objective functions will have different surrogate functions for the upper bounds of infinite augmentations. In this work, two types of Kullback–Leibler divergence-based knowledge distillation loss are investigated, and the corresponding loss functions are derived. Experiment results show that the proposed feature augmentation method can improve the performance over baseline KL–based knowledge distillation losses on four different datasets.

Concretely, our contributions are three-fold:

We propose a novel feature augmented knowledge distillation (FAKD) method for semantic segmentation. Specifically, the student model will mimic the teacher model with infinite augmented samples. To the best of our knowledge, it is the first time that the feature-level augmentation method has been explored in knowledge distillation for semantic segmentation.
We provide a theoretical analysis from the infinite number of augmentations in feature space to the upper-bound loss, so as to theoretically demonstrate our proposal for different knowledge distillation methods.
We demonstrate the effectiveness of our approach applying various network structures on four benchmark datasets: ADE20K, Pascal Context, Cityscapes, and Pascal VOC.

Related Work

Semantic Segmentation

Semantic segmentation can be regarded as a pixel-level classification problem. The fully convolutional network(FCN) long2015fully is a pioneering work in the field of semantic segmentation, which can adapt to any scale input. To improve the performance of the segmentation network, many researchers try to improve FCN in different ways. In chen2017deeplab; yu2015multi; zhao2017pyramid; yang2018denseaspp; peng2017large; chen2018encoder; yuan2020multi, the receptive field is enlarged to capture more details. In zhang2018context; he2019adaptive; lin2017refinenet; zhou2019context; yuan2020object; yuan2018ocnet; yu2020context, the contextual information is combined to improve the semantic understanding of the model. In ding2019boundary; bertasius2016semantic; yuan2020segfix; zhen2020joint; takikawa2019gated; yu2018learning; chen2016semantic, the boundary information is considered to boost the segmentation accuracy further. In wang2018non; fu2019dual; zhong2020squeeze; huang2019ccnet; zhao2018psanet, some new attention modules are designed to enrich the representations of the feature map. Moreover, xie2021segformer; zheng2021rethinking; wang2021end introduce transformer blocks into the Semantic Segmentation task, which boosts the performance with a large margin. Although the state-of-the-art performance keeps improving nowadays, the models have become larger than before. The high computational cost limits the application of the segmentation models over resource-limited mobile devices.

Efficient segmentation networks attract wide attention due to the need for real-time inference. Most of the works pay attention to designing lightweight networks with cheap operations. ENet paszke2016enet is a efficient segmentation network with early downsampling, lightweight decoder. ESPNet mehta2018espnet introduces more cheap spatial pyramid of dilated convolution. ICNet zhao2018icnet proposes a cascade structure to use low-resolution and high-resolution features. BiSeNet yu2018bisenet combines a spatial path and a context path to process features efficiently.

Knowledge distillation

Knowledge distillation has been extensively studied in recent years. The core idea of knowledge distillation is to transfer meaningful knowledge from a cumbersome teacher into a smaller and faster student. Most knowledge distillation methods for image classification networks could be classified into three classes, probability-based, feature-based and relation-based methods. hinton2015distilling; zhou2021rethinking distill the output logits of the teacher network as soft labels to the student. Feature-based knowledge distillation methods romero2014fitnets; heo2019comprehensive; zagoruyko2016paying focuses on the feature maps. Finally, relation-based KD peng2019correlation; park2019relational; qian2022improved aligns correlations among multiple instances between the student and teacher networks. However, these image-level KD methods are not suitable for pixel-level semantic segmentation.

Some works pay attention to utilizing knowledge distillation to semantic segmentation. The pioneering work liu2019structured; liu2020structured proposes Structured Knowledge Distillation (SKDS), which enables the output of the student model to transfer the structure knowledge among pixels from the teacher model to the student model through pairwise relationships and adversarial training. wang2020intra focuses on intra-class feature variations between pixels with the same label, where the set of cosine distances between the features of each pixel and its corresponding class prototype is constructed to convey structural knowledge. shu2021channel proposes channel-wise distillation to guide the student to mimic the teacher’s channel-wise distribution, which indicates the spatial location for each class. yang2022cross focuses on transferring structured pixel-to-pixel and pixel-to-region relations among the whole images. These methods try to develop objective functions to align useful knowledge between the teacher and the student network. Our approach, however, is to face how effective knowledge distillation can be performed when teacher and student training procedures are not aligned in real-life scenarios.

Figure 1: Illustration of the proposed feature augmented knowledge distillation (FAKD). The parameters of the teacher network are fixed during training. The feature of the student network is augmented to mimic the effect of infinite samples. The weight for the ‘CLS $_{s}$ ’ is shared.

Methods

The overall pipeline of the proposed method, feature augmented knowledge distillation (FAKD), is illustrated in Figure 1. We will first revisit two Kullback–Leibler (KL) divergence-based knowledge distillation losses and then investigate the new loss functions with an infinite number of augmentations.

Revisit Knowledge Distillation

KL divergence is a standard metric to evaluate the difference between distributions, which is also widely used in designing knowledge distillation objective functions. In this subsection, we will revisit two popular knowledge distillation objective functions, i.e., pixel-wise distillation hinton2015distilling and channel-wise distillation shu2021channel.

The pixel-wise distillation (PD) is straightforward that mimics the output from the teacher for each pixel. It directly extends the knowledge distillation protocol for classification to semantic segmentation and is widely used as a baseline method in existing semantic segmentation distillation works liu2019structured; hou2020inter. The channel-wise distillation (CWD) shu2021channel aligns the output distribution for each channel rather than the pixel. It achieves state-of-the-art performance on benchmark semantic segmentation datasets.

Let $s_{i} \in R^{A}$ and $t_{i} \in R^{A}$ be the feature of the student and teacher of the $i$ pixel. $A$ is the number of the feature channel before the final classifier. A linear classifier is usually applied to the feature to get the final logit, for example for the teacher’s output of the pixel $i$ for the $c$ class, $q_{t_{i}}^{c} = ω_{c}^{⊤} t_{i} + b_{c}$ . The loss function for the pixel-wise distillation hinton2015distilling; liu2019structured; liu2020structured can be written as

ℓ_{P D} = \frac{1}{M} M \sum i = 1 C \sum c = 1 ⎛ ⎝ - \frac{e^{ω_{c}^{⊤} t_{i} + b_{c}}}{\sum_{k = 1}^{C} e^{ω_{k}^{⊤} t_{k} + b_{k}}} log ⎛ ⎝ \frac{e^{ω_{c}^{⊤} s_{i} + b_{c}}}{\sum_{k = 1}^{C} e^{ω_{k}^{⊤} s_{k} + b_{k}}} ⎞ ⎠ ⎞ ⎠,

(1)

where $M$ donates the number of pixels and $C$ is the number of categories. $W = [ω_{1}, \dots, ω_{C}]^{⊤} \in R^{C \times A}$ and $B = [b_{1}, \dots, b_{C}]^{⊤} \in R^{C}$ are the classification convolution weights and bias, respectively. The softmax operator is applied to every pixel prediction for all classes to find out the class distribution for all pixels. Then the KL divergence between two class distributions will be minimized by $ℓ_{P D}$ .

For CWD shu2021channel, it aims to minimize the KL divergence between the channel-wise probability map of the two networks, which can be written as

ℓ_{C W D} = \frac{τ^{2}}{C} M \sum i = 1 C \sum c = 1 ⎛ ⎜ ⎜ ⎝ - \frac{e^{\frac{ω_{c}^{⊤} t_{i} + b_{c}}{τ}}}{\sum_{k = 1}^{M} e^{\frac{ω_{c}^{⊤} t_{k} + b_{c}}{τ}}} log ⎛ ⎜ ⎜ ⎝ \frac{e^{\frac{ω_{c}^{⊤} s_{i} + b_{c}}{τ}}}{\sum_{k = 1}^{M} e^{\frac{ω_{c}^{⊤} s_{k} + b_{c}}{τ}}} ⎞ ⎟ ⎟ ⎠ ⎞ ⎟ ⎟ ⎠

(2)

where $τ$ is the temperature factor. Note that the main difference between Eq. 2 and Eq. 1 lies in the softmax operator. Different normalization operators leverage different distributions.

Feature Augmented Knowledge Distillation

To improve the training effectiveness, the feature from the student $s_{i}$ can be perturbed $N$ times while all of these augmented examples should be aligned with the feature from the teacher, i.e., $t_{i}$ . Consequently, the distillation loss can be optimized with the set of pairs of ${(s_{i}^{1}, t_{i}), \dots, (s_{i}^{N}, t_{i})}_{i = 1}^{M}$ . We will have the analysis for CWD while a similar analysis can be conducted on PD.

Let $p_{t_{i}}^{c}$ donate the output from the teacher.

p_{t_{i}}^{c} = \frac{e^{\frac{ω_{c}^{⊤} t_{i} + b_{c}}{τ}}}{\sum_{k = 1}^{M} e^{\frac{ω_{c}^{⊤} t_{k} + b_{c}}{τ}}}

The CWD loss in Eq. 2 can be rewritten with the augmented examples.

\frac{τ^{2}}{C} M \sum i = 1 C \sum c = 1 \frac{1}{N} N \sum n = 1 - p_{t_{i}}^{c} log ⎛ ⎜ ⎜ ⎝ \frac{e^{\frac{ω_{c}^{⊤} s_{i}^{n} + b_{c}}{τ}}}{\sum_{k = 1}^{M} e^{\frac{ω_{c}^{⊤} s_{k}^{n} + b_{c}}{τ}}} ⎞ ⎟ ⎟ ⎠

(3)

With a large $N$ , the information from the teacher can be leveraged more sufficiently. Now we consider an extreme case such that $N \to \infty$ . With the infinity augmentations, the expectation over the set of augmentations is optimized instead, and the loss function becomes

ℓaug=τ2CE{^sj}Mj=1M∑i=1C∑c=1pcti(log(M∑k=1ew⊤c(^si−^sk)τ)).

(4)

The loss can be upper-bounded by an empirical loss as follows.

Theorem 1.

By assuming that ${^s}_{i} \sim N (s_{i}, λ_{i} Σ_{i})$ , we have

	$ℓ_{a u g} \leq$
	$\frac{τ^{2}}{C} M \sum i = 1 C \sum c = 1 p_{t_{i}}^{c} log (C \sum k = 1 [e^{\frac{w_{c}^{⊤} (s_{i} - s_{k})}{τ} + \frac{w_{c}^{⊤} (λ_{c} Σ_{i} + λ_{k} Σ_{k}) w_{c}}{2 τ}}]) .$

According to the Jason’s Inequality, the loss in Eq. 4 can be upper-bounded as

ℓ_{a u g} \leq \frac{τ^{2}}{C} M \sum i = 1 C \sum c = 1 p_{t_{i}}^{c} log (E (M \sum k = 1 e^{\frac{w_{c}^{⊤} ({^s}_{i} - {^s}_{k})}{τ}})) .

(5)

We follow existing work gal2015bayesian; kendall2017uncertainties and apply Gaussian distribution to approximate the distribution for deep features. Specifically, we assume that ${^s}_{i} \sim N (s_{i}, λ_{i} Σ_{i})$ , where $Σ_{i}$ are the statistics covariance of the semantic distribution of the $i$ -th example. Particularly, following wang2019implicit, the covariance is updated from the batch data during the training process , and more details are in Supplementary. With the assumption, ${^s}_{i} - {^s}_{k}$ also follows the Gaussian Distribution:

\frac{w_{c}^{⊤} ({^s}_{i} - {^s}_{k})}{τ} \sim N (\frac{w_{c}^{⊤} (s_{i} - s_{k})}{τ}, \frac{w_{c}^{⊤} (λ_{i} Σ_{i} + λ_{k} Σ_{k}) w_{c}}{τ}) .

(6)

For a variable $x$ that follows Gaussian distribution $N (μ, Σ)$ , the moment-generating function shows that $E [e^{a^{⊤} x}] = e^{a^{⊤} μ + \frac{1}{2} a^{⊤} Σ a}$ . Therefore, by taking the statistics for each example, the upper-bound in Eq. 5 can be simplified as

ℓ_{a u g}^{C W D} = \frac{τ^{2}}{C} M \sum i = 1 C \sum c = 1 p_{t_{i}}^{c} log (C \sum k = 1 [e^{\frac{w_{c}^{⊤} (s_{i} - s_{k})}{} τ + \frac{w_{c}^{⊤} (λ_{c} Σ_{i} + λ_{k} Σ_{k}) w_{c}}{2 τ}}])

(7)

Remark

By minimizing the empirical loss in Eq. 7, an infinite number of augmentations will be leveraged for learning the student, which can help capture the knowledge from the teacher. With the same process, the upper-bound for PD with infinite augmentations can be obtained as follows.

ℓ_{a u g}^{P D} = \frac{1}{M} M \sum i = 1 C \sum c = 1 \frac{e^{ω_{c}^{⊤} t_{i} + b_{c}}}{\sum_{k = 1}^{C} e^{ω_{k}^{⊤} t_{k} + b_{k}}} log (C \sum k = 1 [e^{(w_{k}^{⊤} - w_{c}^{⊤}) s_{i} + b_{k} - b_{c} + \frac{λ}{2} (w_{k}^{⊤} - w_{c}^{⊤})^{⊤} Σ_{c} (w_{k}^{⊤} - w_{c}^{⊤})}])

(8)

The proposed method is summarized in the Algorithm 1. Firstly, a powerful teacher network, a lightweight student network, and covariance matrices are defined. Teacher and student networks are composed of feature extraction and classification. Taking a forward as an example, for each mini-batch, the teacher’s classification logits, the student’s classification logits, and the student’s feature before logits are obtained. Then, the extracted features and the classification parameters are used to update the covariance. At the same time, $λ$ will be updated with the current number of iterations. Finally, the parameters of the student model will be optimized with the corresponding loss.

1 Teacher: Teacher network with extraction function

f_{t}

and classification function

g_{t}

;

2 Student: Student network with extraction function

f_{s}

and classification function

g_{s}

;

3 Input: Training image dataset

A

, super parameter

λ

and Covariance

Σ

initialized with zero;

4 for $s t e p = 1, . . ., n_{s t e p s}$ do

x, y

=Sample(

A

p_{t}

g_{t} (f_{t} (x))

s

f_{s} (x)

p_{s}

g_{s} (s)

9 update

Σ

and

λ

with current

s

y

and

s t e p

10 loss

=

ℓ_{s e g}

(

p_{s}

y

)

+

ℓ_{a u g}

(t(

W_{g_{s}}

Σ

s

p_{s}

λ

p_{t}

);

11 minimizing the loss and update student parameters

θ_{s}

;

13 end for

Algorithm 1 The FAKD framework

Experimental Setup

Dataset. We employ four popular semantic segmentation datasets to conduct our experiments. ADE20K zhou2017scene contains 20k/2k/3k images for train/val/test with 150 semantic classes. Cityscapes cordts2016cityscapes is an urban scene parsing dataset that contains 2975/500/1525 finely annotated images used for train/val/test. And the performance is evaluated on 19 classes. Pascal Context mottaghi_cvpr14 provides dense annotations which contains 4998/5105/9637 train/val/test images. We use 59 object categories for training and testing. Pascal VOC contains 21 classes including 20 object categories and one background class. Following the procedure of zhao2017pyramid; chen2018encoder , we use augmented data with annotation of resulting 10582, 1449, and 1456 images for train/val. Our results are all reported on the validation set.

Evaluation metrics. We report the mean Intersection over Union (mIoU) and pixel accuracy (mAcc), a standard metric for semantic segmentation.

Network architectures. On each dataset, the same series of teacher models and student models are used.

Implementation details. Following the standard data augmentation, we employ random flipping, cropping and scaling in the range of [0.5, 2]. All experiments are optimized by SGD with a momentum of 0.9, a batch size of 16, and 512 x 512 crop size. We use an initial learning rate of 0.01 for ADE20K, Cityscapes, and Pascal VOC. In addition, we use an initial learning rate of 0.004 for Pascal Context. The number of the total training iterations is 40K. Following previous methods chen2018encoder; zhao2017pyramid, we use the poly learning rate policy where current learning rate equals to the base one multiplying $(1 - \frac{i t e r}{m a x_{i t e r}})^{0.9}$ . We report the single-scale testing result. To be fair, each method uses the same parameters again for the same data set.

Compared distillation methods. On each dataset, we compare with state-of-the-art segmentation distillation methods including SKDS liu2019structured, IFVD hou2020inter and CWD shu2021channel, CIRKD yang2022cross. We re-run SKDS, IFVD, CWD and FAKD on 4 NIVIDIA V100 GPUs. In addition, due to the limitation of GPU memory, we re-run CIRKD on 4 NVIDIA A100. We take same super parameters as CWD in our method. And all of teacher models are from mmseg2020.

Compared with State-of-the-art Methods

ADE20K. We conduct the comparison experiments with other state-of-the-art algorithms on Table 1. We also apply our proposed framework to different teachers following liu2019structured; wang2020intra; shu2021channel; yang2022cross to make a fair comparison. All of the students’ backbones except that with $^{†}$ are initialized with the pre-trained weights on ImageNet classification. With the help of different teachers, we find that all of the methods improve different student networks. Specifically, our method achieves the best segmentation performance. From Table 1, we can see that our method can effectively improve the performance by $1.48 %$ , $1.82 %$ , $1.1 %$ , $1.4 %$ , $1.05 %$ , respectively. Compared with the student, our method can improve the performance by $5.88 %$ , $9.85 %$ , $3.77 %$ , $5.91 %$ , $8.06 %$ , respectively. It demonstrates that our method does not depend on a specific model structure. Meanwhile, our method can dramatically improve the results when the student model has not learned the knowledge, as PSPnet-R18 $^{†}$ . From Figure in Supplementary, we further show the qualitative segmentation results. Our results have more detailed segmentation results.

Methods	mIoU	mAcc(%)
T:PSPNet	44.39	54.74
S:PSPnet-R18	29.42	38.48
SKDS	31.80	42.25
IFVD	32.15	42.53
CIRKD	32.25	43.02
CWD	33.82	42.41
Ours	35.30	44.06
S:PSPnet-R18 $^{†}$	17.11	22.99
SKDS	20.79	27.74
IFVD	20.75	27.6
CIRKD	22.90	30.68
CWD	25.14	34.13
Ours	26.96	34.13
T:HRNet	42.02	53.52
S:HRNet18s	28.69	37.86
SKDS	30.49	40.19
IFVD	30.57	40.42
CIRKD	31.34	41.45
CWD	31.36	39.68
Ours	32.46	41.92
T:DeeplabV3Plus	45.47	56.41
S:Deeplab-MV2	22.38	31.71
SKDS	24.65	35.07
IFVD	24.53	35.13
CIRKD	25.21	36.17
CWD	26.89	35.79
Ours	28.29	38.31
T:ISANet	43.80	54.39
S: ISANet-R18	27.68	36.92
SKDS	28.70	38.51
IFVD	29.66	39.80
CIRKD	29.79	40.48
CWD	34.69	43.05
Ours	35.74	44.55

Table 1: Performance comparison with state-of-the-art distillation methods over various student and teacher segmentation networks on ADE20K.

Pascal Context. Table 2 summarizes our results on Pascal Context validation set. Our method achieves the best performance consistently. It surpasses the best completing CWD shu2021channel with a $0.45 %$ mIoU improvement, $0.11 %$ mIoU improvements, and $0.92 %$ mIoU improvement on PSPnet-R18, HRNet18s, and Deeplab-MV2. More segmentation results are shown in Figure in Supplementary.

Methods	mIoU	mAcc(%)
T:PSPNet	52.47	63.15
S:PSPnet-R18	43.07	53.79
SKDS	43.93	54.01
IFVD	44.75	54.99
CIRKD	44.83	55.3
CWD	45.92	55.55
Ours	46.37	56.39
T:HRNet	51.12	61.39
S:HRNet18s	40.82	51.70
SKDS	42.91	53.63
IFVD	43.12	54.03
CIRKD	43.45	54.1
CWD	45.50	56.01
Ours	45.61	56.13
T:DeeplabV3Plus	53.20	64.04
S:Deeplab-MV2	37.16	49.10
SKDS	39.18	51.13
IFVD	38.80	50.79
CIRKD	39.99	52.66
CWD	42.52	53.24
Ours	43.44	54.29

Table 2: Performance comparison with state-of-the-art distillation methods over various student and teacher segmentation networks on Pascal Context.

Methods	Cityscapes		Pacal VOC
Methods	mIoU	mAcc(%)	mIoU	mAcc(%)
T:PSPNet	79.74	86.56	78.52	86.11
S:PSPnet-R18	68.99	75.19	70.52	81.04
SKDS	69.33	75.37	70.35	80.22
IFVD	71.08	77.46	70.92	81.31
CIRKD	72.23	78.79	70.13	80.24
CWD	74.29	80.95	73.36	82.63
Ours	74.75	82.0	73.97	82.96
T:HRNet	80.65	87.39	76.24	84.95
S:HRNet18s	73.77	82.89	67.47	78.99
SKDS	74.75	83.23	67.58	79.10
IFVD	75.33	83.83	67.5	78.89
CIRKD	74.63	83.72	67.36	79.22
CWD	75.54	84.08	68.39	79.78
Ours	75.93	84.75	69.04	80.37
T:DeeplabV3Plus	80.98	88.7	78.62	86.55
S:Deeplab-MV2	70.49	80.11	62.56	80.09
SKDS	70.81	79.31	62.85	80.25
IFVD	71.82	80.88	67.50	78.89
CIRKD	72.39	81.84	63.57	79.63
CWD	73.35	82.41	67.61	82.03
Ours	74.08	83.83	67.62	82.13
T:ISANet	80.61	88.29	78.46	87.33
S: ISANet-R18	71.45	78.65	68.71	79.81
SKDS	70.65	77.53	67.86	80.47
IFVD	70.30	77.79	69.11	80.93
CIRKD	72.00	79.32	69.0	80.83
CWD	71.61	80.02	72.83	83.99
Ours	72.44	81.04	73.17	84.25

Table 3: Performance comparison with state-of-the-art distillation methods over various student and teacher segmentation networks on Cityscapes dataset and Pascal VOC dataset.

Cityscapes. Table 3 lists the performance results of other state-of-the-art methods and ours on the Cityscapes dataset. We can observe that all structured knowledge distillation methods improve the student networks under the teacher’s supervision. Our method achieves the best segmentation performance across various student networks. Compared with other state-of-the-art methods, we can see that our method can effectively improve the performance by $0.46 %$ , $0.39 %$ , $0.73 %$ , $0.83 %$ , respectively. Meanwhile, compared with the student, our method can improve the performance by $5.76 %$ , $2.16 %$ , $3.59 %$ , $0.99 %$ , respectively. It is also shown that our method works from natural scenes to street scenes. As shown in Figure in Supplementary, our approach allows for better handling of the truck and bus categories, enabling them to be completed. In addition, the segmentation results of our method are more detailed.

Pascal VOC. In Table 3, we show the segmentation performance of various distillation methods on Pascal VOC. Compared with the student without distillation, we can see that our algorithm can effectively improve the performance by $3.45 %$ , $1.57 %$ , $5.06 %$ , $4.46 %$ , respectively. Meanwhile, our methods achieve the best performance on different students. We further show the qualitative segmentation results visually in Figure in Supplementary.

Ablation Study

In this subsection, we conduct experiments to explore the effectiveness of our methods under different knowledge distillation settings. All ablation experiments are carried out on the ADE20k dataset. We choose PSPnet-R101 as the teacher and PSPnet-R18 as the student.

Ablation study of different equations. As shown in Table 4, compared with $ℓ_{P D}$ , $ℓ_{a u g}^{P D}$ introduces infinite samples with $1.09 %$ improvement. Meanwhile, compared with $ℓ_{C W D}$ , $ℓ_{a u g}^{C W D}$ also introduces infinite samples with $1.48 %$ improvement. The introduction of infinite samples allows for a more accurate simulation of the decision boundary of the teacher model.

Methods	mIoU	mAcc(%)
$ℓ_{P D}$	31.75	42.23
$ℓ_{a u g}^{P D}$	32.84	42.33
$ℓ_{C W D}$	33.82	42.41
$ℓ_{a u g}^{C W D}$	35.30	44.06

Table 4: Ablation study of different equations on ADE20K. Introducing infinite samples in

ℓ_{P D}

and

ℓ_{C W D}

both bring improvements.

Ratio	mIoU	mAcc(%)
0.5	35.06	43.91
1.0	35.30	44.06
1.5	34.63	42.97
2.5	31.95	39.04

Table 5: Experiment for different

λ

λ

is introduced in Equation 6. And

1.0

is the best choice for

λ

Figure 2: Analysis of samples with different difficulties. All categories are divided into difficult and easy samples according to the performance. The abscissa represents the difficulty of the category, which is becoming simpler and simpler from left to right. The ordinate represents a performance improvement.

Ablation study of $λ$ . As shown in $ℓ_{a u g}^{C W D}$ , there has $λ$ to adjust the ratio. However, since covariance is tunneled training process by data set statistics, it is less stable in the early stage of training. For stabilize training, $λ$ uses the cosine update method. As shown in Table 5 , we investigate the impact of $λ$ in our FAKD, and $λ = 1.0$ is the best choice.

Figure 3: Visual of samples with or without co-variance. By augmenting the feature with co-variance, the activation of the meaningful region is strengthen.

Illustrations of the effectiveness of our methods in terms of class IoU scores using the network PSPnet-ResNet18. Both CWD and our method are helpful for improving the performance especially for the hard classes with low IoU scores. The improvement from our method is more significant for hard objects. — (a) ADE20K

Ablation study of hard example mining. As shown in Figure 2, the item with $Σ$ in $ℓ_{a u g}^{C W D}$ and $ℓ_{a u g}^{P D}$ is reflected as a dynamically adjusted complex sample. So we divided the different difficult samples and analyzed the performance improvement for samples of different difficulties. From the results, it can be seen that our method can effectively improve the problematic samples. Also, we visualized the feature maps as Illustrated in Figure 3. It can also be seen from the feature map that adding our method enables the feature map to highlight some problematic regions more. Each class IoU score is shown in Figure 4.

Conclusion

This paper presents a novel feature augmented knowledge distillation (FAKD) method for semantic segmentation to increase the training samples for the student network in the feature space. Our method augments feature maps and constrains all augmented features with the teacher network. Experiments on four public segmentation datasets demonstrate the effectiveness of our FAKD. Our method also demonstrates superiority on long-tail problems. The performance for classes with less training samples has been improved with a larger margin. We hope our work inspires future research exploring semantic feature augmentation for knowledge distillation.