Data-free Dense Depth Distillation

Junjie Hu, Chenyou Fan, Mete Ozay, Hualie Jiang, and Tin Lun Lam

^{†}

, J. Hu is with the Shenzhen Institute of Artificial Intelligence and Robotics for Society, the Chinese University of Hong Kong, Shenzhen. E-mail: hujunjie@cuhk.edu.cnC. Fan is with the School of Artificial Intelligence, South China Normal University, China. E-mail: fanchenyou@scnu.edu.cn. M.Ozay is with the Samsung Research, UK. E-mail: meteozay@gmail.com.H. Jiang and T. T. Lam are with the School of Science and Engineering, the Chinese University of Hong Kong, Shenzhen. E-mail: hualiejiang@link.cuhk.edu.cn, tllam@cuhk.edu.cn.

^{†}

Corresponding author: Tin Lun Lam

Abstract

We study data-free knowledge distillation (KD) for monocular depth estimation (MDE), which learns a lightweight network for real-world depth perception by compressing from a trained expert model under the teacher-student framework while lacking training data in the target domain. Owing to the essential difference between dense regression and image recognition, previous methods of data-free KD are not applicable to MDE. To strengthen the applicability in the real world, in this paper, we seek to apply KD with out-of-distribution simulated images. The major challenges are i) lacking prior information about object distribution of the original training data; ii) the domain shift between the real world and the simulation. To cope with the first difficulty, we apply object-wise image mixing to generate new training samples for maximally covering distributed patterns of objects in the target domain. To tackle the second difficulty, we propose to utilize a transformation network that efficiently learns to fit the simulated data to the feature distribution of the teacher model. We evaluate the proposed approach for various depth estimation models and two different datasets. As a result, our method outperforms the baseline KD by a good margin and even achieves slightly better performance with as few as $1 / 6$ images, demonstrating a clear superiority.

Monocular depth estimation, knowledge distillation, data-free KD, dense distillation

I Introduction

As a cost-effective alternative solution to depth sensors, monocular depth estimation (MDE) predicts scene depth from only RGB images and has wide applications in various tasks, such as scene understanding [max-S-and-D], autonomous driving [song2021self], 3D reconstruction [XiyueGuo2021], and augmented reality [Du2020DepthLab]. In recent years, accuracy of MDE methods has been significantly boosted and dominated by deep learning based approaches [fu2018deep, Hu2019RevisitingSI, jiang2021plnet], where the advances are attributed to modeling and estimating depth by complex nonlinear functions using large-scale deep convolutional neural networks (CNNs).

On the other hand, many practical applications, e.g., robot navigation, demand a lightweight model due to the hardware limitations and requirement for computationally efficient inference. In these cases, we can either perform model compression on a well-trained large network [Wofk2019FastDepthFM] or apply supervised learning to directly train a compact network [Mancini2016FastRM]. These solutions assume that the original training data of the target domain is known and can be freely accessed. However, since data privacy and security are invariably a severe concern in the real world, the training data is routinely unknown in practice, especially for industrial applications. A potential solution under this practical constraint is to distill preserved knowledge from a well-trained and publicly available model. The task is called data-free knowledge distillation (KD) [Lopes2017DataFreeKD] and has been shown effective for image recognition.

Most existing methods of data-free KD proposed to synthesize training images from random noise [yin2020dreaming, Fang2019DataFreeAD]. Specifically, assuming that $y$ is a target object attribute, it is an element that inherently exists in the last layer of a classifier and is easily pre-specified, such that we can enforce a classifier to produce the desired output by gradually optimizing its input data. We refer to this property as the inherent constraint of classification. Unfortunately, due to the essential difference between outputs of models obtained in depth estimation and object classification tasks, the inherent constraint does not hold for MDE, making most existing data-free approaches incompatible.

Fig. 1: A visualization of the proposed framework for model compression in monocular depth estimation tasks. We propose to use simulated images as an alternative solution to the challenges of applying knowledge distillation for monocular depth estimation when original training data is not available.

Given the above challenges, in this paper, we propose to leverage out-of-distribution (OOD) images as an alternative for applying KD. For MDE task, intuitively, we consider three critical elements for choosing the alternative set: i) scene similarity, ii) the number of images, and iii) domain gap. The effective yet not practical solution is to collect a dataset similar to the original training data. In reality, data collection is always costly and time-consuming. Besides, due to the lack of prior information about scene structures of the original training data, we have no sufficient clues to guide this data collection process. For these reasons, we prefer using synthetic images collected from simulators. In this way, we can handily obtain enough data to satisfy the requirement ii), and may accordingly sample useful scenarios to ensure i) to a certain extent. However, it will trigger iii) and bring us the significant domain gap between simulated and the real world data.

Fig. 2: A flowchart of the proposed approach for distilling a trained model in real world with simulated images in a forward computation. We firstly mix two images $x_{i}^{'}$ and $x_{j}^{'}$ sampled from the simulated dataset $X^{'}$ to generate a new sample ${^x}^{'}$ , and use a transformation network to fit ${^x}^{'}$ to the feature distribution of the trained teacher. Then, the distillation is applied from the teacher $N_{t}$ to the target student $N_{s}$ with the new input $G ({^x}^{'})$ .

We analyze the effect of these three factors on the accuracy of KD through empirical experiments. Unsurprisingly, high scene similarity, sufficient data, and a small domain gap contribute to better accuracy. Another valuable observation is that the teacher still estimates meaningful depth maps that correctly represent relative depths among objects even from simulated images. It reveals that CNNs may utilize some geometric cues [hu2019visualization, hu2019analysis, Dijk2019HowDN], or can learn some domain-invariant features [Chen2021S2RDepthNetLA] for inferring depths, rather than the straightforward fitting. Therefore, it is still possible to perform KD even though the predicted depth maps are completely wrong in scales. This phenomenon encourages us to develop a data-free dense depth distillation framework with OOD simulated images.

The problem formulated in this paper is visualized in Fig. 1 where we aim to learn a lightweight model on the target domain by distilling from an expert teacher pre-trained with the private data, utilizing a set of simulated images. Following previous methods of data-free KD, we only have prior knowledge of model outputs. For image recognition, this prior information is the detailed target category. For MDE, we are only aware of the model’s deployment environments, e.g., indoor or outdoor scenarios. In general, the difficulties are two-fold. The first is the unknown distributed patterns of objects in the target domain. The second is the unavoidable domain discrepancy between the transfer and original training sets.

Our distillation framework is composed of two sub-branches. The first branch applies the plain KD using original simulated images to ensure a lower bound performance. The second branch generates additional training samples to tackle the above challenges with two technical proposals. Specifically, to handle the first challenge, we generate additional training images to cover the distributed patterns of objects in the target domain by applying random object-wise mixing between two simulated images. The object-wise mixing is achieved by utilizing semantic maps provided by most modern simulators. To tackle the second challenge, we propose to regularize the OOD simulated images to fit the target domain. However, as the original data is unavailable, such transformation is intractable. Inspired by DeepInversion [yin2020dreaming], we formulate it as an image-to-feature adaption problem by leveraging the running statics in batch normalization layers. To solve the issue of slow optimization, we propose a transformation network that formulates the batch-wise optimization into a learning problem. Fig. 2 shows the diagram of these technical components where we learn the transformation network and the target student network at the same time.

To the best of our knowledge, we are the first to distill knowledge for MDE in data-free scenarios. We extensively evaluate the proposed method for different depth estimation models and multiple datasets, including NYU-v2 and ScanNet. In all datasets, our approach demonstrates the best performance. It outperforms the baseline KD by a good margin and shows slightly better performance with as few as $1 / 6$ of the image dataset.

In summary, our contributions include:

We are the first to study data-free KD for monocular depth estimation. We tackle this unexplored problem with the proposal of using OOD simulated images in a novel data-free dense depth distillation framework.
We perform preliminary studies to understand the essential requirements for selecting the OOD data by analyzing how a depth estimator reacts to different types of OOD datasets.
We apply object-wise image mixing to generate new training images to cover the objects’ distributed patterns in the target domain.
We propose to learn a transformation network to efficiently regularize the simulated images to fit the feature distribution of the teacher model.
Our method obtains consistent performance improvements for various MDE models and different datasets.

The rest of the paper is organized as follows. In Section. II, we introduce the related backgrounds and techniques, including monocular depth estimation, knowledge distillation, and image mixing. In Section. III, we first give formal analyzes regarding the difficulties of applying data-free KD for MDE and present our method in detail. Section. IV shows detailed experimental settings and results to verify the effectiveness of our method. Section. IV concludes the paper.

Ii Related Work

Ii-a Monocular Depth Estimation

Monocular depth estimation (MDE) aims to predict scene depths from only a single image. Deep learning-based approaches have dominated recent progress [laina2016deeper, ma2017sparse, fu2018deep, huynh2020guiding, jiang2021plnet] in which the advanced performances are attributed to modeling and estimating depth using large and complex CNNs with data-driven learning.

On the other hand, deploying MDE algorithms into real-world applications often faces practical challenges, such as limited hardware resources and inefficient computation. Therefore, an emerging requirement of MDE is to develop lightweight models to meet the above demands. This problem has been specifically considered in previous studies [Mancini2016FastRM, Nekrasov2019RealTimeJS, Wofk2019FastDepthFM, hu2021boosting] where several different lightweight networks have been designed.

However, lightweight networks inevitably degrade their MDE performance due to the trade-off between the model complexity and accuracy. Hence, it remains an open question: how to reduce the model complexity while maintaining high accuracy. One potential solution to this problem is knowledge distillation (KD) which transfers the knowledge from a cumbersome teacher network to a compact student network with descent accuracy improvement. However, KD requires the original training dataset for implementation. Currently, there are no existing solutions in data-free scenarios for MDE.

Ii-B Knowledge Distillation

Knowledge distillation [Hinton2015DistillingTK] was initially introduced in image recognition where either the soft label or the one-hot label predicted by the teacher is used to supervise the student learning. Existing methods can be generally categorized into two classes considering whether they can access the original training set: 1) the standard data-aware KD, and 2) data-free KD.

For data-aware KD, its effectiveness has also been demonstrated on various vision tasks, such as image recognition [Hinton2015DistillingTK], semantic segmentation [Liu2020StructuredKD], object detection [chen2017learning], and depth estimation [Pilzer2019RefineAD, Wang2021KnowledgeDF], etc. In addition to the conventional setup, researchers have proposed to improve KD via distilling intermediate features [Huang2017LikeWY, Liu2020StructuredKD], distilling from multiple teachers [Tarvainen2017MeanTA, Liu2019KnowledgeFI], employing an additional assistant network [Mirzadeh2020ImprovedKD], and adversarial distillation [Chung2020FeaturemaplevelOA, Shen2019MEALME].

For data-free KD, researchers resorted to synthesize the training set from random noises [Lopes2017DataFreeKD, yin2020dreaming, Fang2019DataFreeAD, Fang2021UpT1, Yoo2019KnowledgeEW] or employ other large scale data from different domains [Chen2021LearningSN, Xu2019PositiveUnlabeledCO, Fang2021MosaickingTD, nayak2021effectiveness]. However, existing methods are most effective for classification tasks due to the natural property of deep classification models and cannot be applied to MDE. We will elaborate these observations in Sec. III-A. In this paper, we propose the first method of data-free distillation for MDE. Our method leverages data from simulated environments to distill a model trained on a real-world dataset.

We especially clarify the differences between data-free KD and domain adaptation (DA), e.g., sim-to-real adaptation, since they may cause some misunderstandings. Data-free KD essentially differs from DA in two aspects. First, DA transfers a model from the original domain to a different domain, while data-free KD aims at preserving the model accuracy in the original domain. Second, both RGB images in the original and the new domains are usually considered prior information and can be freely accessed in DA. In contrast, training images in the original domain are unknown for data-free KD.

Ii-C Image Mixing for Data Augmentation

Image Mixing is a common technique used for applying data augmentation in semi-supervised learning. One can generate new training pairs in data-scarce scenarios by linearly blending two images and their respective labels (or pseudo-labels). Then, those mixed images and labels are utilized for model training. Classical methods of image mixing include MixUp [zhang2018mixup], MixMatch [Berthelot2019MixMatchAH] for pixel-wise blending, and CutMix [Yun2019CutMixRS], ClassMix [Olsson2021ClassMixSD] for mask-based mixing, i.e., exchanging parts of patches or objects between two images. Among them, the former three methods are applicable to classification, and the ClassMix is tailored to image segmentation.

In our method, we apply object-wise mixing to cover the distributed objects in the original training dataset. Unlike the above methods, we only mix RGB images because mixing depth maps will destroy the geometric relations among objects and yield wrong target labels. In a nutshell, instead of generating training pairs of the mixed images and the mixed labels, we use pairs of the mixed images and predictions from them as additional training samples.

Iii Method

Iii-a Preliminary

Iii-A1 Knowledge Distillation

Suppose that $N_{t}$ is a model trained using data from the target domain $D = {X, Y}$ where $X$ and $Y$ denote input data (i.e. image) and label space, respectively. For any $x \in X$ , its corresponding label is estimated by $y = N_{t} (x)$ .

KD aims at learning a smaller network $N_{s}$ with the supervision from $N_{t}$ . Usually, $N_{t}$ is called the teacher network and $N_{s}$ is called the student network, respectively. Then, the learning is formulated as:

min N_{s} \sum x \in X, y \in Y λ H (N_{t} (x), N_{s} (x)) + (1 - λ) H (y, N_{s} (x))

(1)

where $H$ is a loss function, $λ > 0$ is a weighting coefficient and usually is a relatively large number, e.g., 0.9, for giving more weights to the teacher predictions than ground truths. In practice, the second term of Eq. (1) is sometimes discarded. In these cases, Eq. (1) is simplified for $λ = 1$ by

min N_{s} \sum x \in X H (N_{t} (x), N_{s} (x)) .

(2)

Histograms of depths predicted by the teacher from (a) the NYU-v2 — (a) NYU-v2

Iii-A2 Data-free Knowledge Distillation

As shown above, the standard KD requires knowing the original training data sampled from $X$ . Contrarily, data-free KD attempts to learn the student without being aware of $X$ . It is formulated by

min N_{s} \sum x^{'} \in X^{'} H (N_{t} (x^{'}), N_{s} (x^{'}))

(3)

where $X^{'}$ is a proxy to $X$ and can be either i) a set of images synthesized from $N_{t}$ , or ii) other alternative OOD datasets. Then, Eq. (3) can be solved by searching for the optimal $X^{'}$ .

For image recognition, the success is attributed to the natural property that provides an inherent constraint for identifying $X^{'}$ . As $y$ denotes an object category, it is corresponded to an index of the SoftMax outputs from the last fully convolutional layer and thus provides prior information about the desired model output. Then, $X^{'}$ is constructed by

arg min x^{'} \sum x^{'} \in X^{'} H (N_{t} (x^{'}), y) + R (x^{'})

(4)

where $R$ denotes regularization terms.

Proposition 1.

The first term of Eq. (4) is an inherently strong constraint of image recognition that enforces the output consistency such that $x^{'} = x$ .

Proof.

Suppose that $y = N_{t} (x)$ . If $y = N_{t} (x^{'})$ , then we have $N_{t} (x^{'}) = N_{t} (x)$ , equivalently, $x^{'} = x$ . ∎

We can specify any category corresponding to an actual label of $Y$ and generate sufficient images from random noises. Besides, in some works, this inherent constraint is used to transform the OOD data to the target distribution [Fang2021MosaickingTD] or identify the most relevant data with low entropy from a large-scale dataset to the distribution of the target domain for efficient KD [Chen2021LearningSN]. Finally, we have $P_{X^{'}} = P_{X}$ to ensure KD, where $P_{X^{'}}$ and $P_{X}$ denotes the distribution of $X^{'}$ and $X$ , respectively.

Unfortunately, such an inherent constraint does not hold for depth estimation. In the case of MDE, the output is a high-resolution two-dimensional map with correlated objects, not a score for a category. Therefore, we can hardly pre-specify a target depth map and learn to generate its corresponding input image as previous approaches in data-free scenarios.

Iii-B Depth Distillation with OOD data

Given the above difficulties, data-free KD for MDE seems intractable, since no correct depth maps are available. A plausible way is to use some OOD data if we can decipher the essential requirements for $X^{'}$ . Here, we consider that three factors are essential for selecting $X^{'}$ : i) scene structure similarity to $X$ , ii) data-scale for performing KD, and iii) domain gap between $X^{'}$ and $X$ .

We conducted preliminary experiments to analyze how a network reacts to different types of OOD data. Specifically, we let a model trained on the NYU-v2 [NYUv2] dataset as the teacher and apply KD with several different OOD datasets with the same number of randomly sampled images.

	Dataset ( $X^{'}$ )	Properties of $X^{'}$			$δ_{1}$
(a)	NYU-v2 [NYUv2]	indoor scene	real world	50K	0.808
(b)	ScanNet [dai2017scannet]	indoor scene	real world	50K	0.787
(c)	ImageNet [Deng2009ImageNetAL]	single object	real-wrold	50K	0.685
(d)	Random noises	-	-	50K	0.194
(e)	KITTI [Uhrig2017THREEDV]	outdoor scene	real world	50K	0.705
(f)	SceneNet [mccormac2016scenenet]	indoor scene	simulation	50K	0.712
(g)	SceneNet [mccormac2016scenenet]	indoor scene	simulation	300K	0.742

TABLE I: Results of the student model employed on the NYU-v2 test set. The student model is trained via knowledge distillation with different OOD data. Except for (g), all datasets have approximately 50K images.

Fig. 3 shows the histogram of depths predicted by the teacher for different datasets, where Fig. 3 (a) denotes the histogram of the target domain. First, it is observed that the teacher yields similar depth histograms for those datasets as all exhibit a long-tail distribution. Second, although the teacher tended to produce smaller depths for OOD data, the outputted depths are still constrained in the target distribution. Based on these observations, we can say that the teacher is able to inherently produce depths in the target depth distribution even for data sampled from different domains. However, this constraint is not sufficient to ensure KD, as shown in Table I where random Gaussian noises led to the lowest performance even though they yield similar depth histograms.

Then, we analyze the effect of the above three factors by comparing target scenarios, data scale, and data domain in Table I. Except for $(g)$ , all datasets have 50,000 (50K) images and $(b) = (f) > (e) > (c) > (d)$ , in terms of scene similarity to the original data, i.e., $(a)$ . The $δ_{1}$ accuracy is consistent with the scene similarity in input space. In addition, $(g)$ is an augmented version of $(f)$ obtained by increasing the data scale. Not surprisingly, high scene structure similarity and small domain gap in images, and training models using large-scale datasets are beneficial for performance boost.

However, it is challenging to satisfy all these three conditions simultaneously. Considering the difficulties of data collection in the real world applications, we prefer to apply data-free KD for MDE with simulated images.

Despite a significant domain gap, we observe only an $11.9 %$ accuracy drop from $(a)$ to $(f)$ . A reasonable interpretation is that some monocular cues, such as lines, and object boundaries, are essential to activate the related neurons and form a meaningful internal representation inside the teacher model. It can be validated in Fig. 4 that the depth maps inferred from the simulated images are perceptually correct despite being wrong in absolute scales.

\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap
\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap

Fig. 4: Visualization of the simulated images and the depth maps estimated by the teacher.

Iii-C Learning to Regularize Feature Distribution

We have shown that the trained model would map the out-of-distribution data to the depth histogram distribution in the target space. Admittedly, Eq. (3) is equivalent to Eq. (2) if $P_{X^{'}} = P_{X}$ . However, due to the domain gap, there is a significant discrepancy between $X$ and $X^{'}$ . Thus, we wish to mitigate this domain gap and accordingly improve KD. Since the original trained data is unavailable, we leverage the running average statistics captured in batch normalization (BN) as DeepInversion [yin2020dreaming] to regularize $x^{'}$ . Specifically, assuming that feature statistics follow the Gaussian distribution and can be defined by mean $μ$ and variance $σ^{2}$ , then, $x^{'}$ is optimized through the following loss

ℓ_{B N} = \sum l \in [L] ∥ u_{l} (x^{'}) - {¯ ¯ ¯ u}_{l} ∥_{2} + \sum l \in [L] ∥ σ_{l}^{2} (x^{'}) - {¯ ¯ ¯ σ}_{l}^{2} ∥_{2}

(5)

where $u_{l} (x^{'})$ and $σ_{l}^{2} (x^{'})$ are the batch-wise mean and variance of feature maps of the $l$ -th convolutional layer of $N_{t}$ , respectively. ${¯ ¯ ¯ u}_{l}$ and ${¯ ¯ ¯ σ}_{l}^{2}$ are the running mean and variance of the $l$ -th BN layer of $N_{t}$ , respectively. Eq.(5) allows regularizing $x^{'}$ to approach the feature distribution of the teacher model. However, this optimization requires thousands of iterations¹¹13000 iterations in DeepInversion. for a single batch and is highly time-consuming. To tackle this problem, we use an additional network $G$ for data transformation. Then, Eq.(5) can be rewritten as

ℓ_{B N} = \sum l \in [L] ∥ u_{l} (G (x^{'})) - {¯ ¯ ¯ u}_{l} ∥_{2} + \sum l \in [L] ∥ σ_{l}^{2} (G (x^{'})) - {¯ ¯ ¯ σ}_{l}^{2} ∥_{2}

(6)

It is essential to ensure the fidelity of the original scenes to avoid arbitrarily meaningless transformation. Thus, we adopt an image reconstruction loss. Finally, the transformation of $x^{'}$ is formulated by

(7)

where $ℓ_{r e c} = ∥ x^{'} - G (x^{'}) ∥_{1}$ is the reconstruction error that penalizes the $ℓ_{1}$ norm of image difference, and $α$ and $β$ are weighting coefficients.

Iii-D Distillation from Mixed Images

Since we have no clues about training data, including objects, textures, scene structures, etc., we naturally consider applying data augmentation to maximally cover the distributed patterns of objects in the target domain. We randomly change half of objects between two simulated images to obtain a new image with the help of semantic maps collected from the simulator. More formally, for two images $x_{i}^{'}$ and $x_{j}^{'}$ where $x_{i}^{'} \in X^{'}$ , $x_{j}^{'} \in X^{'}$ , we generate a new mixed image ${^x}^{'}$ by

{^x}^{'} = m ⊙ x_{i}^{'} + (1 - m) ⊙ x_{j}^{'}

(8)

where $m$ is a binary mask obtained from the semantic map of $x_{i}^{'}$ , and randomly selects half of the classes from $x_{i}^{'}$ .

This object-wise mixing operation will lead to significant artifacts around object boundaries. In order to remove those noises, $G$ is applied to the augmented images instead, then, Eq. (7) is rewritten as

min G \sum {^x}^{'} \in {^X}^{'} (α ℓ_{B N} + β ℓ_{r e c})

(9)

where ${^X}^{'}$ denotes the augmented set.

X^{'}

: OOD images collected from simulator;

N_{t}

: the teacher model trained on target domain;

α

β

: weighting coefficients used for defining loss in training

G

;

2:Adam optimizer, initial learning rate:

0.0001

, weight decay:

1 e^{- 4}

, training iterations: iterations.

N_{s}

: the student model;

G

: the transformation model;

4:Freeze

N_{t}

;

5:Initialize

N_{s}

and

G

;

6:for

j

= 1 to

i t e r a t i o n s

7: Set gradients of

N_{s}

and

G

to 0;

8: Select a batch

x^{'}

from

X^{'}

;

9: Let

x_{i}^{'} = x^{'}

and

x_{j}^{'} = r a n d o m_s h u f f l e (x^{'})

;

10: Generate mixed images

{^x}^{'}

by Eq. (8);

11:

▹

\eqparboxCOMMENT % Updating the student network %

12: Calculate

N_{t} (x^{'})

N_{s} (x^{'})

N_{t} (G ({^x}^{'}))

N_{s} (G ({^x}^{'}))

;

13: Calculate the depth loss by Eq. 10;

14: Update

N_{s}

;

15:

▹

\eqparboxCOMMENT % Updating the transformation network %

16: Calculate the loss

ℓ_{B N}

by Eq. 9;

17: Update

G

;

18:end for

Algorithm 1 Data-free Depth Distillation

Teacher (Backbone)

\to

Student (Backbone)

ResNet-34 [hu2021boosting]

\to

ResNet-34

ResNet-34 [hu2021boosting]

\to

MobileNet-v2

ResNet-50 [laina2016deeper]

\to

ResNet-18

ResNet-50 [Hu2019RevisitingSI]

\to

ResNet-18

SeNet-154 [Chen2019structure-aware]

\to

ResNet-34

Parameter Reduction

None

21.9 M

\to

1.7 M

63.6 M

\to

13.7 M

67.6 M

\to

14.9 M

258.4 M

\to

38.7 M

Method

Data

REL

↓

δ_{1}

↑

REL

↓

δ_{1}

↑

REL

↓

δ_{1}

↑

REL

δ_{1}

↑

REL

↓

δ_{1}

↑

Teacher

NYU-v2

0.133

0.829

0.133

0.829

0.134

0.824

0.126

0.843

0.111

0.878

Student

0.133

0.829

0.145

0.802

0.145

0.805

0.137

0.826

0.125

0.843

Random noises

None

0.426

0.193

0.431

0.194

0.517

0.102

0.511

0.112

0.514

0.107

DFAD [Fang2019DataFreeAD]

0.285

0.402

0.306

0.329

0.300

0.382

0.341

0.338

0.347

0.278

KD-OOD [Hinton2015DistillingTK]

SceneNet

X_{1}^{'}

0.164

0.753

0.175

0.712

0.188

0.660

0.175

0.710

0.174

0.695

Ours

0.155

0.774

0.168

0.742

0.173

0.701

0.167

0.722

0.156

0.759

KD-OOD [Hinton2015DistillingTK]

SceneNet

X_{2}^{'}

0.158

0.761

0.165

0.742

0.180

0.676

0.172

0.713

0.161

0.738

Ours

0.151

0.789

0.157

0.778

0.165

0.726

0.157

0.760

0.151

0.776

TABLE II: Quantitative results on the NYU-v2 dataset.

	Dataset	Training	Test
	Dataset	scenarios / images	scenarios / images
Target domain	NYU-v2	249 / 50688	215 / 654
Target domain	ScanNet	1513 / 50473	100 / 17607
Simulated data	SceneNet $X_{1}^{'}$	1000 / 50K	-
Simulated data	SceneNet $X_{2}^{'}$	1000 / 300K	-

TABLE III: Details of the RGBD datasets used in the experiments.

Iii-E Data-free Student Learning

We formally describe the distillation framework to enable data-free student learning. The learning objective consists of two loss terms. The first loss term adopts the plain distillation with the initial simulated images to ensure a lower bound performance. The second loss term penalizes depth differences between the teacher and the student models using images obtained from the transformation network.

The optimization objective of depth distillation from the teacher model to the student model is defined by

(10)

where $L$ is a loss function used for measuring the depth errors. We employ the loss function proposed in [Hu2019RevisitingSI] that penalizes losses of depth, gradient, and normal. The details of our method are given in Algorithm 1 where $r a n d o m_s h u f f l e$ denotes the operation of randomizing images.

Iv Experimental Results

Iv-a Experimental Settings

Iv-A1 Implementation Details

Our learning framework includes three networks; (1) the teacher network $N_{t}$ trained on the target domain and is fixed during training the student model; (2) the student network $N_{s}$ , which we aim to train; and (3) the transformation network $G$ which will be also optimized during training. We train $N_{s}$ and $G$ for 20 epochs using the Adam optimizer with an initial learning rate of 0.0001, and reduce it to 50% for every 5 epochs. The hyper-parameters $α$ and $β$ controlling the data transformation are set to 0.001 for all experiments throughout the paper. We trained models with batch size of 8 in all the experiments and developed the code-base using PyTorch [NEURIPS2019_9015] .

Iv-A2 Datasets

NYU-v2 [NYUv2]

The NYU-v2 dataset is the benchmark most commonly used for depth estimation. It is captured by Microsoft Kinect with an original resolution of $640 \times 480$ , and contains 464 indoor scenes. Among them, 249 scenes are chosen for training, and 215 scenes are used for testing. We use the pre-processed data by Hu et al. [Hu2019RevisitingSI, hu2019analysis] with approximately 50,000 unique pairs of an image and a depth map with the resolution of 640 $\times$ 480. Following most previous studies, we resize the images to 320 $\times$ 240 pixels and then crop their central parts of 304 $\times$ 228 pixels as inputs. For testing, we use the official small subset of 654 RGBD pairs.

ScanNet [dai2017scannet]

ScanNet is a large scale RGBD dataset that contains 2.5 million RGBD images. We randomly and uniformly select a subset of approximately 50,000 samples from the training splits of 1513 scenes for training, and evaluate the models on the test set of another 100 scenes with 17K RGB pairs. We apply the same image pre-processing methods, that is, image resizing and cropping as utilized on the NYU-v2 dataset.

SceneNet [mccormac2016scenenet]

SceneNet is a large scale synthesized dataset which contains 5 Million RGBD indoor images from over 15,000 synthetic trajectories. Each trajectory has 300 rendered frames. The original image resolution is 320 $\times$ 240. Thus, we only apply the center crop to yield an image resolution of 304 $\times$ 228.

We sample two subsets from 1000 indoor scenes of the official validation set. The two subsets have 50,000 and 300,000 images, respectively, and are denoted by $X_{1}^{'}$ and $X_{2}^{'}$ in the following texts. The detailed information of the datasets used in the experiments is given in Table III.

\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap
\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap
\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap
\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap
\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap
\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap
(a) RGB images	(b) Ground truth.	(c) Random noise.	(d) DFAD	(e) KD-OOD with $X_{1}^{'}$ .	(f) Our results with $X_{1}^{'}$ .	(g) KD-OOD with $X_{2}^{'}$ .	(h) Our results with $X_{2}^{'}$ .

Fig. 5: Qualitative comparison of depth maps predicted by different methods on the NYU-v2 test set.

Iv-A3 Networks

We choose multiple combinations of the teacher and student models to evaluate our models and methods extensively. For the first combination, we let the teacher and student models be the same network proposed in [hu2021boosting] built on ResNet-34 [He2016DeepRL] to investigate the performance without model compression. For the second combination, we use the above ResNet-34 based network as the teacher model, and the MobileNet-v2 [sandler2018mobilenetv2] based network as the student model in [hu2021boosting]. For the next two combinations, the teacher models are implemented using a ResNet-50 [He2016DeepRL] based encoder-decoder network [laina2016deeper] and multi-branch depth estimation network [Hu2019RevisitingSI], respectively. Networks of the student models are modified from the teacher networks by replacing ResNet-50 with ResNet-18. For the last combination, network of the teacher model is a SeNet-154 [hu2018senet] based residual pyramid network [Chen2019structure-aware]. Similarly, the student model is derived from the teacher model by replacing the backbone with a smaller ResNet-34.

To implement the network of the transformation model, we use the dilated convolution [Yu2017] based encoder-decoder network modified from the saliency prediction network [hu2019visualization, hu2019analysis] by adding symmetric skip connections between the encoder and the decoder.

Iv-A4 Baselines

As discussed in Sec. III-A, most of the previous data-free KD methods cannot be applied to depth regression tasks. Thus, we choose DFAD [Fang2019DataFreeAD] as a baseline, since this method does not apply the inherent constraint for synthesizing images. Overall, we consider the following methods as baselines for comparison.

Teacher: The teacher model trained on the target dataset.

Student: The student model trained on the target dataset.

KD-OOD: For the sake of comparison, we take KD [Hinton2015DistillingTK] using the OOD simulated data as the strong baseline of our method. It is the first loss term used in our method.

Random noise: The student model is learned via KD with random Gaussian noise. It is also a baseline commonly used for image recognition.

DFAD: The student model is learned with data-free adversarial distillation [Fang2019DataFreeAD] that synthesizes images from random noise with adversarial training.

Iv-B Quantitative Comparisons

Iv-B1 NYU-v2 Dataset

We first thoroughly evaluate the proposed method on the NYU-v2 dataset. We measure depth maps using the mean relative error (REL) and the $δ_{1}$ accuracy. Table II shows the quantitative results of different methods for various teacher-student combinations where the performance of the student (trained in supervised learning) exhibits an upper bound that we aim to reach. As seen, distillation with random noise yields the lowest performance, although they are shown to be effective for some toy datasets, e.g., MNIST [LeCun1998GradientbasedLA] and CIFAR-10 [Krizhevsky2009LearningML], for image recognition. Moreover, DFAD has also failed on the task.

Compared to the above methods, KD-OOD demonstrates much better results, showing the advance of our route that utilizes OOD simulated images. In the case of using the smaller set $X_{1}^{'}$ , it provides $28.1 %$ mean increase in REL and $14.0 %$ decrease in $δ_{1}$ . Most importantly, the proposed method outperforms all baselines and attains consistent performance improvement for all different teacher-student combinations. It yielded $19.7 %$ and $9.9 %$ performance degradation in REL and $δ_{1}$ . Compared to KD-OOD, it achieves $8.4 %$ and $4.1 %$ mean improvement in REL and $δ_{1}$ , respectively.

We then analyze the effect of utilizing the larger set $X_{2}^{'}$ . As a result, we found a performance boost for both KD-OOD and our method in all experiments when using a more large-scale set. Our method consistently outperforms KD-OOD by $8.0 %$ and $4.9 %$ in REL and $δ_{1}$ , respectively. Besides, our method using $X_{1}^{'}$ even outperforms KD-OOD using $X_{2}^{'}$ , that is to say, we contribute to compressing the data-scale to $1 / 6$ .

\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap
\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap
\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap
\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap
\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap
\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap	\addstackgap
(a) $x_{i}^{'}$	(b) $x_{j}^{'}$	(c) ${^x}^{'}$	(d) $N_{t} ({^x}^{'})$	(e) $G ({^x}^{'})$	(f) $N_{t} (G ({^x}^{'}))$	(g) $\| {^x}^{'} - G ({^x}^{'}) \|$	(h) $\| N_{t} ({^x}^{'}) - N_{t} (G ({^x}^{'})) \|$

Fig. 6: Visual comparisons of images and depth maps where (a) and (b) are original images from the simulated set, (c) and (d) are mixed images and estimated depth maps, (e) and (f) denote transformed images of (c) and estimated depth maps, (g) and (f) denote image discrepancy and depth discrepancy, respectively.

Another observation is that the first two teacher-student model combinations outperform the latter three. The results agree well with previous studies [wang2021knowledge] which verified that the performance of the student model degrades when the gap of model capacity between them is significant. This problem can be well handled by using an additional assistant model [Mirzadeh2020ImprovedKD], distilling intermediate features [Liu2020StructuredKD], multiple teacher models [Tarvainen2017MeanTA], and ensemble of distributions [Malinin2020EnsembleDD]. Since it is a common challenge, we leave it as future work.

Fig. 5 visualizes a qualitative comparison of different methods. It is seen that random noises produce meaningless predictions, and DFAD estimates coarse depth maps. A closer observation of maps predicted by KD, OOD and our method shows that our proposed method can estimate more accurate depth in local regions. Overall, the quantitative and the qualitative results verified the effectiveness of our approach.

Iv-B2 ScanNet Dataset

To fully evaluate our method, we also test methods using the ScanNet dataset. We use the teacher and student models proposed in [hu2021boosting]. The results are given in Table IV. The final results are highly consistent with those obtained using NYU-v2. Both random noises and DFAD show extremely low accuracy. The proposed method outperforms KD-OOD even using the smaller set. We obtained 13.7 $%$ and 9.1 $%$ improvement in $δ_{1}$ and 17.0 $%$ , and 9.8 $%$ improvement in REL for $X_{1}^{'}$ and $X_{2}^{'}$ , respectively. The performance improvement obtained on ScanNet is more significant than the improvement obtained using NYU-v2.

Method	Data	REL $↓$	$δ_{1}$ $↑$
Teacher Model [hu2021boosting]	ScanNet	0.150	0.790
Student Model [hu2021boosting]	ScanNet	0.165	0.764
Random noise	None	0.539	0.079
DFAD [Fang2019DataFreeAD]	None	0.335	0.368
KD-OOD [Hinton2015DistillingTK]	SceneNet $X_{1}^{'}$	0.224	0.541
Ours	SceneNet $X_{1}^{'}$	0.196	0.646
KD-OOD [Hinton2015DistillingTK]	SceneNet $X_{2}^{'}$	0.200	0.618
Ours	SceneNet $X_{2}^{'}$	0.185	0.693

TABLE IV: The results provided by the models on the ScanNet dataset.

Iv-C Analyses of the Transformation Network

Fig. 6 shows some examples of the input and output images of the transformation network as well as their corresponding predictions. In the figure, $x_{i}^{'}$ and $x_{j}^{'}$ denote two images randomly selected from the simulated set, and ${^x}^{'}$ is the image generated by applying object-wise mixing between $x_{i}^{'}$ and $x_{j}^{'}$ . $G ({^x}^{'})$ denotes the transformed image, i.e., the output of $G$ . By visually comparing ${^x}^{'}$ and $G ({^x}^{'})$ , we observe that $G$ tends to reduce artifacts around object boundaries such that $G$ can produce more realistic images. It can be validated by $| {^x}^{'} - G ({^x}^{'}) |$ (Fig. 6. (g)) where differences at object boundaries are highlighted. Furthermore, Fig. 6. (d) and (f) shows the predicted depth maps for ${^x}^{'}$ and $G ({^x}^{'})$ , respectively. They demonstrate a clear difference, as observed in Fig. 6. (h). We quantify these differences by evaluating the whole set $X_{1}^{'}$ . As a result, the $ℓ_{1}$ -norm of the image and depth difference is 0.156 and 0.227, respectively.

	REL	$δ_{1}$
Original	0.168	0.742
Without using $l_{r e c}$	0.172	0.735
Without using $G$	0.175	0.722
Without using image mixing	0.171	0.724
With $G (x^{'})$	0.168	0.748

TABLE V: Results for ablation studies.

Iv-D Ablation Studies

We conduct several ablation studies to analyze our approach and provide additional results on the NYU-v2 dataset. Table V gives the results. Specifically, we perform several experiments as follows:

Without using $l_{r e c}$ : In our original method, we impose the reconstruction consistency between ${^x}^{'}$ and $G ({^x}^{'})$ to suppress undesirable noises while training the transformation model. We relax this constraint and observe that the REL and $δ_{1}$ dropped to 0.172 and 0.735, respectively.

Without using $G$ : We also test the performance while removing the transformation network in the pipeline. We directly perform distillation using $x^{'}$ and mixed images ${^x}^{'}$ . As a result, the REL and $δ_{1}$ dropped to 0.175 and 0.722, respectively.

Without using image mixing: We evaluate the effect without utilizing the object-wise image mixing. We feed the images $x^{'}$ to $G$ and apply distillation with both $x^{'}$ and $G (x^{'})$ . We find that the REL and $δ_{1}$ dropped to 0.171 and 0.724, respectively.

With $G (x^{'})$ : Our method performs KD with the initial data $x^{'}$ and transformed data $G ({^x}^{'})$ . We also conduct an experiment to investigate the performance of applying $G$ to $x^{'}$ by performing KD with both $G (x^{'})$ and $G ({^x}^{'})$ . We gain a slight performance boost as $δ_{1}$ is improved from 0.742 to 0.748.

Iv-E Invalidating KD by Adversarial Perturbation

We argue that the scene structure of the alternative OOD data is critical for successfully applying KD. To verify this, we conduct additional experiments adding adversarial perturbations to simulated images to undermine data distribution. Note that even those adversarial perturbations are imperceptible to human vision; they will generate non-robust features [Ilyas2019AdversarialEA] and lead a depth estimator to a malfunction. We generate a set of adversarial images and predict depth maps from them by applying adversarial attacks. We then perform KD with those newly generated RGB and depth pairs. Following [hu2019analysis], we adopt IFGSM attack [Kurakin2017AdversarialEI] to the ResNet-34 based teacher model [hu2021boosting] with different perturbation bounds $ϵ$ . The results provided by the student model are given in Fig. 7 where $ϵ = 0$ denotes the result of no attack. It is seen that the accuracy of depth distillation will gradually deteriorate as $ϵ$ increases. It indicates that KD is vulnerable to adversarial attacks.

Fig. 7: Results of applying IFGSM attack to KD for monocular depth estimation.

V Summary and Conclusion

We have studied knowledge distillation for monocular depth estimation in data-free scenarios. By analyzing the challenges of the task, we showed that a promising approach to address the challenges is to utilize out-of-distribution images as an alternative solution. We then empirically verified that i) high scene similarity, ii) large-scale dataset, and iii) small domain gap contribute to the performance boost of depth distillation through experiments with different OOD data. Given the difficulty of data collection in practice, we proposed to utilize simulated images to strengthen the applicability of KD.

In this paper, for the first time, we presented a novel framework to perform data-free knowledge distillation for monocular depth estimation. We noted that the major challenges are a lack of prior information on the scene structure and a significant domain shift between the simulated and target distribution. To remedy the first difficulty, we proposed to apply object-wise image mixing to cover the unknown distributed patterns in the target domain. To handle the second challenge, we proposed to leverage a transformation network that efficiently learns to adjust image distributions.

As a practical solution to the task, we have evaluated the effectiveness of the proposed approach for various depth estimation models and two real-world benchmark datasets. We hope our method can further inspire future explorations, shedding some light on this unexplored problem.

Data-free Dense Depth Distillation

Abstract

I Introduction

Ii Related Work

Ii-a Monocular Depth Estimation

Ii-B Knowledge Distillation

Ii-C Image Mixing for Data Augmentation

Iii Method

Iii-a Preliminary

Iii-A1 Knowledge Distillation

Iii-A2 Data-free Knowledge Distillation

Proposition 1.

Proof.

Iii-B Depth Distillation with OOD data

Iii-C Learning to Regularize Feature Distribution

Iii-D Distillation from Mixed Images

Iii-E Data-free Student Learning

Iv Experimental Results

Iv-a Experimental Settings

Iv-A1 Implementation Details

Iv-A2 Datasets

NYU-v2 [NYUv2]

ScanNet [dai2017scannet]

SceneNet [mccormac2016scenenet]

Iv-A3 Networks

Iv-A4 Baselines

Iv-B Quantitative Comparisons

Iv-B1 NYU-v2 Dataset

Iv-B2 ScanNet Dataset

Iv-C Analyses of the Transformation Network

Iv-D Ablation Studies

Iv-E Invalidating KD by Adversarial Perturbation

V Summary and Conclusion

References