LiteDepth: Digging into Fast and Accurate
Depth Estimation on Mobile Devices

Anonymous ECCV submission Harbin Institute of Technology ¹
² Zhenyu Li 1Paper ID 511Harbin Institute of Technology ¹
² Zehui Chen 2University of Science and Technology of China 2¹
² Jialei Xu 1Paper ID 511Harbin Institute of Technology ¹
² Xianming Liu 1Paper ID 511Harbin Institute of Technology ¹
² Junjun Jiang Corresponding author (jiangjunjun@hit.edu.cn).1Paper ID 511Harbin Institute of Technology ¹
²

³email: {zhenyuli17, csxm, jiangjunjun}@hit.edu.cn

³email: lovesnow@mail.ustc.edu.cn 21B903029@stu.hit.edu.cn

Abstract

Monocular depth estimation is an essential task in the computer vision community. While tremendous successful methods have obtained excellent results, most of them are computationally expensive and not applicable for real-time on-device inference. In this paper, we aim to address more practical applications of monocular depth estimation, where the solution should consider not only the precision but also the inference time on mobile devices. To this end, we first develop an end-to-end learning-based model with a tiny weight size (1.4MB) and a short inference time (27FPS on Raspberry Pi 4). Then, we propose a simple yet effective data augmentation strategy, called R $^{2}$ crop, to boost the model performance. Moreover, we observe that the simple lightweight model trained with only one single loss term will suffer from performance bottleneck. To alleviate this issue, we adopt multiple loss terms to provide sufficient constraints during the training stage. Furthermore, with a simple dynamic re-weight strategy, we can avoid the time-consuming hyper-parameter choice of loss terms. Finally, we adopt the structure-aware distillation to further improve the model performance. Notably, our solution named LiteDepth ranks 2 $^{n d}$ in the MAI&AIM2022 Monocular Depth Estimation Challenge, with a si-RMSE of 0.311, an RMSE of 3.79, and the inference time is 37 $m s$ tested on the Raspberry Pi 4. Notably, we provide the fastest solution to the challenge. Codes and models will be released at https://github.com/zhyever/LiteDepth.

Keywords:

Monocular Depth Estimation, Lightweight Network, Data Augmentation, Multiple Loss

1 Introduction

Monocular depth estimation plays a vital role in the computer vision community, where a wide spread of various depth-depended tasks related to autonomous driving [chen2022autoalign, wang2022monocular, chen2022graph, reading2021categorical, chen2022autoalignv2, weng2019monocular, li2022unsupervised, wang2022detr3d], virtual reality [armbruster2008depth, gerig2018missing], and scene understanding [zhu2020edge, wang2018depth, hazirbas2016fusenet, vu2019dada] provide strong demand for fast and accurate monocular depth estimation methods that are applicable to portable low-power hardware. Therefore, research along the line of accelerating depth estimation while reducing quality sacrifice on mobile devices has drawn increasing attention [ignatov2021fast, wang2021knowledge].

As a classic ill-posed problem, estimating accurate depth from a single image is challenging. However, with the fast development of deep learning techniques, neural network demonstrates groundbreaking improvement with plausible depth estimation results [eigen2014depth, lee2019bts, bhat2021adabins, yang2021transdepth, li2022depthformer, li2022binsformer]. While engaging results have been presented, most of these state-of-the-art (SoTA) models are only optimized for high fidelity results while not taking into account computational efficiency and mobile-related constraints. The requirements of powerful high-end GPUs and consuming gigabytes of RAM lead to a dilemma when developing these models on resource-constrained mobile hardware [ignatov2021fast, huawei2018, snapdragon2018].

In this paper, we aim to address the more practical application problem of monocular depth estimation on mobile devices, where the solution should consider not only the precision but also the inference time [ignatov2021fast]. We first investigate a suitable network design. Typically, the depth estimation network follows a UNet paradigm [ronneberger2015u] consisting of an encoder and a decoder with skip connections. Regarding the encoder, we choose a variant version of MobileNet-v3 [howard2017mobilenets] as a trade-off between performance and inference time, where we drop out the last convolution layer to speed up inference and reduce the model size. Moreover, we observe that the commonly used image normalization pre-process on input images is also time-consuming (19 $m s$ on Raspberry Pi 4). To solve this issue, we propose to merge the normalization into the first convolution layer in a post-process manner so that the redundant overhead can be eliminated without bells and whistles. Following [ignatov2021fast], we adopt the fast downsampling strategy, which could quickly downsample the resolution of input images from 480 $\times$ 640 to 4 $\times$ 6. A light decoder is introduced to recover the spatial details, consisting of a few convolutional layers and upsampling layers.

After determining the model structure, we propose several effective training strategies to boost the fidelity of the lightweight model. (1) We adopt an effective augmentation strategy called R $^{2}$ crop. It not only adopts crop patches on images with Random locations but also Randomly changes the size of crop patches. This strategy increases the diversity of the scenes and effectively avoids overfitting the training set. (2) We introduce a multiple-loss training strategy to provide sufficient supervision during the training stage, where we propose a gradience loss that can handle invalid holes in training samples and adopt the other three loss terms proposed in previous works. Moreover, we install a dynamic re-weighting strategy that can avoid the time-consuming weight selection of loss terms. (3) We highlight that our work focuses on the model training strategies, unlike previous solutions that adopt variant distillation methods [ignatov2021fast, wang2021knowledge]. However, model distillation can also be an effective way to boost the model fidelity without any overhead. Therefore, we adopt the structure-aware distillation [liu2020structured] in a fine-tuning manner.

We evaluate our method on Mobile AI (MAI2022) dataset, and the results demonstrate that each strategy can improve the accuracy of the lightweight network. With a short inference time (37 $m s$ per image) on Raspberry Pi 4 and a lightweight model design (totally 1.4MB), our solution named MobileDepth achieves results of 0.311 si-RMSE and ranks second in the MAI&AIM 2022 Monocular Depth Estimation Challenge [ignatov2022depth].

In summary, our main contributions are:

We design a lightweight depth estimation model that achieves fast inference on mobile hardware, where an image normalization merging strategy is proposed to reduce the redundant overhead.
We adopt an effective augmentation strategy called R $^{2}$ crop that is adopted at random locations on images with a randomly changed size of patches.
We design a gradience loss that can handle invalid holes in training samples and propose to apply multiple-loss items to provide sufficient supervision during the training stage.
We evaluate our method on MAI2022 dataset and rank second place in the MAI&AIM2022 Monocular Depth Estimation Challenge [ignatov2022depth].

{adjustbox}

width=0.95center Rank Username si-RMSE RMSE log $_{10}$ REL Runtime Score 1 TCL 0.277 3.47 0.110 0.299 46 $m s$ 297.79 2 Zhenyu Li 0.311 3.79 0.124 0.342 37 $m s$ 232.04 3 ChaoMI 0.299 3.89 0.134 0.380 54 $m s$ 187.77 4 parkzyzhang 0.303 3.80 12.189 0.301 68 $m s$ 141.07 5 RocheL 0.329 4.06 0.137 0.366 65 $m s$ 102.07 6 mvc 0.349 4.46 0.140 0.340 139 $m s$ 36.07 7 Byung Hyun Lee 0.338 6.73 0.332 0.507 142 $m s$ 41.58

Table 1: Ranking results in the MAI&AIM2022 Monocular Depth Estimation Challenge, which are evaluated on the online test server. We highlight our results in bold.

2 Related Work

Monocular depth estimation is an ill-posed problem [eigen2014depth]. Lack of cues, scale ambiguities, translucent or reflective materials all leads to ambiguous cases where appearance cannot infer the spatial construction [li2022depthformer]. With the rapid development of deep learning, the neural network has dominated the primary workhorse to provide reasonable depth maps from a single RGB input [yin2019enforcing, lee2019bts, bhat2021adabins, li2022depthformer, li2022binsformer].

Eigen et al. [eigen2014depth] first groundbreakingly propose a multi-scale deep network, consisting of a global network and a local network to predict the coarse depth and refine predictions, respectively. Subsequent works focus on various points to boost depth estimation, for instance, problem formulation [fu2018deep, bhat2021adabins, li2022binsformer], network architecture [lee2019bts, li2022depthformer, kim2022global], supervision design [barron2019general, yin2019enforcing, patil2022p3depth], interpretable method [you2021towards], pre-training strategy [li2021simipu, park2021pseudo], unsupervised training [godard2019digging, zhou2017unsupervised], etc. Though achieving engaging fidelity, these methods neglect the limitation of resource-constrained hardware and can be hard to develop on portable devices or embedded systems.

Notably, there are also some methods that take the inference time and model complexity into account, which makes them applicable on mobile devices [ignatov2021fast]. FastDepth [wofk2019fastdepth] deploys a real-time depth estimation method on embedded systems by designing an efficient model architecture and a pruning strategy to further reduce the model complexity. In our paper, we follow FastDepth [wofk2019fastdepth] to choose MobileNet-v3 [howard2017mobilenets] as our encoder and design an even more lightweight decoder (only consisting of four convolution layers) to achieve a trade-off between fidelity and inference speed.

3 Method

In this section, we first present our network design in Sec. 3.1, where tons of details should be considered to achieve the best trade-off between fidelity and inference speed. Then, we introduce our proposed R $^{2}$ Crop in Sec. 3.2 and Multiple Loss Training strategy in Sec. 3.3. Subsequently, we illustrate the installation of the structure-aware Distillation strategy in Sec. 3.4.

Figure 1: Illustration of our proposed network architecture that follows the prevalent Unet [ronneberger2015u] design consisting of a MobileNet-V3 [howard2017mobilenets] encoder and a lightweight decoder with skip connections.

3.1 Network Design

As shown in Fig. 1, our proposed network consists of an encoder and a lightweight decoder with skip connections. We sequentially introduce each component and design detail.

Encoder plays a crucial role in extracting features from input images for depth estimation. To achieve a trade-off between fidelity and inference speed, we choose MobileNet-v3 [howard2017mobilenets] as our encoder. It is worth noticing that MobileNet contains a dimension-increasing layer (1 $\times$ 1 convolution with an input dimension of 96 and output dimension of 960) to facilitate training for a classification task. We remove this layer to improve the inference speed and reduce the number of model parameters. Following [ignatov2021fast], we adopt the Fast Downsampling Strategy in which a resize layer is inserted at the beginning of the encoder to resize the high-resolution input image from 480 × 640 to 128 × 160. As a result, the encoder can quickly downsample the resolution of feature maps, significantly shorten the inference time. Typically, input images are normalized to align with the pre-training setting. We discern the vanilla image normalization is time-consuming (19 $m s$ of the image normalization v.s. 37 $m s$ of the whole model) on the target device (i.e., Raspberry Pi 4). Therefore, we propose to merge the image normalization into the first convolution layer in a post-process manner so that we can avoid the redundant overhead without bells and whistles. Consider the image normalization and the first convolution layer:

I_{n} = \frac{I_{r} - m}{s},

(1)

f = W * I_{n} + b,

(2)

where $I_{n}$ and $I_{r} \in R^{3 \times H \times W}$ are normalized and raw input images. $m \in R^{3}$ and $s \in R^{3}$ are the mean and standard deviation used in the image normalization. $f \in R^{C \times H_{f} \times W_{f}}$ is the output feature map with $C$ channels of the first convolution in our network. $W \in R^{3 \times C \times k^{2}}$ and $b \in R^{C}$ are the trained weight and bias of the first $k \times k$ convolution. $*$ denotes the convolution operation. Given a trained model with parameters $W$ and $b$ of the first convolution, we update the them based on the mean and standard deviation used in the image normalization during the training stage:

W^{'} = \frac{W}{s},

(3)

b_{i}^{'} = b_{i} - 3 \sum d (\frac{m_{d}}{s_{d}} \times k \times k \sum j W_{d i j}), i \in (1, 2, . . ., C),

(4)

b^{'} = C o n c a t ([b_{1}^{'}, b_{2}^{'}, . . ., b_{C}^{'}]),

(5)

where the $W^{'}$ and $b^{'}$ are the updated weight and bias of the first $k \times k$ convolution. $C o n c a t$ is the element-wise concatenation. $d$ is the index of RGB dimension. Consequently, we discard the image normalization and apply the first convolution directly on input images as:

f = W^{'} * I_{r} + b^{'} .

(6)

As a result, the trained network can directly recieve the raw input images without the time-consuming image normalization.

Decoder is adopted to recover the spatial details by fusing the multi-level deep and shallow features. Unlike previous works [wofk2019fastdepth, ignatov2021fast, wang2021knowledge] that utilize the symmetrical encoder and decoder, we drop out the last decoder layer to further accelerate the model inference. Hence, the resolution of outputs is 4 $\times$ downsampled (i.e., 32 $\times$ 64). At each decoding stage, we apply a simple feature fusion module to aggregate the decoded and skip-connected features, which consists of a concatenation operation and a convolution layer (with ReLU as the activation function). To achieve the best trade-off between fidelity and speed, we utilize the 1 $\times$ 1 and 3 $\times$ 3 convolution for deep and shallow features, respectively. The final feature map is projected to the predicted depth map via the 1 $\times$ 1 convolution, which is then passed by a ReLU function to suppress the plural prediction. Finally, we insert a resize block at the end of the decoder to upsample the predicted depth map to the raw resolution 480 $\times$ 640. We highlight the lightweight design of the decoder that only consists of five convolution layers but achieves satisfactory fidelity.

Figure 2: Comparisons among different crop augmentations. As for R $^{2}$ crop, we utilize different colors to indicate that we adopt randomly selected size of crop patches.

Figure 3: Illustration of invalid depth GT pixels in the dataset. These pixels appear not only in the sky areas but also in close positions where the sensor cannot provide reliable GT value. We highlight a training sample for a clear Introduction of our valid mask in gradloss in Fig. 4.

3.2 R $^{2}$ Crop

Data augmentation is crucial to training models with better performance. Typically, the sequence of data augmentation for monocular depth estimation includes random rotation, random flip, random crop, and random color enhancement [lidepthtoolbox2022]. We propose the more effective crop strategy R $^{2}$ crop, in which we randomly select the size of crop patches and the cropped locations. We highlight the discrepancy with other commonly used crop methods in Fig. 2. It increases the diversity of the scenes and effectively avoids overfitting the training set.

3.3 Multiple Loss Training

Previous depth estimation methods [lee2019bts, bhat2021adabins, li2022depthformer, li2022binsformer] only adopt the silog loss to train the neural network:

L_{s i l o g} = α \sqrt{\frac{1}{N} N \sum i e_{i}^{2} - \frac{λ}{N^{2}} (N \sum i e_{i})^{2}},

(7)

where $e_{i} = log {^d}_{i} - log d_{i}$ with the ground truth depth $d_{i}$ and predicted depth ${^d}_{i}$ . $N$ denotes the number of pixels having valid ground truth values. Since we discover that the lightweight model supervised by this simple single loss lacks representation capability and is easily stuck in local optimal, we adopt diverse loss terms to provide various targets for sufficient model training.

Motivated by [sitzmann2020implicit], we first propose a gradience loss $L_{g r a d}$ formulated as:

L_{g r a d} = \frac{1}{N} \sum i (M_{x_{i}} \times {∥ ∥ \nabla_{x} {^d}_{i} - \nabla_{x} d_{i} ∥ ∥}_{1} + M_{y_{i}} \times {∥ ∥ \nabla_{y} {^d}_{i} - \nabla_{y} d_{i} ∥ ∥}_{1}),

(8)

where $\nabla$ is the gradience calculation operation. Since the gradience loss is calculated in a dislocation subtraction manner and there are tremendous invalid depth GT in the dataset as shown in Fig. 3, as presented in Fig. 4, simply applying gradience calculation will blemish the information of invalid pixels and introduce outlier values when calculating the loss term. Hence, it is necessary to carefully design a strategy to calculate masks $M$ to filter these invalid pixels in $L_{g r a d}$ . To solve this issue, we first replace the invalid value with NaN and then calculate the GT for gradience loss. Thanks to the numeral property of NaN and Inf, invalid information can be reserved. Consequently, we can filter the NaN and Inf when calculating the gradience loss.

Moreover, we also adopt the virtual norm loss $L_{v n l}$ [yin2019enforcing], and robust loss $L_{r o b u s t}$ [barron2019general]. We formulate them as follows:

(9)

where $n$ is the virtual norm. We refer more details in the original paper [yin2019enforcing]. Unlike the original implementation, we sample points from reconstructed point clouds and adopt constraints on predictions to filter invalid samples instead of ground truth. It helps the model convergence at the beginning of training.

Figure 4: Illustration of valid mask calculation for gradience loss (x direction). First line: vanilla calculation of gradience loss. Second line: we propose to first replace invalid value with NaN and compute reasonable valid mask for gradience loss.

L_{r o b u s t} = \frac{1}{N} N \sum i \frac{| α - 2 |}{α} ⎛ ⎝ {(\frac{(e_{i} / c)^{2}}{| α - 2 |})}^{α / 2} - 1 ⎞ ⎠,

(10)

where $e_{i} = {^d}_{i} - d_{i}$ . We experimentally set $α = 1$ and $c = 2$ . In fact, the loss reduces to a simple $L_{2}$ loss, but which is proven to be more effective compared with the proposed adaptive version in our task. More experiments can be conducted to decide a better choice for $α$ and $c$ .

Finally, we adopt a combination of these loss terms to train our network. The total depth loss is

L_{d e p t h} = w_{1} L_{s i l o g} + w_{2} L_{g r a d} + w_{3} L_{v n l} + w_{4} L_{r o b u s t} .

(11)

We set $w_{1} = 1$ , $w_{2} = 0.25$ , $w_{3} = 2.5$ , and $w_{4} = 0.6$ based on tremendous experiments. Then, we apply a dynamic re-weight strategy in which the loss weights $w$ are set as model parameters and are automatically fine-tuned during the training stage. Experimental results indicate that this strategy can achieve similar results as tuning the weights by hand.

Figure 5: Illustration of the teacher network.

3.4 Structure-Aware Distillation

We apply the structure-aware distillation strategy [liu2020structured, wang2021knowledge] to further boost model performance. For the teacher model, we choose Swin Transformer [liu2021swin] as the encoder and adopt a similar lightweight decoder to recover featrue resolution and predict depth maps. We present the network architecture in Fig. 5. The teacher model is trained via the supervision of $L_{d e p t h}$ and is then fixed when distilling the student model. During the distillation, multi-level distlling losses are adopted to provide supervisons on immediate features as shown in Fig. 6. The distillation loss is formulated as

L_{d i s t i l l} = L \sum l (\frac{1}{H \times W} H \sum i W \sum j {∥ ∥ a_{i j}^{s} - a_{i j}^{t} ∥ ∥}_{1}),

(12)

where $a$ is the affinity map calculated via inner-product of $L_{2}$ normalized features. We refer to [liu2020structured, wang2021knowledge] for more details. $s$ and $t$ indicate the features are from the student and teacher model, respectively. We choose three level ( $L = 3$ ) features for distll, which are DeFeat2, DeFeat3, and DeFeat4 in Fig. 1.

Figure 6: Illustration of our multi-scale distillation strategy.

Consequently, the student model is trained via the total loss $L$ :

L = L_{d e p t h} + w_{d} L_{d i s t i l l},

(13)

where $w_{d} = 10$ in our experiments. Notably, unlike previous work [liu2020structured, wang2021knowledge], we adopt a two-stage training paradigm. During the first stage, the student model is only trained via $L_{d e p t h}$ . In the second stage, we adopt the teacher model and utilize the total loss $L$ to further boost the performance of the student model.

4 Experiments

In this section, we introduce our experiments to evaluate the effectiveness of our solution. We first elaborate the dataset and define the evaluation metrics. Then the detailed implementation and ablation studies are presented. We also report the inference time on target devices (i.e., Raspberry Pi 4) to show that our method can not only produce reasonable depth estimation but also achieve real-time inference on resource-constrained hardware.

4.1 Setup

4.1.1 Dataset

We utilize the dataset provided by MAI&AIM2022 challenge to conduct experiments, which contains 7385 pairs of RGB and grayscale depth images. The pixel values of depth maps are in uint16 format ranging from 0 to 40000, which represent depth values from 0 to 40 meters. We use 6869 pairs for training and the rest 516 pairs as the local validation set.

4.1.2 Evaluation Metrics

In MAI&AIM2022 challenge [ignatov2022depth], two metrics are considered for each submission solution: 1) The quality of the depth estimation. It is measured by the invariant standard root mean squared error (si-RMSE). 2) The runtime of the model on the target platform (i.e., Raspberry Pi 4). The scoring formulation is provided below:

Score (si-RMSE, runtime) = \frac{2^{- 20} \cdot si-RMSE}{C \cdot runtime},

(14)

where $C = 0.01$ on the online validation benchmark.

4.2 Implementation Details

We implement the proposed model via the monocular depth estimation toolbox [lidepthtoolbox2022], which is based on the open-source machine learning library Pytorch. The model is converted to TFLite [lite2019deploy] after training. We use Adam optimizer with betas = (0.9, 0.999) and eps=1e-3. A poly schedule is adopted where the base learning rate is 4e $^{- 3}$ and the power is 0.9. The total number of epochs is 600 with batch size = 32 on two RTX3090 GPUs, which takes around 4 hours to train a model. The encoder of our network is pretrained on ImageNet, and the decoder part is trained from scratch.

Figure 7: The visualization results of our proposed methods. One can observe that there is noise in ground truth labels which we highlight with a red circle.

4.3 Quantitative Results

As shown in Tab. 1, our proposed method achieves a score of 232.04 on the challenge test set and ranks second place. Our solution achieves 0.311 si-RMSE with 37 $m s$ on the Raspberry Pi 4. Notably, our runtime is lower than the other methods and the performance is comparable.

4.4 Qualitative Results

We visualize the prediction results of our proposed methods as shown in Fig. 7, which demonstrates that our methods can achieve reasonable depth estimation results. However, the predicted depth maps are very rough around the edges due to the excessive down-sampling.

4.5 Inference Time

In this section, we verify that our method can achieve high-throughput monocular depth estimation on mobile devices. We convert our model to TensorFlow-Lite and test the inference time on various mobile devices, including smartphones with Kirin 980 and Snapdragon 7 Gen 1. We test the model using AI Benchmark [ignatov2018ai, ignatov2019ai]. Following the challenge requirements, the resolution of input and output images is 640 $\times$ 480. The data type is set to float (32 bit). As presented in Tab 2, our network can obtain extremely high-throughput inference. It achieve 162FPS on smart phones with Snapdragon 7 Gen 1 processor. Interestingly, we can observe that the model is CPU-friendly, with an even faster inference on CPU than GPU on mobile devices.

{adjustbox}

width=0.80center SoC Device Average/ms STD/ms Kirin 980 CPU 6.85 0.77 Kirin 980 GPU Delegate 9.84 0.66 Snapdragon 7 Gen 1 CPU 6.16 1.71 Snapdragon 7 Gen 1 GPU Delegate 7.17 1.00

Table 2: Inference time of our network (AI Benchmark).

4.6 Ablation Studies

4.6.1 Effectiveness of Network Design

Encoder selection is crucial to the trade-off between fidelity and runtime. We recommend refering [ignatov2021fast, wang2021knowledge] for more comparisons among various encoders. Following these works, we choose the MobileNet-v3 as the default encoder. We then present comparisons among different decoder designs as shown in Tab. 3. Typically, previous methods [ignatov2021fast, wang2021knowledge, lee2019bts, bhat2021adabins, li2022depthformer] utilize the 3 $\times$ 3 convolution to fuse features. While the quantitative results are good, the runtime can be longer. However, when we replace all the 3 $\times$ 3 convolution with 1 $\times$ 1 convolution, the model performance drops drastically while the runtime gets short. Hence, we adopt a mix version as presented in our Sec. 3.1 and Fig. 1. We utilize a 3 $\times$ 3 convolution at the highest resolution and adopt 1 $\times$ 1 convolutions at other places, which makes the best trade-off between fidelity and runtime, getting the highest score on the benchmark. We then present the importance of the merging image normalization. It significantly reduces the runtime without any performance drop.

{adjustbox}

width=0.8center Architecture MIN si-RMSE Runtime/ms Score Full 3 $\times$ 3 @ Dec. 0.295 62 27.01 Full 1 $\times$ 1 @ Dec. 0.308 53 26.38 Mix Convs @ Dec. 0.301 56 27.51 Mix Convs @ Dec. ✓ 0.301 37 41.64

Table 3: Ablation study about the network architecture design. Dec and MIN are the short for decoder and merge image normalization, respectively.

4.6.2 Effectiveness of R $^{2}$ Crop

We present the ablation study of various crop strategies. In these experiments, we only adopt the single sigloss (Eq. 7) for simplicity. As shown in Tab. 4, our proposed R $^{2}$ crop indicates an engaging improvement on performance compared with the baseline methods. When we adopt the vanilla random crop, the model cannot learn the knowledge of full-area images. However, the model infers on full-area images during the validation stage. This discrepancy leads to significant performance degradation. If we do not apply any crop strategy, the diversity of training samples is limited, also leading to a performance limitation. When we adopt our proposed R $^{2}$ crop, during the training stage, the model can not only learn the knowledge of full-area images but also ensure the diversity of training samples. When increasing the variety of crop sizes, the model performance can be improved simultaneously. However, too small patches cannot bring performance gains but lead to a slight degradation (e.g., (144, 256) patches in our ablation study). We infer that the small patches do not contain sufficient structure information for facilitating the model training. As a result, we adopt patches with a size of [(240, 384), (384, 512), (480, 640)] in our solution.

{adjustbox}

width=0.9center Method si-RMSE RMSE w/o crop 0.335 4.25 random crop with (384, 512) 0.377 4.62 R $^{2}$ crop with [(384, 512), (480, 640)] 0.327 4.15 R $^{2}$ crop with [(240, 384), (384, 512), (480, 640)] 0.323 4.11 R $^{2}$ crop with [(144, 256), (240, 384), (384, 512), (480, 640)] 0.325 4.13

Table 4: Ablation study of crop strategies. (h, w) represents the size of crop patches.

4.6.3 Effectiveness of Multiple-Loss Training

This section evaluates the effectiveness of each loss term used in our solution. The results are presented in Tab. 5. Each loss term can bring performance gains for the model. We also highlight that if we do not apply the invalid mask in gradience loss, the model convergence will be hurt as described in Sec. 3.3. Moreover, our dynamic weight strategy can also achieve satisfactory results without fine-tuning loss weights by hand. We utilize the handcrafted weights as a default setting to achieve a better score in the challenge.

{adjustbox}

width=0.95center Sig Loss (Eq. 7) Grad Loss (Eq. 8) VNL Loss (Eq. 9) Robust Loss (Eq. 10) Dynamic Weight si-RMSE ✓ 0.323 ✓ ✓ 0.316 ✓ ✓ ✓ 0.309 ✓ ✓ ✓ ✓ 0.303 ✓ ✓ ✓ ✓ ✓ 0.306

Table 5: Ablation study of the multiple loss strategy.

4.6.4 Effectiveness of Distillation

We first present the results of the teacher model. As shown in Tab. 6, the teacher model achieves much better fidelity compared to the student model. It indicates that there is improvement room for the student model to learn from the teacher model via the distillation. We also present qualitative results in Fig. 7 for intuitive comparisons. As we can observe from the predicted depth maps, the teacher model provides more reasonable and sharper depth estimation results.

We then evaluate different distillation strategies in this section. Motivated by previous work, we try to apply L2 distillation [ignatov2021fast], structure-aware distillation [liu2020structured, wang2021knowledge], and channel-wise distillation [shu2021channel]. Interestingly, all strategies cannot directly work well for our lightweight student model as presented in Tab. 6. One possible reason is that we adopt multiple loss terms, leading to difficulty in balancing the loss weights. However, we also conduct experiments in which we only adopt the single sigloss and apply the distillation strategies. The results are similar without improvement in model performance. Moreover, some distillation strategies conflict with the two-stage fine-tuning, leading to a convergence issue. These experimental results indicate that more effective distillation strategies should be designed for monocular depth estimation. In this solution, we propose to adopt structure-aware distillation. It brings a slight improvement to the si-RMSE of our lightweight student model but a degradation on RMSE, indicating there is still huge room to improve the distillation strategy.

{adjustbox}

width=0.7center Method Two-Stage si-RMSE RMSE Teacher Model 0.228 3.025 Baseline Student Model 0.303 3.785 L2 Distillation 0.307 3.978 L2 Distillation ✓ $\emptyset$ Channel-Wise Distillation 0.311 4.045 Channel-Wise Distillation ✓ $\emptyset$ Structure-Aware Distillation 0.306 3.994 Structure-Aware Distillation ✓ 0.301 3.839

Table 6: Ablation study of distillation strategies. Two-Stage indicates applying the distillation in a fine-tuning manner.

\emptyset

denotes that the fine-tuning process does not converge.

5 Conclusion

We have introduced our solution for fast and accurate depth estimation on mobile devices. Specifically, we design an extremely lightweight model for depth estimation. Then, we propose R $^{2}$ crop to enrich the diversity of training samples. To facilitate the model training, we design a gradience loss and adopt multiple-loss items. We also investigate various distillation strategies. Extensive experiments indicate the effectiveness of our proposed solution.

6 Acknowledgments

The research was supported by the National Natural Science Foundation of China (61971165, 61922027), and also is supported by the Fundamental Research Funds for the Central Universities.

LiteDepth: Digging into Fast and Accurate Depth Estimation on Mobile Devices

Abstract

Keywords:

1 Introduction

2 Related Work

3 Method

3.1 Network Design

3.2 R2 Crop

3.3 Multiple Loss Training

3.4 Structure-Aware Distillation

4 Experiments

4.1 Setup

4.1.1 Dataset

4.1.2 Evaluation Metrics

4.2 Implementation Details

4.3 Quantitative Results

4.4 Qualitative Results

4.5 Inference Time

4.6 Ablation Studies

4.6.1 Effectiveness of Network Design

4.6.2 Effectiveness of R2 Crop

4.6.3 Effectiveness of Multiple-Loss Training

4.6.4 Effectiveness of Distillation

5 Conclusion

6 Acknowledgments

References

LiteDepth: Digging into Fast and Accurate
Depth Estimation on Mobile Devices

3.2 R $^{2}$ Crop

4.6.2 Effectiveness of R $^{2}$ Crop