Rethinking Skip Connections in Encoder-decoder Networks for Monocular Depth Estimation

Zhitong Lai Haichao Sun ciomp˙shc@163.com Rui Tian Nannan Ding Zhiguo Wu Yanjie Wang Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China University of the Chinese Academy of Sciences, Beijing 100049, China
Abstract

Skip connections are fundamental units in encoder-decoder networks, which are able to improve the feature propagtion of the neural networks. However, most methods with skip connections just connected features with the same resolution in the encoder and the decoder, which ignored the information loss in the encoder with the layers going deeper. To leverage the information loss of the features in shallower layers of the encoder, we propose a full skip connection network (FSCN) for monocular depth estimation task. In addition, to fuse features within skip connections more closely, we present an adaptive concatenation module (ACM). Further more, we conduct extensive experiments on the ourdoor and indoor datasets (i.e., the KITTI dataste and the NYU Depth V2 dataset) for FSCN and FSCN gets the state-of-the-art results.

keywords:
monocular depth estimation, encoder-decoder network, skip connections
journal: NeuroComputingmytitlenotemytitlenotefootnotetext: Fully documented templates are available in the elsarticle package on CTAN.

1 Introduction

In convolutional neural networks (CNNs), convolution operation is a lossy operation. When the network layer goes deeper, the loss of information from the input increases, which causes a performance degradation. However, besides avoiding gradient exploding and gradient vanishing, skip connections are able to alleviate this degradation.

Skip connection is first adopted in ResNetResNet , which is also called residual. After that, skip connections have become a popular component in many neural networks, such as WideResNetWideResNet , DenseNetDenseNet and ResNeXtResNeXt , etc. U-NetU-Net is another type of architecture with skip connections. Different from ResNet, the skip connections in U-Net are conducted between the encoder and the decoder of an encoder-decoder architecture. Nevertheless, in this paper, we call them together as skip connections. For encoder-decoder networks, skip connections between the encoder and the decoder are almost indispensable units due to the long distance between the input and the output. However, in most encoder-decoder networks, skip connections between encoder and decoder are just conducted between features with the same spatial resolution, which ignored the information loss of features in encoder. With the layers going deeper in encoder, the information loss of the features to be connected with decoder continues to increase. On the other hand, features of different levels in encoder contains different information, the fusion of multi-type information can be also helpful for the final prediction. In that case, we consider to connect features of all spatial levels in encoder with features of every spatial level in decoder to preserve more feature information from encoder.

To avoid large loss of feature information over a CNN, many networks have utilized dense connection mechanism. However, most of them conduct dense connections just within the encoderChen2018 ; Chen2019 ; Xu2018 or the decoderBilinski2018 ; DCPNet ; Wu2021 , instead of between the encoder and the decoder. Though encoder-decoder networks contain a feature compression and a feature restoration procedure, the whole procedure of encoder-decoder can be regarded as a continuous loss of feature information. In that case, dense skip connections between encoder and decoder may make difference.

In this paper, we call the form of dense skip connections between encoder and decoder as full skip connections. We propose a full skip connection network (FSCN) for monocular depth estimation, where the features in encoder are connected with features in decoder with a dense fashion. Monocular depth estimation based on deep neural networks (DNNs) is a pixel-wise task where the input is a single color image, and the output is a depth image at a resolution the same as the input. Dense connection mechanism is a popular setting to alleviate information degradation and improve feature presentation ability for a CNN. For pixel-wise tasks like monocular depth estimation, propagating feature information as more as possible is vital important for more accurate predictions.

In the network FSCN, an adaptive concatenation module (ACM) is also presented, which performs better than normal concatenation in fusing features from the encoder. We conducted various experiments for monocular depth estimation task on the KITTI datasetGeiger2013 and the NYU Depth V2 datasetSilberman2012 , which get state-of-the-art results. The main contributions of our work are summarized as follows:

  • We proposed a novel encodr-decoder network FSCN for monocular depth estimation, which employed a dense connection mechanism between the encoder and the decoder.

  • To fuse features from the encoder and features in the decoder effectively, we presented a adaptive concatenation module (ACM), which is more effective than normal concatenation.

  • We conducted extensive experiments on the KITTI datatset and the NYU Depth V2 dataset, which are outdoor dataset and indoor dataset, respectively. The results achieved the state-of-the-art on both two datasets.

The rest of this paper is organized as follows. In section 2 we presented a brief review of related works. In section 3, we explained our proposed method. Extensive experiments and an ablation study are shown in section 4. In section 5, we made some conlusions for our work.

2 Related Work

2.1 Monocular Depth Estimation

Early works for monocular depth estimation mainly focused on exploiting hand-crafted features to learn geometric or optical priors from a color imageKarsch2014 ; Liu2010 ; Saxena2008 . With the development of deep learning, various DNN based models have been proposed. The methods for monocular depth estimation can be sorted as supervised fashionEigen2014 ; DORN ; DenseDepth ; BTS , unsupervised fashionGodard2017 ; Yang2020 ; Johnston2020 ; Wong2019 and semi-supervised fashionQi2018 ; Yue2020 ; Kuznietsov2017 ; Ji2019 . At this stage, for the prediction accuracy, the supervised and semi-supervised fashion still have a gap with supervised fahsion. For supervised fashion, the RGB-D datasets are required to learn a mapping function to generate a corresponding depth map of a single color image. Eigen et al.Eigen2014 proposed the first DNN model for monocular depth estimation, which contains a coarse prediction stage and a refine prediction stage. Fu et al.DORN treated monocular depth estimation as a a deep ordinal regression problem and introduced a discretization strategy. Alhashim et al.DenseDepth introduced a model with transfer learning. Lee et al.BTS proposed a local planar guidance module to link internel features with final output effectively, which got a great improvement. Attracted by the great success of attention machanism in capturing long-range context information, some attention-based models have been presented recently. For instance, HuynhHuynh2020 proposed a depth-attention volume (DAV) to capture more context information in the features propagation to leverage monocular depth estimation. Yang et al.TransDepth adopted the TransformerVaswani2017 and presented an attention gate module for monocular depth estimation. In our work, we also adopted an attention module SENetCBAM , which was helpful for our method to get better performance.

2.2 Skip Connections

Skip connections are first proposed in ResNetResNet to solve the problem of vanishing/exploding gradients, as well as to enhance gradient propagation for deep networks, which has been one of the most fundamental elements of deep architectures. Inspired by ResNet, DenseNetDenseNet and ResNeXtResNeXt were then proposed and got an improvement in parameter efficiency and feature propagation. The three architectures are usually used as a backbone network in encoder-decoder architectures. Nevertheless, as the success shown in U-NetU-Net , skip connections between encoder and decoder can be also helpful for parameter efficiency and feature propagation. For instance, Collin et al.Collin2020 proposed an autoencoder network with skip connections to leverage anomaly detection. Bulat et al.Bulat2020 proposed a hybrid network combining the HourGlass and U-Net architectures, in which the soft-gated skip connections were presented and made great difference for human pose estimation. Wang et al.Wang2019 proposed a fully convolutional neural network with long and short skip connections for monocular depth estimation while performed well.

To further utilize the advantages of skip connections, besides some backbones like DenseNet, many works adopted a dense fashion for skip connections. For example, Shang et al.Shang2020 proposed a novel CNN for SAR image classification, in which the dense connections were used to reuse feature maps and strengthen information transmission. Dai et al.Dai2021 proposed a dense scale network for crowd counting, which is an encoder-decoder architecture and the decoder contained a dense skip connection mechanism. Bao et al.Bao2020 proposed a multi-scale residual dense network that employed dense connection many times for image denoising.

For encoder-decoder architectures, skip connections act as a highway to pass details of the previous feature maps in the encoder to the decoder. However, most skip connections just pass feature maps in the encoder to features with the same resolution in the decoder, which ignored the details of features with other resolutions in the encoder. In that case, we construct highways for all spatial levels of features in the encoder with every spatial level of features in the decoder, which forms a dense fashion.

3 Method

In this section, we first introduce our proposed network FSCN, then describe the details of adaptive concatenation module (ACM). Finally, we introduce the loss function for the training procedure.

3.1 Network Architecture

The overall architecture of our proposed network is shown in Figure 1. The FSCN is an encoder-decoder architecture. Except the densest feature extracted by the encoder (i.e., in the figure), the features in all spatial levels of the encoder are preserved to be concatenated with features in the decoder. Before the concatenation, the features in the encoder are first scaled to specific spatial resolutions that are the same as the ones of features in the decoder by sampling operations. Then the scaled features are concatenated with features in the decoder by adaptive concatenation modules (i.e., ACM in the figure), which will be described in Section 3.2. After the concatenation, the features are sent to a SENetCBAM module and then upscaled to next spatial level by an upscale operation with a ratio of 2. Finally, the network outputs a depth map with a resolution the same as the input image.

The overall architecture of FSCN, in which
Figure 1: The overall architecture of FSCN, in which and indicate features with different spatial resolutions in the encoder and the decoder, respectively. Note that when , the shapes of and are the same.

The upscale module contains a sequence with two elements, which are a upsampling operation and a convolution operation, the former is to upscale the spatial resolution of a feature with a ratio of 2, while the latter is to alter the channel number into next level.

3.2 Adaptive Concatenation Module

The overall architecture of adaptive concatenation module (ACM)
Figure 2: The overall architecture of adaptive concatenation module (ACM)

As shown in Figure 2, we design an adaptive concatenation module to fuse features from the encoder with features in the decoder adaptively. In Figure 2, denotes the features in specific spatial levels of the decoder, are features altered from features by sampling operations, with whose spatial resolution the same as . are a series of learnable parameters with whose initial values are a random number on the interval [0, 1), which are used to decide the importance of each feature block from the encoder when concatenated with features in the decoder. We call concatenation weights in this paper. In Figure 2, indicates concatenation operation, the procedure can be presented as equation 1.

(1)

Though a random ratio is used before the concatenation, the weights are not set in channel-wise. As the channel number of the feature obtained from the concatenation increases intensely, we consider to employ an SENetCBAM for the concatenated feature block. SENet is a channel attention module that is able to decide the weight of each channel of a feature, which can advance the presentation ability of a feature.

After an SENet module, a convolution operation is used to fuse the feature after the concatenation, then we get the output . Note that the shape of is the same as feature . This procedure can be presented as equation 2.

(2)

Since the ratios and channel attention module SENet are used in equation 1 and equation 2, respectively. We call the procedures consisted by equation 1 and equation 2 adaptive concatenation module. In Section 4, several experiments will be conducted for proving the effectiveness of our proposed adaptive concatenation module.

3.3 Loss Function

We adopt the improved scale-invariant loss function introduced in BTS for the training phase. The scale-invariant loss is proposed by Eigen et al.Eigen2014 , which is:

(3)

in which , and denote the ground truth and predicted depth map at pixel , respectively. denotes the total pixels of a depth map. The improved scale-invariant loss is:

(4)

The hyper-parameter is set to 10 and is set to 0.85, which is the same as the ones in [21].

4 Experiments

Multiple experiments for evaluation are conducted on two baseline datasets, i.e., the KITTI dataset and the NYU Depth V2 dataset. Both quantitative and qualitative results are provided. Moreover, we set some comparisons with other representative monocular depth estimation method, i.e., Eigen2014 ; Liu2015 ; DenseDepth ; DORN ; Yin2019 ; BTS ; Godard2019 ; TransDepth ; Liu2021 ; Ye2021 ; Hu2019 ; Chen_arx2019 ; Xu2021 .

4.1 Datasets

The KITTI datasetGeiger2013 is a large-scale outdoor dataset captured by multiple sensors mounted on a driving car, which is created for automatic driving researches. The dataset contains a number of color images and corresponding depth maps with a resolution of 375 1242 pixels. For the experiments, we adopted a data spilt strategy proposed by Eigen et al.Eigen2014 , in which the training set contains 23,488 images from 32 scenes and the test set contains 697 images from remaining 29 scenes. The images are cropped to 352 704 in a random manner in the experiments.

The NYU Depth V2 datasetSilberman2012 is an indoor dataset containing 120K RGB images and paired depth maps from 464 indoor scenes. The resolusion of the color images and depth maps is 480 × 640 pixels. For the experiments, we crop the images into 416 544 pixels randomly. The training set and test set are also splited with the strategy proposed by Eigen et al.Eigen2014 , in which the training set contains 36253 pairs from 249 scenes and the test set contains 654 pairs from 251 scenes.

4.2 Evaluation Metrics

To evaluate the performance of our method, we adopted standard evaluation metrics used in previous worksEigen2014 ; BTS ; TransDepth :

  • Mean relative error (): ;

  • Squared relative error (): ;

  • Root mean squared error (): ;

  • Mean log10 error (): ;

  • Root mean squared log10 error (): ;

  • Accuracy with threshold , i.e., the percantage (%) of subjecting to , here, .

where indicates the total number of valid pixels in the ground truth. and denote ground truth and predicted depth value at pixel , respectively.

4.3 Implementation Details

We implemented all the experiments with the open source deep learning framework PyTorch. Two NVIDIA 3090 GPUs are used for all trainings. When training, we employed the AdamW optimizerGlorot2010 with , and . The number of epochs was set to 50. The batch size was set to 8. We initialize the weights with Xavier initializationDeng2009 . The initial leraning rate was set to and decayed with the strategy proposed in BTS .

We chose the backbone DenseNet161DenseNet that was pretrained on ImageNetSimonyan2014 for the encoder part of our network. Moreover, we conducted expriments for ablation study of three concatenation methods. We also explored the influences of discarding specific skip connections from specific features in the encoder.

We employed data augmentations to improve training performance and avoid overfitting. The augmentations include random horizontal flipping, random contrast, random color adjustment with a chance of 50%. Random rotation was also used, with the angles in range of [-1, 1] for the KITTI dataset, and [-2.5, 2.5] for the NYU Depth V2 dataset.

4.4 Results on the KITTI Dataset

Table 1 shows the quantitative results of our proposed method on the KITTI Eigen split. Note that Eigen2014 BTS Ye2021 Godard2019 TransDepth employed the same split strategy as our method.

Method
Eiegn et al.Eigen2014 0.190 1.515 7.156 0.270 0.692 0.899 0.967
Liu et al.Liu2015 0.217 - 7.046 - 0.656 0.881 0.958
DenseDepthDenseDepth 0.093 0.589 4.170 - 0.886 0.965 0.986
DORNDORN 0.072 0.307 2.727 0.120 0.932 0.984 0.994
Yin et al.Yin2019 0.072 - 3.258 0.117 0.938 0.990 0.998
BTSBTS 0.060 0.249 2.798 0.096 0.955 0.993 0.998
Godard et al.Godard2019 0.106 0.806 4.530 0.193 0.876 0.958 0.980
TransDepthTransDepth 0.064 0.252 2.755 0.098 0.956 0.994 0.999
Liu et al.Liu2021 0.111 - 3.514 - 0.878 0.977 0.994
DPNetYe2021 0.112 - 4.978 0.210 0.842 0.947 0.973
WaveletMnodepthramamonjisoa2021single 0.097 0.718 4.387 0.184 0.891 0.962 0.982
FSCN 0.062 0.248 2.739 0.097 0.955 0.993 0.999
Table 1: Experimental results on the KITTI Eigen split. The values in bold type are the best results of every metric among these works, while the values underlined are the second best results. We set the depth range to 0-80m. Metrics marked by : lower is better; metrics marked by : higher is better.

From Table 1 we can see that our method gets competitive results with current leading algorithms. On metric , our method works worser than DORN, however, it performs much better than other algorithms. Moreover, our method works much better than DORN except metric .

Qualitative examples on the KITTI Eigen test split. (a) RGB image; (b) ground truth; (c) BTS
Figure 3: Qualitative examples on the KITTI Eigen test split. (a) RGB image; (b) ground truth; (c) BTSBTS ; (d) DenseDepthDenseDepth ; (e) our proposed FSCN. The ground truth depth maps are filled based on sparse point clouds utilizing tools provided by the NYU Depth V2 dataset. For better visualization, the values of all the depth maps are logarithmic. Note that the encoders of BTSBTS and FSCN are both DenseNet161.

Figure 3 shows the qualitative results of FSCN on the KITTI Eigen validation set, while comparing with two leading algorithms BTSBTS and DenseDepthDenseDepth . From this figure, we can observe that FSCN method shows more details in the contents like cars structure, traffic signs, sketch of human and so on, comparing with the other two counterpart methods, which may convey the evidence that our method with full skip connection mechanism is able to preserve and propagate more feature information along the deep network.

4.5 Results on the NYU Depth V2 Dataset

Method
Eiegn et al.Eigen2014 0.215 - 0.907 0.611 0.887 0.971
Liu et al.Liu2015 0.213 0.087 0.759 0.650 0.906 0.976
Fu et al.DORN 0.115 0.051 0.509 0.828 0.965 0.992
Hu et al.Hu2019 0.123 0.053 0.544 0.855 0.972 0.993
Yin et al.Yin2019 0.108 0.048 0.416 0.875 0.976 0.994
Chen et al.Chen_arx2019 0.111 0.048 0.514 0.878 0.977 0.994
Liu et al.Liu2021 0.113 0.049 0.525 0.872 0.974 0.993
Ye et al.Ye2021 - 0.063 0.474 0.784 0.948 0.986
Xu et al.Xu2021 0.101 0.054 0.456 0.823 0.962 0.994
WaveletMnodepthramamonjisoa2021single 0.126 0.054 0.552 0.845 0.968 0.992
FSCN 0.111 0.047 0.395 0.884 0.981 0.995
Table 2: Experimental results on the NYU Depth V2 Eigen split. The values in bold type are the best results of every metric among these works, while the values underlined are the second best results. We set the depth range to 0-10m. Metrics marked by : lower is better; metrics marked by : higher is better.

Table 2 shows the quantitative results of FSCN network on the NYU Depth V2 dataset. Comparing with other methods in this table, FSCN network performs best except in the metric . Especially, our method performs much better than other methods in metric and .

Qualitative examples on the NYU Depth V2 Eigen test split. (a) RGB image; (b) ground truth; (c) Hu et al.
Figure 4: Qualitative examples on the NYU Depth V2 Eigen test split. (a) RGB image; (b) ground truth; (c) Hu et al.Hu2019 ; (d) Chen et al.Chen_arx2019 ; (e) Ours. From top to bottom, We select five RGB images from five scenes, i.e., bedroom, bookstore, dining room, home office and kitchen, respectively. Note that the encoder of Ours is DenseNet161.

Figure 4 shows the qualitative results of FSCN working on the NYU Depth V2 dataset, from which we can observe that FSCN network performs excellent in predicting the details like shelves and chair legs. The comparisons in Figure 4 prove the effectiveness of our proposed FSCN network.

4.6 Ablation Study

4.6.1 Effect of Full Skip Connections

We implemented an ablation study to explore the effect of full skip connections conducted in our method. We set a comparison among three setups, which are no skip connection (indicated as ”no-skip”), skip connections conducted within the same spatial level between the encoder and the decoder (indicated as ”same-skip”) and full skip connections introduced in this paper (indicated as ”full-skip”). We implement experiments both on the KITTI dataset and the NYU Depth V2 dataset. For equality we preserve the adaptive concatenation module in ”same-skip” counterpart.

Method #params
no-skip 35.04M 0.063 0.252 2.793 0.099 0.953 0.993 0.998
same-skip 38.77M 0.062 0.246 2.787 0.098 0.954 0.993 0.999
FSCN (full-skip) 42.62M 0.062 0.248 2.739 0.097 0.955 0.993 0.999
Table 3: Experimental results on the KITTI Eigen split for different skip-connection mechanisms. We set the depth range to 0-80m. Metric #params means the total number of parameters of specific experimental setups. Metrics marked by : lower is better; metrics marked by : higher is better.
Method #params
no-skip 35.04M 0.113 0.049 0.404 0.876 0.981 0.996
same-skip 38.77M 0.112 0.048 0.397 0.878 0.980 0.996
Ours (full-skip) 42.62M 0.111 0.047 0.395 0.884 0.981 0.995
Table 4: Experimental results on the NYU Depth V2 Eigen split for different skip-connection mechanisms. The values in bold type are the best results of every metric among these works, while the values underlined are the second best results. We set the depth range to 0-10m. Metric #params means the total number of parameters of specific experimental setups. Metrics marked by : lower is better; metrics marked by : higher is better.

The experimental results on the KITTI dataset and the NYU Depth V2 dataset are shown in Table 1 and Table 2, respectively. From the two tables we can observe that our method with full skip connections performs better than the counterparts ”same-skip” and ”no-skip”, which proves the effectiveness of full skip connection mechanism utilized in FSCN network. Moreover, the setup ”same-skip” works better than the setup ”no-skip”, which shows the advantage of skip connection mechanism in CNNs. Interestingly, the setup ”no-skip” performs better than many methods listed in Table 1 and Table 2, which offers us a direction for future research work.

4.6.2 Effect of Adaptive Concatenation Module

Adaptive concatenation module is an important part in FSCN network. The reason why it is called ”adaptive” is because two items within it, i.e., the concatenation weights (CW) and channel attention module SENet (SE). In this section, we set several experiments to evaluate the effect of ACM, which are counterparts discarding concatenation weights (CW), counterparts discarding SENet (SE) and counterparts discarding both items. All the experiments are implemented on both the KITTI dataset and the NYU Depth V2 dataset.

Method #params
w/o CW 42.62M 0.062 0.249 2.749 0.097 0.954 0.993 0.999
w/o SE 42.15M 0.062 0.251 2.818 0.099 0.953 0.993 0.998
w/o CW&SE 42.15M 0.064 0.257 2.837 0.100 0.952 0.993 0.998
FSCN 42.62M 0.062 0.248 2.739 0.097 0.955 0.993 0.999
Table 5: Experimental results on the KITTI Eigen split for different setups of ACM. We set the depth range to 0-80m. Metric #params means the total number of parameters of specific experimental setups. Metrics marked by : lower is better; metrics marked by : higher is better.
Method #params
w/o CW 42.62M 0.112 0.048 0.397 0.881 0.981 0.995
w/o SE 42.15M 0.114 0.048 0.403 0.874 0.977 0.994
w/o CW&SE 42.15M 0.114 0.049 0.408 0.875 0.978 0.995
FSCN 42.62M 0.111 0.047 0.395 0.884 0.981 0.995
Table 6: Experimental results on the NYU Depth V2 Eigen split for different setups of ACM. The values in bold type are the best results of every metric among these works, while the values underlined are the second best results. We set the depth range to 0-10m. Metric #params means the total number of parameters of specific experimental setups. Metrics marked by : lower is better; metrics marked by : higher is better.

The experimental results in Table 5 and Table 6 show that the performance of the network drops dramatically when discarding the concatenation weights or SENet, which proves the effectiveness of our proposed adaptive concatenation module. The comparison between ”w/o CW” and ”w/o SE” shows that SENet plays a more important role in FSCN than concatenation weights. Moreover, the idea of ACM can be transferred into other networks.

5 Conclusions

In this work, we proposed a so called full skip connection based encoder-decoder network for monocular depth estimation. Comparing with traditional skip connections in normal encoder-decoder networks, the full skip connections presented in our work leveraged the information loss of feature propagation in deep networks. Moreover, we presented an adaptive concatenation module to fuse the features to be connected. Our proposed method achieved state-of-the-art results on the KITTI dataset and the NYU Depth V2 dataset, which demonstrated the effectiveness of our method for monocular depth estimation task. The ablation study proved the effectiveness of the presented full skip connection mechanism and adaptive concatenation module, which offered a new idea for us to utilize skip connections within CNNs.

References

  • (1) K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, IEEE Conference on Computer Vision and Pattern Recognition (2016) 770-778.
  • (2) S. Zagoruyko, N. Komodakis, Wide residual networks, 2016. arXiv preprint arXiv:1605.07146, 2016.
  • (3) G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely Connected Convolutional Networks, IEEE Conference on Computer Vision and Pattern Recognition (2016) 4700-4708.
  • (4) S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated Residual Transformations for Deep Neural Networks, IEEE Conference on Computer Vision and Pattern Recognition (2017) 1492-1500.
  • (5) O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation, International Conference on Medical Image Computing and Computer-assisted Intervention (2015) 234-241.
  • (6) S. Chen, M. Tang, J. Kan, Encoder–decoder with densely convolutional networks for monocular depth estimation, J Opt Soc Am A. 36(10)(2019), 1709-1718.
  • (7) C. Chen, F. Qi, Single Image Super-Resolution Using Deep CNN with Dense Skip Connections and Inception-ResNet, IEEE International Conference on Information Technology in Medicine and Education(2018) 999-1003.
  • (8) J.Xu, Y. Chae, B. Stenger, A. Datta, Dense bynet: Residual dense network for image super resolution, IEEE International Conference on Image Processing (2018) 71-75.
  • (9) P. Bilinski, V. Prisacariu, Dense Decoder Shortcut Connections for Single-Pass Semantic Segmentation, IEEE Conference on Computer Vision and Pattern Recognition (2018) 6596-6605.
  • (10) Z. Lai, R. Tian, Z. Wu, N. Ding, L. Sun, Y. Wang, DCPNet: A Densely Connected Pyramid Network for Monocular Depth Estimation, Sensors. 21(20)(2021), 6780.
  • (11) Y. -H. Wu, Y. Liu, L. Zhang, W. Gao and M. -M. Cheng, Regularized densely-connected pyramid network for salient instance segmentation, IEEE Trans Image Process, 30(2021), 3897-3907.
  • (12) V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, A. Gaidon, 3d packing for self-supervised monocular depth estimation, IEEE Conference on Computer Vision and Pattern Recognition (2020) 2485-2494.
  • (13) R. Mendes, E. Ribeiro, N. Rosa, V. Grassi, On deep learning techniques to boost monocular depth estimation for autonomous navigation, Rob Auton Syst. 136(2021), 103701.
  • (14) A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: The kitti dataset, Int J Rob Res. 32(11)(2013), 1231-1237.
  • (15) N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from rgbd images, European Conference on Computer Vision (2012) 746-760.
  • (16) K. Karsch, C. Liu, S. Kang, Depth transfer: Depth extraction from video using non-parametric sampling, IEEE Trans. Pattern Anal. Mach. Intell.. 36(11)(2014), 2144-2158.
  • (17) B. Liu, S. Gould, D. Koller, Single image depth estimation from predicted semantic labels. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2010) 1253-1260.
  • (18) A. Saxena, M. Sun, A. Ng, Make3d: Learning 3d scene structure from a single still image, IEEE Trans. Pattern Anal. Mach. Intell.. 31(5)(2008), 824-840.
  • (19) D. Eigen, C. Puhrsch, R. Fergus, Depth Map Prediction from a Single Image using a Multi-Scale Deep Network, 2014. arXiv preprint arXiv:1406.2283.
  • (20) H. Fu, M. Gong, C. Wang, K. Batmanghelich, D. Tao, Deep ordinal regression network for monocular depth estimation, IEEE Conference on Computer Vision and Pattern Recognition(2018) 2002-2011.
  • (21) I. Alhashim, P. Wonka, High Quality Monocular Depth Estimation via Transfer Learning, 2018. arXiv preprint arXiv:1812.11941.
  • (22) J. Lee, M. Han, D. Ko, I. Suh, From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation, 2019. arXiv preprint arXiv:1907.10326.
  • (23) C. Godard, O. Aodha, G. Brostow, Unsupervised monocular depth estimation with left-right consistency, IEEE Conference on Computer Vision and Pattern Recognition(2017) 270-279.
  • (24) Yang D, Zhong X, Gu D, et al. Unsupervised framework for depth estimation and camera motion prediction from video, Neurocomputing. 385(2020), 169-185.
  • (25) A. Johnston, G. Carneiro, Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume, IEEE Conference on Computer Vision and Pattern Recognition(2020) 4756-4765.
  • (26) A. Wong, S. Soatto, Bilateral cyclic constraint and adaptive regularization for unsupervised monocular depth prediction, IEEE Conference on Computer Vision and Pattern Recognition(2019) 5644-5653.
  • (27) X. Qi, R. Liao, Z. Liu, R. Urtasun, J. Jia, Geonet: Geometric neural network for joint depth and surface normal estimation, IEEE Conference on Computer Vision and Pattern Recognition(2018) 283-291.
  • (28) M. Yue, G. Fu, M. Wu, H. Wang, Semi-Supervised Monocular Depth Estimation Based on Semantic Supervision, Int. J. Intell. Syst.. 100(2020), 455-463.
  • (29) Y. Kuznietsov, J. Stuckler, B. Leibe, Semi-supervised deep learning for monocular depth map prediction, IEEE Conference on Computer Vision and Pattern Recognition(2017) 6647-6655.
  • (30) R. Ji, K. Li, Y. Wang, X.Sub, F. Guo, X. Guo, Y. Wu, F. Huang, L. Luo, Semi-supervised adversarial monocular depth estimation, IEEE Trans. Pattern Anal. Mach. Intell.. 42(10)(2019), 2410-2422.
  • (31) L. Huynh, P. Nguyen-Ha, J. Matas, E. Rahtu, J. Heikkilä, Guiding monocular depth estimation using depth-attention volume, European Conference on Computer Vision (2020) 581-597.
  • (32) G. Yang, H. Tang, M. Ding, N. Sebe, E. Ricci, Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction, IEEE International Conference on Computer Vision (2021) 16269-16279.
  • (33) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez,N., L. Kaiser, I. Polosukhin, Attention is all you need, Advances in Neural Information Processing Systems (2017) 5998-6008.
  • (34) Hu, Jie, Li Shen, and Gang Sun. ”Squeeze-and-excitation networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
  • (35) A. Collin, C. De Vleeschouwer, Improved anomaly detection by training an autoencoder with skip connections on images corrupted with Stain-shaped noise, International Conference on Pattern Recognition (2020) 7915-7922.
  • (36) A. Bulat, J. Kossaifi, G. Tzimiropoulos, M. Pantic, Toward fast and accurate human pose estimation via soft-gated skip connections, IEEE International Conference on Automatic Face and Gesture Recognition (2020) 8-15.
  • (37) Z. Wang, L. Xiao, R. Xu, S. Su, S. Li, Y. Song, Deeper Monocular Depth Prediction via Long and Short Skip Connection, IEEE International Joint Conference on Neural Networks (2019) 1-7.
  • (38) R. Shang, J. He, J. Wang, K. Xu, L. Jiao, R. Stolkin, Dense connection and depthwise separable convolution based CNN for polarimetric SAR image classification, Knowl Based Syst. 194(2020), 105542.
  • (39) F. Dai, H. Liu, Y. Ma, X. Zhang, Q. Zhao, Dense scale network for crowd counting, International Conference on Multimedia Retrieval (2021) 64-72.
  • (40) L. Bao, Z. Yang, S. Wang, D. Bai, J. Lee, Real image denoising based on multi-scale residual dense block and cascaded U-Net with block-connection, IEEE Conference on Computer Vision and Pattern Recognition Workshops (2020) 448-449.
  • (41) I. Loshchilov, F. Hutter, Decoupled weight decay regularization, 2017. arXiv preprint arXiv:1711.05101.
  • (42) X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, International Conference on Artificial Intelligence and Statistics (2010) 249-256.
  • (43) J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, IEEE Conference on Computer Vision and Pattern Recognition (2009) 248-255.
  • (44) K. Simonyan, A .Zisserman, Very deep convolutional networks for large-scale image recognition, 2014. arXiv preprint arXiv:1409.1556.
  • (45) K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014. arXiv preprint arXiv:1409.1556.
  • (46) F. Liu, C. Shen, G. Lin, I. Reid, Learning depth from single monocular images using deep convolutional neural fields, IEEE Trans. Pattern Anal. Mach. Intell. 38(2015), 2024-2039
  • (47) W. Yin, Y. Liu, C. Shen, Y. Yan, Enforcing geometric constraints of virtual normal for depth prediction, IEEE International Conference on Computer Vision (2019) 5683–5692.
  • (48) C. Godard, O. Aodha, M. Firman, G. Brostow, Digging into self-supervised monocular depth estimation. IEEE International Conference on Computer Vision (2019) 3828-3838.
  • (49) J. Liu, X. Zhang, Z. Li, T. Mao, Multi-Scale Residual Pyramid Attention Network for Monocular Depth Estimation, International Conference on Pattern Recognition (2021) 5137–5144.
  • (50) X. Ye, S. Chen, R. Xu, DPNet: Detail-preserving network for high quality monocular depth estimation, Pattern Recognit. 109(2021), 107578.
  • (51) J. Hu, M. Ozay, Y. Zhang, T. Okatani, Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries, IEEE Winter Conference on Applications of Computer Vision (2019) 1043–1051.
  • (52) X. Chen, X. Chen, Z. Zha, Structure-aware residual pyramid network for monocular depth estimation, 2019. arXiv preprint arXiv:1907.06023.
  • (53) X. Xu, Z. Chen, F. Yin, Monocular Depth Estimation With Multi-Scale Feature Fusion, IEEE Signal Process. Lett. 28(2021), 678–682.
  • (54) Ramamonjisoa, M.; Firman, M.; Watson, J.; Lepetit, V.; Turmukhambetov, D. Single Image Depth Prediction with Wavelet Decomposition. In Proceedings of the CVRR, Virtual Event, 19–25 June 2021; pp. 11089–11098