Uncertainty-Guided Depth Fusion for Spike Camera

Jianing Li
Nanjing University
jnli2021@gmail.com \AndJiaming Liu
Peking University
liujiaming@bupt.edu.cn \AndXiaobao Wei
Beihang University
weixiaobao@buaa.edu.cn \AndJiyuan Zhang
Peking University
jyzhang@stu.pku.edu.cn \AndMing Lu
Intel Labs
lu199192@gmail.com \AndLei Ma
Peking University
lei.ma@pku.edu.com \AndLi Du
Nanjing University
ldu@nju.edu.cn \AndTiejun Huang
Peking University
tjhuang@pku.edu.cn \AndShanghang Zhang
Peking University
shzhang.pku@gmail.com

Abstract

Depth estimation is essential for various important real-world applications such as autonomous driving. However, it suffers from severe performance degradation in high-velocity scenario since traditional cameras can only capture blurred images. To deal with this problem, the spike camera is designed to capture the pixel-wise luminance intensity at high frame rate. However, depth estimation with spike camera remains very challenging using traditional monocular or stereo depth estimation algorithms, which are based on the photometric consistency. In this paper, we propose a novel Uncertainty-Guided Depth Fusion (UGDF) framework to fuse the predictions of monocular and stereo depth estimation networks for spike camera. Our framework is motivated by the fact that stereo spike depth estimation achieves better results at close range while monocular spike depth estimation obtains better results at long range. Therefore, we introduce a dual-task depth estimation architecture with a joint training strategy and estimate the distributed uncertainty to fuse the monocular and stereo results. In order to demonstrate the advantage of spike depth estimation over traditional camera depth estimation, we contribute a spike-depth dataset named CitySpike20K, which contains 20K paired samples, for spike depth estimation. UGDF achieves state-of-the-art results on CitySpike20K, surpassing all monocular or stereo spike depth estimation baselines. We conduct extensive experiments to evaluate the effectiveness and generalization of our method on CitySpike20K. To the best of our knowledge, our framework is the first dual-task fusion framework for spike camera depth estimation. Code and dataset will be released.

1 Introduction

Depth estimation has shown great significance in many real-world applications, including robotic manipulation[40], augmented reality [38, 27], and autonomous driving [26, 42]. However, it suffers from bottlenecks in high-velocity motion circumstances, hindered by blurred images from traditional low frame rate cameras [19]. To deal with high-velocity motion, spike cameras are designed to capture the images at high frame rate [10, 57]. Since spike cameras can capture the pixel-wise luminance intensity at high frame rate, spike depth estimation is an ideal solution to depth estimation in high-velocity motion [58].

We demonstrate the advantage of spike camera when dealing with fast-moving objects for driving depth estimation. The first row indicates the motion blur (yellow dotted circle) from traditional RGB camera. The second row indicates such motion blur brings inaccurate depth estimation for such high-speed objects.
The third row demonstrates the performance decrease for RGB depth estimation, compared with spike depth estimation. Therefore, we introduce spike camera and our proposed UGDF to solve this problem. — Figure 1: We demonstrate the advantage of spike camera when dealing with fast-moving objects for driving depth estimation. The first row indicates the motion blur (yellow dotted circle) from traditional RGB camera. The second row indicates such motion blur brings inaccurate depth estimation for such high-speed objects. The third row demonstrates the performance decrease for RGB depth estimation, compared with spike depth estimation. Therefore, we introduce spike camera and our proposed UGDF to solve this problem.

Figure 2: We use the binocular spike data as the input and train the monocular and stereo depth estimation models separately. By analyzing the predictions of monocular and stereo models, we find they have different accuracies at different depth ranges. This motivates us to fuse the predictions in a dual-task depth estimation architecture.

Although there are plenty of traditional works on monocular depth estimation [28, 21, 22, 5, 16] and stereo depth estimation [43, 30, 23, 31, 12, 14][24]. It is still very challenging to apply them to spike depth estimation since spike data lacks reliable photometric consistency. In order to solve this problem, we first analyze the pros and cons of monocular and stereo depth estimation. On the one hand, monocular depth estimation is inherently ill-posed and mainly depends on the semantic knowledge of features. Therefore, it is robust to the disparity error and achieves better results at long range. On the other hand, stereo depth estimation compares the local patch pairs to obtain the optimal disparity. Therefore, it obtains better results at close range and performs worse at long range. As shown in Figure 1, we conduct the analysis of monocular and stereo spike depth estimation. This motivates us to fuse the monocular and stereo predictions for spike depth estimation, alleviating the problem of lacking reliable photometric consistency.

In this paper, we propose a novel Uncertainty-Guided Depth Fusion (UGDF) framework to fuse the predictions of monocular and stereo spike depth estimation. Instead of training the monocular and stereo models separately, UGDF introduces a depth estimation architecture for dual tasks with a joint training strategy. This architecture includes two components. The first component is a shared encoder, which learns a feature representation to build stereo cost volume and monocular depth regression. The second component consists of two parallel branches for monocular and stereo depth estimation tasks. For the monocular branch, we set decoder to consist of three upsampling blocks. As for the stereo branch, we utilize a 3D hourglass-shaped convolution to aggregate the disparity dimension feature of 4D cost volume [6]. To fuse the predictions of both branches, instead of naive linear fusion, we introduce a novel adaptive uncertainty-guided fusion approach. Different from occlusion-aware fusion [9], which only exploits the knowledge from stereo branch, we adopt regression uncertainty formulations [53] to measure the performances of monocular and stereo branches. Guided by the uncertainty maps, we fuse the reliable predictions of monocular and stereo branches, taking advantage of both tasks for the final estimation.

In addition, we contribute a spike-depth dataset named CitySpike20K, which consists of 20K paired samples, for spike depth estimation. We demonstrate the great advantages of spike camera for high-velocity depth estimation on CitySpike20K. Extensive experiments are conducted to demonstrate the good performance of our framework compared with state-of-the-art monocular and stereo baselines.

Our contributions can be concluded as follows:

We propose a novel Uncertainty-Guided Depth Fusion framework to fuse the predictions of monocular and stereo spike depth estimation, alleviating the problem of lacking reliable photometric consistency for spike data.
We introduce a dual-task depth estimation architecture along with a joint training strategy. To the best of our knowledge, we are the first to fuse dual tasks for spike depth estimation.
We contribute a spike dataset named CitySpike20K, which contains 20K spike-depth pairs, to demonstrate the advantages of spike camera over traditional cameras on high-velocity depth estimation.
We conduct extensive experiments to evaluate the advantages of our method against existing monocular and stereo baselines.

2 Related Work

In this section, we investigate and reviewed recent works that are related to ours concerned with frame-based and event-based vision for depth estimation.

2.1 Monocular and Stereo Depth Estimation

Monocular and stereo methods are two parallel mainstreams in the development of depth estimation algorithms. One of the earliest works that inspired recent trends for monocular depth estimation was introduced by Eigen et at.[11]. This work proposed a kind of novel architecture that includes coarse-scale and fine-scale two steps, defining depth estimation as a pixel-wise regression problem. Similar to semantic segmentation task, one popular design for monocular depth estimation is encoder-decoder structure with CNNs[43, 30, 23, 31, 12, 14] or transformers[33, 44]. In the encoding stage, the encoder captures context information and learns a global representation. And in the decoding stage, the network tends to establish a coupling connection between context texture and depth information with ground-truth full-supervision or self-supervision[14, 15, 25]. Innovation has also been made in regression style[12, 4, 35], for more efficient representation of depth information. A recent study shows great potential in combining monocular depth estimation as auxiliary tasks for semantic segmentation[20, 18].

Stereo depth estimation shows quite a different design strategy from monocular ways. Early works concentrate mainly on stereo matching for left and right stereo pairs[1, 3]. After deep learning was applied to this task, a stereo depth estimation pipeline contains three main steps: (1) feature extraction (2) cost aggregation, and (3) disparity/depth regression. Thanks to 3D convolution and the proposal of 3D cost-volume[28, 21], the whole pipeline can be constructed end-to-end[22, 5]. PSMNet[6] concatenates left-fight features to cost volume and perform hourglass-shaped 3D convolution to make aggregation. GwcNet[16] uses correlation formulation to divide cost volume into groups, which decreases computation while improving prediction results. In addition, self-supervised methods also gain competitive performance[52].

Recent study shows a new trend for rethinking the connection between monocular and stereo depth estimation[39, 9]. Besides, left-right consistency has been the main cue for unsupervised depth estimation works of monocular task[14].

2.2 Depth Estimation for Event-based Vision

Compared to the standard frame-based cameras, biologically-inspired event-based sensors capture visual information with low latency and minimal redundancy. Dynamic Vision Sensors (DVS) is a kind of representative event-based camera. Compared with frame-based camera, DVS is capable of capturing motional objects. Zhu et al.[55] provides a 100Hz DVS dataset containing depth ground-truth. They also propose time-synchronized event disparity volumes in [54] to handle DVS data for stereo matching. Similar research[56] uses discretized event volume to supervise monocular optical flow and depth estimation without labels. Another work[32] adopts a spiking neural network to estimate event depth.

2.3 Spike Camera and Its Visual Application

Spike camera is also a kind of novel bio-inspired event camera. Distinct from the frame-based cameras and dynamic vision sensors, spike camera mimics the retina to record the nature scenes by continuous-time spikes[45, 57]. [48] develops a new image reconstruction approach for potential retina-inspired spike camera to recover high-speed motion scenes. [50, 49, 51, 59]use spiking or regular convolutional neural networks to reconstruct high quality and high-speed images like a human vision from spike streams. Spike vision shows obvious advantages in capturing high-speed moving objects or scenes, so it provides new solutions for some long-standing problems in the field of machine vision.

Figure 3: Illustration of the network architecture. The network consists of three major modules. Processed spike data pairs are sent into spike encoders, which contain 3 downsampling layers, for initial representation (a). Monocular and stereo branches deal with these features, and output depth and disparity respectively (b1, b2). A final uncertainty-guided fusion is performed to aggregate monocular and stereo results (c).

2.4 Spike Camera

Different from the RGB cameras and dynamic vision sensors, spike camera mimics the retina to record natural scenes by continuous-time spikes [45, 57]. [48] develop a new image reconstruction approach for the spike camera to recover high-speed motion scenes. [50, 49, 51, 59] use spike or regular CNNs to reconstruct high quality and high-speed images from spike streams. Spike vision shows obvious advantages in capturing high-speed moving objects or scenes, so it provides new solutions to some long-standing problems in the field of computer vision. In this paper, we propose a novel method for high-quality spike depth estimation by fusing monocular and stereo depth estimation.

3 Proposed Method

In this section, we present our method uncertainty-guided depth fusion framework (UGDF) to fully complement the strengths of both stereo and monocular tasks in spike data. The whole framework is demonstrated in Fig. 3 which consists of four components.

3.1 Spike Data Analysis

For spike camera, natural lights are captured by photoreceptors and converted to voltage under the integration of time series $t$ . Once the voltage at a certain sensing unit reaches a threshold $Θ$ , a one-bit spike is fired and the voltage is reset to zero at the same time [48].

S (i, j, t) = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} 1, \int_{t 0_{i, j}^{p r e}}^{t} I (i, j) d t \geq Θ 0, \int_{t 0_{i, j}^{p r e}}^{t} I (i, j) d t < Θ \end{matrix}

(1)

The above formula reveals the basic working pipeline of the spike camera, where $I (i, j)$ represents the luminance of pixel $(i, j)$ , and $t 0_{i, j}^{p r e}$ represents the time that fires the last spike at pixel $(i, j)$ . An ideal Analog to Digital Conversion(ADC) process is by continuous time, but such circumstances do not exist due to the inherent limitations of the digital circus. Even so, the spike camera is still able to generate much more dense frames than RGB models like streams, at a maximum frequency of 40000Hz[47, 58, 49]. Suppose we have $H \times W$ receptive field, the camera would output a $H \times W$ binary spike frame at a certain moment, and as time goes on, high-frequency spike frames are produced. However, directly performing depth estimation on spike frames remains challenging. On one hand, high contrast between 1-bit spike data makes it more difficult to distinguish local context information. On the other hand, different light intensities in the scene cause different frequencies of spike generation. So in practice, we make spike data in a fixed size time window to be a multi-channel tensor. For example, we take spike frames from continuous 100 time-steps frames and concatenate them at time dimension as $100 \times H \times W$ voxels, which then become inputs to our designed networks.

3.2 UGDF Framework

We propose a simple yet efficient network that includes a shared spike-encoder and a spike-decoder with two branches. First, we build a neuromorphic encoding module to extract spiking features in both time domain and frequency domain. The spike voxel $V \in Z^{100 \times H \times W}$ is split into spike sequences $V_{s} = {v_{1}, v_{2}, . . ., v_{s}}$ by a fixed length of time window n, where s = $| 100 / n |_{Z}$ and $v_{s} \in Z^{n \times H \times W}$ . Then, the spike sequences are fed into a ConvRNN to extract temporal connections. Meanwhile, an FFT operation is performed on spike voxel $V$ to extract global information in frequency domain.

we use a shared deep encoder to learn a representation to build stereo cost volume and monocular depth regression, as shown in part (a) of Figure 3. We adopt MobileNetV3[17] as our encoder to make a trade-off between computation cost and model performance. The encoder contains 3 downsampling stage and the final feature map size of coding is $B \times 256 \times \frac{H}{8} \times \frac{W}{8}$ .

As shown in the part (b1) and (b2) of Figure 3, two parallel branches stretch away for monocular and stereo depth estimation and serves as the spike-decoder. Different from any other previous works, we fuse monocular and depth estimation in one workflow at a multi-task level. In the stereo branch, we found a common ground for monocular and stereo tasks to learn a global context representation. So we take advantage of the encoder module, and concatenate the unary features(obtained coding from spike encoders) to build a 4D cost volume( $256 \times Max-Disp. \times \frac{H}{8} \times \frac{W}{8}$ ) for stereo disparity regression, where Max-Disp. represents the maximum disparity level to regression. Then inspired by [6], we perform a 3D hourglass-shaped convolution to aggregate the disparity dimension feature of 4D cost volume. We stack three 3D hourglasses, each of which contains two blocks with $3 \times 3 \times 3$ kernel size and 2-stride 3D convolution, and two blocks with $3 \times 3 \times 3$ kernel size and 2-stride 3D transposed convolution. The disparity map is regressed at the final 3D convolution stage via a soft-argmin operation[21]:

soft-argmin: = D i s p_{m a x}^{*} \sum d = 0 d \times γ (- c_{d})

(2)

where $γ (\cdot)$ represents soft-max operation at disparity dimension, $c_{d}$ is predicted costs for disparity $d$ , and $D i s p_{m a x}^{*}$ means length of disparity dimension of the output features. The final disparity is weighed by a normalized probability. In the training phase, three 3D hourglass disparity outputs are all involved in building loss function. And the last output of three 3D hourglasses is used for the evaluation process.

In the monocular branch, the right unary features(coding) are then sent to a decoding block for depth estimation. The decoder consists of three upsample blocks, each block contains one bilinear interpolation module and two convolutional layers along with batch normalization and Mish activation. The output layer is a $1 \times 1$ convolution which squeeze the features to two channels, one of which is used for depth estimation and the other one used for uncertainty allocation, as described in the next subsection.

3.3 Uncertainty Guided Fusion

Inspired by SUB-Depth[53], we assume the distribution over the output of either branch can be modeled as exponential family distribution such as Laplace’s distribution or Gaussian distribution. The stereo branch and monocular branch adopt the same approach while we use the monocular branch as an illustration example. Given a dataset with left and right spike frames and corresponding depth ground truth $(x_{l}, x_{r}, y_{l}, y_{r})$ , we let our monocular branch output the mean $^y$ and variance $σ$ of the posterior probability distribution $p (y_{r} |_{r}, x_{r})$ . We use Laplace’s distribution as:

p (y_{r} |_{r}, x_{r}) = \frac{1}{2 σ} e x p \frac{- |_{r} - y_{r} |}{σ}

(3)

We can convert the above distribution to a log-likelihood formula like:

l o g (p (y_{r} |_{r}, x_{r})) = - l o g (σ) + \frac{- |_{r} - y_{r} |}{σ} + c o n s t .

(4)

So according to max-posterior probability estimation, an uncertainty loss can be formulated in the form of:

l o s s_{u n c .} = l o g (σ) + \frac{|_{r} - y_{r} |}{σ}

(5)

We can minimize this loss function to obtain a max-posterior probability distribution over estimated monocular depth $^y$ . The uncertainty coefficient $σ_{m}$ and predicted depth $^y$ , which are the outputs of the part (b1), are regressed from the same decoder at the same time, so $σ_{m}$ can be seen as prediction uncertainty for monocular depth estimation task. Similar to the monocular branch, a lite CNN is added behind the probability map of the stereo branch, and regresses uncertainty coefficient $σ_{s}$ after a sigmoid activation.

We notice that the monocular branch outperforms the stereo branch in farther regions and the stereo branch is good at predicting closer regions. So we hope the fusion style may combine both monocular and stereo advantages. So we define a distance threshold:

σ_{d i s .} = D_{m a x} \frac{e^{2 (σ_{m} - σ_{s})}}{1 + e^{2 (σ_{m} - σ_{s})}}

(6)

Figure 4: Neuromorphic encoding module. We use Conv-RNN to extract local information, and meanwhile a Fast-Fourier-Transform is applied through the whole spike voxel to extract global information.

With estimated uncertainty, an uncertainty-guided fusion mask $F$ can be defined as:

Fi={0,^Dmono.≤σdis.1,^Dmono.>σdis.

(7)

where i represents i-th element of the uncertainty map, and $σ_{d i s}$ represents an uncertainty threshold to make fusion. To take advantage of both monocular and stereo branches, we use monocular depth prediction results ${^D}_{m o n o .}$ and stereo depth prediction results ${^D}_{s t e r .}$ to make further fusion, exploiting complementary advantages for monocular and stereo models. Instead of directly performing linear addition between two kinds of outputs, we fuse them in a more efficient uncertainty-guided way. And the uncertain-guided fusion is given:

_{f} = F ⊙ {^D}_{m o n o .} + (1 - F) ⊙ {^D}_{s t e r .}

(8)

3.4 UGDF Loss Functions

We present training strategies for baseline network without fusion and UGDF with fusion. The training loss of baseline network consists of monocular depth estimation $l o s s_{d i s p .}$ and stereo disparity regression $l o s s_{d e p t h}$ , which use smooth-L1 loss during the training phase under the supervision of depth ground-truth and generated disparity labels. The baseline $l o s s_{b a s e}$ is shown as below:

l o s s_{b a s e .} = l o s s_{d i s p .} + l o s s_{d e p t h .}

(9)

in which $l o s s_{d i s p .}$ and $l o s s_{d e p t h .}$ are shown as below:

l o s s_{d i s p .} (d^{*},^d) = \frac{1}{N} M \sum i = 1 N \sum j = 1 α_{i} \cdot s m o o t h L 1 (d^{*},^d)

(10)

l o s s_{d e p t h .} (D,^D) = \frac{1}{n} \sum i {c_{i}}^{2} - \frac{1}{n^{2}} (\sum i c_{i})^{2} + η

(11)

In which $η = 0.1$ and $c_{i} = l o g (D_{i}) - l o g (_{i})$ . And N is the size of data and M=3 is the number of stacked 3D hourglasses, ${α_{1}, α_{2}, α_{3}}$ = ${0.5, 0.7, 1.0}$ . $^D$ and $D$ represent predicted depth and depth ground truth respectively. Similarly, $^d$ means predicted disparity and $d^{*}$ is generated disparity ground truth.

The training loss of UGDF consists of five losses, including monocular depth estimation $l o s s_{d i s p .}$ , stereo disparity regression $l o s s_{d e p t h .}$ , monocular branch uncertainty $l o s s_{m o n o_u n c .}$ , stereo branch uncertainty $l o s s_{s t e r_u n c .}$ and fusion $l o s s_{f u .}$ . The whole UGDF $l o s s_{u g d f .}$ is shown as:

l o s s_{u g d f .} = l o s s_{b a s e .} + l o s s_{s t e r_u n c .} + l o s s_{m o n o_u n c .}

(12)

where $l o s s_{m o n o - u n c .}$ and $l o s s_{s t e r - u n c .}$ follow Eq. 6 while $_{f}$ denotes the fusion predicted depth from two branches. To be mentioned, the depth of the stereo branch is converted from disparity under intrinsic parameters of the camera. The training details of baseline and UGDF are the same. Further training details are presented in Section 5.

4 Spike-Depth Dataset: CitySpike20K

This section introduces different aspects of the dataset we propose. The dataset includes RGB scenes, and their corresponding spike frames and depth maps. All these data are generated by a simulated spike camera in Unity3D virtual environment. The dataset describes 11 sequences of city street scenes containing 6 day scenes and 5 night scenes.

Our proposed CitySpike20K dataset provides a depth estimation benchmark for spike data. The scenes are created via a simulated spike camera, recording a fast-moving car in the street scene, at a frequency of 1000Hz. The resolution for recorded data is $1024 \times 768$ . We also build a depth field for these scenes and store them as 24-bit depth maps. The ground truth information is of 0.3-1000m absolute depth. In addition, we provide the focus $f$ and baseline length $b a s e_{l e n}$ of the stereo camera in supplement. We also convert depth $D$ to disparity $d i s p$ under the function $d i s p . = \frac{f * b a s e_{l e n}}{D}$ . A visualization of our dataset is shown in supplementary file. Besides, in terms of sensor-collaboration, we provide 842 pairs of RGB images from regular stereo cameras, dense spike frames from stereo Vidar, as well as depth maps from stereo depth cameras. Three kinds of data are organized in a one-to-one corresponding way. Besides, we provide a demo sequence of 40000Hz frequency spike data, recording a 91km/h car driving in the city street. This demo is for evaluating the depth estimation algorithm when loaded with high-frequency spike data.

5 Experiments

In this section, we conduct extensive experiments to show the advantages of UGDF. Then, we extensively evaluate UGDF by comparing it with the state-of-the-art and classic depth estimation methods which have shown great performance on RGB depth datasets such as KITTI[41] and NYUD-V2[29]. We also conduct comprehensive ablation studies to evaluate the contribution of each component in the last subsection. Due to space limitations, some details of experiments and results are provided in the supplementary materials.

Dataset	Method	Approach	Modality	Abs_Rel↓	RMSE ↓	Sq_Rel ↓	RMSE_log ↓	a1 ↑	a2 ↑	a3 ↑
	PSMNet	Ster.	RGB	0.4564	15.484	12.990	0.734	0.469	0.668	0.743
CS20K	GwcNet	Ster.	RGB	0.419	19.724	9.753	0.632	0.469	0.685	0.767
	CFnet	Ster.	RGB	0.4038	14.928	8.870	0.437	0.593	0.677	0.786
CS20k	UGDF(Ours)	Fusion	Spike	0.2282	11.075	4.699	0.305	0.754	0.879	0.942

Table 1: Quantitative results on CitySpike20K (decribed as CS20K below) validation set. Evaluation metrics are as described in section 3. We make comparison with GwcNet[16], CFNet[36], PSMNet[6]. The evaluation metrics are as introduced in subsection 4.2. We also consider model parameter size to be one of the compared targets.

Dataset	Method	Approach	Modality	Abs_Rel↓	RMSE ↓	Sq_Rel ↓	RMSE_log ↓	a1 ↑	a2 ↑	a3 ↑
	UNet	Mono.	RGB	0.3612	19.217	6.981	0.502	0.569	0.765	0.893
	DPT	Mono.	RGB	0.249	13.641	4.349	0.379	0.632	0.817	0.925
CS20K	PSMNet	Ster.	RGB	0.4341	16.294	9.247	0.840	0.411	0.626	0.712
	GwcNet	Ster.	RGB	0.3931	18.680	8.745	0.577	0.492	0.704	0.787
	CFnet	Ster.	RGB	0.3825	13.794	7.925	0.496	0.467	0.723	0.836
CS20k	UGDF(Ours)	Fusion	Spike	0.1997	10.953	4.879	0.412	0.790	0.888	0.945

Table 2: Quantitative results on CitySpike20K test set. We add two monocular algorithms as baselines which are DPT[33] and UNet[34].

5.1 Implementation Details

We train our proposed UGDF network on spike-depth pairs, including stereo spike frames and right depth-ground-truth. The whole training phase contains 200 epochs and takes about 16 hours with the batch size of 4 on two NVIDIA-Tesla P100 GPUs, for $256 \times 512$ resolution spiking frames.

We utilize 24-bit 0-1000m absolute depth ground-truth to supervise training for the monocular branch. We normalize depth ground truth $D$ to $D^{*} \in (0, 1)$ , with the function $D^{*} = D / 1000$ . Meanwhile, the disparity is transformed from depth with camera intrinsics.

As for optimization, we use Adam optimizer with $(β_{1}, β_{2}) = (0.9, 0.999)$ . We set an initial learning rate of 1e-3 and decay to 0.33e-3 at epoch 35 for the sake of a more smooth optimizing process.

In this section, we compare UGDF against the state-of-the-art and classic depth estimation methods on CitySpike20K dataset.

Data processing Our proposed dataset contains 20K frames of spike-depth pairs. We split 7 out of 10 total sequences for training, 2 sequences for testing and 1 sequence for validating. All the data used for training and validating is sampled every 100 time-stamps to form a spike voxel. So we obtain 140 training pieces and 40, 20 for testing and validating our framework. To emphasize the advantage of spike data, we use blurred 30fps RGB frames in our dataset to train the RGB-based baseline methods. So there are 571 training pieces, 142 testing pieces and 111 validation pieces for baseline methods.

Baseline methods To demonstrate the effectiveness of UGDF, we compare it with some state-of-the-art and classic depth estimation methods which have shown remarkable performance on our proposed CitySpike20k dataset. For monocular methods, we choose classical UNet [34] and DPT[33]. UNet has been demonstrated a successful design on semantic segmentation[34, 2] and image reconstruction[8, 7]. we adopt its proposed structure and evaluate it on our spike-depth estimation task. In DPT, we use Vit-b16 as the backbone and 224x224 as input resolution. PSMNet[6] uses subtract and concatenation method to build a 3D cost volume. GwcNet[16] proposes group-wise correlation to reduce computation while conducting 3D convolution. CFNet[36] employs a variance-based uncertainty estimation to adaptively search disparity space.

Visualization of depth estimation on CitySpike20K. Pic. a is 30hz RGB data, and b is one spike frame in a spike voxel. Pic. c-e is output result of our method, and f is UNet output for RGB depth estimation. — (a) RGB Frame

Main Results and Analysis for CS20K Table 1 and Table 2 show quantitative results with the comparison of the RGB methods. We experiment with classic monocular depth estimation works, as well as stereo depth estimation methods. We can see that under the uncertainty-guided fusion, our result gets the top performance among all the methods. Compared with the best monocular method, UGDF reduces $4.93 %$ , $2.688$ error in terms of AbsRel.and RMSE metric respectively. For stereo methods, we also gained improvements on all metrics. We also show the qualitative comparison in Figure 5. As can be seen, our method achieves better depth estimation compared with blurred RGB-based methods. Other visualization results are in our supplement.

Dataset	Method	Approach	Modality	Abs_Rel↓	RMSE ↓	Sq_Rel ↓	RMSE_log ↓	a1 ↑	a2 ↑	a3 ↑
Real	PSMNet	Ster.	Image	0.3743	2.228	0.413	0.843	0.451	0.703	0.838
		Ster.		0.2722	1.264	0.376	0.348	0.581	0.819	0.906
Real	UGDF(Ours)	Mono.	Spike	0.4037	1.552	1.017	0.382	0.528	0.796	0.889
		Fusion		0.2693	1.237	0.413	0.374	0.533	0.795	0.899

Table 3: Quantitative results on Spike-Real test set. Our UGDF framework still obtains performance increase to two branches.

Split	Branch	Abs_Rel	Sq_Rel	a1
Valid	Mono.	0.3302 (0.102↓)	12.759	0.738
	Ster.	0.2543 (0.026↓)	3.995	0.613
	E. Fusion	0.2652 (0.037↓)	5.712	0.706
	U. Fusion	0.2282	4.699	0.754
Test	Mono.	0.2944 (0.095↓)	12.508	0.779
	Ster.	0.2118 (0.012↓)	3.780	0.753
	E. Fusion	0.2347 (0.035↓)	4.018	0.761
	U. Fusion.	0.1997	4.897	0.791

Table 4: Ablating the fusion effectiveness on CitySpike 20K. We design a depth estimation fusion method with strong-efficiency and gain improvements for both branches.

Width	Time			Error
	(ms)	Abs_Rel	Sq_Rel	a1	a2	a3
8	5.3	0.2301	5.146	0.756	0.877	0.942
16	2.8	0.2143	4.699	0.764	0.892	0.946
24	2.1	0.1997	4.879	0.790	0.888	0.945
32	1.6	0.2552	5.612	0.726	0.876	0.934

Table 5: Ablation results on test split of CS20K for window-width of neuromorphic encoding. The runtime statistics are made on RTX 2080ti GPU for a single forward pass of the network with the batch-size of 1

Evaluation on Spike-Real Set We also train and evaluate our network on a dataset captured by a stereo real-world Vidar in a series of outdoor scenes. The dataset contains 40 sequences of outdoor scenes and we split 33 sequences for training and 7 sequences for testing. Table 4 and Figure 6 shows results evaluated on its test set.

5.2 Ablation Study

We carry out ablation experiments from two aspects. The first of those is to explore the effect of different choices of time-window widths, and the other is to verify the effectiveness of uncertainty-guided fusion design.

Effectiveness of Uncertainty Guided Fusion We conduct experiments to verify the effectiveness of monocular and stereo uncertainty jointly guided fusion. In order to demonstrate the benefits of joint-guided fusion, we first compare it with a linear additive ensemble fusion manner. To be specific, we make a uniform linear addition of monocular and stereo estimation results, denoted as E. Fusion in Table 4 As can be seen, the linear additive fusion manner is inferior to other fusion methods. In addition, we visualize the improvement gap between fusion results and the other two branches. We can see the advantages of our UGDF framework. Firstly it combines both advantages of stereo and monocular estimation. And secondly, it brings substantial improvements, rather than the compromising fusion of ensemble style.

Time window Width of Neuromorphic Encoding In our framework, we apply a kind of neuromorphic encoding method to effectively extract the feature of spike data. As we have described in Section 3.2, we chunk the spike voxel into spike sequences by the time window of 24 to obtain better local representations. Then the sequences are sent into the Conv-RNN to extract temporal connections between different sequences. Theoretically, applying a smaller time window is beneficial to extract local connections between spike sequences, yet increases the convergence and inference time on hardware. So we set different widths of encoding time-window and train the whole network for 200 epochs. We make evaluations on the test set of CitySpike20K. The results are shown in Table 5, and we can see that 24 turns out to be the optimal choice of time-window widths for shorter inference time and better precision.

6 Conclusion

In this paper, we propose an uncertainty-guided depth fusion framework (UGDF) for spike data, consisting of four modules including neuromorphic encoding module, spike encoder, spike decoder for monocular and stereo tasks, and an uncertainty-guided fusion part. The main motivation of our work is monocular depth estimation models and stereo models show different advantages when performing predictions for spike data. So it’s critical to explore an effective fusion method for leveraging the advantages of both tasks. Different from previous works, we fuse monocular and stereo depth prediction results according to individual adaptive uncertainty estimations. We also generate a spike dataset for depth estimation which contains 20K paired spike-depth data (CitySpike20K), along with its technical details and evaluation metrics. We demonstrate the good efficiency of spike data when applied to fast-moving circumstances. Extensive experiments are conducted to validate the effectiveness of our proposed UGDF. We hope this paper can inspire future works on spike data depth estimation.

Appendix I: Proposed Dataset: CitySpike20K

Introduction and Visualization

We propose CitySpike20K, a spike-depth dataset to help explore the depth estimation algorithms for spike camera. The dataset is generated by Unity3D and contains 10 sequences, 5 of which are day scenes and 5 others are night scenes. In the dataset, the frequency of the spike data and corresponding depth GTs is 1000Hz. Besides, we supply 30Hz RGB images for each scenes as well as 1000Hz RGB images that aligned with spike data.

To fully simulate the city environments, we add moving automobiles and dynamic traffic lights. We set 5-10 moving automobiles including buses, cars, vans and trucks for each scene. Figure 6 gives a visualization of CitySpike20K which contains RGB frames, spike data and depth maps. Specifically, we split scene03, scene07 for testing, scene09 for validation and others for training.

Figure 6: A visualization of our proposed CitySpike20k dataset. We generate it by Unity3D engine and simulate a vivid city environment along with dense depth maps and spike data.

Figure 7: More prediction results on CitySpike20K dataset. As can be seen, the stereo estimation results and the monocular estimation results fuse efficiently by our framework

Figure 8: A visualization for Spike-Real dataset and prediction results from its test set.

Dataset	Method	Approach	Modality	Abs_Rel↓	RMSE ↓	Sq_Rel ↓	RMSE_log ↓	a1 ↑	a2 ↑	a3 ↑
		Ster.		0.1250	4.283	0.717	0.188	0.830	0.957	0.986
Spike-Kitti	UGDF(Ours)	Mono.	Spike	0.1706	5.067	1.127	0.242	0.753	0.910	0.968
		Fusion		0.1247	4.281	0.721	0.189	0.829	0.957	0.985

Table 6: Quantitative results on Spike-Kitti test set. Our UGDF framework still obtains performance increase to two branches.

Dataset	Method	Approach	Abs_Rel↓	RMSE ↓	Sq_Rel ↓	RMSE_log ↓	a1 ↑	a2 ↑	a3 ↑
	UNet[34]	Mono.	0.2518	23.993	9.008	0.357	0.68	0.896	0.932
demo	DORN[12]	Mono.	0.3857	25.258	10.691	0.438	0.409	0.841	0.917
	Eigen[11]	Mono.	0.4262	25.154	20.363	0.459	0.542	0.800	0.893
	GC-Net[21]	Ster.	0.2350	37.158	12.743	0.401	0.614	0.809	0.868
	GwcNet[16]	Ster.	0.1880	24.152	7.469	0.304	0.757	0.895	0.953
	CFnet[36]	Ster.	0.2281	25.905	5.557	0.397	0.610	0.847	0.926
demo	SteroNet[22]	Ster.	0.2890	50.765	19.772	0.690	0.563	0.727	0.823
	PSMNet[6]	Ster.	0.1886	28.496	7.354	0.340	0.723	0.887	0.941
	GANet-1[46]	Ster.	0.3270	49.068	19.505	0.865	0.586	0.764	0.851
	GANet[46]	Ster.	0.2963	47.202	17.598	0.714	0.576	0.771	0.857
demo	Ours	Fusion	0.1715	22.793	11.217	0.306	0.791	0.928	0.961

Table 7: Quantitative results on CitySpike20K-demo. Evaluation metrics are as described above. We make comparison with DORN[12], GwcNet[16], CFNet[36], StereoNet[22], PSMNet[6], and GANet[46] . The evaluation metrics are as introduced in subsection 4.2. We also consider model parameter size to be one of compared targets.

Figure 9: Visualization of predicting results on Spike-Kitti datasets. As seen, the monocular branch can still provide smoother and more accurate results in further regions, and the stereo branch makes sharper and cleaner prediction for closer regions.

Evaluation Metric

We conducted to evaluate the effectiveness of supervised depth estimation model on CitySpike20K. Our evaluation metrics for depth estimation is described as follows:

Given an estimated depth map $^D$ , and its corresponding ground truth $D$ , $N = H \times W$ , $A b s_R e l$ is quantified as:

A b s_R e l = \frac{1}{N} N \sum i = 1 \frac{| D_{i} -_{i} |}{D_{i}}

(13)

and RMSE defined:

R M S E = \sqrt{\frac{1}{N} N \sum i = 1 | | D_{i} -_{i} | |^{2}}

(14)

we also introduce $R M S E_l o g$ metric:

R M S E_l o g = \sqrt{\frac{1}{N} N \sum i = 1 | | l o g (_{i}) - l o g (D_{i}) | |^{2}}

(15)

and Sq_Rel metric as here:

S q_R e l = \frac{1}{N} N \sum i = 1 \frac{| | D_{i} -_{i} | |^{2}}{D_{i}}

(16)

Above metrics measure output errors from different statistic aspect, weighting the distance between predictions and ground-truth labels, where lower values mean better model performance. Below metrics are for evaluation of whether predictions are accurate within certain range of ground-truth, and higher values mean better performance. Note that $j \in {1, 2, 3}$

a j a c c u r a c y : % o f D_{i} s . t . m a x (\frac{_{i}}{D_{i}}, \frac{D_{i}}{_{i}}) = δ < T = {1.25}^{j}

(17)

Figure 10: Accuracy statistics on CitySpike20K test set. The green lines and blue lines represent the monocular and stereo accuracies respectively.

Figure 11: Accuracy statistics on CitySpike20K validation set.

Appendix II: Performance on other datasets

Real-Dataset

As we have described in our submitted paper, we also evaluate our framework on a real-recorded dataset by a spike camera. The dataset contains 40 sequences data and each of which includes 3-6 $[400 \times 250 \times 400]$ spike voxels in the format of $[T \times H \times W]$ . We split 33 sequences for training and 7 for testing.

Kitti

To demonstrate that our UGDF framework still works in real-world scenes, we carry out experiment on a spike-kitti dataset. To convert Kitti[13] from RGB modality to spike modality, we first make frame interpolation using XVFI[37] by 128 times. Then we use a Simulated-Vidar code script to generate spike data from RGB Kitti images to form spike voxels in the format $(128 \times 375 \times 1242)$ , where 128 represents the time dimension and $(375 \times 1242)$ is the original size of Kitti RGB images. We maintain the same way to operate neuromorphic encoding as what we design for CitySpike20K dataset in our submitted paper. As mentioned above, we set this experiment to further explore the effectiveness of our fusion strategy. We train our framework for 50 epochs on 4 RTX-2080Ti GPUs. Figure 9 provides a group of visualization of the output of our results on validation set.

CitySpike20K-demo

In addition to 10 sequences of 1000Hz spike data we provide in the CitySpike20K dataset, we still supply a 40000Hz demo to simulate real spike as possible as we could. The demo contains 60K paired data and records a 1.5 seconds video of a fast-driving car in the city street. Different from our submitted papers, we use this demo to evaluate the performance of models loaded with spike data. Considering existing methods for monocular or stereo depth estimation are mostly based on RGB 3-channel data, we change the input channel of the models to the time-window width of applied spike sequences, i.e. 32 as we adopted. And we use the first half of the demo for training and the second half for testing.

Appendix III: Statistics to Support Our Motivation

There are two clues to inspire our motivations. The first of which is that, the spike camera has its unique advantages to deal with fast-moving circumstances when operating depth estimation task. And the second is that, the monocular strategy and stereo strategy share some distinct advantages to finish depth estimation task while loaded with spike data. We supply statistical results to prove our second motivation. On CitySpike20K dataset, we make a1, a2, a3 accuracy calculation in different depth intervals according to depth GT while evaluating our network. We transform the stereo disparities into depths, and count a1, a2, a3 accuracy for two branches respectively in the same metrics. Then we plot them in one coordinate. Figure 10 shows statistical results on test set and Figure 11 shows results on validation sets. As seen, the stereo branch suffers from great accuracy decrease for far regions, while monocular branch still maintains certain reliability. Similarly, the stereo branch is more stable and accurate than the monocular branch for closer regions.

References

[1] K. Ambrosch, W. Kubinger, M. Humenberger, and A. Steininger (2007) Hardware implementation of an sad based stereo vision algorithm. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1–6. External Links: Document Cited by: §2.1.
[2] B. Baheti, S. Innani, S. Gajre, and S. Talbar (2020) Eff-unet: a novel architecture for semantic segmentation in unstructured environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 358–359. Cited by: §5.1.
[3] K. Berger, R. Voorhies, and L. H. Matthies (2017) Depth from stereo polarization in specular scenes for urban robotics. In 2017 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 1966–1973. External Links: Document Cited by: §2.1.
[4] S. F. Bhat, I. Alhashim, and P. Wonka (2021) Adabins: depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4009–4018. Cited by: §2.1.
[5] R. Chabra, J. Straub, C. Sweeney, R. Newcombe, and H. Fuchs (2019-06) StereoDRNet: dilated residual stereonet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1.
[6] J. Chang and Y. Chen (2018) Pyramid stereo matching network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5410–5418. Cited by: §1, §2.1, §3.2, §5.1, Table 1, Table 7.
[7] L. Chen, X. Lu, J. Zhang, X. Chu, and C. Chen (2021) HINet: half instance normalization network for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 182–192. Cited by: §5.1.
[8] X. Chen, Y. Liu, Z. Zhang, Y. Qiao, and C. Dong (2021-06) HDRUNet: single image hdr reconstruction with denoising and dequantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 354–363. Cited by: §5.1.
[9] Z. Chen, X. Ye, W. Yang, Z. Xu, X. Tan, Z. Zou, E. Ding, X. Zhang, and L. Huang (2021) Revealing the reciprocal relations between self-supervised stereo and monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15529–15538. Cited by: §1, §2.1.
[10] S. Dong, L. Zhu, D. Xu, Y. Tian, and T. Huang (2019) An efficient coding method for spike camera using inter-spike intervals. arXiv preprint arXiv:1912.09669. Cited by: §1.
[11] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems 27. Cited by: §2.1, Table 7.
[12] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2002–2011. Cited by: §1, §2.1, Table 7.
[13] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: Kitti.
[14] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 270–279. Cited by: §1, §2.1, §2.1.
[15] V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon (2020) 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2485–2494. Cited by: §2.1.
[16] X. Guo, K. Yang, W. Yang, X. Wang, and H. Li (2019) Group-wise correlation stereo network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3273–3282. Cited by: §1, §2.1, §5.1, Table 1, Table 7.
[17] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1314–1324. Cited by: §3.2.
[18] L. Hoyer, D. Dai, Y. Chen, A. Koring, S. Saha, and L. Van Gool (2021) Three ways to improve semantic segmentation with self-supervised depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11130–11140. Cited by: §2.1.
[19] L. Hu, R. Zhao, Z. Ding, L. Ma, B. Shi, R. Xiong, and T. Huang (2021) Optical flow estimation for spiking camera. arXiv preprint arXiv:2110.03916. Cited by: §1.
[20] J. Jiao, Y. Cao, Y. Song, and R. Lau (2018-09) Look deeper into depth: monocular depth estimation with semantic booster and attention-driven loss. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.1.
[21] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry (2017) End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE international conference on computer vision, pp. 66–75. Cited by: §1, §2.1, §3.2, Table 7.
[22] S. Khamis, S. Fanello, C. Rhemann, A. Kowdle, J. Valentin, and S. Izadi (2018-09) StereoNet: guided hierarchical refinement for real-time edge-aware depth prediction. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §2.1, Table 7.
[23] J. Lee and C. Kim (2019) Monocular depth estimation using relative depth maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §1, §2.1.
[24] B. Liu, H. Yu, and Y. Long (2022) Local similarity pattern and cost self-reassembling for deep stereo matching networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, pp. 1647–1655. Cited by: §1.
[25] X. Lyu, L. Liu, M. Wang, X. Kong, L. Liu, Y. Liu, X. Chen, and Y. Yuan (2020) Hr-depth: high resolution self-supervised monocular depth estimation. arXiv preprint arXiv:2012.07356 6. Cited by: §2.1.
[26] F. Manhardt, W. Kehl, and A. Gaidon (2019) Roi-10d: monocular lifting of 2d detection to 6d pose and metric shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2069–2078. Cited by: §1.
[27] E. Marchand, H. Uchiyama, and F. Spindler (2015) Pose estimation for augmented reality: a hands-on survey. IEEE transactions on visualization and computer graphics 22 (12), pp. 2633–2651. Cited by: §1.
[28] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4040–4048. Cited by: §1, §2.1.
[29] P. K. Nathan Silberman and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: §5.
[30] M. Ramamonjisoa, Y. Du, and V. Lepetit (2020-06) Predicting sharp and accurate occlusion boundaries in monocular depth estimation using displacement fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1.
[31] M. Ramamonjisoa and V. Lepetit (2019-10) SharpNet: fast and accurate recovery of occluding contours in monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Cited by: §1, §2.1.
[32] U. Rançon, J. Cuadrado-Anibarro, B. R. Cottereau, and T. Masquelier (2021) StereoSpike: depth learning with a spiking neural network. arXiv preprint arXiv:2109.13751. Cited by: §2.2.
[33] R. Ranftl, A. Bochkovskiy, and V. Koltun (2021) Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188. Cited by: §2.1, §5.1, Table 2.
[34] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §5.1, Table 2, Table 7.
[35] A. Roy and S. Todorovic (2016) Monocular depth estimation using neural regression forest. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5506–5514. Cited by: §2.1.
[36] Z. Shen, Y. Dai, and Z. Rao (2021) Cfnet: cascade and fused cost volume for robust stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13906–13915. Cited by: §5.1, Table 1, Table 7.
[37] H. Sim, J. Oh, and M. Kim (2021) Xvfi: extreme video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14489–14498. Cited by: Kitti.
[38] F. Tang, Y. Wu, X. Hou, and H. Ling (2019) 3d mapping and 6d pose computation for real time augmented reality on cylindrical objects. IEEE Transactions on Circuits and Systems for Video Technology 30 (9), pp. 2887–2899. Cited by: §1.
[39] F. Tosi, F. Aleotti, M. Poggi, and S. Mattoccia (2019-06) Learning monocular depth estimation infusing traditional stereo knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
[40] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birchfield (2018) Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790. Cited by: §1.
[41] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger (2017) Sparsity invariant cnns. In International Conference on 3D Vision (3DV), Cited by: §5.
[42] D. Wu, Z. Zhuang, C. Xiang, W. Zou, and X. Li (2019) 6d-vnet: end-to-end 6-dof vehicle pose estimation from monocular rgb images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §1.
[43] D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci (2018-06) Structured attention guided convolutional neural fields for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1.
[44] G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci (2021) Transformer-based attention networks for continuous pixel-wise prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16269–16279. Cited by: §2.1.
[45] Z. Yu, J. K. Liu, S. Jia, Y. Zhang, Y. Zheng, Y. Tian, and T. Huang (2020) Toward the next generation of retinal neuroprosthesis: visual computation with spikes. Engineering 6 (4), pp. 449–461. Cited by: §2.3, §2.4.
[46] F. Zhang, V. Prisacariu, R. Yang, and P. H.S. Torr (2019-06) GA-net: guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 7.
[47] J. Zhao, J. Xie, R. Xiong, J. Zhang, Z. Yu, and T. Huang (2021-10) Super resolve dynamic scene from continuous spike streams. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2533–2542. Cited by: §3.1.
[48] J. Zhao, R. Xiong, and T. Huang (2020) High-speed motion scene reconstruction for spike camera via motion aligned filtering. In 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Vol. , pp. 1–5. External Links: Document Cited by: §2.3, §2.4, §3.1.
[49] J. Zhao, R. Xiong, H. Liu, J. Zhang, and T. Huang (2021-06) Spk2ImgNet: learning to reconstruct dynamic scene from continuous spike stream. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11996–12005. Cited by: §2.3, §2.4, §3.1.
[50] Y. Zheng, L. Zheng, Z. Yu, B. Shi, Y. Tian, and T. Huang (2021-06) High-speed image reconstruction through short-term plasticity for spiking cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6358–6367. Cited by: §2.3, §2.4.
[51] Y. Zheng, L. Zheng, Z. Yu, B. Shi, Y. Tian, and T. Huang (2021) High-speed image reconstruction through short-term plasticity for spiking cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6358–6367. Cited by: §2.3, §2.4.
[52] C. Zhou, H. Zhang, X. Shen, and J. Jia (2017-10) Unsupervised learning of stereo matching. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
[53] H. Zhou, S. Taylor, and D. Greenwood (2021) SUB-depth: self-distillation and uncertainty boosting self-supervised monocular depth estimation. arXiv preprint arXiv:2111.09692. Cited by: §1, §3.3.
[54] A. Z. Zhu, Y. Chen, and K. Daniilidis (2018) Realtime time synchronized event-based stereo. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 433–447. Cited by: §2.2.
[55] A. Z. Zhu, D. Thakur, T. Özaslan, B. Pfrommer, V. Kumar, and K. Daniilidis (2018) The multivehicle stereo event camera dataset: an event camera dataset for 3d perception. IEEE Robotics and Automation Letters 3 (3), pp. 2032–2039. Cited by: §2.2.
[56] A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis (2019) Unsupervised event-based learning of optical flow, depth, and egomotion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 989–997. Cited by: §2.2.
[57] L. Zhu, S. Dong, T. Huang, and Y. Tian (2019) A retina-inspired sampling method for visual texture reconstruction. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 1432–1437. Cited by: §1, §2.3, §2.4.
[58] L. Zhu, S. Dong, J. Li, T. Huang, and Y. Tian (2020) Retina-like visual image reconstruction via spiking neural model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1438–1446. Cited by: §1, §3.1.
[59] L. Zhu, J. Li, X. Wang, T. Huang, and Y. Tian (2021) NeuSpike-net: high speed video reconstruction via bio-inspired neuromorphic cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2400–2409. Cited by: §2.3, §2.4.