Monocular Depth Estimation (MDE) is a fundamental problem in computer vision with numerous applications. Recently, LIDAR-supervised methods have achieved remarkable per-pixel depth accuracy in outdoor scenes. However, significant errors are typically found in the proximity of depth discontinuities, i.e., depth edges, which often hinder the performance of depth-dependent applications that are sensitive to such inaccuracies, e.g., novel view synthesis and augmented reality. Since direct supervision for the location of depth edges is typically unavailable in sparse LIDAR-based scenes, encouraging the MDE model to produce correct depth edges is not straightforward. In this work we propose to learn to detect the location of depth edges from densely-supervised synthetic data, and use it to generate supervision for the depth edges in the MDE training. %Despite the 'domain gap' between synthetic and real data, we show that depth edges that are estimated directly are significantly more accurate than the ones that emerge indirectly from the MDE training. To quantitatively evaluate our approach, and due to the lack of depth edges ground truth in LIDAR-based scenes, we manually annotated subsets of the KITTI and the DDAD datasets with depth edges ground truth. We demonstrate significant gains in the accuracy of the depth edges with comparable per-pixel depth accuracy on several challenging datasets.
translated by 谷歌翻译
建立新型观点综合的最近进展后,我们提出了改善单眼深度估计的应用。特别是,我们提出了一种在三个主要步骤中分开的新颖训练方法。首先,单眼深度网络的预测结果被扭转到额外的视点。其次,我们应用一个额外的图像综合网络,其纠正并提高了翘曲的RGB图像的质量。通过最小化像素-WISE RGB重建误差,该网络的输出需要尽可能类似地查看地面真实性视图。第三,我们将相同的单眼深度估计重新应用于合成的第二视图点,并确保深度预测与相关的地面真理深度一致。实验结果证明,我们的方法在Kitti和Nyu-Deaft-V2数据集上实现了最先进的或可比性,具有轻量级和简单的香草U-Net架构。
translated by 谷歌翻译
不同的环境对长期自主驾驶的户外强大的视觉感知构成了巨大挑战,以及对不同环境影响的学习算法的概括仍然是一个公开问题。虽然最近单眼深度预测得到了很好的研究,但很少有很多工作,专注于不同环境的强大的基于学习的深度预测,例如,由于缺乏如此多环境的现实世界数据集和基准测试,不断变化照明和季节。为此,基于CMU Visual Location DataSet建立了第一个跨赛季单眼深度预测数据集和基准赛季。为了基准不同环境下的深度估计性能,我们使用几个新配制的指标调查来自Kitti基准的代表性和最近的最先进的开源监督,自我监督和域适应深度预测方法。通过对所提出的数据集进行广泛的实验评估,定性和定量分析了多种环境对性能和鲁棒性的影响,表明即使微调,长期单眼深度预测也仍然具有挑战性。我们进一步提供了承诺的途径,即自我监督的培训和立体声几何约束有助于提高改变环境的鲁棒性。数据集可在https://seasondepth.github.io上找到,并且在https://github.com/seasondepth/seasondepth上提供基准工具包。
translated by 谷歌翻译
本文提出了一个开放而全面的框架,以系统地评估对自我监督单眼估计的最新贡献。这包括训练,骨干,建筑设计选择和损失功能。该领域的许多论文在建筑设计或损失配方中宣称新颖性。但是,简单地更新历史系统的骨干会导致25%的相对改善,从而使其胜过大多数现有系统。对该领域论文的系统评估并不直接。在以前的论文中比较类似于类似的需要,这意味着评估协议中的长期错误在现场无处不在。许多论文可能不仅针对特定数据集进行了优化,而且还针对数据和评估标准的错误。为了帮助该领域的未来研究,我们发布了模块化代码库,可以轻松评估针对校正的数据和评估标准的替代设计决策。我们重新实施,验证和重新评估16个最先进的贡献,并引入一个新的数据集(SYNS-Patches),其中包含各种自然和城市场景中的密集室外深度地图。这允许计算复杂区域(例如深度边界)的信息指标。
translated by 谷歌翻译
Monocular depth estimation is a challenging problem on which deep neural networks have demonstrated great potential. However, depth maps predicted by existing deep models usually lack fine-grained details due to the convolution operations and the down-samplings in networks. We find that increasing input resolution is helpful to preserve more local details while the estimation at low resolution is more accurate globally. Therefore, we propose a novel depth map fusion module to combine the advantages of estimations with multi-resolution inputs. Instead of merging the low- and high-resolution estimations equally, we adopt the core idea of Poisson fusion, trying to implant the gradient domain of high-resolution depth into the low-resolution depth. While classic Poisson fusion requires a fusion mask as supervision, we propose a self-supervised framework based on guided image filtering. We demonstrate that this gradient-based composition performs much better at noisy immunity, compared with the state-of-the-art depth map fusion method. Our lightweight depth fusion is one-shot and runs in real-time, making our method 80X faster than a state-of-the-art depth fusion method. Quantitative evaluations demonstrate that the proposed method can be integrated into many fully convolutional monocular depth estimation backbones with a significant performance boost, leading to state-of-the-art results of detail enhancement on depth maps.
translated by 谷歌翻译
具有高质量注释的大规模培训数据对于训练语义和实例分割模型至关重要。不幸的是,像素的注释是劳动密集型且昂贵的,从而提高了对更有效的标签策略的需求。在这项工作中,我们提出了一种新颖的3D到2D标签传输方法,即Panoptic Nerf,该方法旨在从易于体现的粗3D边界原始基原始素中获取每个像素2D语义和实例标签。我们的方法利用NERF作为可区分的工具来统一从现有数据集中传输的粗3D注释和2D语义提示。我们证明,这种组合允许通过语义信息指导的几何形状,从而使跨多个视图的准确语义图渲染。此外,这种融合过程解决了粗3D注释的标签歧义,并过滤了2D预测中的噪声。通过推断3D空间并渲染到2D标签,我们的2D语义和实例标签是按设计一致的多视图。实验结果表明,在挑战Kitti-360数据集的挑战性城市场景方面,Pastic Nerf的表现优于现有标签传输方法。
translated by 谷歌翻译
Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods.Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.
translated by 谷歌翻译
深度完成旨在预测从深度传感器(例如Lidars)中捕获的极稀疏图的密集像素深度。它在各种应用中起着至关重要的作用,例如自动驾驶,3D重建,增强现实和机器人导航。基于深度学习的解决方案已经证明了这项任务的最新成功。在本文中,我们首次提供了全面的文献综述,可帮助读者更好地掌握研究趋势并清楚地了解当前的进步。我们通过通过对现有方法进行分类的新型分类法提出建议,研究网络体系结构,损失功能,基准数据集和学习策略的设计方面的相关研究。此外,我们在包括室内和室外数据集(包括室内和室外数据集)上进行了三个广泛使用基准测试的模型性能进行定量比较。最后,我们讨论了先前作品的挑战,并为读者提供一些有关未来研究方向的见解。
translated by 谷歌翻译
我们提出了一个新颖的高分辨率和具有挑战性的立体声数据集框架室内场景,并以致密而准确的地面真相差异注释。我们数据集的特殊是存在几个镜面和透明表面的存在,即最先进的立体声网络失败的主要原因。我们的采集管道利用了一个新颖的深度时空立体声框架,该框架可以轻松准确地使用子像素精度进行标记。我们总共发布了419个样本,这些样本在64个不同的场景中收集,并以致密的地面差异注释。每个样本包括高分辨率对(12 MPX)以及一个不平衡对(左:12 MPX,右:1.1 MPX)。此外,我们提供手动注释的材料分割面具和15K未标记的样品。我们根据我们的数据集评估了最新的深层网络,强调了它们在解决立体声方面的开放挑战方面的局限性,并绘制了未来研究的提示。
translated by 谷歌翻译
在接受高质量的地面真相(如LiDAR数据)培训时,监督的学习深度估计方法可以实现良好的性能。但是,LIDAR只能生成稀疏的3D地图,从而导致信息丢失。每个像素获得高质量的地面深度数据很难获取。为了克服这一限制,我们提出了一种新颖的方法,将有前途的平面和视差几何管道与深度信息与U-NET监督学习网络相结合的结构信息结合在一起,与现有的基于流行的学习方法相比,这会导致定量和定性的改进。特别是,该模型在两个大规模且具有挑战性的数据集上进行了评估:Kitti Vision Benchmark和CityScapes数据集,并在相对错误方面取得了最佳性能。与纯深度监督模型相比,我们的模型在薄物体和边缘的深度预测上具有令人印象深刻的性能,并且与结构预测基线相比,我们的模型的性能更加强大。
translated by 谷歌翻译
3D object detection is an essential task in autonomous driving. Recent techniques excel with highly accurate detection rates, provided the 3D input data is obtained from precise but expensive LiDAR technology. Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in drastically lower accuracies -a gap that is commonly attributed to poor image-based depth estimation. However, in this paper we argue that it is not the quality of the data but its representation that accounts for the majority of the difference. Taking the inner workings of convolutional neural networks into consideration, we propose to convert image-based depth maps to pseudo-LiDAR representations -essentially mimicking the LiDAR signal. With this representation we can apply different existing LiDAR-based detection algorithms. On the popular KITTI benchmark, our approach achieves impressive improvements over the existing state-of-the-art in image-based performance -raising the detection accuracy of objects within the 30m range from the previous state-of-the-art of 22% to an unprecedented 74%. At the time of submission our algorithm holds the highest entry on the KITTI 3D object detection leaderboard for stereo-image-based approaches. Our code is publicly available at https: //github.com/mileyan/pseudo_lidar.
translated by 谷歌翻译
尽管在过去几年中取得了重大进展,但使用单眼图像进行深度估计仍然存在挑战。首先,训练度量深度预测模型的训练是不算气的,该预测模型可以很好地推广到主要由于训练数据有限的不同场景。因此,研究人员建立了大规模的相对深度数据集,这些数据集更容易收集。但是,由于使用相对深度数据训练引起的深度转移,现有的相对深度估计模型通常无法恢复准确的3D场景形状。我们在此处解决此问题,并尝试通过对大规模相对深度数据进行训练并估算深度转移来估计现场形状。为此,我们提出了一个两阶段的框架,该框架首先将深度预测到未知量表并从单眼图像转移,然后利用3D点云数据来预测深度​​移位和相机的焦距,使我们能够恢复恢复3D场景形状。由于两个模块是单独训练的,因此我们不需要严格配对的培训数据。此外,我们提出了图像级的归一化回归损失和基于正常的几何损失,以通过相对深度注释来改善训练。我们在九个看不见的数据集上测试我们的深度模型,并在零拍摄评估上实现最先进的性能。代码可用:https://git.io/depth
translated by 谷歌翻译
作为许多自主驾驶和机器人活动的基本组成部分,如自我运动估计,障碍避免和场景理解,单眼深度估计(MDE)引起了计算机视觉和机器人社区的极大关注。在过去的几十年中,已经开发了大量方法。然而,据我们所知,对MDE没有全面调查。本文旨在通过审查1970年至2021年之间发布的197个相关条款来弥补这一差距。特别是,我们为涵盖各种方法的MDE提供了全面的调查,介绍了流行的绩效评估指标并汇总公开的数据集。我们还总结了一些代表方法的可用开源实现,并比较了他们的表演。此外,我们在一些重要的机器人任务中审查了MDE的应用。最后,我们通过展示一些有希望的未来研究方向来结束本文。预计本调查有助于读者浏览该研究领域。
translated by 谷歌翻译
对于单眼深度估计,获取真实数据的地面真相并不容易,因此通常使用监督的合成数据采用域适应方法。但是,由于缺乏实际数据的监督,这仍然可能会导致较大的域间隙。在本文中,我们通过从真实数据中生成可靠的伪基础真理来开发一个域适应框架,以提供直接的监督。具体而言,我们提出了两种用于伪标记的机制:1)通过测量图像具有相同内容但不同样式的深度预测的一致性,通过测量深度预测的一致性; 2)通过点云完成网络的3D感知伪标记,该网络学会完成3D空间中的深度值,从而在场景中提供更多的结构信息,以完善并生成更可靠的伪标签。在实验中,我们表明我们的伪标记方法改善了各种环境中的深度估计,包括在训练过程中使用立体声对。此外,该提出的方法对现实世界数据集中的几种最新无监督域的适应方法表现出色。
translated by 谷歌翻译
Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we innovate beyond existing approaches, replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage.We propose a novel training objective that enables our convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data. Exploiting epipolar geometry constraints, we generate disparity images by training our network with an image reconstruction loss. We show that solving for image reconstruction alone results in poor quality depth images. To overcome this problem, we propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images, leading to improved performance and robustness compared to existing approaches. Our method produces state of the art results for monocular depth estimation on the KITTI driving dataset, even outperforming supervised methods that have been trained with ground truth depth.
translated by 谷歌翻译
现有的深度完成方法通常以特定的稀疏深度类型为目标,并且在任务域之间概括较差。我们提出了一种方法,可以通过各种范围传感器(包括现代手机中的范围传感器或多视图重建算法)获得稀疏/半密度,嘈杂和潜在的低分辨率深度图。我们的方法利用了在大规模数据集中训练的单个图像深度预测网络的形式的数据驱动的先验,其输出被用作我们模型的输入。我们提出了一个有效的培训计划,我们在典型的任务域中模拟各种稀疏模式。此外,我们设计了两个新的基准测试,以评估深度完成方法的普遍性和鲁棒性。我们的简单方法显示了针对最先进的深度完成方法的优越的跨域泛化能力,从而引入了一种实用的解决方案,以在移动设备上捕获高质量的深度捕获。代码可在以下网址获得:https://github.com/yvanyin/filldepth。
translated by 谷歌翻译
Our long term goal is to use image-based depth completion to quickly create 3D models from sparse point clouds, e.g. from SfM or SLAM. Much progress has been made in depth completion. However, most current works assume well distributed samples of known depth, e.g. Lidar or random uniform sampling, and perform poorly on uneven samples, such as from keypoints, due to the large unsampled regions. To address this problem, we extend CSPN with multiscale prediction and a dilated kernel, leading to much better completion of keypoint-sampled depth. We also show that a model trained on NYUv2 creates surprisingly good point clouds on ETH3D by completing sparse SfM points.
translated by 谷歌翻译
Self-supervised monocular depth estimation has shown impressive results in static scenes. It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions and occlusions. Consequently, existing methods show poor accuracy in dynamic scenes, and the estimated depth map is blurred at object boundaries because they are usually occluded in other training views. In this paper, we propose SC-DepthV3 for addressing the challenges. Specifically, we introduce an external pretrained monocular depth estimation model for generating single-image depth prior, namely pseudo-depth, based on which we propose novel losses to boost self-supervised training. As a result, our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes. We demonstrate the significantly superior performance of our method over previous methods on six challenging datasets, and we provide detailed ablation studies for the proposed terms. Source code and data will be released at https://github.com/JiawangBian/sc_depth_pl
translated by 谷歌翻译
Single-view depth prediction is a fundamental problem in computer vision. Recently, deep learning methods have led to significant progress, but such methods are limited by the available training data. Current datasets based on 3D sensors have key limitations, including indoor-only images (NYU), small numbers of training examples (Make3D), and sparse sampling (KITTI). We propose to use multi-view Internet photo collections, a virtually unlimited data source, to generate training data via modern structure-from-motion and multi-view stereo (MVS) methods, and present a large depth dataset called MegaDepth based on this idea. Data derived from MVS comes with its own challenges, including noise and unreconstructable objects. We address these challenges with new data cleaning methods, as well as automatically augmenting our data with ordinal depth relations generated using semantic segmentation. We validate the use of large amounts of Internet data by showing that models trained on MegaDepth exhibit strong generalization-not only to novel scenes, but also to other diverse datasets including Make3D, KITTI, and DIW, even when no images from those datasets are seen during training. 1
translated by 谷歌翻译
A significant weakness of most current deep Convolutional Neural Networks is the need to train them using vast amounts of manually labelled data. In this work we propose a unsupervised framework to learn a deep convolutional neural network for single view depth prediction, without requiring a pre-training stage or annotated ground-truth depths. We achieve this by training the network in a manner analogous to an autoencoder. At training time we consider a pair of images, source and target, with small, known camera motion between the two such as a stereo pair. We train the convolutional encoder for the task of predicting the depth map for the source image. To do so, we explicitly generate an inverse warp of the target image using the predicted depth and known inter-view displacement, to reconstruct the source image; the photometric error in the reconstruction is the reconstruction loss for the encoder. The acquisition of this training data is considerably simpler than for equivalent systems, requiring no manual annotation, nor calibration of depth sensor to camera. We show that our network trained on less than half of the KITTI dataset gives comparable performance to that of the state-of-the-art supervised methods for single view depth estimation. 1 1 Find the model and other imformation on the project github page: https://github. com/Ravi-Garg/Unsupervised_Depth_Estimation
translated by 谷歌翻译