We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. In common with recent work [10,14,16], we use an end-to-end learning approach with view synthesis as the supervisory signal. In contrast to the previous work, our method is completely unsupervised, requiring only monocular video sequences for training. Our method uses single-view depth and multiview pose networks, with a loss based on warping nearby views to the target using the computed depth and pose. The networks are thus coupled by the loss during training, but can be applied independently at test time. Empirical evaluation on the KITTI dataset demonstrates the effectiveness of our approach: 1) monocular depth performs comparably with supervised methods that use either ground-truth pose or depth for training, and 2) pose estimation performs favorably compared to established SLAM systems under comparable input settings.
translated by 谷歌翻译
We propose GeoNet, a jointly unsupervised learning framework for monocular depth, optical flow and egomotion estimation from videos. The three components are coupled by the nature of 3D scene geometry, jointly learned by our framework in an end-to-end manner. Specifically, geometric relationships are extracted over the predictions of individual modules and then combined as an image reconstruction loss, reasoning about static and dynamic scene parts separately. Furthermore, we propose an adaptive geometric consistency loss to increase robustness towards outliers and non-Lambertian regions, which resolves occlusions and texture ambiguities effectively. Experimentation on the KITTI driving dataset reveals that our scheme achieves state-of-the-art results in all of the three tasks, performing better than previously unsupervised methods and comparably with supervised ones.
translated by 谷歌翻译
We address the problem of depth and ego-motion estimation from image sequences. Recent advances in the domain propose to train a deep learning model for both tasks using image reconstruction in a self-supervised manner. We revise the assumptions and the limitations of the current approaches and propose two improvements to boost the performance of the depth and ego-motion estimation. We first use Lie group properties to enforce the geometric consistency between images in the sequence and their reconstructions. We then propose a mechanism to pay an attention to image regions where the image reconstruction get corrupted. We show how to integrate the attention mechanism in the form of attention gates in the pipeline and use attention coefficients as a mask. We evaluate the new architecture on the KITTI datasets and compare it to the previous techniques. We show that our approach improves the state-of-the-art results for ego-motion estimation and achieve comparable results for depth estimation.
translated by 谷歌翻译
Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods.Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.
translated by 谷歌翻译
Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we innovate beyond existing approaches, replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage.We propose a novel training objective that enables our convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data. Exploiting epipolar geometry constraints, we generate disparity images by training our network with an image reconstruction loss. We show that solving for image reconstruction alone results in poor quality depth images. To overcome this problem, we propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images, leading to improved performance and robustness compared to existing approaches. Our method produces state of the art results for monocular depth estimation on the KITTI driving dataset, even outperforming supervised methods that have been trained with ground truth depth.
translated by 谷歌翻译
Photometric differences are widely used as supervision signals to train neural networks for estimating depth and camera pose from unlabeled monocular videos. However, this approach is detrimental for model optimization because occlusions and moving objects in a scene violate the underlying static scenario assumption. In addition, pixels in textureless regions or less discriminative pixels hinder model training. To solve these problems, in this paper, we deal with moving objects and occlusions utilizing the difference of the flow fields and depth structure generated by affine transformation and view synthesis, respectively. Secondly, we mitigate the effect of textureless regions on model optimization by measuring differences between features with more semantic and contextual information without adding networks. In addition, although the bidirectionality component is used in each sub-objective function, a pair of images are reasoned about only once, which helps reduce overhead. Extensive experiments and visual analysis demonstrate the effectiveness of the proposed method, which outperform existing state-of-the-art self-supervised methods under the same conditions and without introducing additional auxiliary information.
translated by 谷歌翻译
现代计算机视觉已超越了互联网照片集的领域,并进入了物理世界,通过非结构化的环境引导配备摄像头的机器人和自动驾驶汽车。为了使这些体现的代理与现实世界对象相互作用,相机越来越多地用作深度传感器,重建了各种下游推理任务的环境。机器学习辅助的深度感知或深度估计会预测图像中每个像素的距离。尽管已经在深入估算中取得了令人印象深刻的进步,但仍然存在重大挑战:(1)地面真相深度标签很难大规模收集,(2)通常认为相机信息是已知的,但通常是不可靠的,并且(3)限制性摄像机假设很常见,即使在实践中使用了各种各样的相机类型和镜头。在本论文中,我们专注于放松这些假设,并描述将相机变成真正通用深度传感器的最终目标的贡献。
translated by 谷歌翻译
Although cameras are ubiquitous, robotic platforms typically rely on active sensors like LiDAR for direct 3D perception. In this work, we propose a novel self-supervised monocular depth estimation method combining geometry with a new deep network, PackNet, learned only from unlabeled monocular videos. Our architecture leverages novel symmetrical packing and unpacking blocks to jointly learn to compress and decompress detail-preserving representations using 3D convolutions. Although self-supervised, our method outperforms other self, semi, and fully supervised methods on the KITTI benchmark. The 3D inductive bias in PackNet enables it to scale with input resolution and number of parameters without overfitting, generalizing better on out-of-domain data such as the NuScenes dataset. Furthermore, it does not require large-scale supervised pretraining on ImageNet and can run in real-time. Finally, we release DDAD (Dense Depth for Automated Driving), a new urban driving dataset with more challenging and accurate depth evaluation, thanks to longer-range and denser ground-truth depth generated from high-density LiDARs mounted on a fleet of self-driving cars operating world-wide. †
translated by 谷歌翻译
在本文中,我们提出了一种新颖的自我监督方法,可以预测未来,未观察到的现实世界中的深度估计。这项工作是第一个探索自我监督的学习,以估计视频未来未观察到的框架的单眼深度。现有作品依靠大量带注释的样本来生成对看不见框架深度的概率预测。但是,由于需要大量注释的视频样本,因此这使它变得不现实。此外,案件的概率性质,其中一个过去可能会有多个未来结果通常会导致深度估计不正确。与以前的方法不同,我们将未观察到的框架的深度估计作为视图合成问题进行建模,该问题将看不见的视频框架的深度估计视为辅助任务,同时使用学识渊博的姿势将视图恢复回去。这种方法不仅具有成本效益 - 我们不使用任何基础真相深度进行培训(因此实用),而且不使用确定性(过去的框架映射到不久的将来)。为了解决此任务,我们首先开发了一个新颖的深度预测网络DEFNET,该深度通过预测潜在特征来估计未观察到的未来的深度。其次,我们开发了基于渠道注意的姿势估计网络,该网络估计未观察到的框架的姿势。使用这个学到的姿势,将估计的深度图重建回图像域,从而形成一个自我监督的解决方案。我们提出的方法在短期和中期预测环境中与最先进的替代方案相比,ABS REL度量的重大改善,在Kitti和CityScapes上标有标准。代码可从https://github.com/sauradip/depthforecasting获得
translated by 谷歌翻译
Monocular depth estimation has been actively studied in fields such as robot vision, autonomous driving, and 3D scene understanding. Given a sequence of color images, unsupervised learning methods based on the framework of Structure-From-Motion (SfM) simultaneously predict depth and camera relative pose. However, dynamically moving objects in the scene violate the static world assumption, resulting in inaccurate depths of dynamic objects. In this work, we propose a new method to address such dynamic object movements through monocular 3D object detection. Specifically, we first detect 3D objects in the images and build the per-pixel correspondence of the dynamic pixels with the detected object pose while leaving the static pixels corresponding to the rigid background to be modeled with camera motion. In this way, the depth of every pixel can be learned via a meaningful geometry model. Besides, objects are detected as cuboids with absolute scale, which is used to eliminate the scale ambiguity problem inherent in monocular vision. Experiments on the KITTI depth dataset show that our method achieves State-of-The-Art performance for depth estimation. Furthermore, joint training of depth, camera motion and object pose also improves monocular 3D object detection performance. To the best of our knowledge, this is the first work that allows a monocular 3D object detection network to be fine-tuned in a self-supervised manner.
translated by 谷歌翻译
作为许多自主驾驶和机器人活动的基本组成部分,如自我运动估计,障碍避免和场景理解,单眼深度估计(MDE)引起了计算机视觉和机器人社区的极大关注。在过去的几十年中,已经开发了大量方法。然而,据我们所知,对MDE没有全面调查。本文旨在通过审查1970年至2021年之间发布的197个相关条款来弥补这一差距。特别是,我们为涵盖各种方法的MDE提供了全面的调查,介绍了流行的绩效评估指标并汇总公开的数据集。我们还总结了一些代表方法的可用开源实现,并比较了他们的表演。此外,我们在一些重要的机器人任务中审查了MDE的应用。最后,我们通过展示一些有希望的未来研究方向来结束本文。预计本调查有助于读者浏览该研究领域。
translated by 谷歌翻译
We present a novel approach for unsupervised learning of depth and ego-motion from monocular video. Unsupervised learning removes the need for separate supervisory signals (depth or ego-motion ground truth, or multi-view video). Prior work in unsupervised depth learning uses pixel-wise or gradient-based losses, which only consider pixels in small local neighborhoods. Our main contribution is to explicitly consider the inferred 3D geometry of the whole scene, and enforce consistency of the estimated 3D point clouds and ego-motion across consecutive frames. This is a challenging task and is solved by a novel (approximate) backpropagation algorithm for aligning 3D structures.We combine this novel 3D-based loss with 2D losses based on photometric quality of frame reconstructions using estimated depth and ego-motion from adjacent frames. We also incorporate validity masks to avoid penalizing areas in which no useful information exists.We test our algorithm on the KITTI dataset and on a video dataset captured on an uncalibrated mobile phone camera. Our proposed approach consistently improves depth estimates on both datasets, and outperforms the stateof-the-art for both depth and ego-motion. Because we only require a simple video, learning depth and ego-motion on large and varied datasets becomes possible. We demonstrate this by training on the low quality uncalibrated video dataset and evaluating on KITTI, ranking among top performing prior methods which are trained on KITTI itself. 1
translated by 谷歌翻译
深度和自我运动估计对于自主机器人和自主驾驶的本地化和导航至关重要。最近的研究可以从未标记的单像素视频中学习每个像素深度和自我运动。提出了一种新颖的无监督培训框架,使用显式3D几何进行3D层次细化和增强。在该框架中,深度和姿势估计在分层和相互耦合以通过层改进估计的姿势层。通过用估计的深度和粗姿势翘曲图像中的像素来提出和合成中间视图图像。然后,可以从新视图图像和相邻帧的图像估计残差变换以改进粗糙姿势。迭代细化在本文中以可分散的方式实施,使整个框架均匀优化。同时,提出了一种新的图像增强方法来综合新视图图像来施加姿势估计,这创造性地增强了3D空间中的姿势,而是获得新的增强2D图像。 Kitti的实验表明,我们的深度估计能够实现最先进的性能,甚至超过最近利用其他辅助任务的方法。我们的视觉内径术优于所有最近无监督的单眼学习的方法,并实现了基于几何的方法,ORB-SLAM2的竞争性能,具有后端优化。
translated by 谷歌翻译
在本文中,通过以自我监督的方式将基于几何的方法纳入深度学习架构来实现强大的视觉测量(VO)的基本问题。通常,基于纯几何的算法与特征点提取和匹配中的深度学习不那么稳健,但由于其成熟的几何理论,在自我运动估计中表现良好。在这项工作中,首先提出了一种新颖的光学流量网络(PANET)内置于位置感知机构。然后,提出了一种在没有典型网络的情况下共同估计深度,光学流动和自我运动来学习自我运动的新系统。所提出的系统的关键组件是一种改进的束调节模块,其包含多个采样,初始化的自我运动,动态阻尼因子调整和Jacobi矩阵加权。另外,新颖的相对光度损耗函数先进以提高深度估计精度。该实验表明,所提出的系统在基于基于基于基于基于基于基于基于学习的基于学习的方法之间的深度,流量和VO估计方面不仅优于其他最先进的方法,而且与几何形状相比,也显着提高了鲁棒性 - 基于,基于学习和混合VO系统。进一步的实验表明,我们的模型在挑战室内(TMU-RGBD)和室外(KAIST)场景中实现了出色的泛化能力和性能。
translated by 谷歌翻译
通过探索跨视图一致性,例如,光度计一致性和3D点云的一致性,在自我监督的单眼深度估计(SS-MDE)中取得了显着进步。但是,它们非常容易受到照明差异,遮挡,无纹理区域以及移动对象的影响,使它们不够强大,无法处理各种场景。为了应对这一挑战,我们在本文中研究了两种强大的跨视图一致性。首先,相邻帧之间的空间偏移场是通过通过可变形对齐来从其邻居重建参考框架来获得的,该比对通过深度特征对齐(DFA)损失来对齐时间深度特征。其次,计算每个参考框架及其附近框架的3D点云并转换为体素空间,在其中计算每个体素中的点密度并通过体素密度比对(VDA)损耗对齐。通过这种方式,我们利用了SS-MDE的深度特征空间和3D体素空间的时间连贯性,将“点对点”对齐范式转移到“区域到区域”。与光度一致性损失以及刚性点云对齐损失相比,由于深度特征的强大代表能力以及对上述挑战的素密度的高公差,提出的DFA和VDA损失更加强大。几个户外基准的实验结果表明,我们的方法的表现优于当前最新技术。广泛的消融研究和分析验证了拟议损失的有效性,尤其是在具有挑战性的场景中。代码和型号可在https://github.com/sunnyhelen/rcvc-depth上找到。
translated by 谷歌翻译
近年来,尤其是在户外环境中,自我监督的单眼深度估计已取得了重大进展。但是,在大多数现有数据被手持设备捕获的室内场景中,深度预测结果无法满足。与室外环境相比,使用自我监督的方法估算室内环境的单眼视频深度,导致了两个额外的挑战:(i)室内视频序列的深度范围在不同的框架上有很大变化,使深度很难进行。网络以促进培训的一致深度线索; (ii)用手持设备记录的室内序列通常包含更多的旋转运动,这使姿势网络难以预测准确的相对摄像头姿势。在这项工作中,我们通过对这些挑战进行特殊考虑并巩固了一系列良好实践,以提高自我监督的单眼深度估计室内环境的表现,从而提出了一种新颖的框架单声道++。首先,提出了具有基于变压器的比例回归网络的深度分解模块,以明确估算全局深度尺度因子,预测的比例因子可以指示最大深度值。其次,我们不像以前的方法那样使用单阶段的姿势估计策略,而是建议利用残留姿势估计模块来估计相对摄像机在连续迭代的跨帧中构成。第三,为了为我们的残留姿势估计模块纳入广泛的坐标指南,我们建议直接在输入上执行坐标卷积编码,以实现姿势网络。提出的方法在各种基准室内数据集(即Euroc Mav,Nyuv2,扫描仪和7片)上进行了验证,证明了最先进的性能。
translated by 谷歌翻译
使用从未标识的视频培训的CNNS的单视深度估计显示了重要的承诺。然而,街头场景驾驶场景中主要获得了优异的结果,并且此类方法通常在其他设置中失败,特别是手持设备采取的室内视频。在这项工作中,我们建立了手持式环境中展出的复杂自我动作是学习深度的关键障碍。我们的基本分析表明,旋转在训练期间的噪声表现在训练期间,而不是提供监督信号的翻译(基线)。为了解决挑战,我们提出了一种数据预处理方法,可以通过去除其有效学习的相对旋转来整流训练图像。显着提高的性能验证了我们的动机。在不需要预处理的情况下,我们向端到端学习,我们提出了一种具有新型损失功能的自动整流网络,可以自动学习在训练期间纠正图像。因此,我们的结果在挑战NYUV2数据集中的大幅度上以较大的余量优于先前的无监督的SOTA方法。我们还展示了我们在Scannet和Make3D中培训模型的概括,以及我们提出的7场景和基蒂数据集的建议学习方法的普遍性。
translated by 谷歌翻译
在接受高质量的地面真相(如LiDAR数据)培训时,监督的学习深度估计方法可以实现良好的性能。但是,LIDAR只能生成稀疏的3D地图,从而导致信息丢失。每个像素获得高质量的地面深度数据很难获取。为了克服这一限制,我们提出了一种新颖的方法,将有前途的平面和视差几何管道与深度信息与U-NET监督学习网络相结合的结构信息结合在一起,与现有的基于流行的学习方法相比,这会导致定量和定性的改进。特别是,该模型在两个大规模且具有挑战性的数据集上进行了评估:Kitti Vision Benchmark和CityScapes数据集,并在相对错误方面取得了最佳性能。与纯深度监督模型相比,我们的模型在薄物体和边缘的深度预测上具有令人印象深刻的性能,并且与结构预测基线相比,我们的模型的性能更加强大。
translated by 谷歌翻译
最近,自我监督的学习技术已经应用于计算单眼视频的深度和自我运动,实现了自动驾驶场景中的显着性能。一种广泛采用的深度和自我运动自我监督学习的假设是图像亮度在附近框架内保持恒定。遗憾的是,内窥镜场景不符合这种假设,因为在数据收集期间的照明变化,非灯泡反射和孤立性引起的严重亮度波动,并且这些亮度波动不可避免地恶化深度和自我运动估计精度。在这项工作中,我们介绍了一个新颖的概念,称为外观流动,以解决亮度不一致问题。外观流程考虑了亮度图案中的任何变型,使我们能够开发广义动态图像约束。此外,我们建立一个统一的自我监督框架,以在内窥镜场景中同时估计单眼深度和自我运动,该内窥镜场景包括结构模块,运动模块,外观模块和对应模块,以准确地重建外观并校准图像亮度。广泛的实验是在害怕的数据集和内酷数据集上进行的,拟议的统一框架超过了大幅度的其他自我监控方法。为了验证我们在不同患者和相机上的框架的泛化能力,我们训练我们的模型害怕,但在没有任何微调的情况下测试它在Serv-CT和Hamlyn数据集上,并且卓越的结果揭示了其强大的泛化能力。代码将可用:\ url {https://github.com/shuweishao/af-sfmlearner}。
translated by 谷歌翻译
从单目视频重建3D网格的关键元素之一是生成每个帧的深度图。然而,在结肠镜检查视频重建的应用中,产生良好质量的深度估计是具有挑战性的。神经网络可以容易地被光度分散注意力欺骗,或者不能捕获结肠表面的复杂形状,预测导致破碎网格的缺陷形状。旨在从根本上提高结肠镜检查3D重建的深度估计质量,在这项工作中,我们设计了一系列培训损失来应对结肠镜检查数据的特殊挑战。为了更好的培训,使用深度和表面正常信息开发了一组几何一致性目标。而且,经典的光度损耗延伸,具有特征匹配以补偿照明噪声。随着足够强大的培训损失,我们的自我监督框架命名为COLLE,与利用先前的深度知识相比,我们的自我监督框架能够产生更好的结肠镜检查数据地图。用于重建,我们的网络能够实时重建高质量的结肠网格,而无需任何后处理,使其成为第一个在临床上适用。
translated by 谷歌翻译