我们提出DeepV2D,一种端到端可微分深度学习架构,用于预测视频序列的深度。我们通过设计一组可微分的几何模块,将Motion的经典结构元素融入端到端的可训练管道中。我们的完整系统在预测深度和精炼相机姿势之间交替。我们通过在学习特征上构建成本量来估计深度,并应用多尺度3D卷积网络进行立体匹配。然后将预测的深度发送到运动模块,该运动模块通过将光流映射到相机运动更新来执行迭代姿势更新。我们在NYU,KITTI和SUN3D数据集上评估我们提出的系统,并显示出比单眼基线和深度和经典立体重建更好的结果。
translated by 谷歌翻译
我们提出了一种新颖的直接稀疏视觉测距配方。它结合了直接的概率模型(最小化光度误差)与所有模型参数的一致,联合优化,包括几何 - 代表参考帧中的反深度 - 和相机运动。通过省略先前在其他直接方法中使用的平滑度并且在整个图像中均匀地采样像素,实时地实现了这一点。由于我们的方法不依赖于关键点检测器或描述符,因此它可以自然地对来自具有强度梯度的所有图像区域的像素进行采样,包括在大多数白色墙壁上的边缘或平滑强度变化。所提出的模型集成了完整的光度校准,计算前曝光时间,镜头渐晕和非线性响应功能。在三个不同的数据集上,包括几个小时的视频,对我们的方法进行了充分的评估。实验表明,在跟踪精度和鲁棒性方面,所提出的方法在各种实际环境中显着地优于现有技术的直接和间接方法。
translated by 谷歌翻译
依赖于光度误差度量的优化的直接图像到图像对准受到有限的会聚范围和对光照条件的敏感性的影响。已经应用深度学习方法通​​过使用卷积神经网络学习更好的特征表示来解决该问题,但仍然需要良好的初始化。在本文中,我们证明了不准确的数值雅可比行列式限制了可以使用学习方法大大改善的收敛范围。基于这一观察,我们提出了一种新颖的端到端网络RegNet,可以优化图像到图像的姿态。注册。通过联合学习每个像素的特征表示和在优化步骤中替换手工制作的部分导数(例如,数值微分),神经网络便于端到端优化。能量景观受特征表示和学习雅可比行列式的约束,从而为优化提供了更大的灵活性,从而导致更加鲁棒和更快的收敛。在包括广泛研究在内的一系列实验中,我们证明了RegNet能够以较少的迭代收敛大型基线图像对。
translated by 谷歌翻译
We propose a novel Large-Scale Direct SLAM algorithm for stereo cameras (Stereo LSD-SLAM) that runs in real-time at high frame rate on standard CPUs. In contrast to sparse interest-point based methods, our approach aligns images directly based on the photoconsistency of all high-contrast pixels, including corners, edges and high texture areas. It concurrently estimates the depth at these pixels from two types of stereo cues: Static stereo through the fixed-baseline stereo camera setup as well as temporal multi-view stereo exploiting the camera motion. By incorporating both disparity sources, our algorithm can even estimate depth of pixels that are under-constrained when only using fixed-baseline stereo. Using a fixed baseline, on the other hand, avoids scale-drift that typically occurs in pure monocular SLAM. We furthermore propose a robust approach to enforce illumination invariance, capable of handling aggressive brightness changes between frames-greatly improving the performance in realistic settings. In experiments, we demonstrate state-of-the-art results on stereo SLAM benchmarks such as Kitti or challenging datasets from the EuRoC Challenge 3 for micro aerial vehicles.
translated by 谷歌翻译
Robust estimation of correspondences between image pixels is an important problem in robotics, with applications in tracking, mapping, and recognition of objects, environments, and other agents. Correspondence estimation has long been the domain of hand-engineered features, but more recently deep learning techniques have provided powerful tools for learning features from raw data. The drawback of the latter approach is that a vast amount of (labelled, typically) training data is required for learning. This paper advocates a new approach to learning visual descriptors for dense correspondence estimation in which we harness the power of a strong 3D generative model to automatically label correspondences in RGB-D video data. A fully-convolutional network is trained using a contrastive loss to produce viewpoint-and lighting-invariant descriptors. As a proof of concept, we collected two datasets: the first depicts the upper torso and head of the same person in widely varied settings, and the second depicts an office as seen on multiple days with objects rearranged within. Our datasets focus on re-visitation of the same objects and environments, and we show that by training the CNN only from local tracking data, our learned visual descriptor generalizes towards identifying non-labelled correspondences across videos. We furthermore show that our approach to descriptor learning can be used to achieve state-of-the-art single-frame localization results on the MSR 7-scenes dataset without using any labels identifying correspondences between separate videos of the same scenes at training time.
translated by 谷歌翻译
We propose Stereo Direct Sparse Odometry (Stereo DSO) as a novel method for highly accurate real-time visual odometry estimation of large-scale environments from stereo cameras. It jointly optimizes for all the model parameters within the active window, including the intrin-sic/extrinsic camera parameters of all keyframes and the depth values of all selected pixels. In particular, we propose a novel approach to integrate constraints from static stereo into the bundle adjustment pipeline of temporal multi-view stereo. Real-time optimization is realized by sampling pix-els uniformly from image regions with sufficient intensity gradient. Fixed-baseline stereo resolves scale drift. It also reduces the sensitivities to large optical flow and to rolling shutter effect which are known shortcomings of direct image alignment methods. Quantitative evaluation demonstrates that the proposed Stereo DSO outperforms existing state-of-the-art visual odometry methods both in terms of tracking accuracy and robustness. Moreover, our method delivers a more precise metric 3D reconstruction than previous dense/semi-dense direct approaches while providing a higher reconstruction density than feature-based methods.
translated by 谷歌翻译
给定图像对或图像序列的恢复结构和运动参数是计算机视觉中充分研究的问题。这通常通过采用基于实时要求的运动结构(SfM)或同时定位和映射(SLAM)算法来实现。最近,随着卷积神经网络(CNN)的出现,研究人员探索了使用机器学习技术重建场景的3D结构并共同预测相机姿态的可能性。在这项工作中,我们提出了一个框架,可以在室内和室外场景的单个图像深度预测上实现最先进的性能。然后扩展深度预测系统以预测光流并最终预测相机姿势并进行端到端训练。我们的运动估计框架优于以前的运动预测系统,我们还证明了使用知识可以进一步提高最先进的度量深度。
translated by 谷歌翻译
RGB-D摄像机保持有限的工作范围,难以精确测量远距离的深度信息。此外,RGB-Dcamera很容易受到强光照和其他外部因素的影响,这将导致获得的环境深度信息的准确性差。最近,深度学习技术在视觉SLAM领域取得了巨大成功,可以直接从视觉输入中学习高级特征,提高深度信息的估计精度。因此,深度学习技术保留了扩展深度信息源的潜力。然而,现有的基于深度学习的方法主要受到监督,需要大量的地面实况深度数据,由于存在实际约束而难以获取。在本文中,我们首先提出了一种不受监督的学习框架,该框架不仅使用图像重建功能,而且利用姿态估计方法来增强监督信号,并为单目摄像机运动估计任务增加训练约束。此外,我们成功利用我们的监督学习框架来协助传统的ORB-SLAM系统,因为ORB-SLAM方法的初始化模块无法匹配足够的特征。定性和定量实验表明,我们的无监督学习框架执行与监督方法相当的深度估计任务。在KITTI数据集上优于先前的最新方法$ 13.5 \%$。此外,我们的无监督学习框架可以显着加快ORB-SLAM系统的初始化过程,有效提高强光照和弱纹理场景下环境映射的准确性。
translated by 谷歌翻译
在本文中,我们将运动的结构表述为学习问题。端到端地卷曲卷积网络以从连续的,无约束的图像对计算深度和相机运动。该架构由多个堆叠的编码器 - 解码器网络组成,核心部分是一个能够改进其自身预测的迭代网络。网络不仅估计深度和运动,而且还估计表面法线,图像之间的光流和匹配的置信度。该方法的关键组成部分是基于空间相对差异的训练损失。比较运动方法的传统双帧结构,结果更准确,更稳健。与流行的单一图像深度网络相比,DeMoN学习了匹配的概念,因此更好地概括了培训期间未见到的结构。
translated by 谷歌翻译
我们提出了一个基于关键帧的密集相机跟踪和深度mapestimation的系统,这是完全学到的。为了跟踪,我们估计当前相机图像和合成视点之间的小姿势增量。这显着简化了学习问题并减轻了相机运动的数据集偏差。此外,我们表明,生成大量的姿势假设导致更准确的预测。对于映射,我们在以当前深度估计为中心的成本量中累积信息。然后,映射网络将成本量和关键帧图像组合以更新深度预测,从而有效地利用深度测量和基于图像的先验。我们的方法产生最先进的结果,几乎没有图像,并且相对于嘈杂的相机姿势而言是稳健的。我们证明了我们的6自由度跟踪的性能与RGB-D跟踪算法竞争。我们与强大的经典和深度学习驱动的密集深度算法相比,有利可图。
translated by 谷歌翻译
RGB-D cameras that can provide rich 2D visual and 3D depth information are well suited to the motion estimation of indoor mobile robots. In recent years, several RGB-D visual odometry methods that process data from the sensor in different ways have been proposed. This paper first presents a brief review of recently proposed RGB-D visual odometry methods, and then presents a detailed analysis and comparison of eight state-of-the-art real-time 6DOF motion estimation methods in a variety of challenging scenarios, with a special emphasis on the trade-off between accuracy, robustness and computation speed. An experimental comparison is conducted using publicly available benchmark datasets and author-collected datasets in various scenarios, including long corridors, illumination changing environments and fast motion scenarios. Experimental results present both quantitative and qualitative differences between these methods and provide some guidelines on how to choose the right algorithm for an indoor mobile robot according to the quality of the RGB-D data and environmental characteristics.
translated by 谷歌翻译
Visual SLAM (Simultaneous Localization and Mapping) methods typically rely on handcrafted visual features or raw RGB values for establishing correspondences between images. These features, while suitable for sparse mapping, often lead to ambiguous matches in texture-less regions when performing dense reconstruction due to the aperture problem. In this work, we explore the use of learned features for the matching task in dense monocular reconstruction. We propose a novel convolutional neural network (CNN) architecture along with a deeply supervised feature learning scheme for pixel-wise regression of visual descriptors from an image which are best suited for dense monocular SLAM. In particular, our learning scheme minimizes a multi-view matching cost-volume loss with respect to the regressed features at multiple stages within the network, for explicitly learning contextual features that are suitable for dense matching between images captured by a moving monocular camera along the epipolar line. We integrate the learned features from our model for depth estimation inside a real-time dense monocular SLAM framework, where photometric error is replaced by our learned descriptor error. Our extensive evaluation on several challenging indoor datasets demonstrate greatly improved accuracy in dense reconstructions of the well celebrated dense SLAM systems like DTAM, without compromising their real-time performance.
translated by 谷歌翻译
在这项工作中,我们解决了在困难的成像条件下找到可靠的像素级对应的问题。我们提出了一种方法,其中单个卷积神经网络起双重作用:它同时是一个密集的特征描述符和一个特征检测器。通过将检测推迟到后期阶段,基于早期检测低水平结构,获得的关键点比传统的关键点更稳定。我们表明,可以使用从现成的大规模SfM重建中提取的像素对应来训练该模型,而无需任何进一步的注释。所提出的方法在困难的亚琛日夜定位数据集和InLocindoor定位基准测试中获得最先进的性能,以及用于图像匹配和3D重建的其他基准标记的竞争性能。
translated by 谷歌翻译
本文讨论了两幅图像之间密集像素对应估计的挑战。该问题与光流估计任务密切相关,其中ConvNets(CNN)最近取得了重大进展。虽然光学流动方法对于小像素透射和有限的外观变化情况产生非常精确的结果,但它们几乎不能处理我们在这项工作中考虑的强烈几何变换。在本文中,我们提出了一种从粗到细的基于CNN的框架,它可以利用光流方法的优势,并将它们扩展到大变换的情况,提供密集和亚像素精确的估计。它通过合成变换进行训练,并且对看不见的,逼真的数据表现出非常好的性能。此外,我们将我们的方法应用于相对摄像机估计的问题,并证明该模型优于现有的密集方法。
translated by 谷歌翻译
在本文中,我们提供了用于密集图像对齐的经典逆分解算法的现代综合。我们首先讨论这种成熟技术所做出的假设,然后通过将数据驱动的先验结合到这个模型中来提出放松这些假设。更具体地说,我们展开了逆组合算法的强大版本,并使用更多的表达模型替换了该算法的多个组成部分。我们的数据以端到端的方式从数据中进行训练。我们对几项具有挑战性的3D刚性运动估计任务进行了实验,证明了将优化与基于学习的技术相结合的优势,优于经典的逆向组合算法以及数据驱动的图像到 - 姿势回归方法。
translated by 谷歌翻译
我们提出了一种自我监督的学习框架,该框架使用未标记的单眼视频序列来生成用于训练VisualOdometry(VO)前端的大规模监督,该前端是一种计算跨图像的逐点数据关联的网络。我们的自我改进方法使VO前端能够学习加班,这与其他VO和SLAM系统不同,后者需要耗时的手动调谐器或昂贵的数据采集以适应新环境。我们提出的frontend操作单眼图像,由单个多任务卷积神经网络组成,输出2D关键点位置,关键点描述符和新的点稳定性得分。我们使用VO的输出来创建一个自我监督的点对应数据集来重新训练这个项目。当使用来自扫描网的250万单眼图像进行大规模VO训练时,稳定性分类器会自动发现不太可能对VO有帮助的关键点的排名,例如跨越深度不连续的T型交点,阴影和高光上的特征以及人物等动态对象。在ScanNet上的3D到2D姿态估计任务中,得到的前端优于传统方法(SIFT,ORB,AKAZE)和深度学习方法(SuperPoint和LF-Net)。
translated by 谷歌翻译
从3D传感器生成的大规模点云比其基于图像的对应物更准确。然而,由于难以获得2D-3D图像到点云对应,因此很少用于视觉姿态估计。在本文中,我们提出了2D3D-MatchNet - 一种端到端深度网络体系结构,以共同学习2D描述符分别来自图像和点云的3D关键点。因此,我们能够直接匹配来自查询图像和3D点云参考图的建立的2D-3D对应以用于视觉姿势估计。我们使用Oxford Robotcar数据集创建牛津2D-3D Patchesdataset,使用地面实况摄像机构建和2D-3D图像到点云对应,用于训练和测试深度网络。实验结果验证了我们的方法的可行性。
translated by 谷歌翻译
我们提出DeepMapping,一种新颖的注册框架,使用深度神经网络(DNN)作为辅助功能,将多个点云从划分对齐到全局一致的帧。我们使用DNN来模拟高度非凸映射过程,该过程传统上涉及手工制作的数据关联,传感器姿态初始化和全局细化。正确定义无监督损失以通过反向传播“训练”这些DNN的关键新颖性等同于解决基础注册问题,但是对ICP的要求实现良好初始化的依赖性较小。我们的框架包含两个DNN:一个估计输入点云姿态的定位网络,以及一个通过估计全局坐标的占用状态来模拟场景结构的地图网络。这允许我们将配准问题转换为二进制占用分类,这可以使用基于梯度的优化来有效地解决。我们进一步表明DeepMapping可以很容易地扩展到解决激光雷达SLAM在连续点云之间施加几何约束的问题。实验在模拟和真实数据集上进行。定性和定量比较表明,与现有技术相比,DeepMapping通常能够实现更加稳健和准确的多点云全局注册。我们的代码可以在http://ai4ce.github.io/DeepMapping/上找到。
translated by 谷歌翻译
This paper proposes a data-driven approach for image alignment. Our main contribution is a novel network architecture that combines the strengths of convolutional neural networks (CNNs) and the Lucas-Kanade algorithm. The main component of this architecture is a Lucas-Kanade layer that performs the inverse compositional algorithm on convolutional feature maps. To train our network, we develop a cascaded feature learning method that incorporates the coarse-to-fine strategy into the training process. This method learns a pyramid representation of convolutional features in a cascaded manner and yields a cascaded network that performs coarse-to-fine alignment on the feature pyramids. We apply our model to the task of homography estimation, and perform training and evaluation on a large labeled dataset generated from the MS-COCO dataset. Experimental results show that the proposed approach significantly outperforms the other methods.
translated by 谷歌翻译
我们提出了一种使用单目RGB相机对刚性3D物体进行实时6DOF姿态跟踪的算法。关键思想是使用时间上一致的局部颜色直方图来导出基于区域的成本函数。虽然这些基于区域的成本函数通常使用一阶梯度下降技术进行优化,但我们系统地推导出高斯 - 牛顿优化方案,从而使收敛速度更快,并且具有高度准确和鲁棒的跟踪性能。我们还提出了一个新颖的复杂数据集,用于单眼对象姿势跟踪的任务,并使其公开可用于社区。据我们所知,它是第一个解决相机和物体在混乱场景中同时移动的常见和重要场景。在许多实验中 - 包括我们自己提出的数据集 - 我们证明了所提出的高斯 - 牛顿方法优于现有方法,特别是在杂乱背景,异构对象和部分夹杂的存在下。
translated by 谷歌翻译