We propose a direct (feature-less) monocular SLAM algorithm which, in contrast to current state-of-the-art regarding direct methods, allows to build large-scale, consistent maps of the environment. Along with highly accurate pose estimation based on direct image alignment, the 3D environment is reconstructed in real-time as pose-graph of keyframes with associated semi-dense depth maps. These are obtained by filtering over a large number of pixelwise small-baseline stereo comparisons. The explicitly scale-drift aware formulation allows the approach to operate on challenging sequences including large variations in scene scale. Major enablers are two key novelties: (1) a novel direct tracking method which operates on sim(3), thereby explicitly detecting scale-drift, and (2) an elegant probabilistic solution to include the effect of noisy depth values into tracking. The resulting direct monocular SLAM system runs in real-time on a CPU.
translated by 谷歌翻译
a) Stereo input: trajectory and sparse reconstruction of an urban environment with multiple loop closures. (b) RGB-D input: keyframes and dense pointcloud of a room scene with one loop closure. The pointcloud is rendered by backprojecting the sensor depth maps from estimated keyframe poses. No fusion is performed.
translated by 谷歌翻译
This paper presents ORB-SLAM, a feature-based monocular SLAM system that operates in real time, in small and large, indoor and outdoor environments. The system is robust to severe motion clutter, allows wide baseline loop closing and relocalization, and includes full automatic initialization. Building on excellent algorithms of recent years, we designed from scratch a novel system that uses the same features for all SLAM tasks: tracking, mapping, relocalization, and loop closing. A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation. We present an exhaustive evaluation in 27 sequences from the most popular datasets. ORB-SLAM achieves unprecedented performance with respect to other state-of-the-art monocular SLAM approaches. For the benefit of the community, we make the source code public.
translated by 谷歌翻译
This paper presents ORB-SLAM3, the first system able to perform visual, visual-inertial and multi-map SLAM with monocular, stereo and RGB-D cameras, using pin-hole and fisheye lens models.The first main novelty is a feature-based tightly-integrated visual-inertial SLAM system that fully relies on Maximum-a-Posteriori (MAP) estimation, even during the IMU initialization phase. The result is a system that operates robustly in real time, in small and large, indoor and outdoor environments, and is two to ten times more accurate than previous approaches.The second main novelty is a multiple map system that relies on a new place recognition method with improved recall. Thanks to it, ORB-SLAM3 is able to survive to long periods of poor visual information: when it gets lost, it starts a new map that will be seamlessly merged with previous maps when revisiting mapped areas. Compared with visual odometry systems that only use information from the last few seconds, ORB-SLAM3 is the first system able to reuse in all the algorithm stages all previous information. This allows to include in bundle adjustment co-visible keyframes, that provide high parallax observations boosting accuracy, even if they are widely separated in time or if they come from a previous mapping session.Our experiments show that, in all sensor configurations, ORB-SLAM3 is as robust as the best systems available in the literature, and significantly more accurate. Notably, our stereo-inertial SLAM achieves an average accuracy of 3.5 cm in the EuRoC drone and 9 mm under quick hand-held motions in the room of TUM-VI dataset, a setting representative of AR/VR scenarios. For the benefit of the community we make public the source code.
translated by 谷歌翻译
我们提出了一个基于深度神经网络深度预测的比例感知直接单眼遗传学的通用框架。与以前的深度信息仅部分利用的方法相反,我们制定了一种新颖的深度预测残差,使我们能够合并多视图深度信息。此外,我们建议使用截短的稳健成本函数,以防止考虑不一致的深度估计。光度法和深度预测测量值集成到紧密耦合的优化中,从而导致尺度感知的单眼系统,该系统不会累积尺度漂移。我们的建议没有针对具体的神经网络的特殊性,能够与绝大多数现有的深度预测解决方案一起工作。我们使用两个公开可用的神经网络在Kitti Odometry数据集上评估该提案的有效性和普遍性,并将其与类似方法进行比较,以及单眼和立体声猛击的最新方法。实验表明,我们的提议在很大程度上要优于经典的单眼大满贯,更精确的5至9倍,击败了类似的方法,并且具有更接近立体系统的精度。
translated by 谷歌翻译
在本文中,我们串联串联一个实时单手抄语和密集的测绘框架。对于姿势估计,串联基于关键帧的滑动窗口执行光度束调整。为了增加稳健性,我们提出了一种新颖的跟踪前端,使用从全局模型中呈现的深度图来执行密集的直接图像对齐,该模型从密集的深度预测逐渐构建。为了预测密集的深度映射,我们提出了通过分层构造具有自适应视图聚合的3D成本卷来平衡关键帧之间的不同立体声基线的3D成本卷来使用整个活动密钥帧窗口的级联视图 - 聚合MVSNet(CVA-MVSNET)。最后,将预测的深度映射融合到表示为截短的符号距离函数(TSDF)体素网格的一致的全局映射中。我们的实验结果表明,在相机跟踪方面,串联优于其他最先进的传统和学习的单眼视觉径管(VO)方法。此外,串联示出了最先进的实时3D重建性能。
translated by 谷歌翻译
Figure 1: Example output from our system, generated in real-time with a handheld Kinect depth camera and no other sensing infrastructure. Normal maps (colour) and Phong-shaded renderings (greyscale) from our dense reconstruction system are shown. On the left for comparison is an example of the live, incomplete, and noisy data from the Kinect sensor (used as input to our system).
translated by 谷歌翻译
A monocular visual-inertial system (VINS), consisting of a camera and a low-cost inertial measurement unit (IMU), forms the minimum sensor suite for metric six degreesof-freedom (DOF) state estimation. However, the lack of direct distance measurement poses significant challenges in terms of IMU processing, estimator initialization, extrinsic calibration, and nonlinear optimization. In this work, we present VINS-Mono: a robust and versatile monocular visual-inertial state estimator. Our approach starts with a robust procedure for estimator initialization and failure recovery. A tightly-coupled, nonlinear optimization-based method is used to obtain high accuracy visual-inertial odometry by fusing pre-integrated IMU measurements and feature observations. A loop detection module, in combination with our tightly-coupled formulation, enables relocalization with minimum computation overhead. We additionally perform four degrees-of-freedom pose graph optimization to enforce global consistency. We validate the performance of our system on public datasets and real-world experiments and compare against other state-of-the-art algorithms. We also perform onboard closed-loop autonomous flight on the MAV platform and port the algorithm to an iOS-based demonstration. We highlight that the proposed work is a reliable, complete, and versatile system that is applicable for different applications that require high accuracy localization. We open source our implementations for both PCs 1 and iOS mobile devices 2 .
translated by 谷歌翻译
尽管密集的视觉大满贯方法能够估计环境的密集重建,但它们的跟踪步骤缺乏稳健性,尤其是当优化初始化较差时。稀疏的视觉大满贯系统通过将惯性测量包括在紧密耦合的融合中,达到了高度的准确性和鲁棒性。受这一表演的启发,我们提出了第一个紧密耦合的密集RGB-D惯性大满贯系统。我们的系统在GPU上运行时具有实时功能。它共同优化了相机姿势,速度,IMU偏见和重力方向,同时建立了全球一致,完全密集的基于表面的3D重建环境。通过一系列关于合成和现实世界数据集的实验,我们表明我们密集的视觉惯性大满贯系统对于低纹理和低几何变化的快速运动和时期比仅相关的RGB-D仅相关的SLAM系统更强大。
translated by 谷歌翻译
由于其对环境变化的鲁棒性,视觉猛感的间接方法是受欢迎的。 ORB-SLAM2 \ CITE {ORBSLM2}是该域中的基准方法,但是,除非选择帧作为关键帧,否则它会消耗从未被重用的描述符。轻量级和高效,因为它跟踪相邻帧之间的关键点而不计算描述符。为此,基于稀疏光流提出了一种两个级粗到微小描述符独立的Keypoint匹配方法。在第一阶段,我们通过简单但有效的运动模型预测初始关键点对应,然后通过基于金字塔的稀疏光流跟踪鲁棒地建立了对应关系。在第二阶段,我们利用运动平滑度和末端几何形状的约束来改进对应关系。特别是,我们的方法仅计算关键帧的描述符。我们在\ texit {tum}和\ texit {icl-nuim} RGB-D数据集上测试Fastorb-Slam,并将其准确性和效率与九种现有的RGB-D SLAM方法进行比较。定性和定量结果表明,我们的方法实现了最先进的准确性,并且大约是ORB-SLAM2的两倍。
translated by 谷歌翻译
现代视觉惯性导航系统(VINS)面临着实际部署中的一个关键挑战:他们需要在高度动态的环境中可靠且强大地运行。当前最佳解决方案仅根据对象类别的语义将动态对象过滤为异常值。这样的方法不缩放,因为它需要语义分类器来包含所有可能移动的对象类;这很难定义,更不用说部署。另一方面,许多现实世界的环境以墙壁和地面等平面形式表现出强大的结构规律,这也是至关重要的。我们呈现RP-VIO,一种单眼视觉惯性内径系统,可以利用这些平面的简单几何形状,以改善充满活力环境的鲁棒性和准确性。由于现有数据集具有有限数量的动态元素,因此我们还提供了一种高动态的光致态度合成数据集,用于更有效地对现代VINS系统的功能的评估。我们评估我们在该数据集中的方法,以及来自标准数据集的三个不同序列,包括两个真实的动态序列,并在最先进的单眼视觉惯性内径系统上显示出鲁棒性和准确性的显着提高。我们还显示在模拟中,通过简单的动态特征掩蔽方法改进。我们的代码和数据集是公开可用的。
translated by 谷歌翻译
This paper presents a method of estimating camera pose in an unknown scene. While this has previously been attempted by adapting SLAM algorithms developed for robotic exploration, we propose a system specifically designed to track a hand-held camera in a small AR workspace. We propose to split tracking and mapping into two separate tasks, processed in parallel threads on a dual-core computer: one thread deals with the task of robustly tracking erratic hand-held motion, while the other produces a 3D map of point features from previously observed video frames. This allows the use of computationally expensive batch optimisation techniques not usually associated with real-time operation: The result is a system that produces detailed maps with thousands of landmarks which can be tracked at frame-rate, with an accuracy and robustness rivalling that of state-of-the-art model-based systems.
translated by 谷歌翻译
结合同时定位和映射(SLAM)估计和动态场景建模可以高效地在动态环境中获得机器人自主权。机器人路径规划和障碍避免任务依赖于场景中动态对象运动的准确估计。本文介绍了VDO-SLAM,这是一种强大的视觉动态对象感知SLAM系统,用于利用语义信息,使得能够在场景中进行准确的运动估计和跟踪动态刚性物体,而无需任何先前的物体形状或几何模型的知识。所提出的方法识别和跟踪环境中的动态对象和静态结构,并将这些信息集成到统一的SLAM框架中。这导致机器人轨迹的高度准确估计和对象的全部SE(3)运动以及环境的时空地图。该系统能够从对象的SE(3)运动中提取线性速度估计,为复杂的动态环境中的导航提供重要功能。我们展示了所提出的系统对许多真实室内和室外数据集的性能,结果表明了对最先进的算法的一致和实质性的改进。可以使用源代码的开源版本。
translated by 谷歌翻译
Various datasets have been proposed for simultaneous localization and mapping (SLAM) and related problems. Existing datasets often include small environments, have incomplete ground truth, or lack important sensor data, such as depth and infrared images. We propose an easy-to-use framework for acquiring building-scale 3D reconstruction using a consumer depth camera. Unlike complex and expensive acquisition setups, our system enables crowd-sourcing, which can greatly benefit data-hungry algorithms. Compared to similar systems, we utilize raw depth maps for odometry computation and loop closure refinement which results in better reconstructions. We acquire a building-scale 3D dataset (BS3D) and demonstrate its value by training an improved monocular depth estimation model. As a unique experiment, we benchmark visual-inertial odometry methods using both color and active infrared images.
translated by 谷歌翻译
在本文中,我们提出了一个紧密耦合的视觉惯性对象级多效性动态大满贯系统。即使在极其动态的场景中,它也可以为摄像机姿势,速度,IMU偏见并构建一个密集的3D重建对象级映射图。我们的系统可以通过稳健的传感器和对象跟踪,可以强牢固地跟踪和重建任意对象的几何形状,其语义和运动的几何形状,其语义和运动的几何形状,并通过逐步融合相关的颜色,深度,语义和前景对象概率概率。此外,当对象在视野视野外丢失或移动时,我们的系统可以在重新观察时可靠地恢复其姿势。我们通过定量和定性测试现实世界数据序列来证明我们方法的鲁棒性和准确性。
translated by 谷歌翻译
在不同情况下,已经探索了相对旋转和翻译估计任务的最小解决方案,通常依赖于所谓的共同可见度图。但是,如何在没有重叠的两个框架之间建立直接旋转关系仍然是一个公开主题,如果解决了,它可以大大提高视觉尾声的准确性。在本文中,提出了一种新的最小解决方案,以通过利用新的图形结构来求解两个图像之间没有重叠区域的相对旋转估计,我们将其称为扩展性图(E-Graph)。与共同可见度图不同,高级标志(包括消失方向和平面正常)存储在我们的电子图纸中,这些图形在几何上可扩展。基于电子图表,旋转估计问题变得更简单,更优雅,因为它可以处理纯粹的旋转运动,并且需要更少的假设,例如曼哈顿/亚特兰大世界,平面/垂直运动。最后,我们将旋转估计策略嵌入完整的相机跟踪和映射系统中,该系统获得了6-DOF相机姿势和密集的3D网格模型。对公共基准测试的广泛实验表明,所提出的方法实现了最新的跟踪性能。
translated by 谷歌翻译
通常,非刚性登记的问题是匹配在两个不同点拍摄的动态对象的两个不同扫描。这些扫描可以进行刚性动作和非刚性变形。由于模型的新部分可能进入视图,而其他部件在两个扫描之间堵塞,则重叠区域是两个扫描的子集。在最常规的设置中,没有给出先前的模板形状,并且没有可用的标记或显式特征点对应关系。因此,这种情况是局部匹配问题,其考虑了随后的扫描在具有大量重叠区域的情况下进行的扫描经历的假设[28]。本文在环境中寻址的问题是同时在环境中映射变形对象和本地化摄像机。
translated by 谷歌翻译
虽然稀疏单眼同时定位和映射(SLAM)系统创建的基于按键的地图对于相机跟踪很有用,但对于许多机器人任务,可能需要密集的3D重建。涉及深度摄像机的解决方案在范围内和室内空间受到限制,并且基于最小化帧之间的光度误差的密集重建系统通常受到限制很差,并且遭受了规模歧义。为了解决这些问题,我们提出了一个3D重建系统,该系统利用卷积神经网络(CNN)的输出来生成包括度量标准量表的密钥帧的完全密集的深度图。我们的系统DeepFusion能够在GPU上产生实时密集的重建。它使用网络产生的学习不确定性,以概率方式将半密度的多视频立体算法与CNN的深度和梯度预测融合在一起。虽然网络只需要每个键帧一次,但我们能够使用每个新帧对深度图进行优化,以便不断利用新的几何约束。根据其在合成和现实世界数据集上的性能,我们证明了DeepLusion至少能够和其他可比较的系统执行。
translated by 谷歌翻译
我们提出了深斑视觉探光仪(DPVO),这是一种新的单眼视觉探光度(VO)的深度学习系统。DPVO在单个RTX-3090 GPU上仅使用4GB存储器以2x-5X实时速度运行时,是准确且健壮的。我们对标准基准测试进行评估,并以准确性和速度均优于所有先前的工作(经典或学习)。代码可在https://github.com/princeton-vl/dpvo上找到。
translated by 谷歌翻译
我们提供了一种基于因子图优化的多摄像性视觉惯性内径系统,该系统通过同时使用所有相机估计运动,同时保留固定的整体特征预算。我们专注于在挑战环境中的运动跟踪,例如狭窄的走廊,具有侵略性动作的黑暗空间,突然的照明变化。这些方案导致传统的单眼或立体声测量失败。在理论上,使用额外的相机跟踪运动,但它会导致额外的复杂性和计算负担。为了克服这些挑战,我们介绍了两种新的方法来改善多相机特征跟踪。首先,除了从一体相机移动到另一个相机时,我们连续地跟踪特征的代替跟踪特征。这提高了准确性并实现了更紧凑的因子图表示。其次,我们选择跨摄像机的跟踪功能的固定预算,以降低反向结束优化时间。我们发现,使用较小的信息性功能可以保持相同的跟踪精度。我们所提出的方法使用由IMU和四个摄像机(前立体网和两个侧面)组成的硬件同步装置进行广泛测试,包括:地下矿,大型开放空间,以及带狭窄楼梯和走廊的建筑室内设计。与立体声最新的视觉惯性内径测量方法相比,我们的方法将漂移率,相对姿势误差,高达80%的翻译和旋转39%降低。
translated by 谷歌翻译