在预先建立的3D环境图中,高精度摄像头重新定位技术是许多任务的基础,例如增强现实,机器人技术和自动驾驶。近几十年来,基于点的视觉重新定位方法已经发达了,但在某些不足的情况下不足。在本文中,我们设计了一条完整的管道,用于使用点和线的相机姿势完善,其中包含创新设计的生产线提取CNN,名为VLSE,线匹配和姿势优化方法。我们采用新颖的线表示,并根据堆叠的沙漏网络自定义混合卷积块,以检测图像上的准确稳定的线路功能。然后,我们采用基于几何的策略,使用表极约束和再投影过滤获得精确的2D-3D线对应关系。构建了以下点线关节成本函数,以通过基于纯点的本地化的初始粗姿势优化相机姿势。在开放数据集(即线框上的线提取器)上进行了足够的实验,在INLOC DUC1和DUC2上的定位性能,以确认我们的点线关节姿势优化方法的有效性。
translated by 谷歌翻译
Visual odometry is crucial for many robotic tasks such as autonomous exploration and path planning. Despite many progresses, existing methods are still not robust enough to dynamic illumination environments. In this paper, we present AirVO, an illumination-robust and accurate stereo visual odometry system based on point and line features. To be robust to illumination variation, we introduce the learning-based feature extraction and matching method and design a novel VO pipeline, including feature tracking, triangulation, key-frame selection, and graph optimization etc. We also employ long line features in the environment to improve the accuracy of the system. Different from the traditional line processing pipelines in visual odometry systems, we propose an illumination-robust line tracking method, where point feature tracking and distribution of point and line features are utilized to match lines. In the experiments, the proposed system is extensively evaluated in environments with dynamic illumination and the results show that it achieves superior performance to the state-of-the-art algorithms.
translated by 谷歌翻译
本地图像功能匹配,旨在识别图像对的识别和相应的相似区域,是计算机视觉中的重要概念。大多数现有的图像匹配方法遵循一对一的分配原则,并采用共同最近的邻居来确保跨图像之间本地特征之间的独特对应关系。但是,来自不同条件的图像可能会容纳大规模变化或观点多样性,以便一对一的分配可能在密集匹配中导致模棱两可或丢失的表示形式。在本文中,我们介绍了一种新颖的无探测器本地特征匹配方法Adamatcher,该方法首先通过轻巧的特征交互模块与密集的特征相关联,并估算了配对图像的可见面积,然后执行贴片级多到 - 一个分配可以预测匹配建议,并最终根据一对一的完善模块进行完善。广泛的实验表明,Adamatcher的表现优于固体基线,并在许多下游任务上实现最先进的结果。此外,多对一分配和一对一的完善模块可以用作其他匹配方法(例如Superglue)的改进网络,以进一步提高其性能。代码将在出版时提供。
translated by 谷歌翻译
在面对低纹理的场景时,视觉测距算法倾向于降解 - 从例如时。人造环境 - 往往难以找到足够数量的点特征。替代的几何视觉提示,例如可以在这些场景中找到的线,这可能会特别有用。此外,这些场景通常存在结构规律,例如并行性或正交性,并持有曼哈顿世界的假设。在这些场所,在这项工作中,我们介绍了MSC-VO,这是一个RGB-D基的视觉测量方法,它结合了点和线条特征和利用,如果存在,那些结构规律和场景的曼哈顿轴。在我们的方法中,这些结构约束最初用于精确地估计提取线的3D位置。这些约束也与估计的曼哈顿轴相结合,并通过本地地图优化将相机姿势改进的点和线路的重新注入误差。这种组合使我们的方法能够在不存在上述约束的情况下操作,允许该方法用于更广泛的方案。此外,我们提出了一种新颖的多视图曼哈顿轴估计程序,主要依赖于线特征。使用几个公共数据集进行评估MSC-VO,优于其他最先进的解决方案,并且即使使用一些SLAM方法也是有利的。
translated by 谷歌翻译
本地功能匹配是在子像素级别上的计算密集任务。尽管基于检测器的方法和特征描述符在低文本场景中遇到了困难,但具有顺序提取到匹配管道的基于CNN的方法无法使用编码器的匹配能力,并且倾向于覆盖用于匹配的解码器。相比之下,我们提出了一种新型的层次提取和匹配变压器,称为火柴场。在层次编码器的每个阶段,我们将自我注意事项与特征提取和特征匹配的交叉注意相结合,从而产生了人直觉提取和匹配方案。这种匹配感知的编码器释放了过载的解码器,并使该模型高效。此外,将自我交叉注意在分层体系结构中的多尺度特征结合起来,可以提高匹配的鲁棒性,尤其是在低文本室内场景或更少的室外培训数据中。得益于这样的策略,MatchFormer是效率,鲁棒性和精度的多赢解决方案。与以前的室内姿势估计中的最佳方法相比,我们的Lite MatchFormer只有45%的Gflops,但获得了 +1.3%的精度增益和41%的运行速度提升。大型火柴构造器以四个不同的基准达到最新的基准,包括室内姿势估计(SCANNET),室外姿势估计(Megadepth),同型估计和图像匹配(HPATCH)和视觉定位(INLOC)。
translated by 谷歌翻译
在3D视觉中,视觉重新定位已被广泛讨论:鉴于预构建的3D视觉图,估计查询图像的6 DOF(自由度)姿势。大规模室内环境中的重新定位可实现有吸引力的应用程序,例如增强现实和机器人导航。但是,当相机移动时,在这种环境中,外观变化很快,这对于重新定位系统来说是具有挑战性的。为了解决这个问题,我们建议一种基于虚拟视图综合方法Rendernet,以丰富有关此特定情况的数据库和完善姿势。我们选择直接渲染虚拟观点的必要全局和本地特征,而不是渲染需要高质量3D模型的真实图像,并分别将它们应用于后续图像检索和功能匹配操作中。所提出的方法在很大程度上可以改善大规模室内环境中的性能,例如,在INLOC数据集中获得7.1 \%和12.2 \%的改善。
translated by 谷歌翻译
在这项研究中,我们提出了一种新型的视觉定位方法,以根据RGB摄像机的可视数据准确估计机器人在3D激光镜头内的六个自由度(6-DOF)姿势。使用基于先进的激光雷达的同时定位和映射(SLAM)算法,可获得3D地图,能够收集精确的稀疏图。将从相机图像中提取的功能与3D地图的点进行了比较,然后解决了几何优化问题,以实现精确的视觉定位。我们的方法允许使用配备昂贵激光雷达的侦察兵机器人一次 - 用于映射环境,并且仅使用RGB摄像头的多个操作机器人 - 执行任务任务,其本地化精度高于常见的基于相机的解决方案。该方法在Skolkovo科学技术研究所(Skoltech)收集的自定义数据集上进行了测试。在评估本地化准确性的过程中,我们设法达到了厘米级的准确性;中间翻译误差高达1.3厘米。仅使用相机实现的确切定位使使用自动移动机器人可以解决需要高度本地化精度的最复杂的任务。
translated by 谷歌翻译
Accurate localization ability is fundamental in autonomous driving. Traditional visual localization frameworks approach the semantic map-matching problem with geometric models, which rely on complex parameter tuning and thus hinder large-scale deployment. In this paper, we propose BEV-Locator: an end-to-end visual semantic localization neural network using multi-view camera images. Specifically, a visual BEV (Birds-Eye-View) encoder extracts and flattens the multi-view images into BEV space. While the semantic map features are structurally embedded as map queries sequence. Then a cross-model transformer associates the BEV features and semantic map queries. The localization information of ego-car is recursively queried out by cross-attention modules. Finally, the ego pose can be inferred by decoding the transformer outputs. We evaluate the proposed method in large-scale nuScenes and Qcraft datasets. The experimental results show that the BEV-locator is capable to estimate the vehicle poses under versatile scenarios, which effectively associates the cross-model information from multi-view images and global semantic maps. The experiments report satisfactory accuracy with mean absolute errors of 0.052m, 0.135m and 0.251$^\circ$ in lateral, longitudinal translation and heading angle degree.
translated by 谷歌翻译
本文展示了一个视觉大满贯系统,该系统利用点和线云,同时使用嵌入式零件平面重建(PPR)模块,共同提供结构图。为了与跟踪并行构建一致的尺度地图,例如使用单个摄像机会带来挑战,以歧义性歧义重建几何原始图,并进一步引入了捆绑调整(BA)的图形优化的难度。我们通过在重建的线和飞机上提出几个运行时优化来解决这些问题。然后根据单眼框架的设计将系统用深度和立体声传感器扩展。结果表明,我们提出的SLAM紧密结合了语义功能,以增强前端跟踪和后端优化。我们在各种数据集上详尽地评估了系统,并为社区开放代码(https://github.com/peterfws/structure-plp-slam)。
translated by 谷歌翻译
本文通过将地面图像与高架视图卫星地图匹配,解决了车辆安装的相机本地化问题。现有方法通常将此问题视为跨视图图像检索,并使用学习的深度特征将地面查询图像与卫星图的分区(例如,小补丁)匹配。通过这些方法,定位准确性受卫星图的分配密度(通常是按数米的顺序)限制。本文偏离了图像检索的传统智慧,提出了一种新的解决方案,可以实现高度准确的本地化。关键思想是将任务提出为构成估计,并通过基于神经网络的优化解决。具体而言,我们设计了一个两分支{CNN},分别从地面和卫星图像中提取可靠的特征。为了弥合巨大的跨视界域间隙,我们求助于基于相对摄像头姿势的几何投影模块,该模块从卫星地图到地面视图。为了最大程度地减少投影功能和观察到的功能之间的差异,我们采用了可区分的Levenberg-Marquardt({lm})模块来迭代地搜索最佳相机。整个管道都是可区分的,并且端到端运行。关于标准自动驾驶汽车定位数据集的广泛实验已经证实了该方法的优越性。值得注意的是,例如,从40m x 40m的宽区域内的相机位置的粗略估计开始,我们的方法迅速降低了新的Kitti Cross-view数据集中的横向位置误差在5m之内。
translated by 谷歌翻译
Sparse local feature extraction is usually believed to be of important significance in typical vision tasks such as simultaneous localization and mapping, image matching and 3D reconstruction. At present, it still has some deficiencies needing further improvement, mainly including the discrimination power of extracted local descriptors, the localization accuracy of detected keypoints, and the efficiency of local feature learning. This paper focuses on promoting the currently popular sparse local feature learning with camera pose supervision. Therefore, it pertinently proposes a Shared Coupling-bridge scheme with four light-weight yet effective improvements for weakly-supervised local feature (SCFeat) learning. It mainly contains: i) the \emph{Feature-Fusion-ResUNet Backbone} (F2R-Backbone) for local descriptors learning, ii) a shared coupling-bridge normalization to improve the decoupling training of description network and detection network, iii) an improved detection network with peakiness measurement to detect keypoints and iv) the fundamental matrix error as a reward factor to further optimize feature detection training. Extensive experiments prove that our SCFeat improvement is effective. It could often obtain a state-of-the-art performance on classic image matching and visual localization. In terms of 3D reconstruction, it could still achieve competitive results. For sharing and communication, our source codes are available at https://github.com/sunjiayuanro/SCFeat.git.
translated by 谷歌翻译
We present a novel method for local image feature matching. Instead of performing image feature detection, description, and matching sequentially, we propose to first establish pixel-wise dense matches at a coarse level and later refine the good matches at a fine level. In contrast to dense methods that use a cost volume to search correspondences, we use self and cross attention layers in Transformer to obtain feature descriptors that are conditioned on both images. The global receptive field provided by Transformer enables our method to produce dense matches in low-texture areas, where feature detectors usually struggle to produce repeatable interest points. The experiments on indoor and outdoor datasets show that LoFTR outperforms state-of-the-art methods by a large margin. LoFTR also ranks first on two public benchmarks of visual localization among the published methods. Code is available at our project page: https://zju3dv.github.io/loftr/.
translated by 谷歌翻译
在本文中,我们提出了一种视觉定位管道,即MEGLOC,在不同的场景下,包括室内和室外场景,包括室内和户外场景,每天不同的时间,跨越多年的不同时间,甚至是跨越多年的。Megloc实现了最先进的数据集,包括在不断变化的条件下赢得ICCV 2021研讨会的户外和室内视觉本地化挑战,以及自主的重新定位挑战驾驶ICCV 2021研讨会关于基于地图的自主驾驶定位。
translated by 谷歌翻译
在本文中,我们考虑了视觉同时定位和映射(SLAM)的实际应用中的问题。随着技术在广泛范围中的普及和应用,SLAM系统的可实用性已成为一个在准确性和鲁棒性之后,例如,如何保持系统的稳定性并实现低文本和低文本和中的准确姿势估计动态环境以及如何在真实场景中改善系统的普遍性和实时性能。动态对象在高度动态的环境中的影响。我们还提出了一种新型的全局灰色相似性(GGS)算法,以实现合理的钥匙扣选择和有效的环闭合检测(LCD)。受益于GGS,PLD-SLAM可以在大多数真实场景中实现实时准确的姿势估计,而无需预先训练和加载巨大的功能词典模型。为了验证拟议系统的性能,我们将其与公共数据集Kitti,Euroc MAV和我们提供的室内立体声数据集的现有最新方法(SOTA)方法进行了比较。实验表明,实验表明PLD-SLAM在大多数情况下确保稳定性和准确性,具有更好的实时性能。此外,通过分析GGS的实验结果,我们可以发现它在关键帧选择和LCD中具有出色的性能。
translated by 谷歌翻译
随着线提供额外的约束,利用线特征可以有助于提高基于点的单眼视觉惯性内径(VIO)系统的定位精度。此外,在人工环境中,一些直线彼此平行。在本文中,我们设计了一种基于点和直线的VIO系统,它将直线分成结构直线(即彼此平行的直线)和非结构直线。另外,与使用四个参数表示3D直线的正交表示不同,我们仅使用两个参数来最小化结构直线和非结构直线的表示。此外,我们设计了一种基于采样点的直线匹配策略,提高了直线匹配的效率和成功率。我们的方法的有效性在EUROC和TUM VI基准的公共数据集上验证,与其他最先进的算法相比。
translated by 谷歌翻译
In this paper, we propose an end-to-end framework that jointly learns keypoint detection, descriptor representation and cross-frame matching for the task of image-based 3D localization. Prior art has tackled each of these components individually, purportedly aiming to alleviate difficulties in effectively train a holistic network. We design a self-supervised image warping correspondence loss for both feature detection and matching, a weakly-supervised epipolar constraints loss on relative camera pose learning, and a directional matching scheme that detects key-point features in a source image and performs coarse-to-fine correspondence search on the target image. We leverage this framework to enforce cycle consistency in our matching module. In addition, we propose a new loss to robustly handle both definite inlier/outlier matches and less-certain matches. The integration of these learning mechanisms enables end-to-end training of a single network performing all three localization components. Bench-marking our approach on public data-sets, exemplifies how such an end-to-end framework is able to yield more accurate localization that out-performs both traditional methods as well as state-of-the-art weakly supervised methods.
translated by 谷歌翻译
Visual perception plays an important role in autonomous driving. One of the primary tasks is object detection and identification. Since the vision sensor is rich in color and texture information, it can quickly and accurately identify various road information. The commonly used technique is based on extracting and calculating various features of the image. The recent development of deep learning-based method has better reliability and processing speed and has a greater advantage in recognizing complex elements. For depth estimation, vision sensor is also used for ranging due to their small size and low cost. Monocular camera uses image data from a single viewpoint as input to estimate object depth. In contrast, stereo vision is based on parallax and matching feature points of different views, and the application of deep learning also further improves the accuracy. In addition, Simultaneous Location and Mapping (SLAM) can establish a model of the road environment, thus helping the vehicle perceive the surrounding environment and complete the tasks. In this paper, we introduce and compare various methods of object detection and identification, then explain the development of depth estimation and compare various methods based on monocular, stereo, and RDBG sensors, next review and compare various methods of SLAM, and finally summarize the current problems and present the future development trends of vision technologies.
translated by 谷歌翻译
大多数最先进的定位算法都依赖于稳健的相对姿势估计和几何验证来获得移动的对象不可知的摄像机在复杂的室内环境中姿势。但是,如果场景包含重复的结构,例如书桌,桌子,盒子或移动的人,则这种方法容易犯错。我们表明,可移动对象包含了不可忽略的本地化误差,并提出了一种新的直接方法,以预测六度自由(6DOF)更加坚固。我们为定位管道INLOC配备了实例分割网络yolact ++。动态对象的口罩用于相对姿势估计步骤和摄像头姿势建议的最终分类中。首先,我们过滤出放置在动态对象的掩模上的匹配。其次,我们跳过了与移动对象相关的区域上查询和合成图像的比较。此过程导致更强大的本地化。最后,我们描述并改善了由合成图像和查询图像之间的基于梯度的比较引起的错误,并发布了新的管道,以模拟MatterPort扫描中具有可移动对象的环境。所有代码均可在github.com/dubenma/d-inlocpp上获得。
translated by 谷歌翻译
在许多视觉应用程序中,查找跨图像的对应是一项重要任务。最新的最新方法着重于以粗到精细的方式设计的基于端到端学习的架构。他们使用非常深的CNN或多块变压器来学习强大的表示,这需要高计算能力。此外,这些方法在不理解对象,图像内部形状的情况下学习功能,因此缺乏解释性。在本文中,我们提出了一个用于图像匹配的体系结构,该体系结构高效,健壮且可解释。更具体地说,我们介绍了一个名为toblefm的新型功能匹配模块,该模块可以大致将图像跨图像的空间结构大致组织到一个主题中,然后扩大每个主题内部的功能以进行准确的匹配。为了推断主题,我们首先学习主题的全局嵌入,然后使用潜在变量模型来检测图像结构将图像结构分配到主题中。我们的方法只能在共同可见性区域执行匹配以减少计算。在室外和室内数据集中进行的广泛实验表明,我们的方法在匹配性能和计算效率方面优于最新方法。该代码可在https://github.com/truongkhang/topicfm上找到。
translated by 谷歌翻译
This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems. The code and trained weights are publicly available at github.com/magicleap/SuperGluePretrainedNetwork.
translated by 谷歌翻译