Line segments are ubiquitous in our human-made world and are increasingly used in vision tasks. They are complementary to feature points thanks to their spatial extent and the structural information they provide. Traditional line detectors based on the image gradient are extremely fast and accurate, but lack robustness in noisy images and challenging conditions. Their learned counterparts are more repeatable and can handle challenging images, but at the cost of a lower accuracy and a bias towards wireframe lines. We propose to combine traditional and learned approaches to get the best of both worlds: an accurate and robust line detector that can be trained in the wild without ground truth lines. Our new line segment detector, DeepLSD, processes images with a deep network to generate a line attraction field, before converting it to a surrogate image gradient magnitude and angle, which is then fed to any existing handcrafted line detector. Additionally, we propose a new optimization tool to refine line segments based on the attraction field and vanishing points. This refinement improves the accuracy of current deep detectors by a large margin. We demonstrate the performance of our method on low-level line detection metrics, as well as on several downstream tasks using multiple challenging datasets. The source code and models are available at https://github.com/cvg/DeepLSD.
translated by 谷歌翻译
This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems. The code and trained weights are publicly available at github.com/magicleap/SuperGluePretrainedNetwork.
translated by 谷歌翻译
在本文中,我们建议超越建立的基于视觉的本地化方法,该方法依赖于查询图像和3D点云之间的视觉描述符匹配。尽管通过视觉描述符匹配关键点使本地化高度准确,但它具有重大的存储需求,提出了隐私问题,并需要长期对描述符进行更新。为了优雅地应对大规模定位的实用挑战,我们提出了Gomatch,这是基于视觉的匹配的替代方法,仅依靠几何信息来匹配图像键点与地图的匹配,这是轴承矢量集。我们的新型轴承矢量表示3D点,可显着缓解基于几何的匹配中的跨模式挑战,这阻止了先前的工作在现实环境中解决本地化。凭借额外的仔细建筑设计,Gomatch在先前的基于几何的匹配工作中改善了(1067m,95.7升)和(1.43m,34.7摄氏度),平均中位数姿势错误,同时需要7个尺寸,同时需要7片。与最佳基于视觉的匹配方法相比,几乎1.5/1.7%的存储容量。这证实了其对现实世界本地化的潜力和可行性,并为不需要存储视觉描述符的城市规模的视觉定位方法打开了未来努力的大门。
translated by 谷歌翻译
This paper presents a self-supervised framework for training interest point detectors and descriptors suitable for a large number of multiple-view geometry problems in computer vision. As opposed to patch-based neural networks, our fully-convolutional model operates on full-sized images and jointly computes pixel-level interest point locations and associated descriptors in one forward pass. We introduce Homographic Adaptation, a multi-scale, multihomography approach for boosting interest point detection repeatability and performing cross-domain adaptation (e.g., synthetic-to-real). Our model, when trained on the MS-COCO generic image dataset using Homographic Adaptation, is able to repeatedly detect a much richer set of interest points than the initial pre-adapted deep model and any other traditional corner detector. The final system gives rise to state-of-the-art homography estimation results on HPatches when compared to LIFT, SIFT and ORB.
translated by 谷歌翻译
相机的估计与一组图像相关联的估计通常取决于图像之间的特征匹配。相比之下,我们是第一个通过使用对象区域来指导姿势估计问题而不是显式语义对象检测来应对这一挑战的人。我们提出了姿势炼油机网络(PosErnet),一个轻量级的图形神经网络,以完善近似的成对相对摄像头姿势。posernet利用对象区域之间的关联(简洁地表示为边界框),跨越了多个视图到全球完善的稀疏连接的视图图。我们在不同尺寸的图表上评估了7个尺寸的数据集,并展示了该过程如何有益于基于优化的运动平均算法,从而相对于基于边界框获得的初始估计,将旋转的中值误差提高了62度。代码和数据可在https://github.com/iit-pavis/posernet上找到。
translated by 谷歌翻译
检测局部特征,例如角落,段或斑点,是许多计算机视觉应用中的第一步。它的速度对于实时应用至关重要。在本文中,我们在文献中呈现elsed,最快的线段探测器。其效率的关键是局部段生长算法,其在存在小不连续性的情况下连接梯度对齐的像素。所提出的算法不仅在具有非常低端硬件的设备中运行,而且还可以参数化以促进短期或更长的段的检测,具体取决于手头的任务。我们还介绍了新的指标,以评估段探测器的准确性和重复性。在我们的实验中,我们证明我们的方法账户最高的重复性,它在文献中最有效。在实验中,我们量化了此类收益所交易的准确性。
translated by 谷歌翻译
Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that a PAF-only refinement rather than both PAF and body part location refinement results in a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an internal annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.
translated by 谷歌翻译
大多数实时人类姿势估计方法都基于检测接头位置。使用检测到的关节位置,可以计算偏差和肢体的俯仰。然而,由于这种旋转轴仍然不观察,因此不能计算沿着肢体沿着肢体至关重要的曲折,这对于诸如体育分析和计算机动画至关重要。在本文中,我们引入了方向关键点,一种用于估计骨骼关节的全位置和旋转的新方法,仅使用单帧RGB图像。灵感来自Motion-Capture Systems如何使用一组点标记来估计全骨骼旋转,我们的方法使用虚拟标记来生成足够的信息,以便准确地推断使用简单的后处理。旋转预测改善了接头角度最佳报告的平均误差48%,并且在15个骨骼旋转中实现了93%的精度。该方法还通过MPJPE在原理数据集上测量,通过MPJPE测量,该方法还改善了当前的最新结果14%,并概括为野外数据集。
translated by 谷歌翻译
在本文中,我们解决了估算图像之间尺度因子的问题。我们制定规模估计问题作为对尺度因素的概率分布的预测。我们设计了一种新的架构,ScaleNet,它利用扩张的卷积以及自我和互相关层来预测图像之间的比例。我们展示了具有估计尺度的整流图像导致各种任务和方法的显着性能改进。具体而言,我们展示了ScaleNet如何与稀疏的本地特征和密集的通信网络组合,以改善不同的基准和数据集中的相机姿势估计,3D重建或密集的几何匹配。我们对多项任务提供了广泛的评估,并分析了标准齿的计算开销。代码,评估协议和培训的型号在https://github.com/axelbarroso/scalenet上公开提供。
translated by 谷歌翻译
Erroneous feature matches have severe impact on subsequent camera pose estimation and often require additional, time-costly measures, like RANSAC, for outlier rejection. Our method tackles this challenge by addressing feature matching and pose optimization jointly. To this end, we propose a graph attention network to predict image correspondences along with confidence weights. The resulting matches serve as weighted constraints in a differentiable pose estimation. Training feature matching with gradients from pose optimization naturally learns to down-weight outliers and boosts pose estimation on image pairs compared to SuperGlue by 6.7% on ScanNet. At the same time, it reduces the pose estimation time by over 50% and renders RANSAC iterations unnecessary. Moreover, we integrate information from multiple views by spanning the graph across multiple frames to predict the matches all at once. Multi-view matching combined with end-to-end training improves the pose estimation metrics on Matterport3D by 18.8% compared to SuperGlue.
translated by 谷歌翻译
地理定位的概念是指确定地球上的某些“实体”的位置的过程,通常使用全球定位系统(GPS)坐标。感兴趣的实体可以是图像,图像序列,视频,卫星图像,甚至图像中可见的物体。由于GPS标记媒体的大规模数据集由于智能手机和互联网而迅速变得可用,而深入学习已经上升以提高机器学习模型的性能能力,因此由于其显着影响而出现了视觉和对象地理定位的领域广泛的应用,如增强现实,机器人,自驾驶车辆,道路维护和3D重建。本文提供了对涉及图像的地理定位的全面调查,其涉及从捕获图像(图像地理定位)或图像内的地理定位对象(对象地理定位)的地理定位的综合调查。我们将提供深入的研究,包括流行算法的摘要,对所提出的数据集的描述以及性能结果的分析来说明每个字段的当前状态。
translated by 谷歌翻译
We present a one-stage Fully Convolutional Line Parsing network (F-Clip) that detects line segments from images. The proposed network is very simple and flexible with variations that gracefully trade off between speed and accuracy for different applications. F-Clip detects line segments in an end-to-end fashion by predicting each line's center position, length, and angle. We further customize the design of convolution kernels of our fully convolutional network to effectively exploit the statistical priors of the distribution of line angles in real image datasets. We conduct extensive experiments and show that our method achieves a significantly better trade-off between efficiency and accuracy, resulting in a real-time line detector at up to 73 FPS on a single GPU. Such inference speed makes our method readily applicable to real-time tasks without compromising any accuracy of previous methods. Moreover, when equipped with a performance-improving backbone network, F-Clip is able to significantly outperform all state-of-the-art line detectors on accuracy at a similar or even higher frame rate. In other word, under same inference speed, F-Clip always achieving best accuracy compare with other methods. Source code https://github.com/Delay-Xili/F-Clip.
translated by 谷歌翻译
Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over stateof-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
translated by 谷歌翻译
我们提出了一种适用于许多场景中的新方法,理解了适应Monte Carlo Tree Search(MCTS)算法的问题,该算法最初旨在学习玩高州复杂性的游戏。从生成的建议库中,我们的方法共同选择并优化了最小化目标项的建议。在我们的第一个从点云中进行平面图重建的应用程序中,我们的方法通过优化将深度网络预测的适应性组合到房间形状上的目标函数,选择并改进了以2D多边形为模型的房间建议。我们还引入了一种新型的可区分方法来渲染这些建议的多边形形状。我们对最近且具有挑战性的结构3D和Floor SP数据集的评估对最先进的表现有了显着改进,而没有对平面图配置施加硬性约束也没有假设。在我们的第二个应用程序中,我们扩展了从颜色图像重建一般3D房间布局并获得准确的房间布局的方法。我们还表明,可以轻松扩展我们的可区分渲染器,以渲染3D平面多边形和多边形嵌入。我们的方法在MatterPort3D-Layout数据集上显示了高性能,而无需在房间布局配置上引入硬性约束。
translated by 谷歌翻译
We present a novel method for local image feature matching. Instead of performing image feature detection, description, and matching sequentially, we propose to first establish pixel-wise dense matches at a coarse level and later refine the good matches at a fine level. In contrast to dense methods that use a cost volume to search correspondences, we use self and cross attention layers in Transformer to obtain feature descriptors that are conditioned on both images. The global receptive field provided by Transformer enables our method to produce dense matches in low-texture areas, where feature detectors usually struggle to produce repeatable interest points. The experiments on indoor and outdoor datasets show that LoFTR outperforms state-of-the-art methods by a large margin. LoFTR also ranks first on two public benchmarks of visual localization among the published methods. Code is available at our project page: https://zju3dv.github.io/loftr/.
translated by 谷歌翻译
We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing or incorrect object hypotheses, and automatically recovers the number of objects in the scene. Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views. This is achieved by solving an object-level bundle adjustment problem that refines the poses of cameras and objects to minimize the reprojection error in all views. We demonstrate that the proposed method, dubbed Cosy-Pose, outperforms current state-of-the-art results for single-view and multi-view 6D object pose estimation by a large margin on two challenging benchmarks: the YCB-Video and T-LESS datasets. Code and pre-trained models are available on the project webpage. 5
translated by 谷歌翻译
Sparse local feature extraction is usually believed to be of important significance in typical vision tasks such as simultaneous localization and mapping, image matching and 3D reconstruction. At present, it still has some deficiencies needing further improvement, mainly including the discrimination power of extracted local descriptors, the localization accuracy of detected keypoints, and the efficiency of local feature learning. This paper focuses on promoting the currently popular sparse local feature learning with camera pose supervision. Therefore, it pertinently proposes a Shared Coupling-bridge scheme with four light-weight yet effective improvements for weakly-supervised local feature (SCFeat) learning. It mainly contains: i) the \emph{Feature-Fusion-ResUNet Backbone} (F2R-Backbone) for local descriptors learning, ii) a shared coupling-bridge normalization to improve the decoupling training of description network and detection network, iii) an improved detection network with peakiness measurement to detect keypoints and iv) the fundamental matrix error as a reward factor to further optimize feature detection training. Extensive experiments prove that our SCFeat improvement is effective. It could often obtain a state-of-the-art performance on classic image matching and visual localization. In terms of 3D reconstruction, it could still achieve competitive results. For sharing and communication, our source codes are available at https://github.com/sunjiayuanro/SCFeat.git.
translated by 谷歌翻译
我们提出了一种称为DPODV2(密集姿势对象检测器)的三个阶段6 DOF对象检测方法,该方法依赖于致密的对应关系。我们将2D对象检测器与密集的对应关系网络和多视图姿势细化方法相结合,以估计完整的6 DOF姿势。与通常仅限于单眼RGB图像的其他深度学习方法不同,我们提出了一个统一的深度学习网络,允许使用不同的成像方式(RGB或DEPTH)。此外,我们提出了一种基于可区分渲染的新型姿势改进方法。主要概念是在多个视图中比较预测并渲染对应关系,以获得与所有视图中预测的对应关系一致的姿势。我们提出的方法对受控设置中的不同数据方式和培训数据类型进行了严格的评估。主要结论是,RGB在对应性估计中表现出色,而如果有良好的3D-3D对应关系,则深度有助于姿势精度。自然,他们的组合可以实现总体最佳性能。我们进行广泛的评估和消融研究,以分析和验证几个具有挑战性的数据集的结果。 DPODV2在所有这些方面都取得了出色的成果,同时仍然保持快速和可扩展性,独立于使用的数据模式和培训数据的类型
translated by 谷歌翻译
在许多计算机视觉管道中,在图像之间建立一组稀疏的关键点相关性是一项基本任务。通常,这转化为一个计算昂贵的最近邻居搜索,必须将一个图像的每个键盘描述符与其他图像的所有描述符进行比较。为了降低匹配阶段的计算成本,我们提出了一个能够检测到每个图像处的互补关键集的深度提取网络。由于仅需要在不同图像上比较同一组中的描述符,因此匹配相计算复杂度随集合数量而降低。我们训练我们的网络以预测关键点并共同计算相应的描述符。特别是,为了学习互补的关键点集,我们引入了一种新颖的无监督损失,对不同集合之间的交叉点进行了惩罚。此外,我们提出了一种基于描述符的新型加权方案,旨在惩罚使用非歧视性描述符的关键点的检测。通过广泛的实验,我们表明,我们的功能提取网络仅在合成的扭曲图像和完全无监督的方式进行训练,以降低匹配的复杂性,在3D重建和重新定位任务上取得了竞争成果。
translated by 谷歌翻译
We propose a single-shot approach for simultaneously detecting an object in an RGB image and predicting its 6D pose without requiring multiple stages or having to examine multiple hypotheses. Unlike a recently proposed single-shot technique for this task [11] that only predicts an approximate 6D pose that must then be refined, ours is accurate enough not to require additional post-processing. As a result, it is much faster -50 fps on a Titan X (Pascal) GPU -and more suitable for real-time processing. The key component of our method is a new CNN architecture inspired by [28,29] that directly predicts the 2D image locations of the projected vertices of the object's 3D bounding box. The object's 6D pose is then estimated using a PnP algorithm.For single object and multiple object pose estimation on the LINEMOD and OCCLUSION datasets, our approach substantially outperforms other recent 26] when they are all used without postprocessing. During post-processing, a pose refinement step can be used to boost the accuracy of these two methods, but at 10 fps or less, they are much slower than our method.
translated by 谷歌翻译