近年来,由通过图表神经网络(GNN)模型的学习鉴别表现来源,深图形匹配方法在匹配语义特征的任务中取得了很大的进展。然而,这些方法通常依赖于启发式生成的图形模式,这可能引入不可靠的关系来损害匹配性能。在本文中,我们提出了一个名为Glam的联合\ EMPH {图学习和匹配}网络,以探索用于升压图形匹配的可靠图形结构。 Glam采用纯粹的关注框架,用于图形学习和图形匹配。具体而言,它采用两种类型的注意机制,自我关注和横向于任务。自我关注发现功能之​​间的关系,并通过学习结构进一步更新功能表示;并且横向计算要与特征重建匹配的两个特征集之间的横谱图相关性。此外,最终匹配解决方案直接来自横向层的输出,而不采用特定的匹配决策模块。所提出的方法是在三个流行的视觉匹配基准(Pascal VOC,Willow Object和Spair-71K)上进行评估,并且在以前的最先进的图表匹配方法中通过所有基准测试的重要利润率。此外,我们的模型学习的图形模式被验证,通过用学习的图形结构替换其手工制作的图形结构,能够显着增强先前的深度图匹配方法。
translated by 谷歌翻译
大多数以前的基于学习的图形匹配算法通过丢弃一个或多个匹配约束并采用放宽的分配求解器来获取次优关卡的\ Textit {二次分配问题}(QAP)。这种放松可能实际上削弱了原始的图形匹配问题,反过来伤害了匹配的性能。在本文中,我们提出了一种基于深度学习的图形匹配框架,其适用于原始QAP而不会影响匹配约束。特别地,我们设计一个亲和分分配预测网络,共同学习一对亲和力并估计节点分配,然后我们开发由概率亲和力的可分辨率的求解器,其灵感来自对成对亲和力的概率视角。旨在获得更好的匹配结果,概率求解器以迭代方式精制估计的分配,以施加离散和一对一的匹配约束。所提出的方法是在三个普遍测试的基准(Pascal VOC,Willow Object和Spair-71K)上进行评估,并且在所有基准上表现出所有先前的最先进。
translated by 谷歌翻译
This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems. The code and trained weights are publicly available at github.com/magicleap/SuperGluePretrainedNetwork.
translated by 谷歌翻译
In this paper, we study a novel and widely existing problem in graph matching (GM), namely, Bi-level Noisy Correspondence (BNC), which refers to node-level noisy correspondence (NNC) and edge-level noisy correspondence (ENC). In brief, on the one hand, due to the poor recognizability and viewpoint differences between images, it is inevitable to inaccurately annotate some keypoints with offset and confusion, leading to the mismatch between two associated nodes, i.e., NNC. On the other hand, the noisy node-to-node correspondence will further contaminate the edge-to-edge correspondence, thus leading to ENC. For the BNC challenge, we propose a novel method termed Contrastive Matching with Momentum Distillation. Specifically, the proposed method is with a robust quadratic contrastive loss which enjoys the following merits: i) better exploring the node-to-node and edge-to-edge correlations through a GM customized quadratic contrastive learning paradigm; ii) adaptively penalizing the noisy assignments based on the confidence estimated by the momentum teacher. Extensive experiments on three real-world datasets show the robustness of our model compared with 12 competitive baselines.
translated by 谷歌翻译
在许多领域,包括计算机视觉和模式识别的许多领域,图形匹配(GM)一直是一个基础。尽管最近取得了令人印象深刻的进展,但现有的深入GM方法通常在处理这两个图中的异常值方面都有困难,这在实践中无处不在。我们提出了基于加权图匹配的基于深的增强学习(RL)方法RGM,其顺序节点匹配方案自然适合选择性嵌入式匹配与异常值的策略。设计了可撤销的动作方案,以提高代理商在复杂受约束的匹配任务上的灵活性。此外,我们提出了一种二次近似技术,以在存在异常值的情况下使亲和力矩阵正常化。因此,当目标得分停止增长时,RL代理可以及时完成匹配,否则,否则会有额外的超参数,即需要常见的嵌入式数量来避免匹配异常值。在本文中,我们专注于学习最通用的GM形式的后端求解器:Lawler's QAP,其输入是亲和力矩阵。我们的方法还可以使用亲和力输入来增强其他求解器。合成和现实世界数据集的实验结果展示了其在匹配准确性和鲁棒性方面的出色性能。
translated by 谷歌翻译
近年来,线性分配问题(LAP)的可分解求解器(LAP)引起了很多研究的关注,通常嵌入到学习框架中作为组件。然而,以前的算法,有或没有学习策略,通常随着问题大小的增量而遭受最优性的降低。在本文中,我们提出了一种基于深图网络的学习线性分配求解器。具体地,我们首先将成本矩阵转换为二分图,并将分配任务转换为从构造的图表中选择可靠的边缘的问题。随后,开发了深图网络以聚合和更新节点和边的特征。最后,网络预测指示指示赋值关系的每个边缘的标签。合成数据集的实验结果表明,我们的方法优于最先进的基线,并以问题尺寸的增量达到始终如一的高精度。此外,我们还与最先进的基线求解器相比,嵌入了所提出的求解器,进入流行的多目标跟踪(MOT)框架,以以端到端的方式训练跟踪器。 MOT基准的实验结果表明,所提出的LAP解算器通过最大的边缘改善跟踪器。
translated by 谷歌翻译
Many challenges from natural world can be formulated as a graph matching problem. Previous deep learning-based methods mainly consider a full two-graph matching setting. In this work, we study the more general partial matching problem with multi-graph cycle consistency guarantees. Building on a recent progress in deep learning on graphs, we propose a novel data-driven method (URL) for partial multi-graph matching, which uses an object-to-universe formulation and learns latent representations of abstract universe points. The proposed approach advances the state of the art in semantic keypoint matching problem, evaluated on Pascal VOC, CUB, and Willow datasets. Moreover, the set of controlled experiments on a synthetic graph matching dataset demonstrates the scalability of our method to graphs with large number of nodes and its robustness to high partiality.
translated by 谷歌翻译
关键点匹配是多个图像相关应用的关键组件,例如图像拼接,视觉同时定位和映射(SLAM)等。基于手工制作和最近出现的深度学习的关键点匹配方法仅依赖于关键点和本地功能,同时在上述应用中丢失其他可用传感器(如惯性测量单元(IMU))的视觉。在本文中,我们证明IMU集成的运动估计可用于利用图像之间的关键点之前的空间分布。为此,提出了一种注意力制剂的概率视角,以自然地将空间分布集成到注意力图神经网络中。在空间分布的帮助下,可以减少用于建模隐藏特征的网络的努力。此外,我们为所提出的关键点匹配网络提出了一个投影损耗,它在匹配和未匹配的关键点之间提供了平滑的边缘。图像匹配在Visual Slam数据集上的实验表明了呈现的方法的有效性和效率。
translated by 谷歌翻译
在这项工作中,我们提出了一个新颖的基于学习的框架,该框架将对比度学习的局部准确性与几何方法的全球一致性结合在一起,以实现强大的非刚性匹配。我们首先观察到,尽管对比度学习可以导致强大的点特征,但由于标准对比度损失的纯粹组合性质,学到的对应关系通常缺乏平滑度和一致性。为了克服这一局限性,我们建议通过两种类型的平滑度正则化来提高对比性学习,从而将几何信息注入对应学习。借助这种新颖的组合,所得的特征既具有跨个别点的高度歧视性,又可以通过简单的接近查询导致坚固且一致的对应关系。我们的框架是一般的,适用于3D和2D域中的本地功能学习。我们通过在各种挑战性的匹配基准上进行广泛的实验来证明我们的方法的优势,包括3D非刚性形状对应关系和2D图像关键点匹配。
translated by 谷歌翻译
我们提出了一种新的成本聚合网络,称为成本聚合变压器(CAT),在语义类似的图像之间找到密集的对应关系,其中具有大型类内外观和几何变化构成的额外挑战。成本聚合是匹配任务的一个非常重要的过程,匹配精度取决于其输出的质量。与寻址成本聚集的手工制作或基于CNN的方法相比,缺乏严重变形的鲁棒性或继承了由于接受领域有限而无法区分错误匹配的CNN的限制,猫探讨了初始相关图之间的全球共识一些建筑设计的帮助,使我们能够充分利用自我关注机制。具体地,我们包括外观亲和力建模,以帮助成本聚合过程,以消除嘈杂的初始相关映射并提出多级聚合,以有效地从分层特征表示中捕获不同的语义。然后,我们与交换自我关注技术和残留连接相结合,不仅要强制执行一致的匹配,而且还可以缓解学习过程,我们发现这些结果导致了表观性能提升。我们进行实验,以证明拟议模型在最新方法中的有效性,并提供广泛的消融研究。代码和培训的型号可以在https://github.com/sunghwanhong/cats提供。
translated by 谷歌翻译
3D point cloud registration is a fundamental problem in computer vision and robotics. Recently, learning-based point cloud registration methods have made great progress. However, these methods are sensitive to outliers, which lead to more incorrect correspondences. In this paper, we propose a novel deep graph matching-based framework for point cloud registration. Specifically, we first transform point clouds into graphs and extract deep features for each point. Then, we develop a module based on deep graph matching to calculate a soft correspondence matrix. By using graph matching, not only the local geometry of each point but also its structure and topology in a larger range are considered in establishing correspondences, so that more correct correspondences are found. We train the network with a loss directly defined on the correspondences, and in the test stage the soft correspondences are transformed into hard one-to-one correspondences so that registration can be performed by a correspondence-based solver. Furthermore, we introduce a transformer-based method to generate edges for graph construction, which further improves the quality of the correspondences. Extensive experiments on object-level and scene-level benchmark datasets show that the proposed method achieves state-of-the-art performance. The code is available at: \href{https://github.com/fukexue/RGM}{https://github.com/fukexue/RGM}.
translated by 谷歌翻译
深度学习技术导致了通用对象检测领域的显着突破,近年来产生了很多场景理解的任务。由于其强大的语义表示和应用于场景理解,场景图一直是研究的焦点。场景图生成(SGG)是指自动将图像映射到语义结构场景图中的任务,这需要正确标记检测到的对象及其关系。虽然这是一项具有挑战性的任务,但社区已经提出了许多SGG方法并取得了良好的效果。在本文中,我们对深度学习技术带来了近期成就的全面调查。我们审查了138个代表作品,涵盖了不同的输入方式,并系统地将现有的基于图像的SGG方法从特征提取和融合的角度进行了综述。我们试图通过全面的方式对现有的视觉关系检测方法进行连接和系统化现有的视觉关系检测方法,概述和解释SGG的机制和策略。最后,我们通过深入讨论当前存在的问题和未来的研究方向来完成这项调查。本调查将帮助读者更好地了解当前的研究状况和想法。
translated by 谷歌翻译
点云注册是许多应用程序(例如本地化,映射,跟踪和重建)的基本任务。成功的注册依赖于提取鲁棒和歧视性的几何特征。现有的基于学习的方法需要高计算能力来同时处理大量原始点。尽管这些方法取得了令人信服的结果,但由于高计算成本,它们很难在现实情况下应用。在本文中,我们介绍了一个框架,该框架使用图形注意网络有效地从经济上提取密集的特征,以进行点云匹配和注册(DFGAT)。 DFGAT的检测器负责在大型原始数据集中找到高度可靠的关键点。 DFGAT的描述符将这些关键点与邻居相结合,以提取不变的密度特征,以准备匹配。图形注意力网络使用了丰富点云之间关系的注意机制。最后,我们将其视为最佳运输问题,并使用Sinkhorn算法找到正匹配和负面匹配。我们对KITTI数据集进行了彻底的测试,并评估了该方法的有效性。结果表明,与其他最先进的方法相比,使用有效紧凑的关键点选择和描述可以实现最佳性能匹配指标,并达到99.88%注册的最高成功率。
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
在两个图像之间建立密集对应是基本计算机视觉问题,通常通过匹配本地特征描述符来解决。然而,如果没有全球意识,这种本地特征通常不足以消除类似地区。并计算图像的成对特征相关性是计算昂贵和内存密集型。为了使本地特征意识到全球背景并提高其匹配的准确性,我们介绍了DendeGap,一种新的解决方案,以获得高效密集的信念学习,在锚点上调节图形结构化神经网络。具体地,我们首先提出利用锚点的曲线图结构,以在和图像间的情况下之前提供稀疏但可靠,并通过定向边沿传播到所有图像点。我们还通过光加权消息传递层设计了图形结构化网络以广播多级上下文,并以低内存成本生成高分辨率特征映射。最后,基于预测的特征图,我们使用循环一致性引入用于准确的对应预测的粗略框架。我们的特征描述符捕获本地和全局信息,从而启用一个连续的特征字段,用于以高分辨率查询任意点。通过对大型室内和室外数据集的全面的消融实验和评估,我们证明我们的方法在大多数基准上推动了最先进的函授学习。
translated by 谷歌翻译
场景图是一个场景的结构化表示,可以清楚地表达场景中对象之间的对象,属性和关系。随着计算机视觉技术继续发展,只需检测和识别图像中的对象,人们不再满足。相反,人们期待着对视觉场景更高的理解和推理。例如,给定图像,我们希望不仅检测和识别图像中的对象,还要知道对象之间的关系(视觉关系检测),并基于图像内容生成文本描述(图像标题)。或者,我们可能希望机器告诉我们图像中的小女孩正在做什么(视觉问题应答(VQA)),甚至从图像中移除狗并找到类似的图像(图像编辑和检索)等。这些任务需要更高水平的图像视觉任务的理解和推理。场景图只是场景理解的强大工具。因此,场景图引起了大量研究人员的注意力,相关的研究往往是跨模型,复杂,快速发展的。然而,目前没有对场景图的相对系统的调查。为此,本调查对现行场景图研究进行了全面调查。更具体地说,我们首先总结了场景图的一般定义,随后对场景图(SGG)和SGG的发电方法进行了全面和系统的讨论,借助于先验知识。然后,我们调查了场景图的主要应用,并汇总了最常用的数据集。最后,我们对场景图的未来发展提供了一些见解。我们相信这将是未来研究场景图的一个非常有帮助的基础。
translated by 谷歌翻译
Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged. In this survey, we provide a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields. We propose a new taxonomy to divide the state-of-the-art graph neural networks into four categories, namely recurrent graph neural networks, convolutional graph neural networks, graph autoencoders, and spatial-temporal graph neural networks. We further discuss the applications of graph neural networks across various domains and summarize the open source codes, benchmark data sets, and model evaluation of graph neural networks. Finally, we propose potential research directions in this rapidly growing field.
translated by 谷歌翻译
根据图像回答语义复杂的问题是在视觉问题应答(VQA)任务中的具有挑战性。虽然图像可以通过深度学习来良好代表,但是始终简单地嵌入问题,并且不能很好地表明它的含义。此外,视觉和文本特征具有不同模式的间隙,很难对齐和利用跨模块信息。在本文中,我们专注于这两个问题,并提出了一种匹配关注(GMA)网络的图表。首先,它不仅为图像构建图形,而且在句法和嵌入信息方面构建了该问题的图表。接下来,我们通过双级图形编码器探讨了模特内的关系,然后呈现双边跨模型图匹配注意力以推断图像与问题之间的关系。然后将更新的跨模式特征发送到答案预测模块中以进行最终答案预测。实验表明,我们的网络在GQA数据集和VQA 2.0数据集上达到了最先进的性能。消融研究验证了GMA网络中每个模块的有效性。
translated by 谷歌翻译
人类自然有效地在复杂的场景中找到突出区域。通过这种观察的动机,引入了计算机视觉中的注意力机制,目的是模仿人类视觉系统的这一方面。这种注意机制可以基于输入图像的特征被视为动态权重调整过程。注意机制在许多视觉任务中取得了巨大的成功,包括图像分类,对象检测,语义分割,视频理解,图像生成,3D视觉,多模态任务和自我监督的学习。在本调查中,我们对计算机愿景中的各种关注机制进行了全面的审查,并根据渠道注意,空间关注,暂时关注和分支注意力进行分类。相关的存储库https://github.com/menghaoguo/awesome-vision-tions致力于收集相关的工作。我们还建议了未来的注意机制研究方向。
translated by 谷歌翻译
Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: https://github.com/soeaver/awesome-human-parsing.
translated by 谷歌翻译