本地功能匹配是在子像素级别上的计算密集任务。尽管基于检测器的方法和特征描述符在低文本场景中遇到了困难,但具有顺序提取到匹配管道的基于CNN的方法无法使用编码器的匹配能力,并且倾向于覆盖用于匹配的解码器。相比之下,我们提出了一种新型的层次提取和匹配变压器,称为火柴场。在层次编码器的每个阶段,我们将自我注意事项与特征提取和特征匹配的交叉注意相结合,从而产生了人直觉提取和匹配方案。这种匹配感知的编码器释放了过载的解码器,并使该模型高效。此外,将自我交叉注意在分层体系结构中的多尺度特征结合起来,可以提高匹配的鲁棒性,尤其是在低文本室内场景或更少的室外培训数据中。得益于这样的策略,MatchFormer是效率,鲁棒性和精度的多赢解决方案。与以前的室内姿势估计中的最佳方法相比,我们的Lite MatchFormer只有45%的Gflops,但获得了 +1.3%的精度增益和41%的运行速度提升。大型火柴构造器以四个不同的基准达到最新的基准,包括室内姿势估计(SCANNET),室外姿势估计(Megadepth),同型估计和图像匹配(HPATCH)和视觉定位(INLOC)。
translated by 谷歌翻译
We present a novel method for local image feature matching. Instead of performing image feature detection, description, and matching sequentially, we propose to first establish pixel-wise dense matches at a coarse level and later refine the good matches at a fine level. In contrast to dense methods that use a cost volume to search correspondences, we use self and cross attention layers in Transformer to obtain feature descriptors that are conditioned on both images. The global receptive field provided by Transformer enables our method to produce dense matches in low-texture areas, where feature detectors usually struggle to produce repeatable interest points. The experiments on indoor and outdoor datasets show that LoFTR outperforms state-of-the-art methods by a large margin. LoFTR also ranks first on two public benchmarks of visual localization among the published methods. Code is available at our project page: https://zju3dv.github.io/loftr/.
translated by 谷歌翻译
在许多视觉应用程序中,查找跨图像的对应是一项重要任务。最新的最新方法着重于以粗到精细的方式设计的基于端到端学习的架构。他们使用非常深的CNN或多块变压器来学习强大的表示,这需要高计算能力。此外,这些方法在不理解对象,图像内部形状的情况下学习功能,因此缺乏解释性。在本文中,我们提出了一个用于图像匹配的体系结构,该体系结构高效,健壮且可解释。更具体地说,我们介绍了一个名为toblefm的新型功能匹配模块,该模块可以大致将图像跨图像的空间结构大致组织到一个主题中,然后扩大每个主题内部的功能以进行准确的匹配。为了推断主题,我们首先学习主题的全局嵌入,然后使用潜在变量模型来检测图像结构将图像结构分配到主题中。我们的方法只能在共同可见性区域执行匹配以减少计算。在室外和室内数据集中进行的广泛实验表明,我们的方法在匹配性能和计算效率方面优于最新方法。该代码可在https://github.com/truongkhang/topicfm上找到。
translated by 谷歌翻译
在图像之间生成健壮和可靠的对应关系是多种应用程序的基本任务。为了在全球和局部粒度上捕获上下文,我们提出了Aspanformer,这是一种基于变压器的无探测器匹配器,建立在层次的注意力结构上,采用了一种新颖的注意操作,能够以自适应方式调整注意力跨度。为了实现这一目标,首先,在每个跨注意阶段都会回归流图,以定位搜索区域的中心。接下来,在中心周围生成一个采样网格,其大小不是根据固定的经验配置为固定的,而是根据与流图一起估计的像素不确定性的自适应计算。最后,在派生区域内的两个图像上计算注意力,称为注意跨度。通过这些方式,我们不仅能够维持长期依赖性,而且能够在高相关性的像素之间获得细粒度的注意,从而补偿基本位置和匹配任务中的零件平滑度。在广泛的评估基准上的最新准确性验证了我们方法的强匹配能力。
translated by 谷歌翻译
本地图像功能匹配,旨在识别图像对的识别和相应的相似区域,是计算机视觉中的重要概念。大多数现有的图像匹配方法遵循一对一的分配原则,并采用共同最近的邻居来确保跨图像之间本地特征之间的独特对应关系。但是,来自不同条件的图像可能会容纳大规模变化或观点多样性,以便一对一的分配可能在密集匹配中导致模棱两可或丢失的表示形式。在本文中,我们介绍了一种新颖的无探测器本地特征匹配方法Adamatcher,该方法首先通过轻巧的特征交互模块与密集的特征相关联,并估算了配对图像的可见面积,然后执行贴片级多到 - 一个分配可以预测匹配建议,并最终根据一对一的完善模块进行完善。广泛的实验表明,Adamatcher的表现优于固体基线,并在许多下游任务上实现最先进的结果。此外,多对一分配和一对一的完善模块可以用作其他匹配方法(例如Superglue)的改进网络,以进一步提高其性能。代码将在出版时提供。
translated by 谷歌翻译
This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems. The code and trained weights are publicly available at github.com/magicleap/SuperGluePretrainedNetwork.
translated by 谷歌翻译
We introduce a lightweight network to improve descriptors of keypoints within the same image. The network takes the original descriptors and the geometric properties of keypoints as the input, and uses an MLP-based self-boosting stage and a Transformer-based cross-boosting stage to enhance the descriptors. The enhanced descriptors can be either real-valued or binary ones. We use the proposed network to boost both hand-crafted (ORB, SIFT) and the state-of-the-art learning-based descriptors (SuperPoint, ALIKE) and evaluate them on image matching, visual localization, and structure-from-motion tasks. The results show that our method significantly improves the performance of each task, particularly in challenging cases such as large illumination changes or repetitive patterns. Our method requires only 3.2ms on desktop GPU and 27ms on embedded GPU to process 2000 features, which is fast enough to be applied to a practical system.
translated by 谷歌翻译
变压器在自然语言处理中的成功最近引起了计算机视觉领域的关注。由于能够学习长期依赖性,变压器已被用作广泛使用的卷积运算符的替代品。事实证明,这种替代者在许多任务中都取得了成功,其中几种最先进的方法依靠变压器来更好地学习。在计算机视觉中,3D字段还见证了使用变压器来增加3D卷积神经网络和多层感知器网络的增加。尽管许多调查都集中在视力中的变压器上,但由于与2D视觉相比,由于数据表示和处理的差异,3D视觉需要特别注意。在这项工作中,我们介绍了针对不同3D视觉任务的100多种变压器方法的系统和彻底审查,包括分类,细分,检测,完成,姿势估计等。我们在3D Vision中讨论了变形金刚的设计,该设计使其可以使用各种3D表示形式处理数据。对于每个应用程序,我们强调了基于变压器的方法的关键属性和贡献。为了评估这些方法的竞争力,我们将它们的性能与12个3D基准测试的常见非转化方法进行了比较。我们通过讨论3D视觉中变压器的不同开放方向和挑战来结束调查。除了提出的论文外,我们的目标是频繁更新最新的相关论文及其相应的实现:https://github.com/lahoud/3d-vision-transformers。
translated by 谷歌翻译
近年来,人群计数研究取得了重大进展。然而,随着人群中存在具有挑战性的规模变化和复杂的场景,传统的卷积网络和最近具有固定大小的变压器架构都不能良好地处理任务。为了解决这个问题,本文提出了一个场景 - 自适应关注网络,称为Saanet。首先,我们设计了可变形的变压器骨干内的可变形关注,从而了解具有可变形采样位置和动态注意力的自适应特征表示。然后,我们提出了多级特征融合和计数专注特征增强模块,以加强全局图像上下文下的特征表示。学习的陈述可以参加前景,并适应不同的人群。我们对四个具有挑战性的人群计数基准进行广泛的实验,表明我们的方法实现了最先进的性能。特别是,我们的方法目前在NWPU-Crowd基准的公共排行榜上排名第一。我们希望我们的方法可能是一个强大的基线,以支持人群计数的未来研究。源代码将被释放到社区。
translated by 谷歌翻译
Advanced visual localization techniques encompass image retrieval challenges and 6 Degree-of-Freedom (DoF) camera pose estimation, such as hierarchical localization. Thus, they must extract global and local features from input images. Previous methods have achieved this through resource-intensive or accuracy-reducing means, such as combinatorial pipelines or multi-task distillation. In this study, we present a novel method called SuperGF, which effectively unifies local and global features for visual localization, leading to a higher trade-off between localization accuracy and computational efficiency. Specifically, SuperGF is a transformer-based aggregation model that operates directly on image-matching-specific local features and generates global features for retrieval. We conduct experimental evaluations of our method in terms of both accuracy and efficiency, demonstrating its advantages over other methods. We also provide implementations of SuperGF using various types of local features, including dense and sparse learning-based or hand-crafted descriptors.
translated by 谷歌翻译
我们解决了一对图像之间找到密集的视觉对应关系的重要任务。由于各种因素,例如质地差,重复的模式,照明变化和运动模糊,这是一个具有挑战性的问题。与使用密集信号基础真相作为本地功能匹配培训的直接监督的方法相反,我们训练3DG-STFM:一种多模式匹配模型(教师),以在3D密集的对应性监督下执行深度一致性,并将知识转移到2D单峰匹配模型(学生)。教师和学生模型均由两个基于变压器的匹配模块组成,这些模块以粗略的方式获得密集的对应关系。教师模型指导学生模型学习RGB诱导的深度信息,以实现粗糙和精细分支的匹配目的。我们还在模型压缩任务上评估了3DG-STFM。据我们所知,3DG-STFM是第一种用于本地功能匹配任务的学生教师学习方法。该实验表明,我们的方法优于室内和室外摄像头姿势估计以及同型估计问题的最先进方法。代码可在以下网址获得:https://github.com/ryan-prime/3dg-stfm。
translated by 谷歌翻译
大多数现有的RGB-D突出物体检测方法利用卷积操作并构建复杂的交织融合结构来实现跨模型信息集成。卷积操作的固有局部连接将基于卷积的方法的性能进行了限制到天花板的性能。在这项工作中,我们从全球信息对齐和转换的角度重新思考此任务。具体地,所提出的方法(Transcmd)级联几个跨模型集成单元来构造基于自上而下的变换器的信息传播路径(TIPP)。 Transcmd将多尺度和多模态特征集成作为序列到序列上下文传播和内置于变压器上的更新过程。此外,考虑到二次复杂性W.R.T.输入令牌的数量,我们设计了具有可接受的计算成本的修补程序令牌重新嵌入策略(Ptre)。七个RGB-D SOD基准数据集上的实验结果表明,在配备TIPP时,简单的两流编码器 - 解码器框架可以超越最先进的基于CNN的方法。
translated by 谷歌翻译
在统一功能对应模型中建模稀疏和致密的图像匹配最近引起了研究的兴趣。但是,现有的努力主要集中于提高匹配的准确性,同时忽略其效率,这对于现实世界的应用至关重要。在本文中,我们提出了一种有效的结构,该结构以粗到精细的方式找到对应关系,从而显着提高了功能对应模型的效率。为了实现这一目标,多个变压器块是阶段范围连接的,以逐步完善共享的多尺度特征提取网络上的预测坐标。给定一对图像和任意查询坐标,所有对应关系均在单个进纸传球内预测。我们进一步提出了一种自适应查询聚类策略和基于不确定性的离群检测模块,以与提出的框架合作,以进行更快,更好的预测。对各种稀疏和密集的匹配任务进行的实验证明了我们方法在效率和有效性上对现有的最新作品的优势。
translated by 谷歌翻译
人类的姿势估计旨在弄清不同场景中所有人的关键。尽管结果有希望,但目前的方法仍然面临一些挑战。现有的自上而下的方法单独处理一个人,而没有不同的人与所在的场景之间的相互作用。因此,当发生严重闭塞时,人类检测的表现会降低。另一方面,现有的自下而上方法同时考虑所有人,并捕获整个图像的全局知识。但是,由于尺度变化,它们的准确性不如自上而下的方法。为了解决这些问题,我们通过整合自上而下和自下而上的管道来探索不同接受场的视觉线索并实现其互补性,提出了一种新颖的双皮线整合变压器(DPIT)。具体而言,DPIT由两个分支组成,自下而上的分支介绍了整个图像以捕获全局视觉信息,而自上而下的分支则从单人类边界框中提取本地视觉的特征表示。然后,从自下而上和自上而下的分支中提取的特征表示形式被馈入变压器编码器,以交互融合全局和本地知识。此外,我们定义了关键点查询,以探索全景和单人类姿势视觉线索,以实现两个管道的相互互补性。据我们所知,这是将自下而上和自上而下管道与变压器与人类姿势估计的变压器相结合的最早作品之一。关于可可和MPII数据集的广泛实验表明,我们的DPIT与最先进的方法相当。
translated by 谷歌翻译
Camouflaged objects are seamlessly blended in with their surroundings, which brings a challenging detection task in computer vision. Optimizing a convolutional neural network (CNN) for camouflaged object detection (COD) tends to activate local discriminative regions while ignoring complete object extent, causing the partial activation issue which inevitably leads to missing or redundant regions of objects. In this paper, we argue that partial activation is caused by the intrinsic characteristics of CNN, where the convolution operations produce local receptive fields and experience difficulty to capture long-range feature dependency among image regions. In order to obtain feature maps that could activate full object extent, keeping the segmental results from being overwhelmed by noisy features, a novel framework termed Cross-Model Detail Querying network (DQnet) is proposed. It reasons the relations between long-range-aware representations and multi-scale local details to make the enhanced representation fully highlight the object regions and eliminate noise on non-object regions. Specifically, a vanilla ViT pretrained with self-supervised learning (SSL) is employed to model long-range dependencies among image regions. A ResNet is employed to enable learning fine-grained spatial local details in multiple scales. Then, to effectively retrieve object-related details, a Relation-Based Querying (RBQ) module is proposed to explore window-based interactions between the global representations and the multi-scale local details. Extensive experiments are conducted on the widely used COD datasets and show that our DQnet outperforms the current state-of-the-arts.
translated by 谷歌翻译
在本文中,我们介绍了全景语义细分,该分段以整体方式提供了对周围环境的全景和密集的像素的理解。由于两个关键的挑战,全景分割尚未探索:(1)全景上的图像扭曲和对象变形; (2)缺乏培训全景分段的注释。为了解决这些问题,我们提出了一个用于全景语义细分(Trans4Pass)体系结构的变压器。首先,为了增强失真意识,Trans4Pass配备了可变形的贴片嵌入(DPE)和可变形的MLP(DMLP)模块,能够在适应之前(适应之前或之后)和任何地方(浅层或深度级别的(浅层或深度))和图像变形(通过任何涉及(浅层或深层))和图像变形(通过任何地方)和图像变形设计。我们进一步介绍了升级后的Trans4Pass+模型,其中包含具有平行令牌混合的DMLPV2,以提高建模歧视性线索的灵活性和概括性。其次,我们提出了一种无监督域适应性的相互典型适应(MPA)策略。第三,除了针孔到型 - 帕诺amic(PIN2PAN)适应外,我们还创建了一个新的数据集(Synpass),其中具有9,080个全景图像,以探索360 {\ deg} Imagery中的合成对真实(Syn2real)适应方案。进行了广泛的实验,这些实验涵盖室内和室外场景,并且使用PIN2PAN和SYN2REAL方案进行了研究。 Trans4Pass+在四个域自适应的全景语义分割基准上实现最先进的性能。代码可从https://github.com/jamycheung/trans4pass获得。
translated by 谷歌翻译
我们介绍了一个基于仅用于跟踪的变压器的暹罗样的双分支网络。给定模板和搜索映像,我们将它们分成非重叠补丁,并基于其在注意窗口中的其他人的匹配结果提取每个补丁的特征向量。对于每个令牌,我们估计它是否包含目标对象和相应的大小。该方法的优点是,该特征从匹配中学到,最终匹配。因此,功能与目标跟踪任务对齐。该方法实现更好或比较的结果作为首先使用CNN提取特征的最佳性能,然后使用变压器熔断它们。它优于GOT-10K和VOT2020基准上的最先进的方法。此外,该方法在一个GPU上实现了实时推理速度(约为40美元的FPS)。代码和模型将被释放。
translated by 谷歌翻译
Sparse local feature extraction is usually believed to be of important significance in typical vision tasks such as simultaneous localization and mapping, image matching and 3D reconstruction. At present, it still has some deficiencies needing further improvement, mainly including the discrimination power of extracted local descriptors, the localization accuracy of detected keypoints, and the efficiency of local feature learning. This paper focuses on promoting the currently popular sparse local feature learning with camera pose supervision. Therefore, it pertinently proposes a Shared Coupling-bridge scheme with four light-weight yet effective improvements for weakly-supervised local feature (SCFeat) learning. It mainly contains: i) the \emph{Feature-Fusion-ResUNet Backbone} (F2R-Backbone) for local descriptors learning, ii) a shared coupling-bridge normalization to improve the decoupling training of description network and detection network, iii) an improved detection network with peakiness measurement to detect keypoints and iv) the fundamental matrix error as a reward factor to further optimize feature detection training. Extensive experiments prove that our SCFeat improvement is effective. It could often obtain a state-of-the-art performance on classic image matching and visual localization. In terms of 3D reconstruction, it could still achieve competitive results. For sharing and communication, our source codes are available at https://github.com/sunjiayuanro/SCFeat.git.
translated by 谷歌翻译
Erroneous feature matches have severe impact on subsequent camera pose estimation and often require additional, time-costly measures, like RANSAC, for outlier rejection. Our method tackles this challenge by addressing feature matching and pose optimization jointly. To this end, we propose a graph attention network to predict image correspondences along with confidence weights. The resulting matches serve as weighted constraints in a differentiable pose estimation. Training feature matching with gradients from pose optimization naturally learns to down-weight outliers and boosts pose estimation on image pairs compared to SuperGlue by 6.7% on ScanNet. At the same time, it reduces the pose estimation time by over 50% and renders RANSAC iterations unnecessary. Moreover, we integrate information from multiple views by spanning the graph across multiple frames to predict the matches all at once. Multi-view matching combined with end-to-end training improves the pose estimation metrics on Matterport3D by 18.8% compared to SuperGlue.
translated by 谷歌翻译
在本文中,我们解决了估算图像之间尺度因子的问题。我们制定规模估计问题作为对尺度因素的概率分布的预测。我们设计了一种新的架构,ScaleNet,它利用扩张的卷积以及自我和互相关层来预测图像之间的比例。我们展示了具有估计尺度的整流图像导致各种任务和方法的显着性能改进。具体而言,我们展示了ScaleNet如何与稀疏的本地特征和密集的通信网络组合,以改善不同的基准和数据集中的相机姿势估计,3D重建或密集的几何匹配。我们对多项任务提供了广泛的评估,并分析了标准齿的计算开销。代码,评估协议和培训的型号在https://github.com/axelbarroso/scalenet上公开提供。
translated by 谷歌翻译