Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propose a "light touch" approach, guiding visual Transformers to learn multiple-view geometry but allowing them to break free when needed. We achieve this by using epipolar lines to guide the Transformer's cross-attention maps, penalizing attention values outside the epipolar lines and encouraging higher attention along these lines since they contain geometrically plausible matches. Unlike previous methods, our proposal does not require any camera pose information at test-time. We focus on pose-invariant object instance retrieval, where standard Transformer networks struggle, due to the large differences in viewpoint between query and retrieved images. Experimentally, our method outperforms state-of-the-art approaches at object retrieval, without needing pose information at test-time.
translated by 谷歌翻译
现代的3D计算机视觉利用学习来增强几何推理,将图像数据映射到经典结构,例如成本量或外观限制,以改善匹配。这些体系结构根据特定问题进行了专门化,因此需要进行大量任务的调整,通常会导致域的泛化性能差。最近,通才变压器架构通过编码几何学先验作为输入而不是执行约束,在诸如光流和深度估计等任务中取得了令人印象深刻的结果。在本文中,我们扩展了这一想法,并建议学习一个隐式,多视图一致的场景表示,并在增加视图多样性之前引入了一系列3D数据增强技术作为几何感应。我们还表明,引入视图合成作为辅助任务进一步改善了深度估计。我们的深度磁场网络(定义)实现了最新的目的,可以实现立体声和视频深度估计,而无需明确的几何约束,并通过广泛的边距改善了零局部域的概括。
translated by 谷歌翻译
我们提出了一个简单的基线,用于直接估计两个图像之间的相对姿势(旋转和翻译,包括比例)。深度方法最近显示出很强的进步,但通常需要复杂或多阶段的体系结构。我们表明,可以将少数修改应用于视觉变压器(VIT),以使其计算接近八点算法。这种归纳偏见使一种简单的方法在多种环境中具有竞争力,通常在有限的数据制度中具有强劲的性能增长,从而实质上有所改善。
translated by 谷歌翻译
最近的3D Vision进步已经受到包括几何诱导偏见的专业架构的推动。在本文中,我们使用域名可靠架构来解决3D重建,并将与模型的额外输入直接注入相同类型的电感偏差。这种方法使得可以应用现有的一般模型,例如感知的富域,而无需架构更改,同时保持定制模型的数据效率。特别是我们研究如何将摄像机,投影射线发射和末端几何形式作为模型输入进行编码,并在多个基准上展示竞争性的多视图深度估计性能。
translated by 谷歌翻译
现代计算机视觉已超越了互联网照片集的领域,并进入了物理世界,通过非结构化的环境引导配备摄像头的机器人和自动驾驶汽车。为了使这些体现的代理与现实世界对象相互作用,相机越来越多地用作深度传感器,重建了各种下游推理任务的环境。机器学习辅助的深度感知或深度估计会预测图像中每个像素的距离。尽管已经在深入估算中取得了令人印象深刻的进步,但仍然存在重大挑战:(1)地面真相深度标签很难大规模收集,(2)通常认为相机信息是已知的,但通常是不可靠的,并且(3)限制性摄像机假设很常见,即使在实践中使用了各种各样的相机类型和镜头。在本论文中,我们专注于放松这些假设,并描述将相机变成真正通用深度传感器的最终目标的贡献。
translated by 谷歌翻译
我们提出了一种从有限重叠的图像中对场景进行平面表面重建的方法。此重构任务是具有挑战性的,因为它需要共同推理单个图像3D重建,图像之间的对应关系以及图像之间的相对摄像头姿势。过去的工作提出了基于优化的方法。我们引入了一种更简单的方法,即平面形式,该方法使用应用于3D感知平面令牌的变压器执行3D推理。我们的实验表明,我们的方法比以前的工作更有效,并且几项3D特定的设计决策对于成功的成功至关重要。
translated by 谷歌翻译
A hallmark of the deep learning era for computer vision is the successful use of large-scale labeled datasets to train feature representations for tasks ranging from object recognition and semantic segmentation to optical flow estimation and novel view synthesis of 3D scenes. In this work, we aim to learn dense discriminative object representations for low-shot category recognition without requiring any category labels. To this end, we propose Deep Object Patch Encodings (DOPE), which can be trained from multiple views of object instances without any category or semantic object part labels. To train DOPE, we assume access to sparse depths, foreground masks and known cameras, to obtain pixel-level correspondences between views of an object, and use this to formulate a self-supervised learning task to learn discriminative object patches. We find that DOPE can directly be used for low-shot classification of novel categories using local-part matching, and is competitive with and outperforms supervised and self-supervised learning baselines. Code and data available at https://github.com/rehg-lab/dope_selfsup.
translated by 谷歌翻译
This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems. The code and trained weights are publicly available at github.com/magicleap/SuperGluePretrainedNetwork.
translated by 谷歌翻译
自从神经辐射场(NERF)出现以来,神经渲染引起了极大的关注,并且已经大大推动了新型视图合成的最新作品。最近的重点是在模型上过度适合单个场景,以及学习模型的一些尝试,这些模型可以综合看不见的场景的新型视图,主要包括将深度卷积特征与类似NERF的模型组合在一起。我们提出了一个不同的范式,不需要深层特征,也不需要类似NERF的体积渲染。我们的方法能够直接从现场采样的贴片集中直接预测目标射线的颜色。我们首先利用表现几何形状沿着每个参考视图的异性线提取斑块。每个贴片线性地投影到1D特征向量和一系列变压器处理集合中。对于位置编码,我们像在光场表示中一样对射线进行参数化,并且至关重要的差异是坐标是相对于目标射线的规范化的,这使我们的方法与参考帧无关并改善了概括。我们表明,即使接受比先前的工作要少得多的数据训练,我们的方法在新颖的综合综合方面都超出了最新的视图综合。
translated by 谷歌翻译
Advanced visual localization techniques encompass image retrieval challenges and 6 Degree-of-Freedom (DoF) camera pose estimation, such as hierarchical localization. Thus, they must extract global and local features from input images. Previous methods have achieved this through resource-intensive or accuracy-reducing means, such as combinatorial pipelines or multi-task distillation. In this study, we present a novel method called SuperGF, which effectively unifies local and global features for visual localization, leading to a higher trade-off between localization accuracy and computational efficiency. Specifically, SuperGF is a transformer-based aggregation model that operates directly on image-matching-specific local features and generates global features for retrieval. We conduct experimental evaluations of our method in terms of both accuracy and efficiency, demonstrating its advantages over other methods. We also provide implementations of SuperGF using various types of local features, including dense and sparse learning-based or hand-crafted descriptors.
translated by 谷歌翻译
我们为RGB视频提供了基于变压器的神经网络体系结构,用于多对象3D重建。它依赖于表示知识的两种替代方法:作为特征的全局3D网格和一系列特定的2D网格。我们通过专用双向注意机制在两者之间逐步交换信息。我们利用有关图像形成过程的知识,以显着稀疏注意力重量矩阵,从而使我们的体系结构在记忆和计算方面可行。我们在3D特征网格的顶部附上一个detr风格的头,以检测场景中的对象并预测其3D姿势和3D形状。与以前的方法相比,我们的体系结构是单阶段,端到端可训练,并且可以从整体上考虑来自多个视频帧的场景,而无需脆弱的跟踪步骤。我们在挑战性的SCAN2CAD数据集上评估了我们的方法,在该数据集中,我们的表现要优于RGB视频的3D对象姿势估算的最新最新方法; (2)将多视图立体声与RGB-D CAD对齐结合的强大替代方法。我们计划发布我们的源代码。
translated by 谷歌翻译
While object reconstruction has made great strides in recent years, current methods typically require densely captured images and/or known camera poses, and generalize poorly to novel object categories. To step toward object reconstruction in the wild, this work explores reconstructing general real-world objects from a few images without known camera poses or object categories. The crux of our work is solving two fundamental 3D vision problems -- shape reconstruction and pose estimation -- in a unified approach. Our approach captures the synergies of these two problems: reliable camera pose estimation gives rise to accurate shape reconstruction, and the accurate reconstruction, in turn, induces robust correspondence between different views and facilitates pose estimation. Our method FORGE predicts 3D features from each view and leverages them in conjunction with the input images to establish cross-view correspondence for estimating relative camera poses. The 3D features are then transformed by the estimated poses into a shared space and are fused into a neural radiance field. The reconstruction results are rendered by volume rendering techniques, enabling us to train the model without 3D shape ground-truth. Our experiments show that FORGE reliably reconstructs objects from five views. Our pose estimation method outperforms existing ones by a large margin. The reconstruction results under predicted poses are comparable to the ones using ground-truth poses. The performance on novel testing categories matches the results on categories seen during training. Project page: https://ut-austin-rpl.github.io/FORGE/
translated by 谷歌翻译
估计每个视图中的2D人类姿势通常是校准多视图3D姿势估计的第一步。但是,2D姿势探测器的性能遭受挑战性的情况,例如闭塞和斜视角。为了解决这些挑战,以前的作品从eMipolar几何中的不同视图之间导出点对点对应关系,并利用对应关系来合并预测热插拔或特征表示。除了后预测合并/校准之外,我们引入了用于多视图3D姿势估计的变压器框架,其目的地通过将来自不同视图的信息集成信息来直接改善单个2D预测器。灵感来自先前的多模态变压器,我们设计一个统一的变压器体系结构,命名为输送,从当前视图和邻近视图中保险。此外,我们提出了eMipolar字段的概念来将3D位置信息编码到变压器模型中。由Epipolar字段引导的3D位置编码提供了一种有效的方式来编码不同视图的像素之间的对应关系。人类3.6M和滑雪姿势的实验表明,与其他融合方法相比,我们的方法更有效,并且具有一致的改进。具体而言,我们在256 x 256分辨率上只有5米参数达到人类3.6米的25.8毫米MPJPE。
translated by 谷歌翻译
This paper proposes a generalizable, end-to-end deep learning-based method for relative pose regression between two images. Given two images of the same scene captured from different viewpoints, our algorithm predicts the relative rotation and translation between the two respective cameras. Despite recent progress in the field, current deep-based methods exhibit only limited generalization to scenes not seen in training. Our approach introduces a network architecture that extracts a grid of coarse features for each input image using the pre-trained LoFTR network. It subsequently relates corresponding features in the two images, and finally uses a convolutional network to recover the relative rotation and translation between the respective cameras. Our experiments indicate that the proposed architecture can generalize to novel scenes, obtaining higher accuracy than existing deep-learning-based methods in various settings and datasets, in particular with limited training data.
translated by 谷歌翻译
新型视图合成是一个长期存在的问题。在这项工作中,我们考虑了一个问题的变体,在这种变体中,只有几个上下文视图稀疏地涵盖了场景或对象。目的是预测现场的新观点,这需要学习先验。当前的艺术状态基于神经辐射场(NERF),在获得令人印象深刻的结果的同时,这些方法遭受了较长的训练时间,因为它们需要通过每个图像来评估数百万个3D点样品。我们提出了一种仅限2D方法,该方法将多个上下文视图映射,并在神经网络的单个通过中映射到新图像。我们的模型使用由密码簿和变压器模型组成的两阶段体系结构。该密码手册用于将单个图像嵌入较小的潜在空间中,而变压器在此更紧凑的空间中求解了视图综合任务。为了有效地训练我们的模型,我们引入了一种新颖的分支注意机制,该机制使我们不仅可以将相同的模型用于神经渲染,还可以用于摄像头姿势估计。现实世界场景的实验结果表明,与基于NERF的方法相比,我们的方法具有竞争力,而在3D中没有明确推理,并且训练速度更快。
translated by 谷歌翻译
This work addresses cross-view camera pose estimation, i.e., determining the 3-DoF camera pose of a given ground-level image w.r.t. an aerial image of the local area. We propose SliceMatch, which consists of ground and aerial feature extractors, feature aggregators, and a pose predictor. The feature extractors extract dense features from the ground and aerial images. Given a set of candidate camera poses, the feature aggregators construct a single ground descriptor and a set of rotational equivariant pose-dependent aerial descriptors. Notably, our novel aerial feature aggregator has a cross-view attention module for ground-view guided aerial feature selection, and utilizes the geometric projection of the ground camera's viewing frustum on the aerial image to pool features. The efficient construction of aerial descriptors is achieved by using precomputed masks and by re-assembling the aerial descriptors for rotated poses. SliceMatch is trained using contrastive learning and pose estimation is formulated as a similarity comparison between the ground descriptor and the aerial descriptors. SliceMatch outperforms the state-of-the-art by 19% and 62% in median localization error on the VIGOR and KITTI datasets, with 3x FPS of the fastest baseline.
translated by 谷歌翻译
Understanding the 3D world without supervision is currently a major challenge in computer vision as the annotations required to supervise deep networks for tasks in this domain are expensive to obtain on a large scale. In this paper, we address the problem of unsupervised viewpoint estimation. We formulate this as a self-supervised learning task, where image reconstruction provides the supervision needed to predict the camera viewpoint. Specifically, we make use of pairs of images of the same object at training time, from unknown viewpoints, to self-supervise training by combining the viewpoint information from one image with the appearance information from the other. We demonstrate that using a perspective spatial transformer allows efficient viewpoint learning, outperforming existing unsupervised approaches on synthetic data, and obtains competitive results on the challenging PASCAL3D+ dataset.
translated by 谷歌翻译
我们介绍了Amazon Berkeley对象(ABO),这是一个新的大型数据集,旨在帮助弥合真实和虚拟3D世界之间的差距。ABO包含产品目录图像,元数据和艺术家创建的3D模型,具有复杂的几何形状和与真实的家用物体相对应的物理基础材料。我们得出了具有挑战性的基准,这些基准利用ABO的独特属性,并测量最先进的对象在三个开放问题上的最新限制,以了解实际3D对象:单视3D 3D重建,材料估计和跨域多视图对象检索。
translated by 谷歌翻译
在图像之间生成健壮和可靠的对应关系是多种应用程序的基本任务。为了在全球和局部粒度上捕获上下文,我们提出了Aspanformer,这是一种基于变压器的无探测器匹配器,建立在层次的注意力结构上,采用了一种新颖的注意操作,能够以自适应方式调整注意力跨度。为了实现这一目标,首先,在每个跨注意阶段都会回归流图,以定位搜索区域的中心。接下来,在中心周围生成一个采样网格,其大小不是根据固定的经验配置为固定的,而是根据与流图一起估计的像素不确定性的自适应计算。最后,在派生区域内的两个图像上计算注意力,称为注意跨度。通过这些方式,我们不仅能够维持长期依赖性,而且能够在高相关性的像素之间获得细粒度的注意,从而补偿基本位置和匹配任务中的零件平滑度。在广泛的评估基准上的最新准确性验证了我们方法的强匹配能力。
translated by 谷歌翻译
人类在理解视觉皮层引起的观点变化方面非常灵活,从而支持3D结构的感知。相反,大多数从2D图像池学习视觉表示的计算机视觉模型通常无法概括新颖的相机观点。最近,视觉体系结构已转向无卷积的架构,视觉变压器,该构造在从图像贴片中得出的令牌上运行。但是,这些变压器和2D卷积网络都没有执行明确的操作来学习视图 - 不合稳定表示以进行视觉理解。为此,我们提出了一个3D令牌表示层(3DTRL),该层估计了视觉令牌的3D位置信息,并利用它来学习视图点 - 不可能的表示。 3DTRL的关键元素包括伪深度估计器和学习的相机矩阵,以对令牌施加几何变换。这些使3DTRL能够从2D贴片中恢复令牌的3D位置信息。实际上,3DTRL很容易插入变压器。我们的实验证明了3DTRL在许多视觉任务中的有效性,包括图像分类,多视频视频对准和动作识别。带有3DTRL的模型在所有任务中都超过了骨干变压器,并以最小的添加计算。我们的项目页面位于https://www3.cs.stonybrook.edu/~jishang/3dtrl/3dtrl.html
translated by 谷歌翻译