We introduce MegaPose, a method to estimate the 6D pose of novel objects, that is, objects unseen during training. At inference time, the method only assumes knowledge of (i) a region of interest displaying the object in the image and (ii) a CAD model of the observed object. The contributions of this work are threefold. First, we present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects. The shape and coordinate system of the novel object are provided as inputs to the network by rendering multiple synthetic views of the object's CAD model. Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner. Third, we introduce a large-scale synthetic dataset of photorealistic images of thousands of objects with diverse visual and shape properties and show that this diversity is crucial to obtain good generalization performance on novel objects. We train our approach on this large synthetic dataset and apply it without retraining to hundreds of novel objects in real images from several pose estimation benchmarks. Our approach achieves state-of-the-art performance on the ModelNet and YCB-Video datasets. An extensive evaluation on the 7 core datasets of the BOP challenge demonstrates that our approach achieves performance competitive with existing approaches that require access to the target objects during training. Code, dataset and trained models are available on the project page: https://megapose6d.github.io/.
translated by 谷歌翻译
自动驾驶汽车广泛使用屋顶旋转的LIDAR传感器,推动了3D点序列实时处理的需求。但是,大多数激光雷达语义细分数据集和算法将这些收购分为$ 360^\ circ $框架,从而导致收购潜伏期与现实的实时应用程序和评估不符。我们通过两个关键贡献来解决这个问题。首先,我们介绍Helixnet,这是一个10亿美元的点数据集,具有细粒度的标签,时间戳和传感器旋转信息,可以准确评估分割算法的实时准备就绪。其次,我们提出了helix4d,这是一种专门设计用于旋转激光雷达点序列的紧凑而有效的时空变压器结构。 Helix4D在采集切片上运行,对应于传感器的全部旋转的一部分,从而大大降低了总延迟。我们介绍了Helixnet和Semantickitti上几种最先进模型的性能和实时准备的广泛基准。 Helix4D与最佳的分割算法达到准确性,而在延迟和型号$ 50 \ times $中,降低了$ 5 \ times $。代码和数据可在以下网址获得:https://romainloiseau.fr/helixnet
translated by 谷歌翻译
本文介绍了学习3D表面类似地图集的表示的新技术,即从2D域到表面的同质形态转换。与先前的工作相比,我们提出了两项​​主要贡献。首先,我们没有通过优化作为高斯人的混合物来了解具有任意拓扑的连续2D域,而不是将固定的2D域(例如一组平方斑)映射到表面上。其次,我们在两个方向上学习一致的映射:图表,从3D表面到2D域,以及参数化,它们的倒数。我们证明,这可以提高学到的表面表示的质量,并在相关形状集合中的一致性。因此,它导致了应用程序的改进,例如对应估计,纹理传输和一致的UV映射。作为额外的技术贡献,我们概述了,尽管合并正常的一致性具有明显的好处,但它会导致优化问题,并且可以使用简单的排斥正则化来缓解这些问题。我们证明我们的贡献比现有基线提供了更好的表面表示。
translated by 谷歌翻译
单视图重建的方法通常依赖于观点注释,剪影,缺乏背景,同一实例的多个视图,模板形状或对称性。我们通过明确利用不同对象实例的图像之间的一致性来避免所有此类监督和假设。结果,我们的方法可以从描述相同对象类别的大量未标记图像中学习。我们的主要贡献是利用跨境一致性的两种方法:(i)渐进式调理,一种培训策略,以逐步将模型从类别中逐步专业为课程学习方式进行实例; (ii)邻居重建,具有相似形状或纹理的实例之间的损失。对于我们方法的成功也至关重要的是:我们的结构化自动编码体系结构将图像分解为显式形状,纹理,姿势和背景;差异渲染的适应性公式;以及一个新的优化方案在3D和姿势学习之间交替。我们将我们的方法(独角兽)在多样化的合成造型数据集上进行比较,这是需要多种视图作为监督的方法的经典基准 - 以及标准的实数基准(Pascal3d+ Car,Cub,Cub,Cub,Cub),大多数方法都需要已知的模板和Silhouette注释。我们还展示了对更具挑战性的现实收藏集(Compcars,LSUN)的适用性,在该收藏中,剪影不可用,图像没有在物体周围裁剪。
translated by 谷歌翻译
神经隐式表面已成为多视图3D重建的重要技术,但它们的准确性仍然有限。在本文中,我们认为这来自难以学习和呈现具有神经网络的高频纹理。因此,我们建议在不同视图中添加标准神经渲染优化直接照片一致性术语。直观地,我们优化隐式几何体,以便以一致的方式扭曲彼此的视图。我们证明,两个元素是这种方法成功的关键:(i)使用沿着每条光线的预测占用和3D点的预测占用和法线来翘曲整个补丁,并用稳健的结构相似度测量它们的相似性; (ii)以这种方式处理可见性和遮挡,使得不正确的扭曲不会给出太多的重要性,同时鼓励重建尽可能完整。我们评估了我们的方法,在标准的DTU和EPFL基准上被称为NeuralWarp,并表明它在两个数据集上以超过20%重建的艺术态度优于未经监督的隐式表面。
translated by 谷歌翻译
在本文中,我们将3D点云的古典表示作为线性形状模型。我们的主要洞察力是利用深度学习,代表一种形状的集合,作为低维线性形状模型的仿射变换。每个线性模型的特征在于形状原型,低维形状基础和两个神经网络。网络以输入点云作为输入,并在线性基础中预测形状的坐标和最能近似输入的仿射变换。使用单一的重建损耗来学习线性模型和神经网络的结束。我们方法的主要优点是,与近期学习基于特征的复杂形状表示的许多深度方法相比,我们的模型是显式的,并且在3D空间中发生每个操作。结果,我们的线性形状模型可以很容易地可视化和注释,并且可以在视觉上了解故障情况。虽然我们的主要目标是引入紧凑且可解释的形状收集表示,但我们表明它导致最新的最先进结果对几次射击分割。
translated by 谷歌翻译
在简单的数据集中,在简单的数据集中开发和广泛地进行了深度多视图立体声(MVS)方法,在那里他们现在优于经典方法。在本文中,我们询问控制方案中达到的结论是否仍然有效,在使用互联网照片集合时仍然有效。我们提出了一种评估方法,探讨了深度MVS方法的三个方面的影响:网络架构,培训数据和监督。我们进行了几个关键观察,我们广泛地定量和定性地验证,无论是深度预测和完整的3D重建。首先,复杂的无监督方法无法在野外训练数据。我们的新方法使三个关键要素成为可能:上采样输出,基于Softmin的聚合和单一的重建损失。其次,监督基于深度堤map的MVS方法是用于重建几个互联网图像的最新技术。最后,我们的评估提供了比通常的结果非常不同。这表明在不受控制的方案中的评估对于新架构很重要。
translated by 谷歌翻译
We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing or incorrect object hypotheses, and automatically recovers the number of objects in the scene. Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views. This is achieved by solving an object-level bundle adjustment problem that refines the poses of cameras and objects to minimize the reprojection error in all views. We demonstrate that the proposed method, dubbed Cosy-Pose, outperforms current state-of-the-art results for single-view and multi-view 6D object pose estimation by a large margin on two challenging benchmarks: the YCB-Video and T-LESS datasets. Code and pre-trained models are available on the project webpage. 5
translated by 谷歌翻译
For long-term simultaneous planning, localization and mapping (SPLAM), a robot should be able to continuously update its map according to the dynamic changes of the environment and the new areas explored. With limited onboard computation capabilities, a robot should also be able to limit the size of the map used for online localization and mapping. This paper addresses these challenges using a memory management mechanism, which identifies locations that should remain in a Working Memory (WM) for online processing from locations that should be transferred to a Long-Term Memory (LTM). When revisiting previously mapped areas that are in LTM, the mechanism can retrieve these locations and place them back in WM for online SPLAM. The approach is tested on a robot equipped with a short-range laser rangefinder and a RGB-D camera, patrolling autonomously 10.5 km in an indoor environment over 11 sessions while having encountered 139 people.
translated by 谷歌翻译
Vision transformers have emerged as powerful tools for many computer vision tasks. It has been shown that their features and class tokens can be used for salient object segmentation. However, the properties of segmentation transformers remain largely unstudied. In this work we conduct an in-depth study of the spatial attentions of different backbone layers of semantic segmentation transformers and uncover interesting properties. The spatial attentions of a patch intersecting with an object tend to concentrate within the object, whereas the attentions of larger, more uniform image areas rather follow a diffusive behavior. In other words, vision transformers trained to segment a fixed set of object classes generalize to objects well beyond this set. We exploit this by extracting heatmaps that can be used to segment unknown objects within diverse backgrounds, such as obstacles in traffic scenes. Our method is training-free and its computational overhead negligible. We use off-the-shelf transformers trained for street-scene segmentation to process other scene types.
translated by 谷歌翻译