我们介绍了一种方法,例如针对3D点云的提案生成。现有技术通常直接在单个进料前进的步骤中回归建议,从而导致估计不准确。我们表明,这是一个关键的瓶颈,并提出了一种基于迭代双边滤波的方法。遵循双边滤波的精神,我们考虑了每个点的深度嵌入以及它们在3D空间中的位置。我们通过合成实验表明,在为给定的兴趣点生成实例建议时,我们的方法会带来巨大的改进。我们进一步验证了我们在挑战性扫描基准测试中的方法,从而在自上而下的方法的子类别中实现了最佳实例分割性能。
translated by 谷歌翻译
We introduce Similarity Group Proposal Network (SGPN), a simple and intuitive deep learning framework for 3D object instance segmentation on point clouds. SGPN uses a single network to predict point grouping proposals and a corresponding semantic class for each proposal, from which we can directly extract instance segmentation results. Important to the effectiveness of SGPN is its novel representation of 3D instance segmentation results in the form of a similarity matrix that indicates the similarity between each pair of points in embedded feature space, thus producing an accurate grouping proposal for each point. Experimental results on various 3D scenes show the effectiveness of our method on 3D instance segmentation, and we also evaluate the capability of SGPN to improve 3D object detection and semantic segmentation results. We also demonstrate its flexibility by seamlessly incorporating 2D CNN features into the framework to boost performance.
translated by 谷歌翻译
我们介绍了一个3D实例表示,称为实例内核,其中实例由一维向量表示,该向量编码3D实例的语义,位置和形状信息。我们显示了实例内核通过简单地在整个场景中扫描内核,避免对标准3D实例分段管道中的建议或启发式聚类算法的严重依赖,从而实现了简单的掩盖推理。实例内核的想法是受到2D/3D实例分割中动态卷积的最新成功的启发。但是,我们发现由于点云数据的无序和非结构化的性质,代表3D实例是非平凡的,例如,糟糕的实例定位可以显着降低实例表示。为了解决这个问题,我们构建了一个编码范式的新颖3D实例。首先,潜在的实例质心定位为候选。然后,设计了一个候选人合并方案,以同时汇总重复的候选人,并收集围绕合并的质心的背景,以形成实例内核。一旦实例内核可用,就可以通过在实例内核调节的动态卷积来重建实例掩码。整个管道是通过动态内核网络(DKNET)实例化的。结果表明,DKNET的表现优于ScannETV2和S3DIS数据集的艺术状态,并具有更好的实例本地化。可用代码:https://github.com/w1zheng/dknet。
translated by 谷歌翻译
当前的3D分割方法很大程度上依赖于大规模的点状数据集,众所周知,这些数据集众所周知。很少有尝试规避需要每点注释的需求。在这项工作中,我们研究了弱监督的3D语义实例分割。关键的想法是利用3D边界框标签,更容易,更快地注释。确实,我们表明只有仅使用边界框标签训练密集的分割模型。在我们方法的核心上,\ name {}是一个深层模型,灵感来自经典的霍夫投票,直接投票赞成边界框参数,并且是专门针对边界盒票的专门定制的群集方法。这超出了常用的中心票,这不会完全利用边界框注释。在扫描仪测试中,我们弱监督的模型在其他弱监督的方法中获得了领先的性能(+18 MAP@50)。值得注意的是,它还达到了当前完全监督模型的50分数的地图的97%。为了进一步说明我们的工作的实用性,我们在最近发布的Arkitscenes数据集中训练Box2mask,该数据集仅使用3D边界框注释,并首次显示引人注目的3D实例细分掩码。
translated by 谷歌翻译
本文在3D Point Cloud中介绍了一个新问题:很少示例实例分割。给定一些带注释的点云举例说明了目标类,我们的目标是在查询点云中细分该目标类的所有实例。这个问题具有广泛的实用应用,在重点实例分段注释非常昂贵的收集中。为了解决此问题,我们提出了测量形式 - 第一个用于3D点云实例分割的地球引导变压器。关键的想法是利用大地距离来应对LIDAR 3D点云的密度不平衡。 LIDAR 3D点云在物体表面附近茂密,在其他地方稀疏或空,使欧几里得距离较差以区分不同的物体。另一方面,大地测量距离更合适,因为它编码了场景的几何形状,该几何形状可以用作变压器解码器中注意机制的指导信号,以生成代表实例的不同特征的内核。然后将这些内核用于动态卷积以获得最终实例掩模。为了评估新任务上的测量形式,我们提出了两个常见的3D点云实例分割数据集的新拆分:ScannETV2和S3DIS。地球形式始终优于根据最新的3D点云实例分割方法的强大基线,并具有明显的余量。代码可从https://github.com/vinairesearch/geoformer获得。
translated by 谷歌翻译
我们提出了一种基于动态卷积的3D点云的实例分割方法。这使其能够在推断时适应变化的功能和对象尺度。这样做避免了一些自下而上的方法的陷阱,包括对超参数调整和启发式后处理管道的依赖,以弥补物体大小的不可避免的可变性,即使在单个场景中也是如此。通过收集具有相同语义类别并为几何质心进行仔细投票的均匀点,网络的表示能力大大提高了。然后通过几个简单的卷积层解码实例,其中参数是在输入上生成的。所提出的方法是无建议的,而是利用适应每个实例的空间和语义特征的卷积过程。建立在瓶颈层上的轻重量变压器使模型可以捕获远程依赖性,并具有有限的计算开销。结果是一种简单,高效且健壮的方法,可以在各种数据集上产生强大的性能:ScannETV2,S3DIS和Partnet。基于体素和点的体系结构的一致改进意味着提出的方法的有效性。代码可在以下网址找到:https://git.io/dyco3d
translated by 谷歌翻译
点云的语义场景重建是3D场景理解的必不可少的任务。此任务不仅需要识别场景中的每个实例,而且还需要根据部分观察到的点云恢复其几何形状。现有方法通常尝试基于基于检测的主链的不完整点云建议直接预测完整对象的占用值。但是,由于妨碍了各种检测到的假阳性对象建议以及对完整对象学习占用值的不完整点观察的歧义,因此该框架始终无法重建高保真网格。为了绕开障碍,我们提出了一个分离的实例网格重建(DIMR)框架,以了解有效的点场景。采用基于分割的主链来减少假阳性对象建议,这进一步使我们对识别与重建之间关系的探索有益。根据准确的建议,我们利用网状意识的潜在代码空间来解开形状完成和网格生成的过程,从而缓解了由不完整的点观测引起的歧义。此外,通过在测试时间访问CAD型号池,我们的模型也可以通过在没有额外训练的情况下执行网格检索来改善重建质量。我们用多个指标彻底评估了重建的网格质量,并证明了我们在具有挑战性的扫描仪数据集上的优越性。代码可在\ url {https://github.com/ashawkey/dimr}上获得。
translated by 谷歌翻译
Current 3D object detection methods are heavily influenced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regular grids (i.e., to voxel grids or to bird's eye view images), or rely on detection in 2D images to propose 3D boxes. Few works have attempted to directly detect objects in point clouds. In this work, we return to first principles to construct a 3D detection pipeline for point cloud data and as generic as possible. However, due to the sparse nature of the data -samples from 2D manifolds in 3D space -we face a major challenge when directly predicting bounding box parameters from scene points: a 3D object centroid can be far from any surface point thus hard to regress accurately in one step. To address the challenge, we propose VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting. Our model achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency. Remarkably, VoteNet outperforms previous methods by using purely geometric information without relying on color images.
translated by 谷歌翻译
Most existing methods realize 3D instance segmentation by extending those models used for 3D object detection or 3D semantic segmentation. However, these non-straightforward methods suffer from two drawbacks: 1) Imprecise bounding boxes or unsatisfactory semantic predictions limit the performance of the overall 3D instance segmentation framework. 2) Existing method requires a time-consuming intermediate step of aggregation. To address these issues, this paper proposes a novel end-to-end 3D instance segmentation method based on Superpoint Transformer, named as SPFormer. It groups potential features from point clouds into superpoints, and directly predicts instances through query vectors without relying on the results of object detection or semantic segmentation. The key step in this framework is a novel query decoder with transformers that can capture the instance information through the superpoint cross-attention mechanism and generate the superpoint masks of the instances. Through bipartite matching based on superpoint masks, SPFormer can implement the network training without the intermediate aggregation step, which accelerates the network. Extensive experiments on ScanNetv2 and S3DIS benchmarks verify that our method is concise yet efficient. Notably, SPFormer exceeds compared state-of-the-art methods by 4.3% on ScanNetv2 hidden test set in terms of mAP and keeps fast inference speed (247ms per frame) simultaneously. Code is available at https://github.com/sunjiahao1999/SPFormer.
translated by 谷歌翻译
点云实例分割在深度学习的出现方面取得了巨大进展。然而,这些方法通常是具有昂贵且耗时的密度云注释的数据饥饿。为了减轻注释成本,在任务中仍申请未标记或弱标记的数据。在本文中,我们使用标记和未标记的边界框作为监控,介绍第一个半监控点云实例分段框架(SPIB)。具体而言,我们的SPIB架构涉及两级学习程序。对于阶段,在具有扰动一致性正则化(SPCR)的半监控设置下培训边界框提案生成网络。正规化通过强制执行对应用于输入点云的不同扰动的边界框预测的不变性,为网络学习提供自我监督。对于阶段,使用SPCR的边界框提案被分组为某些子集,并且使用新颖的语义传播模块和属性一致性图模块中的每个子集中挖掘实例掩码。此外,我们介绍了一种新型占用比导改进模块,以优化实例掩码。对挑战队的攻击v2数据集进行了广泛的实验,证明了我们的方法可以实现与最近的完全监督方法相比的竞争性能。
translated by 谷歌翻译
In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability. * Majority of the work done as an intern at Nuro, Inc. depth to point cloud 2D region (from CNN) to 3D frustum 3D box (from PointNet)
translated by 谷歌翻译
In this paper, we propose PointRCNN for 3D object detection from raw point cloud. The whole framework is composed of two stages: stage-1 for the bottom-up 3D proposal generation and stage-2 for refining proposals in the canonical coordinates to obtain the final detection results. Instead of generating proposals from RGB image or projecting point cloud to bird's view or voxels as previous methods do, our stage-1 sub-network directly generates a small number of high-quality 3D proposals from point cloud in a bottom-up manner via segmenting the point cloud of the whole scene into foreground points and background. The stage-2 sub-network transforms the pooled points of each proposal to canonical coordinates to learn better local spatial features, which is combined with global semantic features of each point learned in stage-1 for accurate box refinement and confidence prediction. Extensive experiments on the 3D detection benchmark of KITTI dataset show that our proposed architecture outperforms state-of-the-art methods with remarkable margins by using only point cloud as input. The code is available at https://github.com/sshaoshuai/PointRCNN.
translated by 谷歌翻译
Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics. As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems. However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks. Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area. To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds. It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation. It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions.
translated by 谷歌翻译
尽管有启发式方法,贪婪的算法以及对数据统计变化的变化,但3D实例分割中的当前最新方法通常涉及聚类步骤。相比之下,我们提出了一种以每点预测方式起作用的全面3D点云实例分割方法。为此,它可以避免基于聚类的方法面临的挑战:在模型的不同任务之间引入依赖性。我们发现其成功的关键是为每个采样点分配一个合适的目标。我们建议使用最佳的传输方法来根据动态匹配成本最佳地将目标掩码分配给采样点。我们的方法在扫描仪和S3DIS基准测试方面取得了令人鼓舞的结果。所提出的方法消除了插入依赖性,因此比其他竞争方法代表了更简单,更灵活的3D实例分割框架,同时实现了提高的分割精度。
translated by 谷歌翻译
3D object detection from LiDAR point cloud is a challenging problem in 3D scene understanding and has many practical applications. In this paper, we extend our preliminary work PointRCNN to a novel and strong point-cloud-based 3D object detection framework, the part-aware and aggregation neural network (Part-A 2 net). The whole framework consists of the part-aware stage and the part-aggregation stage. Firstly, the part-aware stage for the first time fully utilizes free-of-charge part supervisions derived from 3D ground-truth boxes to simultaneously predict high quality 3D proposals and accurate intra-object part locations. The predicted intra-object part locations within the same proposal are grouped by our new-designed RoI-aware point cloud pooling module, which results in an effective representation to encode the geometry-specific features of each 3D proposal. Then the part-aggregation stage learns to re-score the box and refine the box location by exploring the spatial relationship of the pooled intra-object part locations. Extensive experiments are conducted to demonstrate the performance improvements from each component of our proposed framework. Our Part-A 2 net outperforms all existing 3D detection methods and achieves new state-of-the-art on KITTI 3D object detection dataset by utilizing only the LiDAR point cloud data. Code is available at https://github.com/sshaoshuai/PointCloudDet3D.
translated by 谷歌翻译
We present a new, embarrassingly simple approach to instance segmentation. Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that have made instance segmentation much more challenging. In order to predict a mask for each instance, mainstream approaches either follow the "detect-then-segment" strategy (e.g., Mask R-CNN), or predict embedding vectors first then use clustering techniques to group pixels into individual instances. We view the task of instance segmentation from a completely new perspective by introducing the notion of "instance categories", which assigns categories to each pixel within an instance according to the instance's location and size, thus nicely converting instance segmentation into a single-shot classification-solvable problem. We demonstrate a much simpler and flexible instance segmentation framework with strong performance, achieving on par accuracy with Mask R-CNN and outperforming recent single-shot instance segmenters in accuracy. We hope that this simple and strong framework can serve as a baseline for many instance-level recognition tasks besides instance segmentation. Code is available at https://git.io/AdelaiDet
translated by 谷歌翻译
In contrast to fully supervised methods using pixel-wise mask labels, box-supervised instance segmentation takes advantage of simple box annotations, which has recently attracted increasing research attention. This paper presents a novel single-shot instance segmentation approach, namely Box2Mask, which integrates the classical level-set evolution model into deep neural network learning to achieve accurate mask prediction with only bounding box supervision. Specifically, both the input image and its deep features are employed to evolve the level-set curves implicitly, and a local consistency module based on a pixel affinity kernel is used to mine the local context and spatial relations. Two types of single-stage frameworks, i.e., CNN-based and transformer-based frameworks, are developed to empower the level-set evolution for box-supervised instance segmentation, and each framework consists of three essential components: instance-aware decoder, box-level matching assignment and level-set evolution. By minimizing the level-set energy function, the mask map of each instance can be iteratively optimized within its bounding box annotation. The experimental results on five challenging testbeds, covering general scenes, remote sensing, medical and scene text images, demonstrate the outstanding performance of our proposed Box2Mask approach for box-supervised instance segmentation. In particular, with the Swin-Transformer large backbone, our Box2Mask obtains 42.4% mask AP on COCO, which is on par with the recently developed fully mask-supervised methods. The code is available at: https://github.com/LiWentomng/boxlevelset.
translated by 谷歌翻译
现有的最新3D点云实例分割方法依赖于基于分组的方法,该方法指向获得对象实例。尽管产生准确的分割结果方面有所改善,但这些方法缺乏可扩展性,通常需要将大量输入分为多个部分。为了处理数百万点的场景,现有的最快方法软组\ cite {vu2022222222222222222222222222222222222222ggroup}需要数十秒钟,这是满意的。我们的发现是,$ k $ neart的邻居($ k $ -nn)是分组的先决条件,是计算瓶颈。这种瓶颈严重使现场的推理时间恶化了很多。本文提出了软组++来解决此计算瓶颈,并进一步优化了整个网络的推理速度。 SoftGroup ++建立在软组上,这在三个重要方面有所不同:(1)执行OCTREE $ K $ -NN而不是Vanilla $ k $ -nn,以将时间复杂性从$ \ Mathcal {o}(n^2)缩短到$ \ Mathcal {o}(n \ log n)$,(2)执行金字塔缩放,适应性下降样本骨干输出以减少$ k $ -nn和分组的搜索空间,并且(3)执行后期的Devoxelization,延迟了Voxels的转换指向模型的结束,以使中间组件以低计算成本运行。在各种室内和室外数据集上进行了广泛的实验,证明了拟议的软组++的功效。值得注意的是,SoftGroup ++在一个前方的情况下通过单个前方进行了大量的场景,而无需将输入分为多个部分,从而丰富了上下文信息。特别是,SoftGroup ++达到2.4点AP $ _ {50} $改进,而$ 6 \ $ 6 \ times $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $。代码和训练有素的模型将公开可用。
translated by 谷歌翻译
3D场景从点云层的理解对各种机器人应用起着重要作用。遗憾的是,目前的最先进的方法使用单独的神经网络进行对象检测或房间布局估计等不同任务。这种方案具有两个限制:1)存储和运行多个网络以用于不同任务的网络对于典型的机器人平台昂贵。 2)忽略单独输出的内在结构,潜在地侵犯。为此,我们使用点云输入提出了第一变压器架构,其同时预测3D对象和布局。与估计布局关键点或边缘的现有方法不同,我们将单独参数化为一组四边形。因此,所提出的架构被称为p(oint)q(UAD)-Transformer。除了新颖的四边形表示之外,我们提出了一种量身定制的物理约束损失功能,阻碍对象布局干扰。公共基准SCANNet上的定量和定性评估表明,所提出的PQ变换器成功地共同解析了3D对象和布局,以准实时(8.91 FPS)速率运行而无需效率为导向的优化。此外,新的物理限制损失可以改善强力基线,房间布局的F1分数明显促进了37.9%至57.9%。
translated by 谷歌翻译
点云上的实例分割对于3D场景的理解至关重要。距离聚类通常用于最新方法(SOTA),该方法通常是有效的,但在用相同的语义标签(尤其是在共享相邻点)的相邻对象中表现不佳。由于偏移点的分布不均匀,这些现有方法几乎不能集中所有实例点。为此,我们设计了一种新颖的鸿沟和征服策略,并提出了一个名为PBNET的端到端网络,该网络将每个点二进制并分别将它们簇簇为细分实例。 PBNET将偏移实例点分为两类:高密度点(HPS vs.lps),然后分别征服。可以通过删除LPS清楚地分离相邻的对象,然后通过通过邻居投票方法分配LP来完成和完善。为了进一步减少聚类误差,我们根据平均大小开发迭代合并算法,以汇总片段实例。 ScannETV2和S3DIS数据集的实验表明了我们的模型的优势。尤其是,PBNET在ScannETV2官方基准挑战(验证集)上实现了迄今为止最好的AP50和AP25,同时证明了高效率。
translated by 谷歌翻译