在本文中,我们专注于3D形式抽象和语义分析的两个任务。这与目前的方法形成对比,仅关注3D形状抽象或语义分析。此外,以前的方法难以产生实例级语义结果,其限制了它们的应用。我们提出了一种用于联合估计3D形式抽象和语义分析的新方法。我们的方法首先为3D形状产生许多3D语义候选区域;然后,我们采用这些候选者直接预测语义类别,并使用深卷积神经网络同时细化候选地区的参数。最后,我们设计一种融合预测结果并获得最终语义抽象的算法,该抽象被显示为对标准非最大抑制的改进。实验结果表明,我们的方法可以产生最先进的结果。此外,我们还发现我们的结果可以很容易地应用于实例级语义部分割和形状匹配。
translated by 谷歌翻译
Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics. As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems. However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks. Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area. To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds. It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation. It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions.
translated by 谷歌翻译
Current 3D object detection methods are heavily influenced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regular grids (i.e., to voxel grids or to bird's eye view images), or rely on detection in 2D images to propose 3D boxes. Few works have attempted to directly detect objects in point clouds. In this work, we return to first principles to construct a 3D detection pipeline for point cloud data and as generic as possible. However, due to the sparse nature of the data -samples from 2D manifolds in 3D space -we face a major challenge when directly predicting bounding box parameters from scene points: a 3D object centroid can be far from any surface point thus hard to regress accurately in one step. To address the challenge, we propose VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting. Our model achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency. Remarkably, VoteNet outperforms previous methods by using purely geometric information without relying on color images.
translated by 谷歌翻译
Figure 1. Shapes from the ShapeNet [8] database, fit to a structured implicit template, and arranged by template parameters using t-SNE [52]. Similar shape classes, such as airplanes, cars, and chairs, naturally cluster by template parameters. 1
translated by 谷歌翻译
We introduce Similarity Group Proposal Network (SGPN), a simple and intuitive deep learning framework for 3D object instance segmentation on point clouds. SGPN uses a single network to predict point grouping proposals and a corresponding semantic class for each proposal, from which we can directly extract instance segmentation results. Important to the effectiveness of SGPN is its novel representation of 3D instance segmentation results in the form of a similarity matrix that indicates the similarity between each pair of points in embedded feature space, thus producing an accurate grouping proposal for each point. Experimental results on various 3D scenes show the effectiveness of our method on 3D instance segmentation, and we also evaluate the capability of SGPN to improve 3D object detection and semantic segmentation results. We also demonstrate its flexibility by seamlessly incorporating 2D CNN features into the framework to boost performance.
translated by 谷歌翻译
Due to object detection's close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles which combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy and optimization function, etc. In this paper, we provide a review on deep learning based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely Convolutional Neural Network (CNN). Then we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network based learning systems.
translated by 谷歌翻译
We focus on the task of amodal 3D object detection in RGB-D images, which aims to produce a 3D bounding box of an object in metric form at its full extent. We introduce Deep Sliding Shapes, a 3D ConvNet formulation that takes a 3D volumetric scene from a RGB-D image as input and outputs 3D object bounding boxes. In our approach, we propose the first 3D Region Proposal Network (RPN) to learn objectness from geometric shapes and the first joint Object Recognition Network (ORN) to extract geometric features in 3D and color features in 2D. In particular, we handle objects of various sizes by training an amodal RPN at two different scales and an ORN to regress 3D bounding boxes. Experiments show that our algorithm outperforms the state-of-the-art by 13.8 in mAP and is 200× faster than the original Sliding Shapes. Source code and pre-trained models are available.
translated by 谷歌翻译
Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.
translated by 谷歌翻译
点云的语义场景重建是3D场景理解的必不可少的任务。此任务不仅需要识别场景中的每个实例,而且还需要根据部分观察到的点云恢复其几何形状。现有方法通常尝试基于基于检测的主链的不完整点云建议直接预测完整对象的占用值。但是,由于妨碍了各种检测到的假阳性对象建议以及对完整对象学习占用值的不完整点观察的歧义,因此该框架始终无法重建高保真网格。为了绕开障碍,我们提出了一个分离的实例网格重建(DIMR)框架,以了解有效的点场景。采用基于分割的主链来减少假阳性对象建议,这进一步使我们对识别与重建之间关系的探索有益。根据准确的建议,我们利用网状意识的潜在代码空间来解开形状完成和网格生成的过程,从而缓解了由不完整的点观测引起的歧义。此外,通过在测试时间访问CAD型号池,我们的模型也可以通过在没有额外训练的情况下执行网格检索来改善重建质量。我们用多个指标彻底评估了重建的网格质量,并证明了我们在具有挑战性的扫描仪数据集上的优越性。代码可在\ url {https://github.com/ashawkey/dimr}上获得。
translated by 谷歌翻译
In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability. * Majority of the work done as an intern at Nuro, Inc. depth to point cloud 2D region (from CNN) to 3D frustum 3D box (from PointNet)
translated by 谷歌翻译
Three-dimensional objects are commonly represented as 3D boxes in a point-cloud. This representation mimics the well-studied image-based 2D bounding-box detection but comes with additional challenges. Objects in a 3D world do not follow any particular orientation, and box-based detectors have difficulties enumerating all orientations or fitting an axis-aligned bounding box to rotated objects. In this paper, we instead propose to represent, detect, and track 3D objects as points. Our framework, CenterPoint, first detects centers of objects using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation, and velocity. In a second stage, it refines these estimates using additional point features on the object. In CenterPoint, 3D object tracking simplifies to greedy closest-point matching. The resulting detection and tracking algorithm is simple, efficient, and effective. CenterPoint achieved state-of-theart performance on the nuScenes benchmark for both 3D detection and tracking, with 65.5 NDS and 63.8 AMOTA for a single model. On the Waymo Open Dataset, Center-Point outperforms all previous single model methods by a large margin and ranks first among all Lidar-only submissions. The code and pretrained models are available at https://github.com/tianweiy/CenterPoint.
translated by 谷歌翻译
我们为来自多视图立体声(MVS)城市场景的3D建筑物的实例分割了一部小说框架。与关注城市场景的语义分割的现有作品不同,即使它们安装在大型和不精确的3D表面模型中,这项工作的重点是检测和分割3D构建实例。通过添加高度图,首先将多视图RGB图像增强到RGBH图像,并且被分段以使用微调的2D实例分割神经网络获得所有屋顶实例。然后将来自不同的多视图图像的屋顶实例掩码被聚集到全局掩码中。我们的面具聚类占空间闭塞和重叠,可以消除多视图图像之间的分割歧义。基于这些全局掩码,3D屋顶实例由掩码背部投影分割,并通过Markov随机字段(MRF)优化扩展到整个建筑实例。定量评估和消融研究表明了该方法的所有主要步骤的有效性。提供了一种用于评估3D建筑模型的实例分割的数据集。据我们所知,它是一个在实例分割级别的3D城市建筑的第一个数据集。
translated by 谷歌翻译
Intelligent mesh generation (IMG) refers to a technique to generate mesh by machine learning, which is a relatively new and promising research field. Within its short life span, IMG has greatly expanded the generalizability and practicality of mesh generation techniques and brought many breakthroughs and potential possibilities for mesh generation. However, there is a lack of surveys focusing on IMG methods covering recent works. In this paper, we are committed to a systematic and comprehensive survey describing the contemporary IMG landscape. Focusing on 110 preliminary IMG methods, we conducted an in-depth analysis and evaluation from multiple perspectives, including the core technique and application scope of the algorithm, agent learning goals, data types, targeting challenges, advantages and limitations. With the aim of literature collection and classification based on content extraction, we propose three different taxonomies from three views of key technique, output mesh unit element, and applicable input data types. Finally, we highlight some promising future research directions and challenges in IMG. To maximize the convenience of readers, a project page of IMG is provided at \url{https://github.com/xzb030/IMG_Survey}.
translated by 谷歌翻译
The objective of this paper is to learn dense 3D shape correspondence for topology-varying generic objects in an unsupervised manner. Conventional implicit functions estimate the occupancy of a 3D point given a shape latent code. Instead, our novel implicit function produces a probabilistic embedding to represent each 3D point in a part embedding space. Assuming the corresponding points are similar in the embedding space, we implement dense correspondence through an inverse function mapping from the part embedding vector to a corresponded 3D point. Both functions are jointly learned with several effective and uncertainty-aware loss functions to realize our assumption, together with the encoder generating the shape latent code. During inference, if a user selects an arbitrary point on the source shape, our algorithm can automatically generate a confidence score indicating whether there is a correspondence on the target shape, as well as the corresponding semantic point if there is one. Such a mechanism inherently benefits man-made objects with different part constitutions. The effectiveness of our approach is demonstrated through unsupervised 3D semantic correspondence and shape segmentation.
translated by 谷歌翻译
3D object detection from LiDAR point cloud is a challenging problem in 3D scene understanding and has many practical applications. In this paper, we extend our preliminary work PointRCNN to a novel and strong point-cloud-based 3D object detection framework, the part-aware and aggregation neural network (Part-A 2 net). The whole framework consists of the part-aware stage and the part-aggregation stage. Firstly, the part-aware stage for the first time fully utilizes free-of-charge part supervisions derived from 3D ground-truth boxes to simultaneously predict high quality 3D proposals and accurate intra-object part locations. The predicted intra-object part locations within the same proposal are grouped by our new-designed RoI-aware point cloud pooling module, which results in an effective representation to encode the geometry-specific features of each 3D proposal. Then the part-aggregation stage learns to re-score the box and refine the box location by exploring the spatial relationship of the pooled intra-object part locations. Extensive experiments are conducted to demonstrate the performance improvements from each component of our proposed framework. Our Part-A 2 net outperforms all existing 3D detection methods and achieves new state-of-the-art on KITTI 3D object detection dataset by utilizing only the LiDAR point cloud data. Code is available at https://github.com/sshaoshuai/PointCloudDet3D.
translated by 谷歌翻译
我们介绍了一种方法,例如针对3D点云的提案生成。现有技术通常直接在单个进料前进的步骤中回归建议,从而导致估计不准确。我们表明,这是一个关键的瓶颈,并提出了一种基于迭代双边滤波的方法。遵循双边滤波的精神,我们考虑了每个点的深度嵌入以及它们在3D空间中的位置。我们通过合成实验表明,在为给定的兴趣点生成实例建议时,我们的方法会带来巨大的改进。我们进一步验证了我们在挑战性扫描基准测试中的方法,从而在自上而下的方法的子类别中实现了最佳实例分割性能。
translated by 谷歌翻译
我们介绍了一种名为RobustAbnet的新表检测和结构识别方法,以检测表的边界并从异质文档图像中重建每个表的细胞结构。为了进行表检测,我们建议将Cornernet用作新的区域建议网络来生成更高质量的表建议,以更快的R-CNN,这显着提高了更快的R-CNN的定位准确性以进行表检测。因此,我们的表检测方法仅使用轻巧的RESNET-18骨干网络,在三个公共表检测基准(即CTDAR TRACKA,PUBLAYNET和IIIT-AR-13K)上实现最新性能。此外,我们提出了一种新的基于分裂和合并的表结构识别方法,其中提出了一个新型的基于CNN的新空间CNN分离线预测模块将每个检测到的表分为单元格,并且基于网格CNN的CNN合并模块是应用用于恢复生成细胞。由于空间CNN模块可以有效地在整个表图像上传播上下文信息,因此我们的表结构识别器可以坚固地识别具有较大的空白空间和几何扭曲(甚至弯曲)表的表。得益于这两种技术,我们的表结构识别方法在包括SCITSR,PubTabnet和CTDAR TrackB2-Modern在内的三个公共基准上实现了最先进的性能。此外,我们进一步证明了我们方法在识别具有复杂结构,大空间以及几何扭曲甚至弯曲形状的表上的表格上的优势。
translated by 谷歌翻译
从卷积神经网络的快速发展中受益,汽车牌照检测和识别的性能得到了很大的改善。但是,大多数现有方法分别解决了检测和识别问题,并专注于特定方案,这阻碍了现实世界应用的部署。为了克服这些挑战,我们提出了一个有效而准确的框架,以同时解决车牌检测和识别任务。这是一个轻巧且统一的深神经网络,可以实时优化端到端。具体而言,对于不受约束的场景,采用了无锚方法来有效检测车牌的边界框和四个角,这些框用于提取和纠正目标区域特征。然后,新型的卷积神经网络分支旨在进一步提取角色的特征而不分割。最后,将识别任务视为序列标记问题,这些问题通过连接派时间分类(CTC)解决。选择了几个公共数据集,包括在各种条件下从不同方案中收集的图像进行评估。实验结果表明,所提出的方法在速度和精度上都显着优于先前的最新方法。
translated by 谷歌翻译
点云实例分割在深度学习的出现方面取得了巨大进展。然而,这些方法通常是具有昂贵且耗时的密度云注释的数据饥饿。为了减轻注释成本,在任务中仍申请未标记或弱标记的数据。在本文中,我们使用标记和未标记的边界框作为监控,介绍第一个半监控点云实例分段框架(SPIB)。具体而言,我们的SPIB架构涉及两级学习程序。对于阶段,在具有扰动一致性正则化(SPCR)的半监控设置下培训边界框提案生成网络。正规化通过强制执行对应用于输入点云的不同扰动的边界框预测的不变性,为网络学习提供自我监督。对于阶段,使用SPCR的边界框提案被分组为某些子集,并且使用新颖的语义传播模块和属性一致性图模块中的每个子集中挖掘实例掩码。此外,我们介绍了一种新型占用比导改进模块,以优化实例掩码。对挑战队的攻击v2数据集进行了广泛的实验,证明了我们的方法可以实现与最近的完全监督方法相比的竞争性能。
translated by 谷歌翻译
由于动态环境中LIDAR点的稀缺性,3D对象跟踪仍然是一个具有挑战性的问题。在这项工作中,我们提出了一个暹罗体素到BEV跟踪器,可以显着提高稀疏3D点云中的跟踪性能。具体地,它包括暹罗形状感知特征学习网络和体素到BEV目标本地化网络。暹罗形式感知特征学习网络可以捕获对象的3D形状信息以学习对象的辨别特征,使得可以识别来自稀疏点云中的背景的潜在目标。为此,我们首先执行模板特征嵌入以将模板的特征嵌入到电位目标中,然后生成密集的3D形状以表征潜在目标的形状信息。为了本地化跟踪目标,Voxel-to-BeV目标本地化网络以无密集的鸟瞰图(BEV)特征图,将目标的2D中心和$ Z $ -Axis中心以无锚的方式回归。具体地,我们通过MAX池沿Z $ -axis压缩了Voxelized Point云,以获得密集的BEV特征图,其中可以更有效地执行2D中心和$ Z $ -Axis中心的回归。对基蒂和NUSCENES数据集的广泛评估表明,我们的方法通过大边距显着优于当前最先进的方法。
translated by 谷歌翻译