3D pose estimation is a challenging problem in computer vision. Most of the existing neural-network-based approaches address color or depth images through convolution networks (CNNs). In this paper, we study the task of 3D human pose estimation from depth images. Different from the existing CNN-based human pose estimation method, we propose a deep human pose network for 3D pose estimation by taking the point cloud data as input data to model the surface of complex human structures. We first cast the 3D human pose estimation from 2D depth images to 3D point clouds and directly predict the 3D joint position. Our experiments on two public datasets show that our approach achieves higher accuracy than previous state-of-art methods. The reported results on both ITOP and EVAL datasets demonstrate the effectiveness of our method on the targeted tasks.
translated by 谷歌翻译
Human pose estimation has been widely applied in various industries. While recent decades have witnessed the introduction of many advanced two-dimensional (2D) human pose estimation solutions, three-dimensional (3D) human pose estimation is still an active research field in computer vision. Generally speaking, 3D human pose estimation methods can be divided into two categories: single-stage and two-stage. In this paper, we focused on the 2D-to-3D lifting process in the two-stage methods and proposed a more advanced baseline model for 3D human pose estimation, based on the existing solutions. Our improvements include optimization of machine learning models and multiple parameters, as well as introduction of a weighted loss to the training model. Finally, we used the Human3.6M benchmark to test the final performance and it did produce satisfactory results.
translated by 谷歌翻译
基于骨架的识别系统正在获得流行,并在骨骼中关注点或关节的机器学习模型已被证明在机器人技术等许多领域具有计算有效和应用。很容易跟踪点,从而保存空间和时间信息,这在抽象所需信息中起着重要作用,分类成为一项容易的任务。在本文中,我们旨在研究这些要点,但使用云机制,在该机制中我们将云定义为点的集合。但是,当我们添加时间信息时,可能不可能检索每个帧中一个点的坐标,而不是专注于单个点,我们可以使用k-neighbors来检索所讨论的观点的状态。我们的重点是使用重量共享收集此类信息,但请确保当我们尝试从邻居那里检索信息时,我们不会随身携带噪音。 LSTM具有长期建模功能,并且可以携带时间和空间信息。在本文中,我们试图总结基于图的手势识别方法。
translated by 谷歌翻译
Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics. As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems. However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks. Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area. To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds. It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation. It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions.
translated by 谷歌翻译
This paper proposes a novel application system for the generation of three-dimensional (3D) character animation driven by markerless human body motion capturing. The entire pipeline of the system consists of five stages: 1) the capturing of motion data using multiple cameras, 2) detection of the two-dimensional (2D) human body joints, 3) estimation of the 3D joints, 4) calculation of bone transformation matrices, and 5) generation of character animation. The main objective of this study is to generate a 3D skeleton and animation for 3D characters using multi-view images captured by ordinary cameras. The computational complexity of the 3D skeleton reconstruction based on 3D vision has been reduced as needed to achieve frame-by-frame motion capturing. The experimental results reveal that our system can effectively and efficiently capture human actions and use them to animate 3D cartoon characters in real-time.
translated by 谷歌翻译
毫米波(mmwave)雷达在不利的环境中起作用,例如在烟,雨,雪,照明等不良环境中起作用。先前的工作探索了从嘈杂且稀疏的MMWAVE雷达信号中重建3D骨骼或网格的可能性。但是,目前尚不清楚我们如何准确地从跨场景的MMWave信号重建3D主体,以及与摄像机相比的性能,当单独使用MMWave雷达或将它们与摄像机结合时,这是需要考虑的重要方面。为了回答这些问题,首先设计并构建了多个传感器,以收集大规模数据集。该数据集由在不同场景中的同步和校准的MMWave雷达点云和RGB(D)图像组成,以及在场景中人类的骨架/网格注释。使用此数据集,我们使用来自不同传感器的输入来训练最先进的方法,并在各种情况下对其进行测试。结果表明,1)尽管生成点云的噪音和稀疏性,MMWave雷达可以比RGB摄像机获得更好的重建精度,但比深度摄像头还差; 2)MMWave雷达的重建受不利天气条件的影响,而RGB(D)摄像机受到严重影响。此外,对数据集的分析和结果对改善MMWave雷达重建的重建以及来自不同传感器的信号的组合的洞察力。
translated by 谷歌翻译
基于RGB图像的人类姿势估计(HPE)经历了从深度学习中受益的快速发展。但是,基于事件的HPE尚未得到充分研究,这仍然是在极端场景和关键效率条件下应用的巨大潜力。在本文中,我们是第一个直接从3D事件点云中估算2D人类姿势的人。我们提出了一个新颖的事件表示,即栅格的事件点云,将事件汇总在小时切片的相同位置上。它保持了来自多个统计提示的3D功能,并显着降低了记忆消耗和计算复杂性,这在我们的工作中很有效。然后,我们利用两种不同的骨干,点网,DGCNN和点变压器来利用栅格化事件点云,并使用两个线性层解码器来预测人关键点的位置。我们发现,基于我们的方法,PointNet以更快的速度实现了令人鼓舞的结果,而点传感器的精度也更高,甚至接近以前的基于事件框架的方法。一组全面的结果表明,在事件驱动的人姿势估计中,我们提出的方法对这些3D主干模型始终有效。我们基于2048点输入的PointNet的方法在DHP19数据集上的MPJPE3D中实现了82.46mm,而在NVIDIA Jetson jetson Xavier NX Edge Computing Platform上仅具有12.29ms的延迟,理想地适合于实时检测事件Cameras。代码将在https://github.com/masterhow/eventpointpose上公开制作。
translated by 谷歌翻译
Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over stateof-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
translated by 谷歌翻译
尽管最近的进步,但是,尽管最近的进展,但是从单个图像中的人类姿势的全3D估计仍然是一个具有挑战性的任务。在本文中,我们探讨了关于场景几何体的强先前信息的假设可用于提高姿态估计精度。为了主弱地解决这个问题,我们已经组装了一种新的$ \ textbf {几何姿势提供} $ DataSet,包括与各种丰富的3D环境交互的人员的多视图图像。我们利用商业运动捕获系统来收集场景本身的姿势和构造精确的几何3D CAD模型的金标估计。要将对现有框架的现有框架注入图像的现有框架,我们介绍了一种新颖的,基于视图的场景几何形状,一个$ \ textbf {多层深度图} $,它采用了多次射线跟踪到简明地编码沿着每种相机视图光线方向的多个表面入口和退出点。我们提出了两种不同的机制,用于集成多层深度信息姿势估计:输入作为升降2D姿势的编码光线特征,其次是促进学习模型以支持几何一致姿态估计的可差异损失。我们通过实验展示这些技术可以提高3D姿势估计的准确性,特别是在遮挡和复杂场景几何形状的存在中。
translated by 谷歌翻译
我们提出了一种对类别级别的6D对象姿势和大小估计的新方法。为了解决类内的形状变化,我们学习规范形状空间(CASS),统一表示,用于某个对象类别的各种情况。特别地,CASS被建模为具有标准化姿势的规范3D形状深度生成模型的潜在空间。我们训练变形式自动编码器(VAE),用于从RGBD图像中的规范空间中生成3D点云。 VAE培训以跨类方式培训,利用公开的大型3D形状存储库。由于3D点云在归一化姿势(具有实际尺寸)中生成,因此VAE的编码器学习视图分解RGBD嵌入。它将RGBD图像映射到任意视图中以独立于姿势的3D形状表示。然后通过将对象姿势与用单独的深神经网络提取的输入RGBD的姿势相关的特征进行对比姿势估计。我们将CASS和姿势和大小估计的学习集成到最终的培训网络中,实现了最先进的性能。
translated by 谷歌翻译
A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose. Our code and video are available at https://sites.google.com/view/densefusion/.
translated by 谷歌翻译
估计对象的6D姿势是必不可少的计算机视觉任务。但是,大多数常规方法从单个角度依赖相机数据,因此遭受遮挡。我们通过称为MV6D的新型多视图6D姿势估计方法克服了这个问题,该方法从多个角度根据RGB-D图像准确地预测了混乱场景中所有对象的6D姿势。我们将方法以PVN3D网络为基础,该网络使用单个RGB-D图像来预测目标对象的关键点。我们通过从多个视图中使用组合点云来扩展此方法,并将每个视图中的图像与密集层层融合。与当前的多视图检测网络(例如Cosypose)相反,我们的MV6D可以以端到端的方式学习多个观点的融合,并且不需要多个预测阶段或随后对预测的微调。此外,我们介绍了三个新颖的影像学数据集,这些数据集具有沉重的遮挡的混乱场景。所有这些都从多个角度包含RGB-D图像,例如语义分割和6D姿势估计。即使在摄像头不正确的情况下,MV6D也明显优于多视图6D姿势估计中最新的姿势估计。此外,我们表明我们的方法对动态相机设置具有强大的态度,并且其准确性随着越来越多的观点而逐渐增加。
translated by 谷歌翻译
本文从单个RGB图像中解决了人手的3D点云重建和3D姿势估计。为此,我们在学习姿势估计的潜在表示时,我们展示了一个用于本地和全球点云重建的新型管道,同时使用3D手模板。为了展示我们的方法,我们介绍了一个新的多视图手姿势数据集,以获得现实世界中的手的完整3D点云。我们新拟议的数据集和四个公共基准测试的实验展示了模型的优势。我们的方法优于3D姿势估计中的竞争对手,同时重建现实看的完整3D手云。
translated by 谷歌翻译
本文首先提出了一个有效的3D点云学习架构,名为PWCLO-NET的LIDAR ODOMORY。在该架构中,提出了3D点云的投影感知表示来将原始的3D点云组织成有序数据表单以实现效率。 LIDAR ODOMOMERY任务的金字塔,翘曲和成本量(PWC)结构是为估计和优化在分层和高效的粗良好方法中的姿势。建立一个投影感知的细心成本卷,以直接关联两个离散点云并获得嵌入运动模式。然后,提出了一种可训练的嵌入掩模来称量局部运动模式以回归整体姿势和过滤异常值点。可训练的姿势经线细化模块迭代地与嵌入式掩码进行分层优化,使姿势估计对异常值更加强大。整个架构是全能优化的端到端,实现成本和掩码的自适应学习,并且涉及点云采样和分组的所有操作都是通过投影感知的3D特征学习方法加速。在Kitti Ocomatry DataSet上证明了我们的激光乐队内径架构的卓越性能和有效性。我们的方法优于基于学习的所有基于学习的方法,甚至基于几何的方法,在大多数基于Kitti Odomatry数据集的序列上具有映射优化的遗传。
translated by 谷歌翻译
点对特征(PPF)广泛用于6D姿势估计。在本文中,我们提出了一种基于PPF框架的有效的6D姿势估计方法。我们介绍了一个目标良好的下采样策略,该策略更多地集中在边缘区域,以有效地提取复杂的几何形状。提出了一种姿势假设验证方法来通过计算边缘匹配度来解决对称歧义。我们对两个具有挑战性的数据集和一个现实世界中收集的数据集进行评估,这证明了我们方法对姿势估计几何复杂,遮挡,对称对象的优越性。我们通过将其应用于模拟穿刺来进一步验证我们的方法。
translated by 谷歌翻译
由于价格合理的可穿戴摄像头和大型注释数据集的可用性,在过去几年中,Egintric Vision(又名第一人称视觉-FPV)的应用程序在过去几年中蓬勃发展。可穿戴摄像机的位置(通常安装在头部上)允许准确记录摄像头佩戴者在其前面的摄像头,尤其是手和操纵物体。这种内在的优势可以从多个角度研究手:将手及其部分定位在图像中;了解双手涉及哪些行动和活动;并开发依靠手势的人类计算机界面。在这项调查中,我们回顾了使用以自我为中心的愿景专注于手的文献,将现有方法分类为:本地化(其中的手或部分在哪里?);解释(手在做什么?);和应用程序(例如,使用以上为中心的手提示解决特定问题的系统)。此外,还提供了带有手基注释的最突出的数据集的列表。
translated by 谷歌翻译
准确的轨道位置是铁路支持驱动系统的重要组成部分,用于安全监控。激光雷达可以获得携带铁路环境的3D信息的点云,特别是在黑暗和可怕的天气条件下。在本文中,提出了一种基于3D点云的实时轨识别方法来解决挑战,如无序,不均匀的密度和大量点云的挑战。首先呈现Voxel Down-采样方法,用于铁路点云的密度平衡,并且金字塔分区旨在将3D扫描区域划分为具有不同卷的体素。然后,开发了一个特征编码模块以找到最近的邻点并聚合它们的局部几何特征。最后,提出了一种多尺度神经网络以产生每个体素和轨道位置的预测结果。该实验是在铁路的3D点云数据的9个序列下进行的。结果表明,该方法在检测直,弯曲和其他复杂的拓扑轨道方面具有良好的性能。
translated by 谷歌翻译
了解协作环境中工人和机器人的确切3D位置可以实现多种真实应用,例如检测不安全情况或用于统计和社会目的的相互作用的研究。在本文中,我们提出了一个基于深度设备和深度神经网络的非侵入性和光变色的框架,以估算外部摄像头的3D机器人姿势。该方法可以应用于任何机器人,而无需硬件访问内部状态。我们介绍了预测姿势的新颖代表,即半光谱脱钩的热图(SPDH),以准确计算世界坐标中的3D关节位置,以适应为2D人类姿势估计设计的有效的深层网络。所提出的方法可以作为基于XYZ坐标的输入深度表示,可以在合成深度数据上进行训练,并应用于现实世界设置,而无需域适应技术。为此,我们根据合成和真实深度图像介绍SIMBA数据集,并将其用于实验评估。结果表明,由特定的深度图表示和SPDH制成的建议方法克服了当前的最新状态。
translated by 谷歌翻译
3D手姿势估计方法最近取得了重大进展。但是,对于特定的现实世界应用,估计准确性通常远远不足,因此有很大的改进空间。本文提出了Trihorn-Net,这是一种新型模型,该模型使用特定的创新来提高深度图像的手姿势估计精度。第一个创新是将3D手姿势估计分解为深度图像空间(UV)中2D关节位置的估计,以及其相应深度的估计得到了两个互补注意图的帮助。这种分解可防止深度估计,这是一项更加困难的任务,无法在预测水平和特征级别上干扰紫外线估计。第二个创新是PixDropout,据我们所知,这是第一个基于外观的数据增强方法,用于手动深度图像。实验结果表明,所提出的模型优于三个公共基准数据集上的最新方法。
translated by 谷歌翻译
We propose a CNN-based approach for 3D human body pose estimation from single RGB images that addresses the issue of limited generalizability of models trained solely on the starkly limited publicly available 3D pose data. Using only the existing 3D pose data and 2D pose data, we show state-of-the-art performance on established benchmarks through transfer of learned features, while also generalizing to in-the-wild scenes. We further introduce a new training set for human body pose estimation from monocular images of real humans that has the ground truth captured with a multi-camera marker-less motion capture system. It complements existing corpora with greater diversity in pose, human appearance, clothing, occlusion, and viewpoints, and enables an increased scope of augmentation. We also contribute a new benchmark that covers outdoor and indoor scenes, and demonstrate that our 3D pose dataset shows better in-the-wild performance than existing annotated data, which is further improved in conjunction with transfer learning from 2D pose data. All in all, we argue that the use of transfer learning of representations in tandem with algorithmic and data contributions is crucial for general 3D body pose estimation.
translated by 谷歌翻译