传统的3D人姿态估计依赖于首次检测2D身体键盘,然后求解2D到3D对应问题。提高有希望的结果,该学习范例高度依赖于2D关键点检测器的质量,这不可避免地易于闭塞和堵塞-of-image缺席。在本文中,我们提出了一种新颖的姿势定向网(PONET),其能够仅通过学习方向估计3D姿势,因此在没有图像证据的情况下绕过错误易于keypoint检测器。对于具有部分不可见的四肢的图像,Ponet通过利用本地图像证据来恢复3D姿势来估计这些肢体的3D方向。通过利用完全看不见的四肢来说,Ponet甚至可以从完全看不见的四肢的图像中推断出完整的3D姿势。可见肢体之间的取向相关性以补充估计的姿势,进一步提高了3D姿态估计的鲁棒性。我们在多个数据集中评估我们的方法,包括Human3.6M,MPII,MPI-INF-3DHP和3DPW。我们的方法在理想设置中实现了与最先进的技术的结果,但显着消除了对关键点检测器和相应的计算负担的依赖性。在截断和擦除等方面的高度挑战性方案中,我们的方法稳健地表现得非常强大,与本领域的状态相比,展示其对现实世界应用的可能性。
translated by 谷歌翻译
人类姿势和形状估计的任务中的关键挑战是闭塞,包括自闭合,对象 - 人闭塞和人际闭塞。缺乏多样化和准确的姿势和形状训练数据成为一个主要的瓶颈,特别是对于野外闭塞的场景。在本文中,我们专注于在人际闭塞的情况下估计人类姿势和形状,同时处理对象 - 人闭塞和自动闭塞。我们提出了一种新颖的框架,该框架综合了遮挡感知的轮廓和2D关键点数据,并直接回归到SMPL姿势和形状参数。利用神经3D网格渲染器以启用剪影监控,这有助于形状估计的巨大改进。此外,合成了全景视点中的关键点和轮廓驱动的训练数据,以补偿任何现有数据集中缺乏视点的多样性。实验结果表明,在姿势估计准确性方面,我们在3DPW和3DPW-Crowd数据集中是最先进的。所提出的方法在形状估计方面显着优于秩1方法。在形状预测精度方面,SSP-3D还实现了顶级性能。
translated by 谷歌翻译
闭塞对单眼多人3D人体姿势估计构成了极大的威胁,这是由于封闭器的形状,外观和位置方面的差异很大。尽管现有的方法试图用姿势先验/约束,数据增强或隐性推理处理遮挡,但它们仍然无法概括地看不见姿势或遮挡案例,并且在出现多人时可能会犯大错误。受到人类从可见线索推断关节的显着能力的启发,我们开发了一种方法来显式建模该过程,该过程可以显着改善有或没有遮挡的情况下,可以显着改善自下而上的多人姿势估计。首先,我们将任务分为两个子任务:可见的关键点检测和遮挡的关键点推理,并提出了深入监督的编码器蒸馏(DSED)网络以求解第二个网络。为了训练我们的模型,我们提出了一种骨骼引导的人形拟合(SSF)方法,以在现有数据集上生成伪遮挡标签,从而实现明确的遮挡推理。实验表明,从遮挡中明确学习可以改善人类姿势估计。此外,利用可见关节的特征级信息使我们可以更准确地推理遮挡关节。我们的方法的表现优于几个基准的最新自上而下和自下而上的方法。
translated by 谷歌翻译
在本文中,我们考虑了同时找到和从单个2D图像中恢复多手的具有挑战性的任务。先前的研究要么关注单手重建,要么以多阶段的方式解决此问题。此外,常规的两阶段管道首先检测到手部区域,然后估计每个裁剪贴片的3D手姿势。为了减少预处理和特征提取中的计算冗余,我们提出了一条简洁但有效的单阶段管道。具体而言,我们为多手重建设计了多头自动编码器结构,每个HEAD网络分别共享相同的功能图并分别输出手动中心,姿势和纹理。此外,我们采用了一个弱监督的计划来减轻昂贵的3D现实世界数据注释的负担。为此,我们提出了一系列通过舞台训练方案优化的损失,其中根据公开可用的单手数据集生成具有2D注释的多手数据集。为了进一步提高弱监督模型的准确性,我们在单手和多个手设置中采用了几个功能一致性约束。具体而言,从本地功能估算的每只手的关键点应与全局功能预测的重新投影点一致。在包括Freihand,HO3D,Interhand 2.6M和RHD在内的公共基准测试的广泛实验表明,我们的方法在弱监督和完全监督的举止中优于基于最先进的模型方法。代码和模型可在{\ url {https://github.com/zijinxuxu/smhr}}上获得。
translated by 谷歌翻译
This paper addresses the challenge of 6DoF pose estimation from a single RGB image under severe occlusion or truncation. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise unit vectors pointing to the keypoints and use these vectors to vote for keypoint locations using RANSAC. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occlusion LINEMOD and YCB-Video datasets by a large margin, while being efficient for real-time pose estimation. We further create a Truncation LINEMOD dataset to validate the robustness of our approach against truncation. The code will be avaliable at https://zju-3dv.github.io/pvnet/.
translated by 谷歌翻译
大多数实时人类姿势估计方法都基于检测接头位置。使用检测到的关节位置,可以计算偏差和肢体的俯仰。然而,由于这种旋转轴仍然不观察,因此不能计算沿着肢体沿着肢体至关重要的曲折,这对于诸如体育分析和计算机动画至关重要。在本文中,我们引入了方向关键点,一种用于估计骨骼关节的全位置和旋转的新方法,仅使用单帧RGB图像。灵感来自Motion-Capture Systems如何使用一组点标记来估计全骨骼旋转,我们的方法使用虚拟标记来生成足够的信息,以便准确地推断使用简单的后处理。旋转预测改善了接头角度最佳报告的平均误差48%,并且在15个骨骼旋转中实现了93%的精度。该方法还通过MPJPE在原理数据集上测量,通过MPJPE测量,该方法还改善了当前的最新结果14%,并概括为野外数据集。
translated by 谷歌翻译
Input Reconstruction Side and top down view Part Segmentation Input Reconstruction Side and top down view Part Segmentation Figure 1: Human Mesh Recovery (HMR): End-to-end adversarial learning of human pose and shape. We describe a real time framework for recovering the 3D joint angles and shape of the body from a single RGB image. The first two rowsshow results from our model trained with some 2D-to-3D supervision, the bottom row shows results from a model that is trained in a fully weakly-supervised manner without using any paired 2D-to-3D supervision. We infer the full 3D body even in case of occlusions and truncations. Note that we capture head and limb orientations.
translated by 谷歌翻译
全面监督的人类网格恢复方法是渴望数据的,由于3D规定基准数据集的可用性有限和多样性,因此具有较差的概括性。使用合成数据驱动的训练范例,已经从合成配对的2D表示(例如2D关键点和分段掩码)和3D网格中训练了模型的最新进展,其中已使用合成数据驱动的训练范例和3D网格进行了训练。但是,由于合成训练数据和实际测试数据之间的域间隙很难解决2D密集表示,因此很少探索合成密集的对应图(即IUV)。为了减轻IUV上的这个领域差距,我们提出了使用可靠但稀疏表示的互补信息(2D关键点)提出的交叉代理对齐。具体而言,初始网格估计和两个2D表示之间的比对误差将转发为回归器,并在以下网格回归中动态校正。这种适应性的交叉代理对准明确地从偏差和捕获互补信息中学习:从稀疏的表示和浓郁的浓度中的稳健性。我们对多个标准基准数据集进行了广泛的实验,并展示了竞争结果,帮助减少在人类网格估计中生产最新模型所需的注释工作。
translated by 谷歌翻译
培训视频中人类姿势估计的最先进模型需要具有很难获得的注释的数据集。尽管最近已将变压器用于身体姿势序列建模,但相关方法依靠伪地真相来增强目前有限的培训数据可用于学习此类模型。在本文中,我们介绍了Posebert,Posebert是一个通过掩盖建模对3D运动捕获(MOCAP)数据进行全面训练的变压器模块。它是简单,通用和通用的,因为它可以插入任何基于图像的模型的顶部,以在基于视频的模型中使用时间信息。我们展示了Posebert的变体,不同的输入从3D骨骼关键点到全身或仅仅是手(Mano)的3D参数模型的旋转。由于Posebert培训是任务不可知论的,因此该模型可以应用于姿势细化,未来的姿势预测或运动完成等几个任务。我们的实验结果验证了在各种最新姿势估计方法之上添加Posebert始终提高其性能,而其低计算成本使我们能够在实时演示中使用它,以通过A的机器人手使机器人手通过摄像头。可以在https://github.com/naver/posebert上获得测试代码和型号。
translated by 谷歌翻译
我们考虑从野外拥挤的场景中恢复一个人的3D人网格的问题。尽管在3D人网估计中取得了很多进展,但当测试输入的场景拥挤时,现有的方法很难。失败的第一个原因是训练和测试数据之间的域间隙。一个运动捕获数据集为训练提供准确的3D标签,缺乏人群数据,并阻碍了网络无法学习目标人的拥挤场景射击图像特征。第二个原因是功能处理,该功能处理在空间上平均包含多个人的本地化边界框的特征图。平均整个功能映射使目标人的特征与他人无法区分。我们提出了3dcrowdnet,首先要明确针对野生野外的场景,并通过解决上述问题来估算强大的3D人网。首先,我们利用2D人姿势估计不需要带有3D标签的运动捕获数据集进行训练,并且不受域间隙的困扰。其次,我们提出了一个基于联合的回归器,将目标人的特征与他人区分开来。我们的基于联合的回归器通过对目标关节位置的采样特征来保留目标的空间激活并回归人类模型参数。结果,3DCORDNET学习了针对目标的功能,并有效地排除了附近人的无关特征。我们对各种基准进行实验,并证明3dcrowdnet对野外拥挤的场景的鲁棒性在定量和定性上。该代码可在https://github.com/hongsukchoi/3dcrowdnet_release上获得。
translated by 谷歌翻译
尽管近年来,在无单眼制造商的人类运动捕获上取得了重大进展,但最先进的方法仍然很难在遮挡场景中获得令人满意的结果。有两个主要原因:一个是遮挡的运动捕获本质上是模棱两可的,因为各种3D姿势可以映射到相同的2D观测值,这总是导致不可靠的估计。另一个是没有足够的封闭人类数据可用于训练健壮的模型。为了解决这些障碍,我们的钥匙界是使用非封闭式人类数据来学习以自我监督策略的封闭人类的联合时空运动。为了进一步减少合成数据和实际遮挡数据之间的差距,我们构建了第一个3D遮挡运动数据集〜(Ocmotion),可用于训练和测试。我们在2D地图中编码运动,并在非封闭数据上合成遮挡,以进行自我监督训练。然后,设计空间层层以学习联合级别的相关性。博学的先前降低了闭塞的歧义,并且对各种遮挡类型具有坚固态度,然后采用这些类型来帮助封闭的人类运动捕获。实验结果表明,我们的方法可以从具有良好概括能力和运行时效率的遮挡视频中产生准确且相干的人类动作。数据集和代码可在\ url {https://github.com/boycehbz/chomp}上公开获得。
translated by 谷歌翻译
从单个图像中感知3D人体的能力具有多种应用,从娱乐和机器人技术到神经科学和医疗保健。人类网格恢复中的一个基本挑战是收集训练所需的地面真相3D网格目标,这需要负担重大的运动捕获系统,并且通常仅限于室内实验室。结果,尽管在这些限制性设置中收集的基准数据集上取得了进展,但由于分配变化,模型无法推广到现实世界中的``野外''方案。我们提出了域自适应3D姿势增强(DAPA),这是一种数据增强方法,可增强模型在野外场景中的概括能力。 DAPA通过从综合网格中获得直接监督,并通过使用目标数据集的地面真相2D关键点来结合基于合成数据集的方法的强度。我们定量地表明,使用DAPA的填充有效地改善了基准3DPW和Agora的结果。我们进一步证明了DAPA在一个充满挑战的数据集中,该数据集从现实世界中亲子互动的视频中策划了。
translated by 谷歌翻译
3D models provide a common ground for different representations of human bodies. In turn, robust 2D estimation has proven to be a powerful tool to obtain 3D fits "in-thewild". However, depending on the level of detail, it can be hard to impossible to acquire labeled data for training 2D estimators on large scale. We propose a hybrid approach to this problem: with an extended version of the recently introduced SMPLify method, we obtain high quality 3D body model fits for multiple human pose datasets. Human annotators solely sort good and bad fits. This procedure leads to an initial dataset, UP-3D, with rich annotations. With a comprehensive set of experiments, we show how this data can be used to train discriminative models that produce results with an unprecedented level of detail: our models predict 31 segments and 91 landmark locations on the body. Using the 91 landmark pose estimator, we present state-ofthe art results for 3D human pose and shape estimation using an order of magnitude less training data and without assumptions about gender or pose in the fitting procedure. We show that UP-3D can be enhanced with these improved fits to grow in quantity and quality, which makes the system deployable on large scale. The data, code and models are available for research purposes.* This work was performed while J. Romero and F. Bogo were with the MPI-IS 2 ; P. V. Gehler with the BCCN 1 and MPI-IS 2 .
translated by 谷歌翻译
Figure 1: Given challenging in-the-wild videos, a recent state-of-the-art video-pose-estimation approach [31] (top), fails to produce accurate 3D body poses. To address this, we exploit a large-scale motion-capture dataset to train a motion discriminator using an adversarial approach. Our model (VIBE) (bottom) is able to produce realistic and accurate pose and shape, outperforming previous work on standard benchmarks.
translated by 谷歌翻译
人类性能捕获是一种非常重要的计算机视觉问题,在电影制作和虚拟/增强现实中具有许多应用。许多以前的性能捕获方法需要昂贵的多视图设置,或者没有恢复具有帧到帧对应关系的密集时空相干几何。我们提出了一种新颖的深度致密人体性能捕获的深层学习方法。我们的方法是基于多视图监督的弱监督方式培训,完全删除了使用3D地面真理注释的培训数据的需求。网络架构基于两个单独的网络,将任务解散为姿势估计和非刚性表面变形步骤。广泛的定性和定量评估表明,我们的方法在质量和稳健性方面优于现有技术。这项工作是DeepCAP的扩展版本,在那里我们提供更详细的解释,比较和结果以及应用程序。
translated by 谷歌翻译
基于回归的方法可以通过直接以馈送方式将原始像素直接映射到模型参数来估算从单眼图像的身体,手甚至全身模型。但是,参数的微小偏差可能导致估计的网格和输入图像之间的明显未对准,尤其是在全身网格恢复的背景下。为了解决这个问题,我们建议在我们的回归网络中进行锥体网状对准反馈(PYMAF)循环,以进行良好的人类网格恢复,并将其扩展到PYMAF-X,以恢复表达全身模型。 PYMAF的核心思想是利用特征金字塔并根据网格图像对准状态明确纠正预测参数。具体而言,给定当前预测的参数,将相应地从更优质的特征中提取网格对准的证据,并将其送回以进行参数回流。为了增强一致性的看法,采用辅助密集的监督来提供网格图像对应指南,同时引入了空间对齐的注意,以使我们的网络对全球环境的认识。当扩展PYMAF以进行全身网状恢复时,PYMAF-X中提出了一种自适应整合策略来调整肘部扭转旋转,该旋转会产生自然腕部姿势,同时保持部分特定估计的良好性能。我们的方法的功效在几个基准数据集上得到了验证,以实现身体和全身网状恢复,在该数据集中,PYMAF和PYMAF-X有效地改善了网格图像的对准并实现了新的最新结果。具有代码和视频结果的项目页面可以在https://www.liuyebin.com/pymaf-x上找到。
translated by 谷歌翻译
自上而下的方法主导了3D人类姿势和形状估计的领域,因为它们与人类的检测脱钩,并使研究人员能够专注于核心问题。但是,裁剪是他们的第一步,从一开始就丢弃了位置信息,这使自己无法准确预测原始摄像机坐标系中的全局旋转。为了解决此问题,我们建议将完整框架(悬崖)的位置信息携带到此任务中。具体而言,我们通过将裁剪图像功能与其边界盒信息连接在一起来养活更多的整体功能来悬崖。我们通过更广泛的全帧视图来计算2D再投影损失,进行了类似于图像中投射的人的投影过程。克里夫(Cliff)通过全球态度感知信息进行了喂养和监督,直接预测全球旋转以及更准确的明确姿势。此外,我们提出了一个基于Cliff的伪基真实注释,该注释为野外2D数据集提供了高质量的3D注释,并为基于回归的方法提供了至关重要的全面监督。对流行基准测试的广泛实验表明,悬崖的表现要超过先前的艺术,并在Agora排行榜上获得了第一名(SMPL-Algorithms曲目)。代码和数据可在https://github.com/huawei-noah/noah-research/tree/master/cliff中获得。
translated by 谷歌翻译
Model-based human pose estimation is currently approached through two different paradigms. Optimizationbased methods fit a parametric body model to 2D observations in an iterative manner, leading to accurate imagemodel alignments, but are often slow and sensitive to the initialization. In contrast, regression-based methods, that use a deep network to directly estimate the model parameters from pixels, tend to provide reasonable, but not pixel accurate, results while requiring huge amounts of supervision. In this work, instead of investigating which approach is better, our key insight is that the two paradigms can form a strong collaboration. A reasonable, directly regressed estimate from the network can initialize the iterative optimization making the fitting faster and more accurate. Similarly, a pixel accurate fit from iterative optimization can act as strong supervision for the network. This is the core of our proposed approach SPIN (SMPL oPtimization IN the loop). The deep network initializes an iterative optimization routine that fits the body model to 2D joints within the training loop, and the fitted estimate is subsequently used to supervise the network. Our approach is self-improving by nature, since better network estimates can lead the optimization to better solutions, while more accurate optimization fits provide better supervision for the network. We demonstrate the effectiveness of our approach in different settings, where 3D ground truth is scarce, or not available, and we consistently outperform the state-of-the-art model-based pose estimation approaches by significant margins. The project website with videos, results, and code can be found at https://seas.upenn.edu/ ˜nkolot/projects/spin.
translated by 谷歌翻译
Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that a PAF-only refinement rather than both PAF and body part location refinement results in a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an internal annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.
translated by 谷歌翻译
This work addresses the problem of estimating the full body 3D human pose and shape from a single color image. This is a task where iterative optimization-based solutions have typically prevailed, while Convolutional Networks (ConvNets) have suffered because of the lack of training data and their low resolution 3D predictions. Our work aims to bridge this gap and proposes an efficient and effective direct prediction method based on ConvNets. Central part to our approach is the incorporation of a parametric statistical body shape model (SMPL) within our end-to-end framework. This allows us to get very detailed 3D mesh results, while requiring estimation only of a small number of parameters, making it friendly for direct network prediction. Interestingly, we demonstrate that these parameters can be predicted reliably only from 2D keypoints and masks. These are typical outputs of generic 2D human analysis ConvNets, allowing us to relax the massive requirement that images with 3D shape ground truth are available for training. Simultaneously, by maintaining differentiability, at training time we generate the 3D mesh from the estimated parameters and optimize explicitly for the surface using a 3D per-vertex loss. Finally, a differentiable renderer is employed to project the 3D mesh to the image, which enables further refinement of the network, by optimizing for the consistency of the projection with 2D annotations (i.e., 2D keypoints or masks). The proposed approach outperforms previous baselines on this task and offers an attractive solution for direct prediction of 3D shape from a single color image.
translated by 谷歌翻译