训练精确的3D人体姿势估计器需要大量的3Dground真实数据,这些数据的收集成本很高。由于缺乏三维数据,已经提出了各种弱的或自我监督的估计方法。然而,除了2D地面实况之外,这些方法还需要各种形式的附加监督(例如,不成对的3D地面真实数据,一小部分标签)为了解决这些问题,我们提出了EpipolarPose,一种用于3D人体姿态估计的自我监督学习方法,它不需要任何3D地面真实数据或相机外部学。在训练期间,EpipolarPose从多视图图像估计2D姿势,然后利用极线几何来获得3Dpose和相机几何,随后用于训练3D姿势估计器。我们证明了我们的方法在标准基准数据集上的有效性,即Human3.6M和MPI-INF-3DHP,其中我们在弱/自监督方法中设置了最新的技术。此外,我们提出了一种新的绩效测量姿势结构分数(PSS),它是一种规模不变的结构感知度量,用于评估使徒在其基本事实方面的结构合理性。代码和预训练模型可通过https://github.com/mkocabas/EpipolarPose获得
translated by 谷歌翻译
Fig. 1. We recover the full global 3D skeleton pose in real-time from a single RGB camera, even wireless capture is possible by streaming from a smartphone (left). It enables applications such as controlling a game character, embodied VR, sport motion analysis and reconstruction of community video (right). Community videos (CC BY) courtesy of Real Madrid C.F. [2016] and RUSFENCING-TV [2017]. We present the first real-time method to capture the full global 3D skeletal pose of a human in a stable, temporally consistent manner using a single RGB camera. Our method combines a new convolutional neural network (CNN) based pose regressor with kinematic skeleton fitting. Our novel fully-convolutional pose formulation regresses 2D and 3D joint positions jointly in real time and does not require tightly cropped input frames. A real-time kinematic skeleton fitting method uses the CNN output to yield temporally stable 3D global pose reconstructions on the basis of a coherent kinematic skeleton. This makes our approach the first monocular RGB method usable in real-time applications such as 3D character control-thus far, the only monocular methods for such applications employed specialized RGB-D cameras. Our method's accuracy is quantitatively on par with the best offline 3D monocular RGB pose estimation methods. Our results are qualitatively comparable to, and sometimes better than, results from monocular RGB-D approaches, such as the Kinect. However, we show that our approach is more broadly applicable than RGB-D solutions, i.e., it works for outdoor scenes, community videos, and low quality commodity RGB cameras.
translated by 谷歌翻译
我们提出了一种基于CNN的方法,用于人体的多相机无标记运动捕捉。与首先在个人相机上执行姿态估计并生成3D模型作为后处理的现有方法不同,我们的方法在整个多阶段方法中使用3D推理。这种新颖性允许我们使用人体姿势的临时3D模型来重新思考联接应该位于图像中的位置并从过去的错误中恢复。我们对3D人体姿势进行了原则性的改进,使我们能够利用图像线索,甚至是我们之前错误检测到关节的图像,来完善我们对端到端方法的估计。最后,我们演示了如何将我们的多摄像机设置的高质量输出用作额外的训练源,以提高现有单摄像机模型的准确性。
translated by 谷歌翻译
我们提出了从单眼视图输入捕获目标人物的3D总运动的第一种方法。给定图像或单眼视频,我们的方法从由3D可变形网格模型表示的身体,面部和手指重构运动。我们使用称为3D PartOrientation Fields(POF)的有效表示来编码公共2D图像空间中所有身体部位的3D方向。 POF由完全卷积网络(FCN)以及联合置信图预测。为了训练我们的网络,我们收集了一个新的3D人体运动数据集,捕获多视图系统中40个受试者的多种全身运动。我们利用3D可变形人体模型,通过在模型中利用姿势和形状,从CNN输出重建总体姿势。我们还提出了一种基于纹理的跟踪方法,以获得时间相干运动捕捉输出。我们进行彻底的定量评估,包括与现有的身体特定和手部特定方法的比较,以及对摄像机视点和人体姿势变化的性能分析。最后,我们展示了我们对各种具有挑战性的野外视频的全身动作捕捉结果。我们的代码和新收集的人体动态数据集将公开分享。
translated by 谷歌翻译
This paper addresses the challenge of 3D human pose estimation from a single color image. Despite the general success of the end-to-end learning paradigm, top performing approaches employ a two-step solution consisting of a Convolutional Network (ConvNet) for 2D joint localization and a subsequent optimization step to recover 3D pose. In this paper, we identify the representation of 3D pose as a critical issue with current ConvNet approaches and make two important contributions towards validating the value of end-to-end learning for this task. First, we propose a fine discretization of the 3D space around the subject and train a ConvNet to predict per voxel likelihoods for each joint. This creates a natural representation for 3D pose and greatly improves performance over the direct regression of joint coordinates. Second, to further improve upon initial estimates, we employ a coarse-to-fine prediction scheme. This step addresses the large dimensionality increase and enables iterative refinement and repeated processing of the image features. The proposed approach outperforms all state-of-the-art methods on standard benchmarks achieving a relative error reduction greater than 30% on average. Additionally, we investigate using our volumetric representation in a related architecture which is suboptimal compared to our end-to-end approach, but is of practical interest, since it enables training when no image with corresponding 3D groundtruth is available, and allows us to present compelling results for in-the-wild images.
translated by 谷歌翻译
基于卷积神经网络的单眼3D人体姿态估计方法通常需要大量具有3D姿势注释的训练图像。尽管为人类提供大量野外图像的2D联合注释是可行的,但是提供准确的3D注释,在野外语料库中实际上几乎不可行。大多数现有的3D标记数据集要么是综合创建的,要么是工作室内的图像。在这样的数据上训练的3D姿势估计算法通常具有推广到现实世界场景分集的有限性。因此,我们提出了一种新的基于深度学习的单眼三维人体姿态估计方法,该方法显示出高精度并且更好地概括为野外场景。它具有网络体系结构,其包括新的解析的隐藏空间编码的显式2D和3D特征,并且使用来自预测的3D姿势的新学习的投影模型的监督。我们的算法可以在图像数据上与3D标签和仅具有2D标签的图像数据联合训练。它在挑战野外数据方面实现了最先进的准确性。
translated by 谷歌翻译
在这项工作中,我们证明了基于2D关键点上的扩张时间卷积的完全卷积模型可以有效地估计视频中的3D姿势。我们还介绍了反投影,这是一种利用未标记视频数据的简单有效的半监督训练方法。我们从未预测的视频的预测2D关键点开始,然后估计3Dposes并最终反投影到输入的2D关键点。在监督设置中,我们的完全卷积模型优于文献中先前的最佳结果,人类3.6M上的6mm平均每关节位置误差,对应于11%的误差减少,并且该模型还显示出对HumanEva-I的显着改善。此外,使用背投影的实验表明,在标记数据稀缺的半监督环境中,它可以轻松地胜过以前最先进的结果。代码和模型可从https://github.com/facebookresearch/VideoPose3D获得
translated by 谷歌翻译
Action recognition and human pose estimation are closely related but both problems are generally handled as distinct tasks in the literature. In this work, we propose a multitask framework for jointly 2D and 3D pose estimation from still images and human action recognition from video sequences. We show that a single architecture can be used to solve the two problems in an efficient way and still achieves state-of-the-art results. Additionally , we demonstrate that optimization from end-to-end leads to significantly higher accuracy than separated learning. The proposed architecture can be trained with data from different categories simultaneously in a seamlessly way. The reported results on four datasets (MPII, Human3.6M, Penn Action and NTU) demonstrate the effectiveness of our method on the targeted tasks.
translated by 谷歌翻译
我们提出了一种从稀疏的一组宽基线摄像机视图同时估计3D人体姿势和身体形状的方法。我们训练具有双重损失的对称卷积自动编码器,其强制学习编码骨骼关节位置的潜在表示,并且同时学习体积体形的深度表示。我们利用后者的toup-scale输入体积数据,以$ 4 \ times $的倍数,同时恢复联合位置的3D估计,其精度等于或高于现有技术。推理以实时(25 fps)运行,并且具有被动人类行为监控的潜力,其中需要对人体形状和姿势进行高保真度估计。
translated by 谷歌翻译
人体形状估计是视频编辑,动画和时尚产业的重要任务。然而,由于诸如人体,衣服和视点的变化等因素,从自然图像预测3D人体形状是非常具有挑战性的。解决该问题的现有方法通常试图使参数身体模型适合姿势和形状的某些先验。在这项工作中,我们争论一个替代的表示,并提出BodyNet,一个神经网络,从单个图像直接推断体积体形.BodyNet是一个端到端的可训练网络,受益于(i)体积3D损失,(ii)a多视图重投影损失,以及(iii)2D姿势,2D身体部分分割和3D姿势的中间视觉监视。如我们的实验所证明的那样,它们中的每一个都会提高性能。为了评估方法,我们将SMPL模型与我们的网络输出相匹配,并在SURREAL和Unite the People数据集上显示最新结果,优于最近的方法。除了实现最先进的性能外,我们的方法还可以实现体积分割。
translated by 谷歌翻译
随着深度卷积网络的成功,三维人体姿态估计的最先进方法已经集中在深度端到端系统上,该系统预测给定原始图像像素的三维关节位置。尽管它们具有出色的性能,但通常不容易理解它们的剩余误差是来自有限的2d姿势(视觉)理解,还是来自无法将map2d姿势转化为3维位置。为了理解错误的来源,我们开始构建一个给定2d联合位置预测3d位置的系统。令我们惊讶的是,我们发现,利用当前的技术,将地面实况2d联合位置“提升”到3d空间是一项可以用非常低的错误率解决的任务:相对简单的深度前馈网络优于最佳报告结果大约30%onHuman3.6M,这是目前最大的公共可用3D姿态估计基准。此外,我们的系统在现有的2d探测器输出上进行训练(即,使用图像作为输入) theart结果的状态 - 这包括一系列专门针对此任务的trainedend-end结束的系统。我们的结果表明,现代深度三维姿态估计系统的大部分误差源于其视觉分析,并提出了进一步推进三维人体姿态估计中的状态的方向。
translated by 谷歌翻译
Automatically determining three-dimensional human pose from monocular RGB image data is a challenging problem. The two-dimensional nature of the input results in intrinsic ambiguities which make inferring depth particularly difficult. Recently, researchers have demonstrated that the flexible statistical modelling capabilities of deep neural networks are sufficient to make such inferences with reasonable accuracy. However, many of these models use coordinate output techniques which are memory-intensive, not differentiable, and/or do not spatially generalise well. We propose improvements to 3D coordinate prediction which avoid the aforementioned undesirable traits by predicting 2D marginal heatmaps under an augmented soft-argmax scheme. Our resulting model, MargiPose, produces visually coherent heatmaps whilst maintaining differentiability. We are also able to achieve state-of-the-art accuracy on publicly available 3D human pose estimation data.
translated by 谷歌翻译
We propose a CNN-based approach for 3D human body pose estimation from singleRGB images that addresses the issue of limited generalizability of modelstrained solely on the starkly limited publicly available 3D pose data. Usingonly the existing 3D pose data and 2D pose data, we show state-of-the-artperformance on established benchmarks through transfer of learned features,while also generalizing to in-the-wild scenes. We further introduce a newtraining set for human body pose estimation from monocular images of realhumans that has the ground truth captured with a multi-camera marker-lessmotion capture system. It complements existing corpora with greater diversityin pose, human appearance, clothing, occlusion, and viewpoints, and enables anincreased scope of augmentation. We also contribute a new benchmark that coversoutdoor and indoor scenes, and demonstrate that our 3D pose dataset showsbetter in-the-wild performance than existing annotated data, which is furtherimproved in conjunction with transfer learning from 2D pose data. All in all,we argue that the use of transfer learning of representations in tandem withalgorithmic and data contributions is crucial for general 3D body poseestimation.
translated by 谷歌翻译
Estimating the 3D pose of a hand is an essential part of human-computer interaction. Estimating 3D pose using depth or multi-view sensors has become easier with recent advances in computer vision, however, regressing pose from a single RGB image is much less straightforward. The main difficulty arises from the fact that 3D pose requires some form of depth estimates, which are ambiguous given only an RGB image. In this paper we propose a new method for 3D hand pose estimation from a monocular image through a novel 2.5D pose representation. Our new representation estimates pose up to a scaling factor, which can be estimated additionally if a prior of the hand size is given. We implicitly learn depth maps and heatmap distributions with a novel CNN architecture. Our system achieves the state-of-the-art estimation of 2D and 3D hand pose on several challenging datasets in presence of severe occlusions.
translated by 谷歌翻译
我们提出了一种新的单镜头方法,用于从单目RGB相机的一般场景中进行多人3D姿态估计。我们的方法使用新颖的 - 稳健的姿势图(ORPM),即使在场景中的其他人和物体的强烈部分遮挡下也能够进行全身姿势推理。 ORPM输出固定数量的地图,这些地图编码场景中所有人的3D关节位置。身体部位关联允许我们在没有明确边界框预测的情况下推断出任意人数的3D姿势。为了训练我们的方法,我们引入了MuCo-3DHP,这是第一个大规模的训练数据集,展示了复杂的多人互动和遮挡的真实图像。我们通过合成个人的图像来合成大型的多人图像(具有来自多视图的基本事实)表现捕捉)。我们在新的具有挑战性的3D注释多人测试集MuPoTs-3D上评估我们的方法,我们在这里实现了最先进的性能。为了进一步激发多人3D姿态估计的研究,我们将使我们的新数据集和相关代码公开用于研究目的。
translated by 谷歌翻译
在本文中,我们将介绍2018年ECCV PoseTrackChallenge中关于3D人体姿态估计的获奖作品。使用完全卷积骨架结构,我们获得每个身体关节的体积热图,我们使用soft-argmax转换坐标。绝对人中心深度由a1D热图预测头估计。将坐标反投影到3D摄像机空间,在此处我们将L1损失最小化。我们取得良好成果的关键是使用来自Pascal VOC数据集的随机放置的封堵器进行训练数据增强。除了在挑战中获得第一名之外,我们的方法还在方法中超越了完整的Human3.6M基准测试的最新技术水平。在训练中使用无附加姿势数据集。应用合成遮挡的代码是https://github.com/isarandi/synthetic-occlusion。
translated by 谷歌翻译
Fig. 1. Our real-time mobile 3D pose estimation system is based on a single monocular cap-mounted fisheye camera that is attached to a standard baseball cap. The setup is lightweight and enables 3D pose estimation in everyday situations. Abstract-We propose the first real-time system for the egocentric estimation of 3D human body pose in a wide range of unconstrained everyday activities. This setting has a unique set of challenges, such as mobility of the hardware setup, and robustness to long capture sessions with fast recovery from tracking failures. We tackle these challenges based on a novel lightweight setup that converts a standard baseball cap to a device for high-quality pose estimation based on a single cap-mounted fisheye camera. From the captured egocentric live stream, our CNN based 3D pose estimation approach runs at 60 Hz on a consumer-level GPU. In addition to the lightweight hardware setup, our other main contributions are: 1) a large ground truth training corpus of top-down fisheye images and 2) a disentangled 3D pose estimation approach that takes the unique properties of the egocentric viewpoint into account. As shown by our evaluation, we achieve lower 3D joint error as well as better 2D overlay than the existing baselines.
translated by 谷歌翻译
Recent advances with Convolutional Networks (ConvNets) have shifted the bottleneck for many computer vision tasks to annotated data collection. In this paper, we present a geometry-driven approach to automatically collect annotations for human pose prediction tasks. Starting from a generic ConvNet for 2D human pose, and assuming a multi-view setup, we describe an automatic way to collect accurate 3D human pose annotations. We capitalize on constraints offered by the 3D geometry of the camera setup and the 3D structure of the human body to probabilistically combine per view 2D ConvNet predictions into a globally optimal 3D pose. This 3D pose is used as the basis for harvesting annotations. The benefit of the annotations produced automatically with our approach is demonstrated in two challenging settings: (i) fine-tuning a generic ConvNet-based 2D pose predictor to capture the discriminative aspects of a subject's appearance (i.e.,"personalization"), and (ii) training a ConvNet from scratch for single view 3D human pose prediction without leveraging 3D pose groundtruth. The proposed multi-view pose estimator achieves state-of-the-art results on standard benchmarks, demonstrating the effectiveness of our method in exploiting the available multi-view information.
translated by 谷歌翻译
3D人体姿势估计的常见方法是预测相对于臀部的身体关节坐标。这适用于单个人,但在多个交互人员的情况下效率不高。预测绝对坐标的方法首先估计根相对姿势,然后通过辅助优化任务计算翻译。我们提出了一种神经网络,它可以预测相机中心坐标系中的关节,而不是基于相对的坐标系。与以前的方法不同,我们的网络只需一步即可完成任何后期处理。我们的网络优先于MuPoTS-3D数据集上的先前方法,并实现了最先进的结果。
translated by 谷歌翻译
Our ability to train end-to-end systems for 3D human pose estimation from single images is currently constrained by the limited availability of 3D annotations for natural images. Most datasets are captured using Motion Capture (MoCap) systems in a studio setting and it is difficult to reach the variability of 2D human pose datasets, like MPII or LSP. To alleviate the need for accurate 3D ground truth, we propose to use a weaker supervision signal provided by the ordinal depths of human joints. This information can be acquired by human annotators for a wide range of images and poses. We showcase the effectiveness and flexibility of training Convolutional Networks (ConvNets) with these ordinal relations in different settings, always achieving competitive performance with ConvNets trained with accurate 3D joint coordinates. Additionally, to demonstrate the potential of the approach, we augment the popular LSP and MPII datasets with ordinal depth annotations. This extension allows us to present quantitative and qualitative evaluation in non-studio conditions. Simultaneously, these ordinal annotations can be easily incorporated in the training procedure of typical ConvNets for 3D human pose. Through this inclusion we achieve new state-of-the-art performance for the relevant benchmarks and validate the effectiveness of ordinal depth supervision for 3D human pose.
translated by 谷歌翻译