Recent state-of-the-art performance on human-body pose estimation has been achieved with Deep Convolutional Networks (ConvNets). Traditional ConvNet architectures include pooling and sub-sampling layers which reduce computational requirements, introduce invariance and prevent over-training. These benefits of pooling come at the cost of reduced localization accuracy. We introduce a novel architecture which includes an efficient 'position refinement' model that is trained to estimate the joint offset location within a small region of the image. This refinement model is jointly trained in cascade with a state-of-the-art ConvNet model [21] to achieve improved accuracy in human joint location estimation. We show that the variance of our detector approaches the variance of human annotations on the FLIC [20] dataset and outperforms all existing approaches on the MPII-human-pose dataset [1].
translated by 谷歌翻译
Figure 1. Besides extreme variability in articulations, many of the joints are barely visible. We can guess the location of the right arm in the left image only because we see the rest of the pose and anticipate the motion or activity of the person. Similarly, the left body half of the person on the right is not visible at all. These are examples of the need for holistic reasoning. We believe that DNNs can naturally provide such type of reasoning.
translated by 谷歌翻译
This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a "stacked hourglass" network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.
translated by 谷歌翻译
State-of-the-art human pose estimation methods are based on heat map representation. In spite of the good performance, the representation has a few issues in nature, such as non-differentiable postprocessing and quantization error. This work shows that a simple integral operation relates and unifies the heat map representation and joint regression, thus avoiding the above issues. It is differentiable, efficient, and compatible with any heat map based methods. Its effectiveness is convincingly validated via comprehensive ablation experiments under various settings, specifically on 3D pose estimation, for the first time.
translated by 谷歌翻译
Due to object detection's close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles which combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy and optimization function, etc. In this paper, we provide a review on deep learning based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely Convolutional Neural Network (CNN). Then we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network based learning systems.
translated by 谷歌翻译
Accurate whole-body multi-person pose estimation and tracking is an important yet challenging topic in computer vision. To capture the subtle actions of humans for complex behavior analysis, whole-body pose estimation including the face, body, hand and foot is essential over conventional body-only pose estimation. In this paper, we present AlphaPose, a system that can perform accurate whole-body pose estimation and tracking jointly while running in realtime. To this end, we propose several new techniques: Symmetric Integral Keypoint Regression (SIKR) for fast and fine localization, Parametric Pose Non-Maximum-Suppression (P-NMS) for eliminating redundant human detections and Pose Aware Identity Embedding for jointly pose estimation and tracking. During training, we resort to Part-Guided Proposal Generator (PGPG) and multi-domain knowledge distillation to further improve the accuracy. Our method is able to localize whole-body keypoints accurately and tracks humans simultaneously given inaccurate bounding boxes and redundant detections. We show a significant improvement over current state-of-the-art methods in both speed and accuracy on COCO-wholebody, COCO, PoseTrack, and our proposed Halpe-FullBody pose estimation dataset. Our model, source codes and dataset are made publicly available at https://github.com/MVIG-SJTU/AlphaPose.
translated by 谷歌翻译
我们提出Bapose,一种新颖的自下而上的方法,实现了多人姿态估计的最先进结果。我们的最终培训框架利用了解开的多尺度瀑布架构,并将自适应卷曲融合在拥挤的场景中更准确地推断出闭塞的关键点。由BAPOSE中的解开瀑布模块获得的多尺度表示,利用级联架构中进行逐行滤波的效率,同时保持与空间金字塔配置的多尺度视图相当。我们对挑战性的Coco和Crowdose数据集的结果表明,Bapose是多人姿态估计的高效且稳健的框架,实现了最先进的准确性的显着改善。
translated by 谷歌翻译
We propose a CNN-based approach for 3D human body pose estimation from single RGB images that addresses the issue of limited generalizability of models trained solely on the starkly limited publicly available 3D pose data. Using only the existing 3D pose data and 2D pose data, we show state-of-the-art performance on established benchmarks through transfer of learned features, while also generalizing to in-the-wild scenes. We further introduce a new training set for human body pose estimation from monocular images of real humans that has the ground truth captured with a multi-camera marker-less motion capture system. It complements existing corpora with greater diversity in pose, human appearance, clothing, occlusion, and viewpoints, and enables an increased scope of augmentation. We also contribute a new benchmark that covers outdoor and indoor scenes, and demonstrate that our 3D pose dataset shows better in-the-wild performance than existing annotated data, which is further improved in conjunction with transfer learning from 2D pose data. All in all, we argue that the use of transfer learning of representations in tandem with algorithmic and data contributions is crucial for general 3D body pose estimation.
translated by 谷歌翻译
We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learned simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and obtained very competitive results for the detection and classifications tasks. In post-competition work, we establish a new state of the art for the detection task. Finally, we release a feature extractor from our best model called OverFeat.
translated by 谷歌翻译
大多数实时人类姿势估计方法都基于检测接头位置。使用检测到的关节位置,可以计算偏差和肢体的俯仰。然而,由于这种旋转轴仍然不观察,因此不能计算沿着肢体沿着肢体至关重要的曲折,这对于诸如体育分析和计算机动画至关重要。在本文中,我们引入了方向关键点,一种用于估计骨骼关节的全位置和旋转的新方法,仅使用单帧RGB图像。灵感来自Motion-Capture Systems如何使用一组点标记来估计全骨骼旋转,我们的方法使用虚拟标记来生成足够的信息,以便准确地推断使用简单的后处理。旋转预测改善了接头角度最佳报告的平均误差48%,并且在15个骨骼旋转中实现了93%的精度。该方法还通过MPJPE在原理数据集上测量,通过MPJPE测量,该方法还改善了当前的最新结果14%,并概括为野外数据集。
translated by 谷歌翻译
Pose Machines provide a sequential prediction framework for learning rich implicit spatial models. In this work we show a systematic design for how convolutional networks can be incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation. The contribution of this paper is to implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation. We achieve this by designing a sequential architecture composed of convolutional networks that directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations, without the need for explicit graphical model-style inference. Our approach addresses the characteristic difficulty of vanishing gradients during training by providing a natural learning objective function that enforces intermediate supervision, thereby replenishing back-propagated gradients and conditioning the learning procedure. We demonstrate state-of-the-art performance and outperform competing methods on standard benchmarks including the MPII, LSP, and FLIC datasets.
translated by 谷歌翻译
在本文中,我们考虑了同时找到和从单个2D图像中恢复多手的具有挑战性的任务。先前的研究要么关注单手重建,要么以多阶段的方式解决此问题。此外,常规的两阶段管道首先检测到手部区域,然后估计每个裁剪贴片的3D手姿势。为了减少预处理和特征提取中的计算冗余,我们提出了一条简洁但有效的单阶段管道。具体而言,我们为多手重建设计了多头自动编码器结构,每个HEAD网络分别共享相同的功能图并分别输出手动中心,姿势和纹理。此外,我们采用了一个弱监督的计划来减轻昂贵的3D现实世界数据注释的负担。为此,我们提出了一系列通过舞台训练方案优化的损失,其中根据公开可用的单手数据集生成具有2D注释的多手数据集。为了进一步提高弱监督模型的准确性,我们在单手和多个手设置中采用了几个功能一致性约束。具体而言,从本地功能估算的每只手的关键点应与全局功能预测的重新投影点一致。在包括Freihand,HO3D,Interhand 2.6M和RHD在内的公共基准测试的广泛实验表明,我们的方法在弱监督和完全监督的举止中优于基于最先进的模型方法。代码和模型可在{\ url {https://github.com/zijinxuxu/smhr}}上获得。
translated by 谷歌翻译
Following the success of deep convolutional networks, state-of-the-art methods for 3d human pose estimation have focused on deep end-to-end systems that predict 3d joint locations given raw image pixels. Despite their excellent performance, it is often not easy to understand whether their remaining error stems from a limited 2d pose (visual) understanding, or from a failure to map 2d poses into 3dimensional positions.With the goal of understanding these sources of error, we set out to build a system that given 2d joint locations predicts 3d positions. Much to our surprise, we have found that, with current technology, "lifting" ground truth 2d joint locations to 3d space is a task that can be solved with a remarkably low error rate: a relatively simple deep feedforward network outperforms the best reported result by about 30% on Human3.6M, the largest publicly available 3d pose estimation benchmark. Furthermore, training our system on the output of an off-the-shelf state-of-the-art 2d detector (i.e., using images as input) yields state of the art results -this includes an array of systems that have been trained end-to-end specifically for this task. Our results indicate that a large portion of the error of modern deep 3d pose estimation systems stems from their visual analysis, and suggests directions to further advance the state of the art in 3d human pose estimation.
translated by 谷歌翻译
在自然谈话和互动中,我们的手经常重叠或彼此接触。由于双手的均匀外观,这使得估计从图像互动的3D姿势困难。在本文中,我们证明了自我相似性,以及将像素观测分配给各自的手和它们的部分的产生的歧义是最终3D姿势错误的主要原因。通过这种洞察力,我们提出了数字,一种估计来自单眼图像的两个交互手的3D姿势的新方法。该方法包括两个交织分支,该分支处理输入图像到每个像素语义部分分段掩模和视觉特征卷。与事先工作相比,我们不会从姿势估计阶段解耦分割,而是直接利用每个像素概率直接在下游姿势估计任务中。为此,零件概率与视觉功能合并并通过全卷积层处理。我们通过实验表明,该方法在Interhand2.6M数据集中实现了新的最先进的性能。我们提供详细的消融研究,以证明我们方法的功效,并提供对像素所有权建模如何影响3D手姿势估计的见解。
translated by 谷歌翻译
Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks. Following this finding -and building on other recent work for finding simple network structures -we propose a new architecture that consists solely of convolutional layers and yields competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet). To analyze the network we introduce a new variant of the "deconvolution approach" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches.
translated by 谷歌翻译
Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network. This differentiable module can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. We show that the use of spatial transformers results in models which learn invariance to translation, scale, rotation and more generic warping, resulting in state-of-the-art performance on several benchmarks, and for a number of classes of transformations.
translated by 谷歌翻译
In this paper we address three different computer vision tasks using a single multiscale convolutional network architecture: depth prediction, surface normal estimation, and semantic labeling. The network that we develop is able to adapt naturally to each task using only small modifications, regressing from the input image to the output map directly. Our method progressively refines predictions using a sequence of scales, and captures many image details without any superpixels or low-level segmentation. We achieve state-of-the-art performance on benchmarks for all three tasks.
translated by 谷歌翻译
本文调查了2D全身人类姿势估计的任务,该任务旨在将整个人体(包括身体,脚,脸部和手)局部定位在整个人体上。我们提出了一种称为Zoomnet的单网络方法,以考虑到完整人体的层次结构,并解决不同身体部位的规模变化。我们进一步提出了一个称为Zoomnas的神经体系结构搜索框架,以促进全身姿势估计的准确性和效率。Zoomnas共同搜索模型体系结构和不同子模块之间的连接,并自动为搜索的子模块分配计算复杂性。为了训练和评估Zoomnas,我们介绍了第一个大型2D人类全身数据集,即可可叶全体V1.0,它注释了133个用于野外图像的关键点。广泛的实验证明了Zoomnas的有效性和可可叶v1.0的重要性。
translated by 谷歌翻译
Recognition algorithms based on convolutional networks (CNNs) typically use the output of the last layer as a feature representation. However, the information in this layer may be too coarse spatially to allow precise localization. On the contrary, earlier layers may be precise in localization but will not capture semantics. To get the best of both worlds, we define the hypercolumn at a pixel as the vector of activations of all CNN units above that pixel. Using hypercolumns as pixel descriptors, we show results on three fine-grained localization tasks: simultaneous detection and segmentation [22], where we improve state-of-the-art from 49.7 mean AP r [22] to 60.0, keypoint localization, where we get a 3.3 point boost over [20], and part labeling, where we show a 6.6 point gain over a strong baseline.
translated by 谷歌翻译
尽管最近的进步,但是,尽管最近的进展,但是从单个图像中的人类姿势的全3D估计仍然是一个具有挑战性的任务。在本文中,我们探讨了关于场景几何体的强先前信息的假设可用于提高姿态估计精度。为了主弱地解决这个问题,我们已经组装了一种新的$ \ textbf {几何姿势提供} $ DataSet,包括与各种丰富的3D环境交互的人员的多视图图像。我们利用商业运动捕获系统来收集场景本身的姿势和构造精确的几何3D CAD模型的金标估计。要将对现有框架的现有框架注入图像的现有框架,我们介绍了一种新颖的,基于视图的场景几何形状,一个$ \ textbf {多层深度图} $,它采用了多次射线跟踪到简明地编码沿着每种相机视图光线方向的多个表面入口和退出点。我们提出了两种不同的机制,用于集成多层深度信息姿势估计:输入作为升降2D姿势的编码光线特征,其次是促进学习模型以支持几何一致姿态估计的可差异损失。我们通过实验展示这些技术可以提高3D姿势估计的准确性,特别是在遮挡和复杂场景几何形状的存在中。
translated by 谷歌翻译