State-of-the-art human pose estimation methods are based on heat map representation. In spite of the good performance, the representation has a few issues in nature, such as non-differentiable postprocessing and quantization error. This work shows that a simple integral operation relates and unifies the heat map representation and joint regression, thus avoiding the above issues. It is differentiable, efficient, and compatible with any heat map based methods. Its effectiveness is convincingly validated via comprehensive ablation experiments under various settings, specifically on 3D pose estimation, for the first time.
translated by 谷歌翻译
There has been significant progress on pose estimation and increasing interests on pose tracking in recent years. At the same time, the overall algorithm and system complexity increases as well, making the algorithm analysis and comparison more difficult. This work provides simple and effective baseline methods. They are helpful for inspiring and evaluating new ideas for the field. State-of-the-art results are achieved on challenging benchmarks. The code will be available at https://github. com/leoxiaobin/pose.pytorch.
translated by 谷歌翻译
我们提出了一种直接的,基于回归的方法,以从单个图像中估计2D人姿势。我们将问题提出为序列预测任务,我们使用变压器网络解决了问题。该网络直接学习了从图像到关键点坐标的回归映射,而无需诉诸中间表示(例如热图)。这种方法避免了与基于热图的方法相关的许多复杂性。为了克服以前基于回归的方法的特征错位问题,我们提出了一种注意机制,该机制适应与目标关键最相关的功能,从而大大提高了准确性。重要的是,我们的框架是端到端的可区分,并且自然学会利用关键点之间的依赖关系。两个主要的姿势估计数据集在MS-Coco和MPII上进行的实验表明,我们的方法在基于回归的姿势估计中的最新方法显着改善。更值得注意的是,与最佳的基于热图的姿势估计方法相比,我们的第一种基于回归的方法是有利的。
translated by 谷歌翻译
Accurate whole-body multi-person pose estimation and tracking is an important yet challenging topic in computer vision. To capture the subtle actions of humans for complex behavior analysis, whole-body pose estimation including the face, body, hand and foot is essential over conventional body-only pose estimation. In this paper, we present AlphaPose, a system that can perform accurate whole-body pose estimation and tracking jointly while running in realtime. To this end, we propose several new techniques: Symmetric Integral Keypoint Regression (SIKR) for fast and fine localization, Parametric Pose Non-Maximum-Suppression (P-NMS) for eliminating redundant human detections and Pose Aware Identity Embedding for jointly pose estimation and tracking. During training, we resort to Part-Guided Proposal Generator (PGPG) and multi-domain knowledge distillation to further improve the accuracy. Our method is able to localize whole-body keypoints accurately and tracks humans simultaneously given inaccurate bounding boxes and redundant detections. We show a significant improvement over current state-of-the-art methods in both speed and accuracy on COCO-wholebody, COCO, PoseTrack, and our proposed Halpe-FullBody pose estimation dataset. Our model, source codes and dataset are made publicly available at https://github.com/MVIG-SJTU/AlphaPose.
translated by 谷歌翻译
In this paper, we are interested in the human pose estimation problem with a focus on learning reliable highresolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process.We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutliresolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich highresolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. In addition, we show the superiority of our network in pose tracking on the PoseTrack dataset. The code and models have been publicly available at https://github.com/leoxiaobin/ deep-high-resolution-net.pytorch.
translated by 谷歌翻译
Bottom-up human pose estimation methods have difficulties in predicting the correct pose for small persons due to challenges in scale variation. In this paper, we present HigherHRNet: a novel bottom-up human pose estimation method for learning scale-aware representations using high-resolution feature pyramids. Equipped with multi-resolution supervision for training and multiresolution aggregation for inference, the proposed approach is able to solve the scale variation challenge in bottom-up multi-person pose estimation and localize keypoints more precisely, especially for small person. The feature pyramid in HigherHRNet consists of feature map outputs from HRNet and upsampled higher-resolution outputs through a transposed convolution. HigherHR-Net outperforms the previous best bottom-up method by 2.5% AP for medium person on COCO test-dev, showing its effectiveness in handling scale variation. Furthermore, HigherHRNet achieves new state-of-the-art result on COCO test-dev (70.5% AP) without using refinement or other post-processing techniques, surpassing all existing bottom-up methods. HigherHRNet even surpasses all topdown methods on CrowdPose test (67.6% AP), suggesting its robustness in crowded scene. The code and models are available at https://github.com/HRNet/ Higher-HRNet-Human-Pose-Estimation.
translated by 谷歌翻译
The topic of multi-person pose estimation has been largely improved recently, especially with the development of convolutional neural network. However, there still exist a lot of challenging cases, such as occluded keypoints, invisible keypoints and complex background, which cannot be well addressed. In this paper, we present a novel network structure called Cascaded Pyramid Network (CPN) which targets to relieve the problem from these "hard" keypoints. More specifically, our algorithm includes two stages: Glob-alNet and RefineNet. GlobalNet is a feature pyramid network which can successfully localize the "simple" keypoints like eyes and hands but may fail to precisely recognize the occluded or invisible keypoints. Our RefineNet tries explicitly handling the "hard" keypoints by integrating all levels of feature representations from the Global-Net together with an online hard keypoint mining loss. In general, to address the multi-person pose estimation problem, a top-down pipeline is adopted to first generate a set of human bounding boxes based on a detector, followed by our CPN for keypoint localization in each human bounding box. Based on the proposed algorithm, we achieve stateof-art results on the COCO keypoint benchmark, with average precision at 73.0 on the COCO test-dev dataset and 72.1 on the COCO test-challenge dataset, which is a 19% relative improvement compared with 60.5 from the COCO 2016 keypoint challenge. Code 1 and the detection results are publicly available for further research.
translated by 谷歌翻译
由于高性能,基于2D热图的方法多年来一直占据了人类姿势估计(HPE)。但是,基于2D热图的方法中长期存在的量化错误问题导致了几个众所周知的缺点:1)低分辨率输入的性能受到限制; 2)为了改善特征图分辨率以提高本地化精度,需要多个昂贵的UP采样层; 3)采用额外的后处理以减少量化误差。为了解决这些问题,我们旨在探索一种称为\ textit {SIMCC}的全新方案,该方案将HPE重新定义为水平和垂直坐标的两个分类任务。提出的SIMCC均匀地将每个像素分为几个箱,从而实现\ emph {subpixel}本地化精度和低量化误差。从中受益,SIMCC可以在某些设置下省略其他细化后处理,并排除更简单和有效的HPE管道。通过可可,人群和MPII数据集进行的广泛实验表明,SIMCC优于基于热图的同行,尤其是在低分辨率设置中,较大的边距。
translated by 谷歌翻译
本文调查了2D全身人类姿势估计的任务,该任务旨在将整个人体(包括身体,脚,脸部和手)局部定位在整个人体上。我们提出了一种称为Zoomnet的单网络方法,以考虑到完整人体的层次结构,并解决不同身体部位的规模变化。我们进一步提出了一个称为Zoomnas的神经体系结构搜索框架,以促进全身姿势估计的准确性和效率。Zoomnas共同搜索模型体系结构和不同子模块之间的连接,并自动为搜索的子模块分配计算复杂性。为了训练和评估Zoomnas,我们介绍了第一个大型2D人类全身数据集,即可可叶全体V1.0,它注释了133个用于野外图像的关键点。广泛的实验证明了Zoomnas的有效性和可可叶v1.0的重要性。
translated by 谷歌翻译
大多数实时人类姿势估计方法都基于检测接头位置。使用检测到的关节位置,可以计算偏差和肢体的俯仰。然而,由于这种旋转轴仍然不观察,因此不能计算沿着肢体沿着肢体至关重要的曲折,这对于诸如体育分析和计算机动画至关重要。在本文中,我们引入了方向关键点,一种用于估计骨骼关节的全位置和旋转的新方法,仅使用单帧RGB图像。灵感来自Motion-Capture Systems如何使用一组点标记来估计全骨骼旋转,我们的方法使用虚拟标记来生成足够的信息,以便准确地推断使用简单的后处理。旋转预测改善了接头角度最佳报告的平均误差48%,并且在15个骨骼旋转中实现了93%的精度。该方法还通过MPJPE在原理数据集上测量,通过MPJPE测量,该方法还改善了当前的最新结果14%,并概括为野外数据集。
translated by 谷歌翻译
基于回归的方法可以通过直接以馈送方式将原始像素直接映射到模型参数来估算从单眼图像的身体,手甚至全身模型。但是,参数的微小偏差可能导致估计的网格和输入图像之间的明显未对准,尤其是在全身网格恢复的背景下。为了解决这个问题,我们建议在我们的回归网络中进行锥体网状对准反馈(PYMAF)循环,以进行良好的人类网格恢复,并将其扩展到PYMAF-X,以恢复表达全身模型。 PYMAF的核心思想是利用特征金字塔并根据网格图像对准状态明确纠正预测参数。具体而言,给定当前预测的参数,将相应地从更优质的特征中提取网格对准的证据,并将其送回以进行参数回流。为了增强一致性的看法,采用辅助密集的监督来提供网格图像对应指南,同时引入了空间对齐的注意,以使我们的网络对全球环境的认识。当扩展PYMAF以进行全身网状恢复时,PYMAF-X中提出了一种自适应整合策略来调整肘部扭转旋转,该旋转会产生自然腕部姿势,同时保持部分特定估计的良好性能。我们的方法的功效在几个基准数据集上得到了验证,以实现身体和全身网状恢复,在该数据集中,PYMAF和PYMAF-X有效地改善了网格图像的对准并实现了新的最新结果。具有代码和视频结果的项目页面可以在https://www.liuyebin.com/pymaf-x上找到。
translated by 谷歌翻译
我们提出了一种用于多实例姿态估计的端到端培训方法,称为诗人(姿势估计变压器)。将卷积神经网络与变压器编码器 - 解码器架构组合,我们将多个姿势估计从图像标记为直接设置预测问题。我们的模型能够使用双方匹配方案直接出现所有个人的姿势。诗人使用基于集的全局损失进行培训,该丢失包括关键点损耗,可见性损失和载重损失。诗歌的原因与多个检测到的个人与完整图像上下文之间的关系直接预测它们并行姿势。我们展示诗人在Coco Keypoint检测任务上实现了高精度,同时具有比其他自下而上和自上而下的方法更少的参数和更高推理速度。此外,在将诗人应用于动物姿势估计时,我们表现出了成功的转移学习。据我们所知,该模型是第一个端到端的培训多实例姿态估计方法,我们希望它将成为一种简单而有前途的替代方案。
translated by 谷歌翻译
Recent state-of-the-art performance on human-body pose estimation has been achieved with Deep Convolutional Networks (ConvNets). Traditional ConvNet architectures include pooling and sub-sampling layers which reduce computational requirements, introduce invariance and prevent over-training. These benefits of pooling come at the cost of reduced localization accuracy. We introduce a novel architecture which includes an efficient 'position refinement' model that is trained to estimate the joint offset location within a small region of the image. This refinement model is jointly trained in cascade with a state-of-the-art ConvNet model [21] to achieve improved accuracy in human joint location estimation. We show that the variance of our detector approaches the variance of human annotations on the FLIC [20] dataset and outperforms all existing approaches on the MPII-human-pose dataset [1].
translated by 谷歌翻译
我们提出Bapose,一种新颖的自下而上的方法,实现了多人姿态估计的最先进结果。我们的最终培训框架利用了解开的多尺度瀑布架构,并将自适应卷曲融合在拥挤的场景中更准确地推断出闭塞的关键点。由BAPOSE中的解开瀑布模块获得的多尺度表示,利用级联架构中进行逐行滤波的效率,同时保持与空间金字塔配置的多尺度视图相当。我们对挑战性的Coco和Crowdose数据集的结果表明,Bapose是多人姿态估计的高效且稳健的框架,实现了最先进的准确性的显着改善。
translated by 谷歌翻译
多人3D姿势估计是一项具有挑战性的任务,因为遮挡和深度歧义,尤其是在人群场景的情况下。为了解决这些问题,大多数现有方法通过使用图神经网络增强特征表示或添加结构约束来探索建模身体上下文提示。但是,这些方法对于它们的单根公式并不强大,该公式将3D从根节点带有预定义的图形。在本文中,我们提出了GR-M3D,该GR-M3D模拟了\ textbf {m} ulti-person \ textbf {3d}构成构成构成效果估计,并使用动态\ textbf {g} raph \ textbf {r textbf {r} eSounting。预测GR-M3D中的解码图而不是预定。特别是,它首先生成几个数据图,并通过刻度和深度意识到的细化模块(SDAR)增强它们。然后从这些数据图估算每个人的多个根关键点和密集的解码路径。基于它们,动态解码图是通过将路径权重分配给解码路径来构建的,而路径权重是从这些增强的数据图推断出来的。此过程被命名为动态图推理(DGR)。最后,根据每个检测到的人的动态解码图对3D姿势进行解码。 GR-M3D可以根据输入数据采用软路径权重,通过采用软路径权重来调整解码图的结构,这使得解码图最能适应不同的输入人员,并且比以前的方法更有能力处理闭塞和深度歧义。我们从经验上表明,提出的自下而上方法甚至超过自上而下的方法,并在三个3D姿势数据集上实现最先进的方法。
translated by 谷歌翻译
我们为变体视觉任务提供了一个概念上简单,灵活和通用的视觉感知头,例如分类,对象检测,实例分割和姿势估计以及不同的框架,例如单阶段或两个阶段的管道。我们的方法有效地标识了图像中的对象,同时同时生成高质量的边界框或基于轮廓的分割掩码或一组关键点。该方法称为Unihead,将不同的视觉感知任务视为通过变压器编码器体系结构学习的可分配点。给定固定的空间坐标,Unihead将其自适应地分散到了不同的空间点和有关它们的关系的原因。它以多个点的形式直接输出最终预测集,使我们能够在具有相同头部设计的不同框架中执行不同的视觉任务。我们展示了对成像网分类的广泛评估以及可可套件的所有三个曲目,包括对象检测,实例分割和姿势估计。如果没有铃铛和口哨声,Unihead可以通过单个视觉头设计统一这些视觉任务,并与为每个任务开发的专家模型相比,实现可比的性能。我们希望我们的简单和通用的Unihead能够成为可靠的基线,并有助于促进通用的视觉感知研究。代码和型号可在https://github.com/sense-x/unihead上找到。
translated by 谷歌翻译
Mask r-cnn
分类:
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without tricks, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code will be made available.
translated by 谷歌翻译
在本文中,我们考虑了同时找到和从单个2D图像中恢复多手的具有挑战性的任务。先前的研究要么关注单手重建,要么以多阶段的方式解决此问题。此外,常规的两阶段管道首先检测到手部区域,然后估计每个裁剪贴片的3D手姿势。为了减少预处理和特征提取中的计算冗余,我们提出了一条简洁但有效的单阶段管道。具体而言,我们为多手重建设计了多头自动编码器结构,每个HEAD网络分别共享相同的功能图并分别输出手动中心,姿势和纹理。此外,我们采用了一个弱监督的计划来减轻昂贵的3D现实世界数据注释的负担。为此,我们提出了一系列通过舞台训练方案优化的损失,其中根据公开可用的单手数据集生成具有2D注释的多手数据集。为了进一步提高弱监督模型的准确性,我们在单手和多个手设置中采用了几个功能一致性约束。具体而言,从本地功能估算的每只手的关键点应与全局功能预测的重新投影点一致。在包括Freihand,HO3D,Interhand 2.6M和RHD在内的公共基准测试的广泛实验表明,我们的方法在弱监督和完全监督的举止中优于基于最先进的模型方法。代码和模型可在{\ url {https://github.com/zijinxuxu/smhr}}上获得。
translated by 谷歌翻译
Although it is well believed for years that modeling relations between objects would help object recognition, there has not been evidence that the idea is working in the deep learning era. All state-of-the-art object detection systems still rely on recognizing object instances individually, without exploiting their relations during learning.This work proposes an object relation module. It processes a set of objects simultaneously through interaction between their appearance feature and geometry, thus allowing modeling of their relations. It is lightweight and in-place. It does not require additional supervision and is easy to embed in existing networks. It is shown effective on improving object recognition and duplicate removal steps in the modern object detection pipeline. It verifies the efficacy of modeling object relations in CNN based detection. It gives rise to the first fully end-to-end object detector. Code is available at https://github.com/msracver/ Relation-Networks-for-Object-Detection.
translated by 谷歌翻译
本文介绍了Houghnet,这是一种单阶段,无锚,基于投票的,自下而上的对象检测方法。受到广义的霍夫变换的启发,霍尼特通过在该位置投票的总和确定了某个位置的物体的存在。投票是根据对数极极投票领域的近距离和长距离地点收集的。由于这种投票机制,Houghnet能够整合近距离和远程的班级条件证据以进行视觉识别,从而概括和增强当前的对象检测方法,这通常仅依赖于本地证据。在可可数据集中,Houghnet的最佳型号达到$ 46.4 $ $ $ ap $(和$ 65.1 $ $ $ ap_ {50} $),与自下而上的对象检测中的最先进的作品相同,超越了最重要的一项 - 阶段和两阶段方法。我们进一步验证了提案在其他视觉检测任务中的有效性,即视频对象检测,实例分割,3D对象检测和人为姿势估计的关键点检测以及其他“图像”图像生成任务的附加“标签”,其中集成的集成在所有情况下,我们的投票模块始终提高性能。代码可在https://github.com/nerminsamet/houghnet上找到。
translated by 谷歌翻译