本文介绍了使用变压器解决关键点检测和实例关联的新方法。对于自下而上的多人姿势估计模型,他们需要检测关键点并在关键点之间学习关联信息。我们认为这些问题可以完全由变压器解决。具体而言,变压器中的自我关注测量任何一对位置之间的依赖性,这可以为关键点分组提供关联信息。但是,天真的注意力模式仍然没有主观控制,因此无法保证关键点始终会参加它们所属的实例。为了解决它,我们提出了一种监督多人关键点检测和实例关联的自我关注的新方法。通过使用实例掩码来监督自我关注的实例感知,我们可以基于成对引人注定分数为其对应的实例分配检测到的关键字,而无需使用预定义的偏移量字段或嵌入像基于CNN的自下而上模型。我们方法的另一个好处是可以从监督的注意矩阵直接获得任何数量的人的实例分段结果,从而简化了像素分配管道。对Coco多人关键点检测挑战和人实例分割任务的实验证明了所提出的方法的有效性和简单性,并显示出于针对特定目的控制自我关注行为的有希望的方法。
translated by 谷歌翻译
我们提出了一种用于多实例姿态估计的端到端培训方法,称为诗人(姿势估计变压器)。将卷积神经网络与变压器编码器 - 解码器架构组合,我们将多个姿势估计从图像标记为直接设置预测问题。我们的模型能够使用双方匹配方案直接出现所有个人的姿势。诗人使用基于集的全局损失进行培训,该丢失包括关键点损耗,可见性损失和载重损失。诗歌的原因与多个检测到的个人与完整图像上下文之间的关系直接预测它们并行姿势。我们展示诗人在Coco Keypoint检测任务上实现了高精度,同时具有比其他自下而上和自上而下的方法更少的参数和更高推理速度。此外,在将诗人应用于动物姿势估计时,我们表现出了成功的转移学习。据我们所知,该模型是第一个端到端的培训多实例姿态估计方法,我们希望它将成为一种简单而有前途的替代方案。
translated by 谷歌翻译
人类的姿势估计旨在弄清不同场景中所有人的关键。尽管结果有希望,但目前的方法仍然面临一些挑战。现有的自上而下的方法单独处理一个人,而没有不同的人与所在的场景之间的相互作用。因此,当发生严重闭塞时,人类检测的表现会降低。另一方面,现有的自下而上方法同时考虑所有人,并捕获整个图像的全局知识。但是,由于尺度变化,它们的准确性不如自上而下的方法。为了解决这些问题,我们通过整合自上而下和自下而上的管道来探索不同接受场的视觉线索并实现其互补性,提出了一种新颖的双皮线整合变压器(DPIT)。具体而言,DPIT由两个分支组成,自下而上的分支介绍了整个图像以捕获全局视觉信息,而自上而下的分支则从单人类边界框中提取本地视觉的特征表示。然后,从自下而上和自上而下的分支中提取的特征表示形式被馈入变压器编码器,以交互融合全局和本地知识。此外,我们定义了关键点查询,以探索全景和单人类姿势视觉线索,以实现两个管道的相互互补性。据我们所知,这是将自下而上和自上而下管道与变压器与人类姿势估计的变压器相结合的最早作品之一。关于可可和MPII数据集的广泛实验表明,我们的DPIT与最先进的方法相当。
translated by 谷歌翻译
Bottom-up human pose estimation methods have difficulties in predicting the correct pose for small persons due to challenges in scale variation. In this paper, we present HigherHRNet: a novel bottom-up human pose estimation method for learning scale-aware representations using high-resolution feature pyramids. Equipped with multi-resolution supervision for training and multiresolution aggregation for inference, the proposed approach is able to solve the scale variation challenge in bottom-up multi-person pose estimation and localize keypoints more precisely, especially for small person. The feature pyramid in HigherHRNet consists of feature map outputs from HRNet and upsampled higher-resolution outputs through a transposed convolution. HigherHR-Net outperforms the previous best bottom-up method by 2.5% AP for medium person on COCO test-dev, showing its effectiveness in handling scale variation. Furthermore, HigherHRNet achieves new state-of-the-art result on COCO test-dev (70.5% AP) without using refinement or other post-processing techniques, surpassing all existing bottom-up methods. HigherHRNet even surpasses all topdown methods on CrowdPose test (67.6% AP), suggesting its robustness in crowded scene. The code and models are available at https://github.com/HRNet/ Higher-HRNet-Human-Pose-Estimation.
translated by 谷歌翻译
In this paper, we are interested in the human pose estimation problem with a focus on learning reliable highresolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process.We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutliresolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich highresolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. In addition, we show the superiority of our network in pose tracking on the PoseTrack dataset. The code and models have been publicly available at https://github.com/leoxiaobin/ deep-high-resolution-net.pytorch.
translated by 谷歌翻译
Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that a PAF-only refinement rather than both PAF and body part location refinement results in a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an internal annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.
translated by 谷歌翻译
我们提出了一种直接的,基于回归的方法,以从单个图像中估计2D人姿势。我们将问题提出为序列预测任务,我们使用变压器网络解决了问题。该网络直接学习了从图像到关键点坐标的回归映射,而无需诉诸中间表示(例如热图)。这种方法避免了与基于热图的方法相关的许多复杂性。为了克服以前基于回归的方法的特征错位问题,我们提出了一种注意机制,该机制适应与目标关键最相关的功能,从而大大提高了准确性。重要的是,我们的框架是端到端的可区分,并且自然学会利用关键点之间的依赖关系。两个主要的姿势估计数据集在MS-Coco和MPII上进行的实验表明,我们的方法在基于回归的姿势估计中的最新方法显着改善。更值得注意的是,与最佳的基于热图的姿势估计方法相比,我们的第一种基于回归的方法是有利的。
translated by 谷歌翻译
最近,视觉变压器及其变体在人类和多视图人类姿势估计中均起着越来越重要的作用。将图像补丁视为令牌,变形金刚可以对整个图像中的全局依赖项进行建模或其他视图中的图像。但是,全球关注在计算上是昂贵的。结果,很难将这些基于变压器的方法扩展到高分辨率特征和许多视图。在本文中,我们提出了代币螺旋的姿势变压器(PPT)进行2D人姿势估计,该姿势估计可以找到粗糙的人掩模,并且只能在选定的令牌内进行自我注意。此外,我们将PPT扩展到多视图人类姿势估计。我们建立在PPT的基础上,提出了一种新的跨视图融合策略,称为人类区域融合,该策略将所有人类前景像素视为相应的候选者。可可和MPII的实验结果表明,我们的PPT可以在减少计算的同时匹配以前的姿势变压器方法的准确性。此外,对人类360万和滑雪姿势的实验表明,我们的多视图PPT可以有效地从多个视图中融合线索并获得新的最新结果。
translated by 谷歌翻译
本文提出了一个统一的框架到(i)找到球,(ii)预测姿势,(iii)在团队体育场景中分段播放器的实例掩码。这些问题对自动体育分析,生产和广播有高兴趣。常见做法是通过利用通用最先进的模型,例如Panoptic-Deeblab来单独解决每个问题,用于玩家分割。除了从单任务模型的乘法乘以增加的复杂性之外,由于团队体育场景的复杂性和特异性,使用现成的架子模型也会阻碍性能,如强大的遮挡和运动模糊。为了规避这些限制,我们的论文提出培训一种单一的模型,它通过组合零件强度场和空间嵌入原理来预测球和玩家掩模和姿势。部件强度场提供球和播放器位置,以及播放器接头位置。然后利用空间嵌入来将播放器实例像素联系到其各自的播放器中心,而且还将播放器接头分组成骷髅。我们展示了拟议模型在DeepSport篮球数据集上的有效性,为单独解决每个单独任务的SOA模型实现了可比性的性能。
translated by 谷歌翻译
Accurate whole-body multi-person pose estimation and tracking is an important yet challenging topic in computer vision. To capture the subtle actions of humans for complex behavior analysis, whole-body pose estimation including the face, body, hand and foot is essential over conventional body-only pose estimation. In this paper, we present AlphaPose, a system that can perform accurate whole-body pose estimation and tracking jointly while running in realtime. To this end, we propose several new techniques: Symmetric Integral Keypoint Regression (SIKR) for fast and fine localization, Parametric Pose Non-Maximum-Suppression (P-NMS) for eliminating redundant human detections and Pose Aware Identity Embedding for jointly pose estimation and tracking. During training, we resort to Part-Guided Proposal Generator (PGPG) and multi-domain knowledge distillation to further improve the accuracy. Our method is able to localize whole-body keypoints accurately and tracks humans simultaneously given inaccurate bounding boxes and redundant detections. We show a significant improvement over current state-of-the-art methods in both speed and accuracy on COCO-wholebody, COCO, PoseTrack, and our proposed Halpe-FullBody pose estimation dataset. Our model, source codes and dataset are made publicly available at https://github.com/MVIG-SJTU/AlphaPose.
translated by 谷歌翻译
在本文中,我们介绍了人际内和人际关系网络(I^2R-NET),以进行多人姿势估计。它涉及两个基本模块。首先,人类内部关系模块在一个人身上运行,旨在捕获人类内部依赖性。其次,人际关系模块考虑了多个实例之间的关系,并着重于捕获人间的相互作用。人际关系间的关系模块可以通过减少特征图的分辨率来设计非常轻巧,但学习有用的关系信息以显着提高人类内部关系模块的性能。即使没有铃铛和哨子,我们的方法也可以竞争或胜过当前的比赛获胜者。我们对可可,人群和ochuman数据集进行了广泛的实验。结果表明,所提出的模型超过了所有最新方法。具体而言,所提出的方法在众群数据集上达到了77.4%的AP和Ochuman数据集上的67.8%AP,从而超过了现有方法的大幅度优于较大的利润率。此外,消融研究和可视化分析还证明了我们的模型的有效性。
translated by 谷歌翻译
The topic of multi-person pose estimation has been largely improved recently, especially with the development of convolutional neural network. However, there still exist a lot of challenging cases, such as occluded keypoints, invisible keypoints and complex background, which cannot be well addressed. In this paper, we present a novel network structure called Cascaded Pyramid Network (CPN) which targets to relieve the problem from these "hard" keypoints. More specifically, our algorithm includes two stages: Glob-alNet and RefineNet. GlobalNet is a feature pyramid network which can successfully localize the "simple" keypoints like eyes and hands but may fail to precisely recognize the occluded or invisible keypoints. Our RefineNet tries explicitly handling the "hard" keypoints by integrating all levels of feature representations from the Global-Net together with an online hard keypoint mining loss. In general, to address the multi-person pose estimation problem, a top-down pipeline is adopted to first generate a set of human bounding boxes based on a detector, followed by our CPN for keypoint localization in each human bounding box. Based on the proposed algorithm, we achieve stateof-art results on the COCO keypoint benchmark, with average precision at 73.0 on the COCO test-dev dataset and 72.1 on the COCO test-challenge dataset, which is a 19% relative improvement compared with 60.5 from the COCO 2016 keypoint challenge. Code 1 and the detection results are publicly available for further research.
translated by 谷歌翻译
We propose a sparse end-to-end multi-person pose regression framework, termed QueryPose, which can directly predict multi-person keypoint sequences from the input image. The existing end-to-end methods rely on dense representations to preserve the spatial detail and structure for precise keypoint localization. However, the dense paradigm introduces complex and redundant post-processes during inference. In our framework, each human instance is encoded by several learnable spatial-aware part-level queries associated with an instance-level query. First, we propose the Spatial Part Embedding Generation Module (SPEGM) that considers the local spatial attention mechanism to generate several spatial-sensitive part embeddings, which contain spatial details and structural information for enhancing the part-level queries. Second, we introduce the Selective Iteration Module (SIM) to adaptively update the sparse part-level queries via the generated spatial-sensitive part embeddings stage-by-stage. Based on the two proposed modules, the part-level queries are able to fully encode the spatial details and structural information for precise keypoint regression. With the bipartite matching, QueryPose avoids the hand-designed post-processes and surpasses the existing dense end-to-end methods with 73.6 AP on MS COCO mini-val set and 72.7 AP on CrowdPose test set. Code is available at https://github.com/buptxyb666/QueryPose.
translated by 谷歌翻译
在诸如人类姿态估计的关键点估计任务中,尽管具有显着缺点,但基于热线的回归是主要的方法:Heatmaps本质上遭受量化误差,并且需要过多的计算来产生和后处理。有动力寻找更有效的解决方案,我们提出了一种新的热映射无关声点估计方法,其中各个关键点和空间相关的关键点(即,姿势)被建模为基于密集的单级锚的检测框架内的对象。因此,我们将我们的方法Kapao(发音为“KA-Pow!”)对于关键点并作为对象构成。我们通过同时检测人姿势对象和关键点对象并融合检测来利用两个对象表示的强度来将Kapao应用于单阶段多人人类姿势估算问题。在实验中,我们观察到Kapao明显比以前的方法更快,更准确,这极大地来自热爱处理后处理。此外,在不使用测试时间增强时,精度速度折衷特别有利。我们的大型型号Kapao-L在Microsoft Coco Keypoints验证集上实现了70.6的AP,而无需测试时增强,其比下一个最佳单级模型更准确,4.0 AP更准确。此外,Kapao在重闭塞的存在下擅长。在繁荣试验套上,Kapao-L为一个单级方法实现新的最先进的准确性,AP为68.9。
translated by 谷歌翻译
我们为变体视觉任务提供了一个概念上简单,灵活和通用的视觉感知头,例如分类,对象检测,实例分割和姿势估计以及不同的框架,例如单阶段或两个阶段的管道。我们的方法有效地标识了图像中的对象,同时同时生成高质量的边界框或基于轮廓的分割掩码或一组关键点。该方法称为Unihead,将不同的视觉感知任务视为通过变压器编码器体系结构学习的可分配点。给定固定的空间坐标,Unihead将其自适应地分散到了不同的空间点和有关它们的关系的原因。它以多个点的形式直接输出最终预测集,使我们能够在具有相同头部设计的不同框架中执行不同的视觉任务。我们展示了对成像网分类的广泛评估以及可可套件的所有三个曲目,包括对象检测,实例分割和姿势估计。如果没有铃铛和口哨声,Unihead可以通过单个视觉头设计统一这些视觉任务,并与为每个任务开发的专家模型相比,实现可比的性能。我们希望我们的简单和通用的Unihead能够成为可靠的基线,并有助于促进通用的视觉感知研究。代码和型号可在https://github.com/sense-x/unihead上找到。
translated by 谷歌翻译
闭塞对单眼多人3D人体姿势估计构成了极大的威胁,这是由于封闭器的形状,外观和位置方面的差异很大。尽管现有的方法试图用姿势先验/约束,数据增强或隐性推理处理遮挡,但它们仍然无法概括地看不见姿势或遮挡案例,并且在出现多人时可能会犯大错误。受到人类从可见线索推断关节的显着能力的启发,我们开发了一种方法来显式建模该过程,该过程可以显着改善有或没有遮挡的情况下,可以显着改善自下而上的多人姿势估计。首先,我们将任务分为两个子任务:可见的关键点检测和遮挡的关键点推理,并提出了深入监督的编码器蒸馏(DSED)网络以求解第二个网络。为了训练我们的模型,我们提出了一种骨骼引导的人形拟合(SSF)方法,以在现有数据集上生成伪遮挡标签,从而实现明确的遮挡推理。实验表明,从遮挡中明确学习可以改善人类姿势估计。此外,利用可见关节的特征级信息使我们可以更准确地推理遮挡关节。我们的方法的表现优于几个基准的最新自上而下和自下而上的方法。
translated by 谷歌翻译
我们提出Bapose,一种新颖的自下而上的方法,实现了多人姿态估计的最先进结果。我们的最终培训框架利用了解开的多尺度瀑布架构,并将自适应卷曲融合在拥挤的场景中更准确地推断出闭塞的关键点。由BAPOSE中的解开瀑布模块获得的多尺度表示,利用级联架构中进行逐行滤波的效率,同时保持与空间金字塔配置的多尺度视图相当。我们对挑战性的Coco和Crowdose数据集的结果表明,Bapose是多人姿态估计的高效且稳健的框架,实现了最先进的准确性的显着改善。
translated by 谷歌翻译
Recently, human pose estimation mainly focuses on how to design a more effective and better deep network structure as human features extractor, and most designed feature extraction networks only introduce the position of each anatomical keypoint to guide their training process. However, we found that some human anatomical keypoints kept their topology invariance, which can help to localize them more accurately when detecting the keypoints on the feature map. But to the best of our knowledge, there is no literature that has specifically studied it. Thus, in this paper, we present a novel 2D human pose estimation method with explicit anatomical keypoints structure constraints, which introduces the topology constraint term that consisting of the differences between the distance and direction of the keypoint-to-keypoint and their groundtruth in the loss object. More importantly, our proposed model can be plugged in the most existing bottom-up or top-down human pose estimation methods and improve their performance. The extensive experiments on the benchmark dataset: COCO keypoint dataset, show that our methods perform favorably against the most existing bottom-up and top-down human pose estimation methods, especially for Lite-HRNet, when our model is plugged into it, its AP scores separately raise by 2.9\% and 3.3\% on COCO val2017 and test-dev2017 datasets.
translated by 谷歌翻译
现成的单阶段多人姿势回归方法通常利用实例得分(即,实例定位的置信度)来指示用于选择姿势候选的姿势质量。我们认为现有范式中有两个差距:〜1)实例分数与姿势回归质量不充分相互关联。〜2)实例特征表示,用于预测实例分数,不会明确地编码结构构成信息预测代表姿势回归质量的合理分数。为了解决上述问题,我们建议学习姿势回归质量感知的表现。具体地,对于第一间隙,而不是使用前一个实例置信度标签(例如,离散{1,0}或高斯表示)来表示人类实例的位置和置信度,我们首先介绍一个统一的实例表示(cir)构成回归质量分数的实例和背景到像素明智的评分映射的置信度,以校准实例分数与姿势回归质量之间的不一致。为了填充第二间隙,我们进一步提出了包括KeyPoint查询编码(KQE)的查询编码模块(QEM)来对每个键盘的位置和语义信息和姿态查询编码(PQE)进行编码,该姿势查询编码(PQE)明确地编码预测的结构姿势信息为了更好地拟合一致的实例表示(CIR)。通过使用拟议的组件,我们显着减轻了上述空白。我们的方法优于以前的基于单级回归的甚至自下而上的方法,实现了71.7 AP在MS Coco Test-Dev集上的最先进结果。
translated by 谷歌翻译
多人姿态估计方法通常遵循自上而下和自下而上的范式,两者都可以被认为是两级方法,从而导致高计算成本和低效率。在这篇文章中,向多人姿态估计任务的紧凑且有效的管道迈进,我们建议将人类部位代表为点并提出一种新的身体表示,它利用包括人类中心和七个人部分的自适应点集合以更细粒度的方式代表人类案。新颖的表示更能够捕获各种姿态变形,并自适应地将远程中心到关节位移进行自适应地分解,因此将单级可分子网络传递到更准确的返回多人姿势,称为适应性。对于推理,我们所提出的网络消除了分组以及改进,只需要单步解开过程来形成多人姿势。如果没有任何铃声和吹口哨,我们通过在Coco Test-Dev数据集上实现了DLA-34和71.3%AP / 9.1 FPS的最佳速度准确性折衷67.4%AP / 29.4 FPS。
translated by 谷歌翻译