由于高性能,基于2D热图的方法多年来一直占据了人类姿势估计(HPE)。但是,基于2D热图的方法中长期存在的量化错误问题导致了几个众所周知的缺点:1)低分辨率输入的性能受到限制; 2)为了改善特征图分辨率以提高本地化精度,需要多个昂贵的UP采样层; 3)采用额外的后处理以减少量化误差。为了解决这些问题,我们旨在探索一种称为\ textit {SIMCC}的全新方案,该方案将HPE重新定义为水平和垂直坐标的两个分类任务。提出的SIMCC均匀地将每个像素分为几个箱,从而实现\ emph {subpixel}本地化精度和低量化误差。从中受益,SIMCC可以在某些设置下省略其他细化后处理,并排除更简单和有效的HPE管道。通过可可,人群和MPII数据集进行的广泛实验表明,SIMCC优于基于热图的同行,尤其是在低分辨率设置中,较大的边距。
translated by 谷歌翻译
我们提出了一种直接的,基于回归的方法,以从单个图像中估计2D人姿势。我们将问题提出为序列预测任务,我们使用变压器网络解决了问题。该网络直接学习了从图像到关键点坐标的回归映射,而无需诉诸中间表示(例如热图)。这种方法避免了与基于热图的方法相关的许多复杂性。为了克服以前基于回归的方法的特征错位问题,我们提出了一种注意机制,该机制适应与目标关键最相关的功能,从而大大提高了准确性。重要的是,我们的框架是端到端的可区分,并且自然学会利用关键点之间的依赖关系。两个主要的姿势估计数据集在MS-Coco和MPII上进行的实验表明,我们的方法在基于回归的姿势估计中的最新方法显着改善。更值得注意的是,与最佳的基于热图的姿势估计方法相比,我们的第一种基于回归的方法是有利的。
translated by 谷歌翻译
Bottom-up human pose estimation methods have difficulties in predicting the correct pose for small persons due to challenges in scale variation. In this paper, we present HigherHRNet: a novel bottom-up human pose estimation method for learning scale-aware representations using high-resolution feature pyramids. Equipped with multi-resolution supervision for training and multiresolution aggregation for inference, the proposed approach is able to solve the scale variation challenge in bottom-up multi-person pose estimation and localize keypoints more precisely, especially for small person. The feature pyramid in HigherHRNet consists of feature map outputs from HRNet and upsampled higher-resolution outputs through a transposed convolution. HigherHR-Net outperforms the previous best bottom-up method by 2.5% AP for medium person on COCO test-dev, showing its effectiveness in handling scale variation. Furthermore, HigherHRNet achieves new state-of-the-art result on COCO test-dev (70.5% AP) without using refinement or other post-processing techniques, surpassing all existing bottom-up methods. HigherHRNet even surpasses all topdown methods on CrowdPose test (67.6% AP), suggesting its robustness in crowded scene. The code and models are available at https://github.com/HRNet/ Higher-HRNet-Human-Pose-Estimation.
translated by 谷歌翻译
In this paper, we are interested in the human pose estimation problem with a focus on learning reliable highresolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process.We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutliresolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich highresolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. In addition, we show the superiority of our network in pose tracking on the PoseTrack dataset. The code and models have been publicly available at https://github.com/leoxiaobin/ deep-high-resolution-net.pytorch.
translated by 谷歌翻译
Accurate whole-body multi-person pose estimation and tracking is an important yet challenging topic in computer vision. To capture the subtle actions of humans for complex behavior analysis, whole-body pose estimation including the face, body, hand and foot is essential over conventional body-only pose estimation. In this paper, we present AlphaPose, a system that can perform accurate whole-body pose estimation and tracking jointly while running in realtime. To this end, we propose several new techniques: Symmetric Integral Keypoint Regression (SIKR) for fast and fine localization, Parametric Pose Non-Maximum-Suppression (P-NMS) for eliminating redundant human detections and Pose Aware Identity Embedding for jointly pose estimation and tracking. During training, we resort to Part-Guided Proposal Generator (PGPG) and multi-domain knowledge distillation to further improve the accuracy. Our method is able to localize whole-body keypoints accurately and tracks humans simultaneously given inaccurate bounding boxes and redundant detections. We show a significant improvement over current state-of-the-art methods in both speed and accuracy on COCO-wholebody, COCO, PoseTrack, and our proposed Halpe-FullBody pose estimation dataset. Our model, source codes and dataset are made publicly available at https://github.com/MVIG-SJTU/AlphaPose.
translated by 谷歌翻译
人类的姿势估计旨在弄清不同场景中所有人的关键。尽管结果有希望,但目前的方法仍然面临一些挑战。现有的自上而下的方法单独处理一个人,而没有不同的人与所在的场景之间的相互作用。因此,当发生严重闭塞时,人类检测的表现会降低。另一方面,现有的自下而上方法同时考虑所有人,并捕获整个图像的全局知识。但是,由于尺度变化,它们的准确性不如自上而下的方法。为了解决这些问题,我们通过整合自上而下和自下而上的管道来探索不同接受场的视觉线索并实现其互补性,提出了一种新颖的双皮线整合变压器(DPIT)。具体而言,DPIT由两个分支组成,自下而上的分支介绍了整个图像以捕获全局视觉信息,而自上而下的分支则从单人类边界框中提取本地视觉的特征表示。然后,从自下而上和自上而下的分支中提取的特征表示形式被馈入变压器编码器,以交互融合全局和本地知识。此外,我们定义了关键点查询,以探索全景和单人类姿势视觉线索,以实现两个管道的相互互补性。据我们所知,这是将自下而上和自上而下管道与变压器与人类姿势估计的变压器相结合的最早作品之一。关于可可和MPII数据集的广泛实验表明,我们的DPIT与最先进的方法相当。
translated by 谷歌翻译
本文调查了2D全身人类姿势估计的任务,该任务旨在将整个人体(包括身体,脚,脸部和手)局部定位在整个人体上。我们提出了一种称为Zoomnet的单网络方法,以考虑到完整人体的层次结构,并解决不同身体部位的规模变化。我们进一步提出了一个称为Zoomnas的神经体系结构搜索框架,以促进全身姿势估计的准确性和效率。Zoomnas共同搜索模型体系结构和不同子模块之间的连接,并自动为搜索的子模块分配计算复杂性。为了训练和评估Zoomnas,我们介绍了第一个大型2D人类全身数据集,即可可叶全体V1.0,它注释了133个用于野外图像的关键点。广泛的实验证明了Zoomnas的有效性和可可叶v1.0的重要性。
translated by 谷歌翻译
我们提出Bapose,一种新颖的自下而上的方法,实现了多人姿态估计的最先进结果。我们的最终培训框架利用了解开的多尺度瀑布架构,并将自适应卷曲融合在拥挤的场景中更准确地推断出闭塞的关键点。由BAPOSE中的解开瀑布模块获得的多尺度表示,利用级联架构中进行逐行滤波的效率,同时保持与空间金字塔配置的多尺度视图相当。我们对挑战性的Coco和Crowdose数据集的结果表明,Bapose是多人姿态估计的高效且稳健的框架,实现了最先进的准确性的显着改善。
translated by 谷歌翻译
We propose a sparse end-to-end multi-person pose regression framework, termed QueryPose, which can directly predict multi-person keypoint sequences from the input image. The existing end-to-end methods rely on dense representations to preserve the spatial detail and structure for precise keypoint localization. However, the dense paradigm introduces complex and redundant post-processes during inference. In our framework, each human instance is encoded by several learnable spatial-aware part-level queries associated with an instance-level query. First, we propose the Spatial Part Embedding Generation Module (SPEGM) that considers the local spatial attention mechanism to generate several spatial-sensitive part embeddings, which contain spatial details and structural information for enhancing the part-level queries. Second, we introduce the Selective Iteration Module (SIM) to adaptively update the sparse part-level queries via the generated spatial-sensitive part embeddings stage-by-stage. Based on the two proposed modules, the part-level queries are able to fully encode the spatial details and structural information for precise keypoint regression. With the bipartite matching, QueryPose avoids the hand-designed post-processes and surpasses the existing dense end-to-end methods with 73.6 AP on MS COCO mini-val set and 72.7 AP on CrowdPose test set. Code is available at https://github.com/buptxyb666/QueryPose.
translated by 谷歌翻译
本文介绍了使用变压器解决关键点检测和实例关联的新方法。对于自下而上的多人姿势估计模型,他们需要检测关键点并在关键点之间学习关联信息。我们认为这些问题可以完全由变压器解决。具体而言,变压器中的自我关注测量任何一对位置之间的依赖性,这可以为关键点分组提供关联信息。但是,天真的注意力模式仍然没有主观控制,因此无法保证关键点始终会参加它们所属的实例。为了解决它,我们提出了一种监督多人关键点检测和实例关联的自我关注的新方法。通过使用实例掩码来监督自我关注的实例感知,我们可以基于成对引人注定分数为其对应的实例分配检测到的关键字,而无需使用预定义的偏移量字段或嵌入像基于CNN的自下而上模型。我们方法的另一个好处是可以从监督的注意矩阵直接获得任何数量的人的实例分段结果,从而简化了像素分配管道。对Coco多人关键点检测挑战和人实例分割任务的实验证明了所提出的方法的有效性和简单性,并显示出于针对特定目的控制自我关注行为的有希望的方法。
translated by 谷歌翻译
在诸如人类姿态估计的关键点估计任务中,尽管具有显着缺点,但基于热线的回归是主要的方法:Heatmaps本质上遭受量化误差,并且需要过多的计算来产生和后处理。有动力寻找更有效的解决方案,我们提出了一种新的热映射无关声点估计方法,其中各个关键点和空间相关的关键点(即,姿势)被建模为基于密集的单级锚的检测框架内的对象。因此,我们将我们的方法Kapao(发音为“KA-Pow!”)对于关键点并作为对象构成。我们通过同时检测人姿势对象和关键点对象并融合检测来利用两个对象表示的强度来将Kapao应用于单阶段多人人类姿势估算问题。在实验中,我们观察到Kapao明显比以前的方法更快,更准确,这极大地来自热爱处理后处理。此外,在不使用测试时间增强时,精度速度折衷特别有利。我们的大型型号Kapao-L在Microsoft Coco Keypoints验证集上实现了70.6的AP,而无需测试时增强,其比下一个最佳单级模型更准确,4.0 AP更准确。此外,Kapao在重闭塞的存在下擅长。在繁荣试验套上,Kapao-L为一个单级方法实现新的最先进的准确性,AP为68.9。
translated by 谷歌翻译
Recently, human pose estimation mainly focuses on how to design a more effective and better deep network structure as human features extractor, and most designed feature extraction networks only introduce the position of each anatomical keypoint to guide their training process. However, we found that some human anatomical keypoints kept their topology invariance, which can help to localize them more accurately when detecting the keypoints on the feature map. But to the best of our knowledge, there is no literature that has specifically studied it. Thus, in this paper, we present a novel 2D human pose estimation method with explicit anatomical keypoints structure constraints, which introduces the topology constraint term that consisting of the differences between the distance and direction of the keypoint-to-keypoint and their groundtruth in the loss object. More importantly, our proposed model can be plugged in the most existing bottom-up or top-down human pose estimation methods and improve their performance. The extensive experiments on the benchmark dataset: COCO keypoint dataset, show that our methods perform favorably against the most existing bottom-up and top-down human pose estimation methods, especially for Lite-HRNet, when our model is plugged into it, its AP scores separately raise by 2.9\% and 3.3\% on COCO val2017 and test-dev2017 datasets.
translated by 谷歌翻译
姿势估计在以人为本的视力应用中起关键作用。但是,由于高计算成本(每帧超过150 GMAC),很难在资源受限的边缘设备上部署最新的基于HRNET的姿势估计模型。在本文中,我们研究了在边缘实时多人姿势估计的有效体系结构设计。我们透露,通过我们的逐渐收缩实验,HRNET的高分辨率分支对于低计量区域的模型是多余的。删除它们可以提高效率和性能。受这一发现的启发,我们设计了LitePose,这是一种有效的单分支架构,用于姿势估计,并引入了两种简单的方法来增强LitePose的能力,包括Fusion Deconv Head和大型内核Corvs。 Fusion deconv头部删除了高分辨率分支中的冗余,从而使尺度感知的特征融合且开销低。大型内核会大大提高模型的能力和接受场,同时保持低计算成本。只有25%的计算增量,7x7内核的实现+14.0地图优于人群数据集上的3x3内核。在移动平台上,LitePose与先前最新的有效姿势估计模型相比,LitePose将潜伏期最高可达5.0倍,而无需牺牲性能,从而推动了实时多人姿势估计的边界。我们的代码和预培训模型在https://github.com/mit-han-lab/litepose上发布。
translated by 谷歌翻译
我们观察到,由于不同身体部位的生物学约束,人类的姿势表现出强大的群体结构相关性和空间耦合。可以探索这种群体结构相关性,以提高人类姿势估计的准确性和鲁棒性。在这项工作中,我们开发了一个自我控制的预测验证网络,以表征和学习训练过程中关键点之间的结构相关性。在推理阶段,来自验证网络的反馈信息使我们能够进一步优化姿势预测,从而显着提高了人类姿势估计的性能。具体而言,我们根据人体的生物结构将关键点分组分组。在每个组中,关键点进一步分为两个子集,高信心基础关键点和低信心终端关键点。我们开发一个自我约束的预测验证网络,以在这些关键点子集之间执行前向和向后的预测。姿势估计以及通用预测任务中的一个基本挑战是,由于无法获得地面真相,因此我们没有机制可以验证获得的姿势估计或预测结果是否准确。一旦成功学习,验证网络将用作前向姿势预测的准确性验证模块。在推理阶段,它可用于指导低保持信心关键点的姿势估计结果的局部优化,而高信心关键点的自我约束损失是目标函数。我们对基准MS可可和人群数据集的广泛实验结果表明,所提出的方法可以显着改善姿势估计结果。
translated by 谷歌翻译
The topic of multi-person pose estimation has been largely improved recently, especially with the development of convolutional neural network. However, there still exist a lot of challenging cases, such as occluded keypoints, invisible keypoints and complex background, which cannot be well addressed. In this paper, we present a novel network structure called Cascaded Pyramid Network (CPN) which targets to relieve the problem from these "hard" keypoints. More specifically, our algorithm includes two stages: Glob-alNet and RefineNet. GlobalNet is a feature pyramid network which can successfully localize the "simple" keypoints like eyes and hands but may fail to precisely recognize the occluded or invisible keypoints. Our RefineNet tries explicitly handling the "hard" keypoints by integrating all levels of feature representations from the Global-Net together with an online hard keypoint mining loss. In general, to address the multi-person pose estimation problem, a top-down pipeline is adopted to first generate a set of human bounding boxes based on a detector, followed by our CPN for keypoint localization in each human bounding box. Based on the proposed algorithm, we achieve stateof-art results on the COCO keypoint benchmark, with average precision at 73.0 on the COCO test-dev dataset and 72.1 on the COCO test-challenge dataset, which is a 19% relative improvement compared with 60.5 from the COCO 2016 keypoint challenge. Code 1 and the detection results are publicly available for further research.
translated by 谷歌翻译
In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking advantage of the scalable model capacity and high parallelism of the vision transformer, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a novel ViTPose+ model is proposed to deal with heterogeneous body keypoint categories in different types of body pose estimation tasks via knowledge factorization, i.e., adopting task-agnostic and task-specific feed-forward networks in the transformer. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our ViTPose model outperforms representative methods on the challenging MS COCO Human Keypoint Detection benchmark at both top-down and bottom-up settings. Furthermore, our ViTPose+ model achieves state-of-the-art performance simultaneously on a series of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII for human keypoint detection, COCO-Wholebody for whole-body keypoint detection, as well as AP-10K and APT-36K for animal keypoint detection, without sacrificing inference speed.
translated by 谷歌翻译
对新生儿的运动和姿势评估使经验丰富的儿科医生可以预测神经发育障碍,从而可以早期干预相关疾病。但是,大多数用于人类姿势估计方法的最新AI方法都集中在成年人上,缺乏公开基准的婴儿姿势估计。在本文中,我们通过提出婴儿姿势数据集和深度聚合视觉变压器来填补这一空白,以进行人姿势估计,该姿势估计引入了一个快速训练的完整变压器框架,而无需使用卷积操作在早期阶段提取功能。它将变压器 + MLP概括为特征图内的高分辨率深层聚集,从而在不同视力级别之间实现信息融合。我们在可可姿势数据集上预先训练,并将其应用于新发布的大规模婴儿姿势估计数据集。结果表明,凝集可以有效地学习不同分辨率之间的多尺度特征,并显着提高婴儿姿势估计的性能。我们表明,在婴儿姿势估计数据集中,凝集优于混合模型hrformer和tokenpose。此外,在可可瓣姿势估计上,我们的凝集表现优于0.8 AP。我们的代码可在github.com/szar-lab/aggpose上获得。
translated by 谷歌翻译
本文提出了一种名为定位变压器(LOTR)的新型变压器的面部地标定位网络。所提出的框架是一种直接坐标回归方法,利用变压器网络以更好地利用特征图中的空间信息。 LOTR模型由三个主要模块组成:1)将输入图像转换为特征图的视觉骨干板,2)改进Visual Backone的特征表示,以及3)直接预测的地标预测头部的变压器模块来自变压器的代表的地标坐标。给定裁剪和对齐的面部图像,所提出的LOTR可以训练结束到底,而无需任何后处理步骤。本文还介绍了光滑翼损失功能,它解决了机翼损耗的梯度不连续性,导致比L1,L2和机翼损耗等标准损耗功能更好地收敛。通过106点面部地标定位的第一个大挑战提供的JD地标数据集的实验结果表明了LOTR在排行榜上的现有方法和最近基于热爱的方法的优势。在WFLW DataSet上,所提出的Lotr框架与若干最先进的方法相比,展示了有希望的结果。此外,我们在使用我们提出的LOTRS面向对齐时,我们报告了最先进的面部识别性能的提高。
translated by 谷歌翻译
现成的单阶段多人姿势回归方法通常利用实例得分(即,实例定位的置信度)来指示用于选择姿势候选的姿势质量。我们认为现有范式中有两个差距:〜1)实例分数与姿势回归质量不充分相互关联。〜2)实例特征表示,用于预测实例分数,不会明确地编码结构构成信息预测代表姿势回归质量的合理分数。为了解决上述问题,我们建议学习姿势回归质量感知的表现。具体地,对于第一间隙,而不是使用前一个实例置信度标签(例如,离散{1,0}或高斯表示)来表示人类实例的位置和置信度,我们首先介绍一个统一的实例表示(cir)构成回归质量分数的实例和背景到像素明智的评分映射的置信度,以校准实例分数与姿势回归质量之间的不一致。为了填充第二间隙,我们进一步提出了包括KeyPoint查询编码(KQE)的查询编码模块(QEM)来对每个键盘的位置和语义信息和姿态查询编码(PQE)进行编码,该姿势查询编码(PQE)明确地编码预测的结构姿势信息为了更好地拟合一致的实例表示(CIR)。通过使用拟议的组件,我们显着减轻了上述空白。我们的方法优于以前的基于单级回归的甚至自下而上的方法,实现了71.7 AP在MS Coco Test-Dev集上的最先进结果。
translated by 谷歌翻译
State-of-the-art human pose estimation methods are based on heat map representation. In spite of the good performance, the representation has a few issues in nature, such as non-differentiable postprocessing and quantization error. This work shows that a simple integral operation relates and unifies the heat map representation and joint regression, thus avoiding the above issues. It is differentiable, efficient, and compatible with any heat map based methods. Its effectiveness is convincingly validated via comprehensive ablation experiments under various settings, specifically on 3D pose estimation, for the first time.
translated by 谷歌翻译