Most recent head pose estimation (HPE) methods are dominated by the Euler angle representation. To avoid its inherent ambiguity problem of rotation labels, alternative quaternion-based and vector-based representations are introduced. However, they both are not visually intuitive, and often derived from equivocal Euler angle labels. In this paper, we present a novel single-stage keypoint-based method via an {\it intuitive} and {\it unconstrained} 2D cube representation for joint head detection and pose estimation. The 2D cube is an orthogonal projection of the 3D regular hexahedron label roughly surrounding one head, and itself contains the head location. It can reflect the head orientation straightforwardly and unambiguously in any rotation angle. Unlike the general 6-DoF object pose estimation, our 2D cube ignores the 3-DoF of head size but retains the 3-DoF of head pose. Based on the prior of equal side length, we can effortlessly obtain the closed-form solution of Euler angles from predicted 2D head cube instead of applying the error-prone PnP algorithm. In experiments, our proposed method achieves comparable results with other representative methods on the public AFLW2000 and BIWI datasets. Besides, a novel test on the CMU panoptic dataset shows that our method can be seamlessly adapted to the unconstrained full-view HPE task without modification.
translated by 谷歌翻译
随着人类生活中的许多实际应用,包括制造监控摄像机,分析和加工客户行为,许多研究人员都注明了对数字图像的面部检测和头部姿势估计。大量提出的深度学习模型具有最先进的准确性,如YOLO,SSD,MTCNN,解决了面部检测或HOPENET的问题,FSA-NET,用于头部姿势估计问题的速度。根据许多最先进的方法,该任务的管道由两部分组成,从面部检测到头部姿势估计。这两个步骤完全独立,不共享信息。这使得模型在设置中清除但不利用每个模型中提取的大部分特色资源。在本文中,我们提出了多任务净模型,具有利用从面部检测模型提取的特征的动机,将它们与头部姿势估计分支共享以提高精度。此外,随着各种数据,表示面部的欧拉角域大,我们的模型可以预测360欧拉角域的结果。应用多任务学习方法,多任务净模型可以同时预测人头的位置和方向。为了提高预测模型的头部方向的能力,我们将人脸从欧拉角呈现到旋转矩阵的载体。
translated by 谷歌翻译
In this paper, we present a method for unconstrained end-to-end head pose estimation. We address the problem of ambiguous rotation labels by introducing the rotation matrix formalism for our ground truth data and propose a continuous 6D rotation matrix representation for efficient and robust direct regression. This way, our method can learn the full rotation appearance which is contrary to previous approaches that restrict the pose prediction to a narrow-angle for satisfactory results. In addition, we propose a geodesic distance-based loss to penalize our network with respect to the SO(3) manifold geometry. Experiments on the public AFLW2000 and BIWI datasets demonstrate that our proposed method significantly outperforms other state-of-the-art methods by up to 20\%. We open-source our training and testing code along with our pre-trained models: https://github.com/thohemp/6DRepNet.
translated by 谷歌翻译
在本文中,我们介绍了一种新的方法来估计从一小组头关键点开始的单个图像中的人们的头部姿势。为此目的,我们提出了一种回归模型,其利用2D姿势估计算法自动计算的关键点,并输出由偏航,间距和滚动表示的头部姿势。我们的模型很容易实现和更高效地相对于最先进的最新技术 - 在记忆占用方面的推动和更小的速度更快 - 具有可比的准确性。我们的方法还通过适当设计的损耗功能提供与三个角度相关的异源间不确定性的量度;我们在误差和不确定值之间显示了相关性,因此可以在后续计算步骤中使用这种额外的信息来源。作为示例申请,我们解决了图像中的社交交互分析:我们提出了一种算法,以定量估计人们之间的互动水平,从他们的头部姿势和推理在其相互阵地上。代码可在https://github.com/cantarinigiorgio/hhp-net中获得。
translated by 谷歌翻译
头部姿势估计是一个具有挑战性的任务,旨在解决与预测三维向量相关的问题,这为人机互动或客户行为中的许多应用程序提供服务。以前的研究提出了一些用于收集头部姿势数据的精确方法。但这些方法需要昂贵的设备,如深度摄像机或复杂的实验室环境设置。在这项研究中,我们引入了一种新的方法,以有效的成本和易于设置,以收集头部姿势图像,即UET-HEADBETS数据集,具有顶视图头姿势数据。该方法使用绝对方向传感器而不是深度摄像机快速设置,但仍然可以确保良好的效果。通过实验,我们的数据集已显示其分发和可用数据集之间的差异,如CMU Panoptic DataSet \ Cite {CMU}。除了使用UET符号数据集和其他头部姿势数据集外,我们还介绍了称为FSANET的全范围模型,这显着优于UET-HEALPETS数据集的头部姿势估计结果,尤其是在顶视图上。此外,该模型非常重量轻,占用小尺寸图像。
translated by 谷歌翻译
单眼3D对象检测是自主驾驶中的重要任务。在存在自我汽车姿势改变W.R.T的情况下,它可以很容易难以解决。地平面。由于道路平滑度和斜坡的轻微波动,这很常见。由于工业应用缺乏洞察力,开放数据集的现有方法忽略了相机姿势信息,这不可避免地导致探测器易受相机外在参数的影响。物体的扰动在工业产品最自主驾驶案件中非常受欢迎。为此,我们提出了一种捕获摄像机姿势的新方法,以配制无自脉扰动的检测器。具体地,所提出的框架通过检测消失点和地平线改变来预测相机外在参数。转换器旨在纠正潜在空间中的扰动特征。通过这样做,我们的3D探测器独立于外在参数变化,并在现实情况下产生准确的结果,例如坑道和不均匀的道路,几乎所有现有的单眼检测器都无法处理。实验证明我们的方法与基蒂3D和NUSCENES数据集的大型裕度相比,我们的方法与其他最先进的最先进。
translated by 谷歌翻译
大多数实时人类姿势估计方法都基于检测接头位置。使用检测到的关节位置,可以计算偏差和肢体的俯仰。然而,由于这种旋转轴仍然不观察,因此不能计算沿着肢体沿着肢体至关重要的曲折,这对于诸如体育分析和计算机动画至关重要。在本文中,我们引入了方向关键点,一种用于估计骨骼关节的全位置和旋转的新方法,仅使用单帧RGB图像。灵感来自Motion-Capture Systems如何使用一组点标记来估计全骨骼旋转,我们的方法使用虚拟标记来生成足够的信息,以便准确地推断使用简单的后处理。旋转预测改善了接头角度最佳报告的平均误差48%,并且在15个骨骼旋转中实现了93%的精度。该方法还通过MPJPE在原理数据集上测量,通过MPJPE测量,该方法还改善了当前的最新结果14%,并概括为野外数据集。
translated by 谷歌翻译
本文调查了2D全身人类姿势估计的任务,该任务旨在将整个人体(包括身体,脚,脸部和手)局部定位在整个人体上。我们提出了一种称为Zoomnet的单网络方法,以考虑到完整人体的层次结构,并解决不同身体部位的规模变化。我们进一步提出了一个称为Zoomnas的神经体系结构搜索框架,以促进全身姿势估计的准确性和效率。Zoomnas共同搜索模型体系结构和不同子模块之间的连接,并自动为搜索的子模块分配计算复杂性。为了训练和评估Zoomnas,我们介绍了第一个大型2D人类全身数据集,即可可叶全体V1.0,它注释了133个用于野外图像的关键点。广泛的实验证明了Zoomnas的有效性和可可叶v1.0的重要性。
translated by 谷歌翻译
面姿势估计是指通过单个RGB图像预测面部取向的任务。这是一个重要的研究主题,在计算机视觉中具有广泛的应用。最近已经提出了基于标签的分布学习(LDL)方法进行面部姿势估计,从而实现了有希望的结果。但是,现有的LDL方法有两个主要问题。首先,标签分布的期望是偏见的,导致姿势估计。其次,将固定的分布参数用于所有学习样本,严重限制了模型能力。在本文中,我们提出了一种各向异性球形高斯(ASG)的LDL方法进行面部姿势估计。特别是,我们的方法在单位球体上采用了球形高斯分布,该分布不断产生公正的期望。同时,我们引入了一个新的损失功能,该功能使网络可以灵活地学习每个学习样本的分布参数。广泛的实验结果表明,我们的方法在AFLW2000和BIWI数据集上设置了新的最新记录。
translated by 谷歌翻译
多人3D姿势估计是一项具有挑战性的任务,因为遮挡和深度歧义,尤其是在人群场景的情况下。为了解决这些问题,大多数现有方法通过使用图神经网络增强特征表示或添加结构约束来探索建模身体上下文提示。但是,这些方法对于它们的单根公式并不强大,该公式将3D从根节点带有预定义的图形。在本文中,我们提出了GR-M3D,该GR-M3D模拟了\ textbf {m} ulti-person \ textbf {3d}构成构成构成效果估计,并使用动态\ textbf {g} raph \ textbf {r textbf {r} eSounting。预测GR-M3D中的解码图而不是预定。特别是,它首先生成几个数据图,并通过刻度和深度意识到的细化模块(SDAR)增强它们。然后从这些数据图估算每个人的多个根关键点和密集的解码路径。基于它们,动态解码图是通过将路径权重分配给解码路径来构建的,而路径权重是从这些增强的数据图推断出来的。此过程被命名为动态图推理(DGR)。最后,根据每个检测到的人的动态解码图对3D姿势进行解码。 GR-M3D可以根据输入数据采用软路径权重,通过采用软路径权重来调整解码图的结构,这使得解码图最能适应不同的输入人员,并且比以前的方法更有能力处理闭塞和深度歧义。我们从经验上表明,提出的自下而上方法甚至超过自上而下的方法,并在三个3D姿势数据集上实现最先进的方法。
translated by 谷歌翻译
Accurate whole-body multi-person pose estimation and tracking is an important yet challenging topic in computer vision. To capture the subtle actions of humans for complex behavior analysis, whole-body pose estimation including the face, body, hand and foot is essential over conventional body-only pose estimation. In this paper, we present AlphaPose, a system that can perform accurate whole-body pose estimation and tracking jointly while running in realtime. To this end, we propose several new techniques: Symmetric Integral Keypoint Regression (SIKR) for fast and fine localization, Parametric Pose Non-Maximum-Suppression (P-NMS) for eliminating redundant human detections and Pose Aware Identity Embedding for jointly pose estimation and tracking. During training, we resort to Part-Guided Proposal Generator (PGPG) and multi-domain knowledge distillation to further improve the accuracy. Our method is able to localize whole-body keypoints accurately and tracks humans simultaneously given inaccurate bounding boxes and redundant detections. We show a significant improvement over current state-of-the-art methods in both speed and accuracy on COCO-wholebody, COCO, PoseTrack, and our proposed Halpe-FullBody pose estimation dataset. Our model, source codes and dataset are made publicly available at https://github.com/MVIG-SJTU/AlphaPose.
translated by 谷歌翻译
单眼3D对象检测对于自动驾驶具有重要意义,但仍然具有挑战性。核心挑战是在没有明确深度信息的情况下预测对象的距离。与在大多数现有方法中将距离作为单个变量回归不同,我们提出了一种基于几何几何距离的分解,以通过其因子恢复距离。分解因素因物体到最具代表性和稳定的变量的距离,即图像平面中的物理高度和投影视觉高度。此外,该分解保持了两个高度之间的自我矛盾,当两个预测高度不准确时,导致距离的距离预测可靠。分解还使我们能够追踪不同场景的距离不确定性的原因。这种分解使距离预测可解释,准确和稳健。我们的方法直接通过紧凑的体系结构从RGB图像中预测3D边界框,从而使训练和推理简单有效。实验结果表明,我们的方法在单眼3D对象检测上实现了最先进的性能,而鸟类视图Kitti数据集的眼睛视图任务,并且可以推广到具有不同摄像机内在的图像。
translated by 谷歌翻译
随着服务机器人和监控摄像头的出现,近年来野外的动态面部识别(DFR)受到了很多关注。面部检测和头部姿势估计是DFR的两个重要步骤。经常,在面部检测后估计姿势。然而,这种顺序计算导致更高的延迟。在本文中,我们提出了一种低延迟和轻量级网络,用于同时脸部检测,地标定位和头部姿势估计。灵感来自观察,以大角度定位面部的面部地标更具挑战性,提出了一个姿势损失来限制学习。此外,我们还提出了不确定性的多任务损失,以便自动学习各个任务的权重。另一个挑战是,机器人通常使用武器基的计算核心等低计算单元,我们经常需要使用轻量级网络而不是沉重的网络,这导致性能下降,特别是对于小型和硬面。在本文中,我们提出了在线反馈采样来增加不同尺度的培训样本,这会自动增加培训数据的多样性。通过验证常用的更广泛的脸,AFLW和AFLW2000数据集,结果表明,该方法在低计算资源中实现了最先进的性能。代码和数据将在https://github.com/lyp-deeplearning/mos-multi-task-face-detect上使用。
translated by 谷歌翻译
We present a method for 3D object detection and pose estimation from a single image. In contrast to current techniques that only regress the 3D orientation of an object, our method first regresses relatively stable 3D object properties using a deep convolutional neural network and then combines these estimates with geometric constraints provided by a 2D object bounding box to produce a complete 3D bounding box. The first network output estimates the 3D object orientation using a novel hybrid discrete-continuous loss, which significantly outperforms the L2 loss. The second output regresses the 3D object dimensions, which have relatively little variance compared to alternatives and can often be predicted for many object types. These estimates, combined with the geometric constraints on translation imposed by the 2D bounding box, enable us to recover a stable and accurate 3D object pose. We evaluate our method on the challenging KITTI object detection benchmark [2] both on the official metric of 3D orientation estimation and also on the accuracy of the obtained 3D bounding boxes. Although conceptually simple, our method outperforms more complex and computationally expensive approaches that leverage semantic segmentation, instance level segmentation and flat ground priors [4] and sub-category detection [23][24]. Our discrete-continuous loss also produces state of the art results for 3D viewpoint estimation on the Pascal 3D+ dataset[26].
translated by 谷歌翻译
在本文中,我们考虑了同时找到和从单个2D图像中恢复多手的具有挑战性的任务。先前的研究要么关注单手重建,要么以多阶段的方式解决此问题。此外,常规的两阶段管道首先检测到手部区域,然后估计每个裁剪贴片的3D手姿势。为了减少预处理和特征提取中的计算冗余,我们提出了一条简洁但有效的单阶段管道。具体而言,我们为多手重建设计了多头自动编码器结构,每个HEAD网络分别共享相同的功能图并分别输出手动中心,姿势和纹理。此外,我们采用了一个弱监督的计划来减轻昂贵的3D现实世界数据注释的负担。为此,我们提出了一系列通过舞台训练方案优化的损失,其中根据公开可用的单手数据集生成具有2D注释的多手数据集。为了进一步提高弱监督模型的准确性,我们在单手和多个手设置中采用了几个功能一致性约束。具体而言,从本地功能估算的每只手的关键点应与全局功能预测的重新投影点一致。在包括Freihand,HO3D,Interhand 2.6M和RHD在内的公共基准测试的广泛实验表明,我们的方法在弱监督和完全监督的举止中优于基于最先进的模型方法。代码和模型可在{\ url {https://github.com/zijinxuxu/smhr}}上获得。
translated by 谷歌翻译
尽管单眼3D姿势估计似乎在公共数据集上取得了非常准确的结果,但它们的概括能力在很大程度上被忽略了。在这项工作中,我们对现有方法进行系统评估,并发现在对不同的摄像机,人体姿势和外观进行测试时,它们会出现更大的错误。为了解决这个问题,我们介绍了VirtualPose,这是一个两阶段的学习框架,以利用该任务特定的隐藏的“免费午餐”,即免费生成无限数量的姿势和摄像头,以免费培训模型。为此,第一阶段将图像转换为抽象的几何表示(AGR),然后第二阶段将它们映射到3D姿势。它从两个方面解决了概括问题:(1)可以在不同的2D数据集上对第一阶段进行培训,以降低过度合适外观的风险; (2)第二阶段可以接受从大量虚拟摄像机和姿势合成的不同AGR训练。它的表现优于SOTA方法,而无需使用任何配对的图像和3D姿势,从而为实用应用铺平了道路。代码可从https://github.com/wkom/virtualpose获得。
translated by 谷歌翻译
The detection of human body and its related parts (e.g., face, head or hands) have been intensively studied and greatly improved since the breakthrough of deep CNNs. However, most of these detectors are trained independently, making it a challenging task to associate detected body parts with people. This paper focuses on the problem of joint detection of human body and its corresponding parts. Specifically, we propose a novel extended object representation that integrates the center location offsets of body or its parts, and construct a dense single-stage anchor-based Body-Part Joint Detector (BPJDet). Body-part associations in BPJDet are embedded into the unified representation which contains both the semantic and geometric information. Therefore, BPJDet does not suffer from error-prone association post-matching, and has a better accuracy-speed trade-off. Furthermore, BPJDet can be seamlessly generalized to jointly detect any body part. To verify the effectiveness and superiority of our method, we conduct extensive experiments on the CityPersons, CrowdHuman and BodyHands datasets. The proposed BPJDet detector achieves state-of-the-art association performance on these three benchmarks while maintains high accuracy of detection. Code will be released to facilitate further studies.
translated by 谷歌翻译
Recently, human pose estimation mainly focuses on how to design a more effective and better deep network structure as human features extractor, and most designed feature extraction networks only introduce the position of each anatomical keypoint to guide their training process. However, we found that some human anatomical keypoints kept their topology invariance, which can help to localize them more accurately when detecting the keypoints on the feature map. But to the best of our knowledge, there is no literature that has specifically studied it. Thus, in this paper, we present a novel 2D human pose estimation method with explicit anatomical keypoints structure constraints, which introduces the topology constraint term that consisting of the differences between the distance and direction of the keypoint-to-keypoint and their groundtruth in the loss object. More importantly, our proposed model can be plugged in the most existing bottom-up or top-down human pose estimation methods and improve their performance. The extensive experiments on the benchmark dataset: COCO keypoint dataset, show that our methods perform favorably against the most existing bottom-up and top-down human pose estimation methods, especially for Lite-HRNet, when our model is plugged into it, its AP scores separately raise by 2.9\% and 3.3\% on COCO val2017 and test-dev2017 datasets.
translated by 谷歌翻译
头视点标签的成本是改善细粒度头姿势估计算法的主要障碍。缺乏大量标签的一种解决方案正在使用自我监督的学习(SSL)。 SSL可以从未标记的数据中提取良好的功能,用于下游任务。因此,本文试图显示头部姿势估计的SSL方法之间的差异。通常,使用SSL的两个主要方法:(1)使用它以预先培训权重,(2)在一个训练期间除了监督学习(SL)之外的SSL作为辅助任务。在本文中,我们通过设计混合多任务学习(HMTL)架构并使用两个SSL预先文本任务,旋转和令人困惑来评估两种方法。结果表明,两种方法的组合在其中使用旋转进行预训练和使用令人难以用于辅助头的令人费示。与基线相比,误差率降低了23.1%,这与电流的SOTA方法相当。最后,我们比较了初始权重对HMTL和SL的影响。随后,通过HMTL,使用各种初始权重减少错误:随机,想象成和SSL。
translated by 谷歌翻译
This paper addresses the challenge of 6DoF pose estimation from a single RGB image under severe occlusion or truncation. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise unit vectors pointing to the keypoints and use these vectors to vote for keypoint locations using RANSAC. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occlusion LINEMOD and YCB-Video datasets by a large margin, while being efficient for real-time pose estimation. We further create a Truncation LINEMOD dataset to validate the robustness of our approach against truncation. The code will be avaliable at https://zju-3dv.github.io/pvnet/.
translated by 谷歌翻译