Traditionally, monocular 3D human pose estimation employs a machine learning model to predict the most likely 3D pose for a given input image. However, a single image can be highly ambiguous and induces multiple plausible solutions for the 2D-3D lifting step which results in overly confident 3D pose predictors. To this end, we propose \emph{DiffPose}, a conditional diffusion model, that predicts multiple hypotheses for a given input image. In comparison to similar approaches, our diffusion model is straightforward and avoids intensive hyperparameter tuning, complex network structures, mode collapse, and unstable training. Moreover, we tackle a problem of the common two-step approach that first estimates a distribution of 2D joint locations via joint-wise heatmaps and consecutively approximates them based on first- or second-moment statistics. Since such a simplification of the heatmaps removes valid information about possibly correct, though labeled unlikely, joint locations, we propose to represent the heatmaps as a set of 2D joint candidate samples. To extract information about the original distribution from these samples we introduce our \emph{embedding transformer} that conditions the diffusion model. Experimentally, we show that DiffPose slightly improves upon the state of the art for multi-hypothesis pose estimation for simple poses and outperforms it by a large margin for highly ambiguous poses.
translated by 谷歌翻译
诸如关键点之类的结构化表示形式被广泛用于姿势传输,条件图像生成,动画和3D重建。但是,他们的监督学习需要每个目标域的昂贵注释。我们提出了一种自我监督的方法,该方法学会从外观上脱离对象结构,并用直边链接的2D关键点的图形。只有描绘同一对象类的图像集合,都学会了关键点的位置及其成对边缘权重。该图是可以解释的,例如,当应用于显示人的图像时,自动链接会恢复人类骨架拓扑。我们的关键要素是i)一个编码器,该编码器可预测输入图像中的关键点位置,ii)共享图作为一个潜在变量,该图形在每个图像中链接了相同的对键点,iii)一个中间边缘映射,结合了潜在图形边缘权重和关键点的位置以柔软,可区分的方式以及iv)在随机掩盖的图像上的介入目标。尽管更简单,但自动链接在已建立的关键点上优于现有的自我监督方法,并构成估计基准,并为更多样化的数据集上的结构调节生成模型铺平了道路。
translated by 谷歌翻译
伤害分析可能是基于深度学习的人类姿势估计的最有益的应用之一。为了促进进一步研究本主题,我们为高山滑雪提供了伤害特定的2D数据集,总计533个图像。我们进一步提出了一个后处理程序,它将旋转信息与简单的运动模型相结合。我们可以将秋季情况的检测结果提高到21%,关于pck@0.2指标。
translated by 谷歌翻译
本文解决了3D人类姿势估计模型的交叉数据集泛化问题。在新数据集上测试预先训练的3D姿势估计值会导致主要的性能下降。以前的方法主要通过改善培训数据的多样性来解决这个问题。我们认为单独的多样性是不够的,并且训练数据的特征需要适应新数据集的那些,例如相机观点,位置,人类动作和体型。为此,我们提出了一种完全的端到端框架,该端到端框架从源数据集生成合成3D人体运动,并使用它们来微调3D姿势估计器。适配遵循对抗培训计划。来自源3D构成发电机生成一系列3D姿势和用于将生成的姿势投影到新颖视图的相机方向。如果没有任何3D标签或相机信息,则成功地学习从目标数据集创建合成3D构成,同时仅在2D姿势培训。在Human3.6m,MPI-INF-3DHP,3DPW和SKI-Pose数据集的实验中,我们的方法优于跨数据集评估的先前工作14%和以前的半监督学习方法,使用部分3D注释达到16%。
translated by 谷歌翻译
从单个图像的人类姿势估计是一个充满挑战的问题,通常通过监督学习解决。不幸的是,由于3D注释需要专用的运动捕获系统,因此许多人类活动尚不存在标记的培训数据。因此,我们提出了一种无监督的方法,该方法学会从单个图像预测3D人类姿势,同时只有2D姿势数据培训,这可能是人群的并且已经广泛可用。为此,我们估计最有可能过于随机投影的3D姿势,其中使用2D姿势的归一化流程估计的可能性。虽然以前的工作需要在训练数据集中的相机旋转上需要强大的前锋,但我们了解了相机角度的分布,显着提高了性能。我们的贡献的另一部分是通过首先将2D突出到线性子空间来稳定高维3D姿势数据上的标准化流动的训练。在许多指标中,我们优于基准数据集Humanets3.6m和MPI-INF-3DHP的最先进的无人监督的人类姿势估算方法。
translated by 谷歌翻译
将图像分段为其部件是频繁预处理,用于高级视觉任务,例如图像编辑。然而,用于监督培训的注释面具是昂贵的。存在弱监督和无监督的方法,但它们依赖于图像对的比较,例如来自多视图,视频帧和单个图像的图像转换,这限制了它们的适用性。为了解决这个问题,我们提出了一种基于GAN的方法,可以在潜在掩模上生成图像,从而减轻了先前方法所需的完整或弱注释。我们表明,当在明确地定义零件位置的潜在关键点上以分层方式调节掩模时,可以忠实地学习这种掩模条件的图像生成。在不需要监督掩模或点的情况下,该策略增加了对观点和对象位置变化的鲁棒性。它还允许我们生成用于训练分段网络的图像掩码对,这优于已建立的基准的最先进的无监督的分段方法。
translated by 谷歌翻译
生成的对抗网络(GANS)已经实现了图像生成的照片逼真品质。但是,如何最好地控制图像内容仍然是一个开放的挑战。我们介绍了莱特基照片,这是一个两级GaN,它在古典GAN目标上训练了训练,在一组空间关键点上有内部调节。这些关键点具有相关的外观嵌入,分别控制生成对象的位置和样式及其部件。我们使用合适的网络架构和培训方案地址的一个主要困难在没有领域知识和监督信号的情况下将图像解开到空间和外观因素中。我们展示了莱特基点提供可解释的潜在空间,可用于通过重新定位和交换Keypoint Embedding来重新安排生成的图像,例如通过组合来自不同图像的眼睛,鼻子和嘴巴来产生肖像。此外,关键点和匹配图像的显式生成启用了一种用于无监督的关键点检测的新的GaN的方法。
translated by 谷歌翻译
在感官替代领域的长期目标是通过可视化音频内容来实现对聋人的声音感知。与语音和文本或文本和图像之间的现有模型不同,我们针对即时和低级音频到视频翻译,适用于通用环境声音以及人类语音。由于这种替代是人为的,没有监督学习的标签,我们的核心贡献是通过高级约束来建立从音频到视频的映射。对于言语,我们通过将它们映射到共同的解除不诚格的潜在空间来释放型号(性别和方言)的内容(电话)。包括用户学习的定性和定量结果表明,我们的未配对翻译方法在生成的视频中保持了重要的音频功能,并且面孔和数字的视频非常适合可视化可以通过人类解析的高维音频特征和区分声音,单词和扬声器。
translated by 谷歌翻译
The release of ChatGPT, a language model capable of generating text that appears human-like and authentic, has gained significant attention beyond the research community. We expect that the convincing performance of ChatGPT incentivizes users to apply it to a variety of downstream tasks, including prompting the model to simplify their own medical reports. To investigate this phenomenon, we conducted an exploratory case study. In a questionnaire, we asked 15 radiologists to assess the quality of radiology reports simplified by ChatGPT. Most radiologists agreed that the simplified reports were factually correct, complete, and not potentially harmful to the patient. Nevertheless, instances of incorrect statements, missed key medical findings, and potentially harmful passages were reported. While further studies are needed, the initial insights of this study indicate a great potential in using large language models like ChatGPT to improve patient-centered care in radiology and other medical domains.
translated by 谷歌翻译
Deep learning-based 3D human pose estimation performs best when trained on large amounts of labeled data, making combined learning from many datasets an important research direction. One obstacle to this endeavor are the different skeleton formats provided by different datasets, i.e., they do not label the same set of anatomical landmarks. There is little prior research on how to best supervise one model with such discrepant labels. We show that simply using separate output heads for different skeletons results in inconsistent depth estimates and insufficient information sharing across skeletons. As a remedy, we propose a novel affine-combining autoencoder (ACAE) method to perform dimensionality reduction on the number of landmarks. The discovered latent 3D points capture the redundancy among skeletons, enabling enhanced information sharing when used for consistency regularization. Our approach scales to an extreme multi-dataset regime, where we use 28 3D human pose datasets to supervise one model, which outperforms prior work on a range of benchmarks, including the challenging 3D Poses in the Wild (3DPW) dataset. Our code and models are available for research purposes.
translated by 谷歌翻译