The combination of artist-curated scans, and deep implicit functions (IF), is enabling the creation of detailed, clothed, 3D humans from images. However, existing methods are far from perfect. IF-based methods recover free-form geometry but produce disembodied limbs or degenerate shapes for unseen poses or clothes. To increase robustness for these cases, existing work uses an explicit parametric body model to constrain surface reconstruction, but this limits the recovery of free-form surfaces such as loose clothing that deviates from the body. What we want is a method that combines the best properties of implicit and explicit methods. To this end, we make two key observations: (1) current networks are better at inferring detailed 2D maps than full-3D surfaces, and (2) a parametric model can be seen as a "canvas" for stitching together detailed surface patches. ECON infers high-fidelity 3D humans even in loose clothes and challenging poses, while having realistic faces and fingers. This goes beyond previous methods. Quantitative, evaluation of the CAPE and Renderpeople datasets shows that ECON is more accurate than the state of the art. Perceptual studies also show that ECON's perceived realism is better by a large margin. Code and models are available for research purposes at https://xiuyuliang.cn/econ
translated by 谷歌翻译
目前用于学习现实和可动画3D穿衣服的方法需要带有仔细控制的用户的构成3D扫描或2D图像。相比之下,我们的目标是从不受约束的姿势中只有2D人的人们学习化身。给定一组图像,我们的方法估计来自每个图像的详细3D表面,然后将它们组合成一个可动画的化身。隐式功能非常适合第一个任务,因为他们可以捕获像头发或衣服等细节。然而,目前的方法对各种人类的姿势并不稳健,并且通常会产生破碎或肢体的3D表面,缺少细节或非人形状。问题是这些方法使用对全局姿势敏感的全局特征编码器。为了解决这个问题,我们提出图标(“从正规中获得的隐式衣物人类”),它使用本地特征。图标有两个主要模块,两者都利用SMPL(-X)正文模型。首先,图标Infers详细的衣服 - 人类法线(前/后)在SMPL(-X)法线上。其次,可视性感知隐式表面回归系统产生人占用场的ISO表面。重要的是,在推断时间下,反馈回路在使用推断的布料正线改进SMPL(-X)网格之间交替,然后改装正常。给定多种姿势的多个重建帧,我们使用扫描来从中生成可动画的化身。对Agora和Cape数据集的评估显示,即使具有大量有限的培训数据,图标越优于重建中的最新状态。另外,它对分布外样品进行更强大,例如,野外的姿势/图像和帧外裁剪。图标从野外图像中迈向强大的3D穿上人体重建。这使得能够使用个性化和天然姿势依赖布变形来直接从视频创建化身。
translated by 谷歌翻译
3D单眼图像的人体重建是在多个域中具有更广泛应用的计算机视觉中有趣和不良的问题。在本文中,我们提出了一种新颖的端到端培训网络,可从单眼图像中准确地恢复3D人的详细几何和外观。在衣服模型的非参数去皮深度图表示之前,我们提出了稀疏和有效的参数体融合。参数正文以两种方式进行了限制我们的模型:首先,网络保留不受衣服封闭的几何一致身体部位,而第二件,它提供了改善剥离深度图的预测的身体形状上下文。这使得能够在给定输入图像的情况下,在2D地图上的L1损耗仅恢复细粒度的3D几何细节。我们在公开可用的布料3D和Thuman数据集中评估夏普,并向最先进的方法报告卓越的性能。
translated by 谷歌翻译
虽然3D人类重建方法使用像素对齐的隐式功能(PIFU)开发快速,但我们观察到重建细节的质量仍然不令人满意。扁平的面部表面经常发生在基于PIFU的重建结果中。为此,我们提出了一个双重PIFU表示,以提高重建的面部细节的质量。具体地,我们利用两只MLP分别代表面部和人体的PIFU。专用于三维面重建的MLP可以提高网络容量,并降低面部细节重建的难度,如前一级PIFU表示。要解决拓扑错误,我们利用3个RGBD传感器捕获多视图RGBD数据作为网络的输入,稀疏,轻量级捕获设置。由于深度噪声严重影响重建结果,我们设计深度细化模块,以减少输入RGB图像的引导下的原始深度的噪声。我们还提出了一种自适应融合方案来熔化身体的预测占用场和面部的预测占用场,以消除其边界处的不连续性伪影。实验证明了我们在重建生动的面部细节和变形体形状方面的效果,并验证了其优于最先进的方法。
translated by 谷歌翻译
为了使3D人的头像广泛可用,我们必须能够在任意姿势中产生各种具有不同身份和形状的多种3D虚拟人。由于衣服的身体形状,复杂的关节和由此产生的丰富,随机几何细节,这项任务是挑战的挑战。因此,目前代表3D人的方法不提供服装中的人的全部生成模型。在本文中,我们提出了一种新的方法,这些方法可以学习在具有相应的剥皮重量的各种衣服中产生详细的3D形状。具体而言,我们设计了一个多主题前进的剥皮模块,这些模块只有几个受试者的未预装扫描。为了捕获服装中高频细节的随机性,我们利用对抗的侵害制定,鼓励模型捕获潜在统计数据。我们提供了经验证据,这导致了皱纹的局部细节的现实生成。我们表明我们的模型能够产生佩戴各种和详细的衣服的自然人头像。此外,我们表明我们的方法可以用于拟合人类模型到原始扫描的任务,优于以前的最先进。
translated by 谷歌翻译
为了解决由单眼人类体积捕获中部分观察结果引起的不足问题,我们提出了Avatarcap,这是一个新颖的框架,该框架将可动画的化身引入了可见和不可见区域中高保真重建的捕获管道中。我们的方法首先为该主题创建一个可动画化的化身,从少量(〜20)的3D扫描作为先验。然后给出了该主题的单眼RGB视频,我们的方法集成了图像观察和头像先验的信息,因此无论可见性如何,都会重新构建具有动态细节的高保真3D纹理模型。为了学习有效的头像,仅从少数样品中捕获体积捕获,我们提出了GeoteXavatar,该地理Xavatar利用几何和纹理监督以分解的隐式方式限制了姿势依赖性动力学。进一步提出了一种涉及规范正常融合和重建网络的头像条件的体积捕获方法,以在观察到的区域和无形区域中整合图像观测和化身动力学,以整合图像观测和头像动力学。总体而言,我们的方法可以通过详细的和姿势依赖性动力学实现单眼人体体积捕获,并且实验表明我们的方法优于最新的最新状态。代码可在https://github.com/lizhe00/avatarcap上找到。
translated by 谷歌翻译
3D服装重建的现有方法要么假设服装几何形状的预定义模板(将其限制为固定服装样式),要么产生顶点有色网眼(缺少高频纹理细节)。我们的新型框架共同学习的几何和语义信息来自输入单眼图像,用于无模板纹理的3D服装数字化。更具体地说,我们建议扩展去皮的表示,以预测像素对齐的分层深度和语义图以提取3D服装。进一步利用分层表示,以参数化提取服装的任意表面,而没有任何人类干预以形成紫外线图集。然后,通过将像素从输入图像从输入图像投射到可见区域的UV空间,然后以混合方式将纹理以混合方式赋予,然后添加封闭的区域。因此,我们能够将任意放松的衣服样式数字化,同时从单眼图像中保留高频纹理细节。我们在三个公开可用的数据集中获得了高保真3D服装重建结果,并在Internet图像上概括。
translated by 谷歌翻译
In this paper, we propose ARCH (Animatable Reconstruction of Clothed Humans), a novel end-to-end framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image. Existing approaches to digitize 3D humans struggle to handle pose variations and recover details. Also, they do not produce models that are animation ready. In contrast, ARCH is a learned pose-aware model that produces detailed 3D rigged full-body human avatars from a single unconstrained RGB image. A Semantic Space and a Semantic Deformation Field are created using a parametric 3D body estimator. They allow the transformation of 2D/3D clothed humans into a canonical space, reducing ambiguities in geometry caused by pose variations and occlusions in training data. Detailed surface geometry and appearance are learned using an implicit function representation with spatial local features. Furthermore, we propose additional per-pixel supervision on the 3D reconstruction using opacity-aware differentiable rendering. Our experiments indicate that ARCH increases the fidelity of the reconstructed humans. We obtain more than 50% lower reconstruction errors for standard metrics compared to state-of-the-art methods on public datasets. We also show numerous qualitative examples of animated, high-quality reconstructed avatars unseen in the literature so far.
translated by 谷歌翻译
Recent advances in image-based 3D human shape estimation have been driven by the significant improvement in representation power afforded by deep neural networks. Although current approaches have demonstrated the potential in real world settings, they still fail to produce reconstructions with the level of detail often present in the input images. We argue that this limitation stems primarily form two conflicting requirements; accurate predictions require large context, but precise predictions require high resolution. Due to memory limitations in current hardware, previous approaches tend to take low resolution images as input to cover large spatial context, and produce less precise (or low resolution) 3D estimates as a result. We address this limitation by formulating a multi-level architecture that is end-to-end trainable. A coarse level observes the whole image at lower resolution and focuses on holistic reasoning. This provides context to an fine level which estimates highly detailed geometry by observing higher-resolution images. We demonstrate that our approach significantly outperforms existing state-of-the-art techniques on single image human shape reconstruction by fully leveraging 1k-resolution input images.
translated by 谷歌翻译
精确地重建由单个图像的各种姿势和服装引起的精确复杂的人类几何形状非常具有挑战性。最近,基于像素对齐的隐式函数(PIFU)的作品已迈出了一步,并在基于图像的3D人数数字化上实现了最先进的保真度。但是,PIFU的培训在很大程度上取决于昂贵且有限的3D地面真相数据(即合成数据),从而阻碍了其对更多样化的现实世界图像的概括。在这项工作中,我们提出了一个名为selfpifu的端到端自我监督的网络,以利用丰富和多样化的野外图像,在对无约束的内部图像进行测试时,在很大程度上改善了重建。 SelfPifu的核心是深度引导的体积/表面感知的签名距离领域(SDF)学习,它可以自欺欺人地学习PIFU,而无需访问GT网格。整个框架由普通估计器,深度估计器和基于SDF的PIFU组成,并在训练过程中更好地利用了额外的深度GT。广泛的实验证明了我们自我监督框架的有效性以及使用深度作为输入的优越性。在合成数据上,与PIFUHD相比,我们的交叉点(IOU)达到93.5%,高18%。对于野外图像,我们对重建结果进行用户研究,与其他最先进的方法相比,我们的结果的选择率超过68%。
translated by 谷歌翻译
我们提出了一种基于优化的新型范式,用于在图像和扫描上拟合3D人类模型。与直接回归输入图像中低维统计体模型(例如SMPL)的参数的现有方法相反,我们训练了每个vertex神经场网络的集合。该网络以分布式的方式预测基于当前顶点投影处提取的神经特征的顶点下降方向。在推断时,我们在梯度降低的优化管道中采用该网络,称为LVD,直到其收敛性为止,即使将所有顶点初始化为单个点,通常也会以一秒钟的分数出现。一项详尽的评估表明,我们的方法能够捕获具有截然不同的身体形状的穿着的人体,与最先进的人相比取得了重大改进。 LVD也适用于人类和手的3D模型配合,为此,我们以更简单,更快的方法对SOTA显示出显着改善。
translated by 谷歌翻译
我们提出了CrossHuman,这是一种新颖的方法,该方法从参数人类模型和多帧RGB图像中学习了交叉指导,以实现高质量的3D人类重建。为了恢复几何细节和纹理,即使在无形区域中,我们设计了一个重建管道,结合了基于跟踪的方法和无跟踪方法。给定一个单眼RGB序列,我们在整个序列中跟踪参数人模型,与目标框架相对应的点(体素)被参数体运动扭曲为参考框架。在参数体的几何学先验和RGB序列的空间对齐特征的指导下,稳健隐式表面被融合。此外,将多帧变压器(MFT)和一个自我监管的经过修补模块集成到框架中,以放宽参数主体的要求并帮助处理非常松散的布。与以前的作品相比,我们的十字人类可以在可见的和无形区域启用高保真的几何细节和纹理,并提高人类重建的准确性,即使在估计的不准确的参数人类模型下也是如此。实验表明我们的方法达到了最新的(SOTA)性能。
translated by 谷歌翻译
4D隐式表示中的最新进展集中在全球控制形状和运动的情况下,低维潜在向量,这很容易缺少表面细节和累积跟踪误差。尽管许多深层的本地表示显示了3D形状建模的有希望的结果,但它们的4D对应物尚不存在。在本文中,我们通过提出一个新颖的局部4D隐性代表来填补这一空白,以动态穿衣人,名为Lord,具有4D人类建模和局部代表的优点,并实现具有详细的表面变形的高保真重建,例如衣服皱纹。特别是,我们的主要见解是鼓励网络学习本地零件级表示的潜在代码,能够解释本地几何形状和时间变形。为了在测试时间进行推断,我们首先估计内部骨架运动在每个时间步中跟踪本地零件,然后根据不同类型的观察到的数据通过自动编码来优化每个部分的潜在代码。广泛的实验表明,该提出的方法具有强大的代表4D人类的能力,并且在实际应用上胜过最先进的方法,包括从稀疏点,非刚性深度融合(质量和定量)进行的4D重建。
translated by 谷歌翻译
To facilitate the analysis of human actions, interactions and emotions, we compute a 3D model of human body pose, hand pose, and facial expression from a single monocular image. To achieve this, we use thousands of 3D scans to train a new, unified, 3D model of the human body, SMPL-X, that extends SMPL with fully articulated hands and an expressive face. Learning to regress the parameters of SMPL-X directly from images is challenging without paired images and 3D ground truth. Consequently, we follow the approach of SMPLify, which estimates 2D features and then optimizes model parameters to fit the features. We improve on SMPLify in several significant ways: (1) we detect 2D features corresponding to the face, hands, and feet and fit the full SMPL-X model to these; (2) we train a new neural network pose prior using a large MoCap dataset; (3) we define a new interpenetration penalty that is both fast and accurate; (4) we automatically detect gender and the appropriate body models (male, female, or neutral); (5) our PyTorch implementation achieves a speedup of more than 8× over Chumpy. We use the new method, SMPLify-X, to fit SMPL-X to both controlled images and images in the wild. We evaluate 3D accuracy on a new curated dataset comprising 100 images with pseudo ground-truth. This is a step towards automatic expressive human capture from monocular RGB data. The models, code, and data are available for research purposes at https://smpl-x.is.tue.mpg.de.
translated by 谷歌翻译
Single-image 3D human reconstruction aims to reconstruct the 3D textured surface of the human body given a single image. While implicit function-based methods recently achieved reasonable reconstruction performance, they still bear limitations showing degraded quality in both surface geometry and texture from an unobserved view. In response, to generate a realistic textured surface, we propose ReFu, a coarse-to-fine approach that refines the projected backside view image and fuses the refined image to predict the final human body. To suppress the diffused occupancy that causes noise in projection images and reconstructed meshes, we propose to train occupancy probability by simultaneously utilizing 2D and 3D supervisions with occupancy-based volume rendering. We also introduce a refinement architecture that generates detail-preserving backside-view images with front-to-back warping. Extensive experiments demonstrate that our method achieves state-of-the-art performance in 3D human reconstruction from a single image, showing enhanced geometry and texture quality from an unobserved view.
translated by 谷歌翻译
SMPL(SMPL)的参数3D身体模型仅代表最小衣服的人,并且很难扩展到衣服,因为它们具有固定的网格拓扑和分辨率。为了解决这些局限性,最近的工作使用隐式表面或点云来建模衣服。虽然不受拓扑的限制,但这种方法仍然很难为偏离身体的偏离的衣服建模,例如裙子和连衣裙。这是因为他们依靠身体来通过将衣服表面放置为参考形状。不幸的是,当衣服远离身体时,这个过程的定义很差。此外,他们使用线性混合剥皮来摆姿势,并将皮肤重量与下面的身体部位绑在一起。相比之下,我们在没有规范化的情况下对局部坐标空间中的衣服变形进行了建模。我们还放松皮肤重量以使多个身体部位影响表面。具体而言,我们用粗糙的阶段扩展了基于点的方法,该方法用学习的姿势独立的“粗大形状”代替了规范化,该方法可以捕获裙子(如裙子)的粗糙表面几何形状。然后,我们使用一个网络来完善该网络,该网络会渗透到粗糙表示中的线性混合剥皮权重和姿势依赖的位移。该方法适合符合身体并偏离身体的服装。我们通过从示例中学习特定于人的化身,然后展示如何以新的姿势和动作来展示它们的有用性。我们还表明,该方法可以直接从原始扫描中学习缺少数据,从而大大简化了创建逼真的化身的过程。代码可用于研究目的,可在{\ small \ url {https://qianlim.github.io/skirt}}中使用。
translated by 谷歌翻译
We introduce Structured 3D Features, a model based on a novel implicit 3D representation that pools pixel-aligned image features onto dense 3D points sampled from a parametric, statistical human mesh surface. The 3D points have associated semantics and can move freely in 3D space. This allows for optimal coverage of the person of interest, beyond just the body shape, which in turn, additionally helps modeling accessories, hair, and loose clothing. Owing to this, we present a complete 3D transformer-based attention framework which, given a single image of a person in an unconstrained pose, generates an animatable 3D reconstruction with albedo and illumination decomposition, as a result of a single end-to-end model, trained semi-supervised, and with no additional postprocessing. We show that our S3F model surpasses the previous state-of-the-art on various tasks, including monocular 3D reconstruction, as well as albedo and shading estimation. Moreover, we show that the proposed methodology allows novel view synthesis, relighting, and re-posing the reconstruction, and can naturally be extended to handle multiple input images (e.g. different views of a person, or the same view, in different poses, in video). Finally, we demonstrate the editing capabilities of our model for 3D virtual try-on applications.
translated by 谷歌翻译
人类性能捕获是一种非常重要的计算机视觉问题,在电影制作和虚拟/增强现实中具有许多应用。许多以前的性能捕获方法需要昂贵的多视图设置,或者没有恢复具有帧到帧对应关系的密集时空相干几何。我们提出了一种新颖的深度致密人体性能捕获的深层学习方法。我们的方法是基于多视图监督的弱监督方式培训,完全删除了使用3D地面真理注释的培训数据的需求。网络架构基于两个单独的网络,将任务解散为姿势估计和非刚性表面变形步骤。广泛的定性和定量评估表明,我们的方法在质量和稳健性方面优于现有技术。这项工作是DeepCAP的扩展版本,在那里我们提供更详细的解释,比较和结果以及应用程序。
translated by 谷歌翻译
捕获穿着人的动态变形3D形状对于许多应用,包括VR / AR,自主驾驶和人机交互必不可少。现有方法要么需要高度专业化的捕获设置,如昂贵的多视图成像系统,或者它们缺乏对挑战身体姿势的鲁棒性。在这项工作中,我们提出了一种能够从具有具有挑战性身体姿势的单眼视频捕获动态3D人形状的方法,而没有任何额外的输入。我们首先基于学习的回归模型构建了对象的3D模板人体模型。然后,我们基于2D图像观察跟踪该模板模型在具有挑战性的身体剖视下的变形。我们的方法在野外的人类视频数据集3DPW上占据了最先进的方法。此外,我们展示了IPS数据集视频中鲁棒性和普遍性的效果。
translated by 谷歌翻译