通过利用预熟gan的潜在空间,已经提出了许多最近的作品来进行面部图像编辑。但是,很少有尝试将它们直接应用于视频,因为1)他们不能保证时间一致性,2)他们的应用受到视频的处理速度的限制,3)他们无法准确编码面部运动和表达的细节。为此,我们提出了一个新颖的网络,将面部视频编码到Stylegan的潜在空间中,以进行语义面部视频操纵。基于视觉变压器,我们的网络重复了潜在向量的高分辨率部分,以实现时间一致性。为了捕捉微妙的面部运动和表情,我们设计了涉及稀疏面部地标和密集的3D脸部网眼的新颖损失。我们已经彻底评估了我们的方法,并成功证明了其对各种面部视频操作的应用。特别是,我们提出了一个新型网络,用于3D坐标系中的姿势/表达控制。定性和定量结果都表明,我们的方法可以显着优于现有的单图方法,同时实现实时(66 fps)速度。
translated by 谷歌翻译
与传统的头像创建管道相反,这是一个昂贵的过程,现代生成方法直接从照片中学习数据分布,而艺术的状态现在可以产生高度的照片现实图像。尽管大量作品试图扩展无条件的生成模型并达到一定程度的可控性,但要确保多视图一致性,尤其是在大型姿势中,仍然具有挑战性。在这项工作中,我们提出了一个3D肖像生成网络,该网络可产生3D一致的肖像,同时根据有关姿势,身份,表达和照明的语义参数可控。生成网络使用神经场景表示在3D中建模肖像,其生成以支持明确控制的参数面模型为指导。尽管可以通过将图像与部分不同的属性进行对比,但可以进一步增强潜在的分离,但在非面积区域(例如,在动画表达式)时,仍然存在明显的不一致。我们通过提出一种体积混合策略来解决此问题,在该策略中,我们通过将动态和静态辐射场融合在一起,形成一个复合输出,并从共同学习的语义场中分割了两个部分。我们的方法在广泛的实验中优于先前的艺术,在自由视点中观看时,在自然照明中产生了逼真的肖像。所提出的方法还证明了真实图像以及室外卡通面孔的概括能力,在实际应用中显示出巨大的希望。其他视频结果和代码将在项目网页上提供。
translated by 谷歌翻译
生成的对抗网络(GAN)表现出了真实图像的令人印象深刻的图像生成质量和语义编辑功能,例如更改对象类,修改属性或传输样式。但是,将这些基于GAN的编辑应用于每个框架的视频,不可避免地会导致时间闪烁的伪影。我们提出了一种简单而有效的方法,以促进时间连贯的视频编辑。我们的核心思想是通过优化潜在代码和预训练的发电机来最大程度地减少时间光度不一致。我们评估了在不同领域和GAN倒置技术上编辑的质量,并对基线显示出优惠的结果。
translated by 谷歌翻译
最近,音频驱动的会说话的面部视频产生引起了广泛的关注。但是,很少有研究能够解决这些会说话的面部视频的情感编辑问题,并具有连续可控的表达式,这是行业中强烈的需求。面临的挑战是,与语音有关的表达和与情感有关的表达通常是高度耦合的。同时,由于表达式与其他属性(例如姿势)的耦合,即在每个框架中翻译角色的表达可能会同时改变头部姿势,因此传统的图像到图像翻译方法无法在我们的应用中很好地工作。培训数据分布。在本文中,我们提出了一种高质量的面部表达编辑方法,用于谈话面部视频,使用户可以连续控制编辑视频中的目标情感。我们为该任务提供了一个新的视角,作为运动信息编辑的特殊情况,我们使用3DMM捕获主要的面部运动和由StyleGAN模拟的相关纹理图,以捕获外观细节。两种表示(3DMM和纹理图)都包含情感信息,并且可以通过神经网络进行连续修改,并通过系数/潜在空间平均轻松平滑,从而使我们的方法变得简单而有效。我们还引入了口腔形状的保存损失,以控制唇部同步和编辑表达的夸张程度之间的权衡。广泛的实验和用户研究表明,我们的方法在各种评估标准中实现了最先进的表现。
translated by 谷歌翻译
尽管最近通过生成对抗网络(GAN)操纵面部属性最近取得了非常成功的成功,但在明确控制姿势,表达,照明等特征的明确控制方面仍然存在一些挑战。最近的方法通过结合2D生成模型来实现对2D图像的明确控制和3dmm。但是,由于3DMM缺乏现实主义和纹理重建的清晰度,因此合成图像与3DMM的渲染图像之间存在域间隙。由于渲染的3DMM图像仅包含面部区域,因此直接计算这两个域之间的损失是不理想的,因此训练有素的模型将是偏差的。在这项研究中,我们建议通过控制3DMM的参数来明确编辑验证样式的潜在空间。为了解决域间隙问题,我们提出了一个名为“地图和编辑”的新网络,以及一种简单但有效的属性编辑方法,以避免渲染和合成图像之间的直接损失计算。此外,由于我们的模型可以准确地生成多视图的面部图像,而身份保持不变。作为副产品,结合可见性掩模,我们提出的模型还可以生成质地丰富和高分辨率的紫外面部纹理。我们的模型依赖于验证的样式,并且提出的模型以自我监督的方式进行了训练,而无需任何手动注释或数据集训练。
translated by 谷歌翻译
由于其语义上的理解和用户友好的可控性,通过三维引导,通过三维引导的面部图像操纵已广泛应用于各种交互式场景。然而,现有的基于3D形式模型的操作方法不可直接适用于域名面,例如非黑色素化绘画,卡通肖像,甚至是动物,主要是由于构建每个模型的强大困难具体面部域。为了克服这一挑战,据我们所知,我们建议使用人为3DMM操纵任意域名的第一种方法。这是通过两个主要步骤实现的:1)从3DMM参数解开映射到潜在的STYLEGO2的潜在空间嵌入,可确保每个语义属性的解除响应和精确的控制; 2)通过实施一致的潜空间嵌入,桥接域差异并使人类3DMM适用于域外面的人类3DMM。实验和比较展示了我们高质量的语义操作方法在各种面部域中的优越性,所有主要3D面部属性可控姿势,表达,形状,反照镜和照明。此外,我们开发了直观的编辑界面,以支持用户友好的控制和即时反馈。我们的项目页面是https://cassiepython.github.io/cddfm3d/index.html
translated by 谷歌翻译
由于GaN潜在空间的勘探和利用,近年来,现实世界的图像操纵实现了奇妙的进展。 GaN反演是该管道的第一步,旨在忠实地将真实图像映射到潜在代码。不幸的是,大多数现有的GaN反演方法都无法满足下面列出的三个要求中的至少一个:重建质量,可编辑性和快速推断。我们在本研究中提出了一种新的两阶段策略,同时适合所有要求。在第一阶段,我们训练编码器将输入图像映射到StyleGan2 $ \ Mathcal {W} $ - 空间,这被证明具有出色的可编辑性,但重建质量较低。在第二阶段,我们通过利用一系列HyperNetWorks来补充初始阶段的重建能力以在反转期间恢复缺失的信息。这两个步骤互相补充,由于Hypernetwork分支和由于$ \ Mathcal {W} $ - 空间中的反转,因此由于HyperNetwork分支和优异的可编辑性而相互作用。我们的方法完全是基于编码器的,导致极快的推断。关于两个具有挑战性的数据集的广泛实验证明了我们方法的优越性。
translated by 谷歌翻译
我们提出了一种新颖的方法,用于生成语音音频和单个“身份”图像的高分辨率视频。我们的方法基于卷积神经网络模型,该模型结合了预训练的样式Gener。我们将每个帧建模为Stylegan潜在空间中的一个点,以便视频对应于潜在空间的轨迹。培训网络分为两个阶段。第一阶段是根据语音话语调节潜在空间中的轨迹。为此,我们使用现有的编码器倒转发电机,将每个视频框架映射到潜在空间中。我们训练一个经常性的神经网络,以从语音话语绘制到图像发生器潜在空间中的位移。这些位移是相对于从训练数据集中所描绘的个体选择的身份图像的潜在空间的反向预测的。在第二阶段,我们通过在单个图像或任何选择的身份的简短视频上调整图像生成器来提高生成视频的视觉质量。我们对标准度量(PSNR,SSIM,FID和LMD)的模型进行评估,并表明它在两个常用数据集之一上的最新方法明显优于最新的最新方法,另一方面给出了可比的性能。最后,我们报告了验证模型组成部分的消融实验。可以在https://mohammedalghamdi.github.io/talking-heads-acm-mm上找到实验的代码和视频
translated by 谷歌翻译
We present a novel image inversion framework and a training pipeline to achieve high-fidelity image inversion with high-quality attribute editing. Inverting real images into StyleGAN's latent space is an extensively studied problem, yet the trade-off between the image reconstruction fidelity and image editing quality remains an open challenge. The low-rate latent spaces are limited in their expressiveness power for high-fidelity reconstruction. On the other hand, high-rate latent spaces result in degradation in editing quality. In this work, to achieve high-fidelity inversion, we learn residual features in higher latent codes that lower latent codes were not able to encode. This enables preserving image details in reconstruction. To achieve high-quality editing, we learn how to transform the residual features for adapting to manipulations in latent codes. We train the framework to extract residual features and transform them via a novel architecture pipeline and cycle consistency losses. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements. Code: https://github.com/hamzapehlivan/StyleRes
translated by 谷歌翻译
Our goal with this survey is to provide an overview of the state of the art deep learning technologies for face generation and editing. We will cover popular latest architectures and discuss key ideas that make them work, such as inversion, latent representation, loss functions, training procedures, editing methods, and cross domain style transfer. We particularly focus on GAN-based architectures that have culminated in the StyleGAN approaches, which allow generation of high-quality face images and offer rich interfaces for controllable semantics editing and preserving photo quality. We aim to provide an entry point into the field for readers that have basic knowledge about the field of deep learning and are looking for an accessible introduction and overview.
translated by 谷歌翻译
In this work, we propose an ID-preserving talking head generation framework, which advances previous methods in two aspects. First, as opposed to interpolating from sparse flow, we claim that dense landmarks are crucial to achieving accurate geometry-aware flow fields. Second, inspired by face-swapping methods, we adaptively fuse the source identity during synthesis, so that the network better preserves the key characteristics of the image portrait. Although the proposed model surpasses prior generation fidelity on established benchmarks, to further make the talking head generation qualified for real usage, personalized fine-tuning is usually needed. However, this process is rather computationally demanding that is unaffordable to standard users. To solve this, we propose a fast adaptation model using a meta-learning approach. The learned model can be adapted to a high-quality personalized model as fast as 30 seconds. Last but not the least, a spatial-temporal enhancement module is proposed to improve the fine details while ensuring temporal coherency. Extensive experiments prove the significant superiority of our approach over the state of the arts in both one-shot and personalized settings.
translated by 谷歌翻译
StyleGAN has achieved great progress in 2D face reconstruction and semantic editing via image inversion and latent editing. While studies over extending 2D StyleGAN to 3D faces have emerged, a corresponding generic 3D GAN inversion framework is still missing, limiting the applications of 3D face reconstruction and semantic editing. In this paper, we study the challenging problem of 3D GAN inversion where a latent code is predicted given a single face image to faithfully recover its 3D shapes and detailed textures. The problem is ill-posed: innumerable compositions of shape and texture could be rendered to the current image. Furthermore, with the limited capacity of a global latent code, 2D inversion methods cannot preserve faithful shape and texture at the same time when applied to 3D models. To solve this problem, we devise an effective self-training scheme to constrain the learning of inversion. The learning is done efficiently without any real-world 2D-3D training pairs but proxy samples generated from a 3D GAN. In addition, apart from a global latent code that captures the coarse shape and texture information, we augment the generation network with a local branch, where pixel-aligned features are added to faithfully reconstruct face details. We further consider a new pipeline to perform 3D view-consistent editing. Extensive experiments show that our method outperforms state-of-the-art inversion methods in both shape and texture reconstruction quality. Code and data will be released.
translated by 谷歌翻译
Face Restoration (FR) aims to restore High-Quality (HQ) faces from Low-Quality (LQ) input images, which is a domain-specific image restoration problem in the low-level computer vision area. The early face restoration methods mainly use statistic priors and degradation models, which are difficult to meet the requirements of real-world applications in practice. In recent years, face restoration has witnessed great progress after stepping into the deep learning era. However, there are few works to study deep learning-based face restoration methods systematically. Thus, this paper comprehensively surveys recent advances in deep learning techniques for face restoration. Specifically, we first summarize different problem formulations and analyze the characteristic of the face image. Second, we discuss the challenges of face restoration. Concerning these challenges, we present a comprehensive review of existing FR methods, including prior based methods and deep learning-based methods. Then, we explore developed techniques in the task of FR covering network architectures, loss functions, and benchmark datasets. We also conduct a systematic benchmark evaluation on representative methods. Finally, we discuss future directions, including network designs, metrics, benchmark datasets, applications,etc. We also provide an open-source repository for all the discussed methods, which is available at https://github.com/TaoWangzj/Awesome-Face-Restoration.
translated by 谷歌翻译
The introduction of high-quality image generation models, particularly the StyleGAN family, provides a powerful tool to synthesize and manipulate images. However, existing models are built upon high-quality (HQ) data as desired outputs, making them unfit for in-the-wild low-quality (LQ) images, which are common inputs for manipulation. In this work, we bridge this gap by proposing a novel GAN structure that allows for generating images with controllable quality. The network can synthesize various image degradation and restore the sharp image via a quality control code. Our proposed QC-StyleGAN can directly edit LQ images without altering their quality by applying GAN inversion and manipulation techniques. It also provides for free an image restoration solution that can handle various degradations, including noise, blur, compression artifacts, and their mixtures. Finally, we demonstrate numerous other applications such as image degradation synthesis, transfer, and interpolation.
translated by 谷歌翻译
我们提出了Styletalker,这是一种新颖的音频驱动的会说话的头部生成模型,可以从单个参考图像中综合一个会说话的人的视频,并具有准确的音频同步的唇形,逼真的头姿势和眼睛眨眼。具体而言,通过利用预验证的图像生成器和图像编码器,我们估计了会说话的头视频的潜在代码,这些代码忠实地反映了给定的音频。通过几个新设计的组件使这成为可能:1)一种用于准确唇部同步的对比性唇部同步鉴别剂,2)一种条件顺序的连续变异自动编码器,该差异自动编码器了解从唇部运动中解散的潜在运动空间,以便我们可以独立地操纵运动运动的运动。和唇部运动,同时保留身份。 3)自动回归事先增强,并通过标准化流量来学习复杂的音频到运动多模式潜在空间。配备了这些组件,Styletalker不仅可以在给出另一个运动源视频时以动作控制的方式生成说话的头视频,而且还可以通过从输入音频中推断出现实的动作,以完全由音频驱动的方式生成。通过广泛的实验和用户研究,我们表明我们的模型能够以令人印象深刻的感知质量合成会说话的头部视频,这些视频与输入音频相符,可以准确地唇部同步,这在很大程度上要优于先进的基线。
translated by 谷歌翻译
生产级别的工作流程用于产生令人信服的3D动态人体面孔长期以来依赖各种劳动密集型工具用于几何和纹理生成,运动捕获和索具以及表达合成。最近的神经方法可以使单个组件自动化,但是相应的潜在表示不能像常规工具一样为艺术家提供明确的控制。在本文中,我们提出了一种新的基于学习的,视频驱动的方法,用于生成具有高质量基于物理资产的动态面部几何形状。对于数据收集,我们构建了一个混合多视频测量捕获阶段,与超快速摄像机耦合以获得原始的3D面部资产。然后,我们着手使用单独的VAE对面部表达,几何形状和基于物理的纹理进行建模,我们在各个网络的潜在范围内强加了基于全局MLP的表达映射,以保留各个属性的特征。我们还将增量信息建模为基于物理的纹理的皱纹图,从而达到高质量的4K动态纹理。我们展示了我们在高保真表演者特异性面部捕获和跨认同面部运动重新定位中的方法。此外,我们的基于多VAE的神经资产以及快速适应方案也可以部署以处理内部视频。此外,我们通过提供具有较高现实主义的各种有希望的基于身体的编辑结果来激发我们明确的面部解散策略的实用性。综合实验表明,与以前的视频驱动的面部重建和动画方法相比,我们的技术提供了更高的准确性和视觉保真度。
translated by 谷歌翻译
由于生成对抗网络(GAN)的突破,3D可控制的肖像合成已大大提高。但是,用精确的3D控制操纵现有的面部图像仍然具有挑战性。虽然连接gan倒置和3D感知,但噪声到图像是一种直接的解决方案,但它效率低下,可能导致编辑质量明显下降。为了填补这一空白,我们提出了3D-FM GAN,这是一个专门为3D可控制的面部操作设计的新型有条件GAN框架,并且在端到端学习阶段后不需要任何调整。通过小心地编码输入面图像和3D编辑的基于物理的渲染,我们的图像生成器提供了高质量,具有身份的3D控制面部操纵。为了有效地学习这种新颖的框架,我们制定了两种基本的训练策略和一种新颖的乘法共同调制体系结构,可在天真的方案上显着改善。通过广泛的评估,我们表明我们的方法在各种任务上的表现优于先前的艺术,具有更好的编辑性,更强的身份保存和更高的照片真实性。此外,我们在大型姿势编辑和室外图像上展示了设计更好的概括性。
translated by 谷歌翻译
Recent 3D-aware GANs rely on volumetric rendering techniques to disentangle the pose and appearance of objects, de facto generating entire 3D volumes rather than single-view 2D images from a latent code. Complex image editing tasks can be performed in standard 2D-based GANs (e.g., StyleGAN models) as manipulation of latent dimensions. However, to the best of our knowledge, similar properties have only been partially explored for 3D-aware GAN models. This work aims to fill this gap by showing the limitations of existing methods and proposing LatentSwap3D, a model-agnostic approach designed to enable attribute editing in the latent space of pre-trained 3D-aware GANs. We first identify the most relevant dimensions in the latent space of the model controlling the targeted attribute by relying on the feature importance ranking of a random forest classifier. Then, to apply the transformation, we swap the top-K most relevant latent dimensions of the image being edited with an image exhibiting the desired attribute. Despite its simplicity, LatentSwap3D provides remarkable semantic edits in a disentangled manner and outperforms alternative approaches both qualitatively and quantitatively. We demonstrate our semantic edit approach on various 3D-aware generative models such as pi-GAN, GIRAFFE, StyleSDF, MVCGAN, EG3D and VolumeGAN, and on diverse datasets, such as FFHQ, AFHQ, Cats, MetFaces, and CompCars. The project page can be found: \url{https://enisimsar.github.io/latentswap3d/}.
translated by 谷歌翻译
Figure 1. The proposed pixel2style2pixel framework can be used to solve a wide variety of image-to-image translation tasks. Here we show results of pSp on StyleGAN inversion, multi-modal conditional image synthesis, facial frontalization, inpainting and super-resolution.
translated by 谷歌翻译
Inspired by the impressive performance of recent face image editing methods, several studies have been naturally proposed to extend these methods to the face video editing task. One of the main challenges here is temporal consistency among edited frames, which is still unresolved. To this end, we propose a novel face video editing framework based on diffusion autoencoders that can successfully extract the decomposed features - for the first time as a face video editing model - of identity and motion from a given video. This modeling allows us to edit the video by simply manipulating the temporally invariant feature to the desired direction for the consistency. Another unique strength of our model is that, since our model is based on diffusion models, it can satisfy both reconstruction and edit capabilities at the same time, and is robust to corner cases in wild face videos (e.g. occluded faces) unlike the existing GAN-based methods.
translated by 谷歌翻译