可控的人图像合成任务可以通过对身体姿势和外观的明确控制来实现广泛的应用。在本文中,我们提出了一个基于跨注意的样式分布模块,该模块在源语义样式和目标姿势转移的目标姿势之间计算。该模块故意选择每个语义表示的样式,并根据目标姿势分配它们。交叉注意的注意力矩阵表达了目标姿势与所有语义的源样式之间的动态相似性。因此,可以利用它来从源图像路由颜色和纹理,并受到目标解析图的进一步限制,以实现更清晰的目标。同时,为了准确编码源外观,还添加了不同语义样式之间的自我注意力。我们的模型的有效性在姿势转移和虚拟的尝试任务上进行了定量和质量验证。
translated by 谷歌翻译
人物图像的旨在在源图像上执行非刚性变形,这通常需要未对准数据对进行培训。最近,自我监督的方法通过合并自我重建的解除印章表达来表达这项任务的巨大前景。然而,这些方法未能利用解除戒断功能之间的空间相关性。在本文中,我们提出了一种自我监督的相关挖掘网络(SCM-NET)来重新排列特征空间中的源图像,其中两种协作模块是集成的,分解的样式编码器(DSE)和相关挖掘模块(CMM)。具体地,DSE首先在特征级别创建未对齐的对。然后,CMM建立用于特征重新排列的空间相关领域。最终,翻译模块将重新排列的功能转换为逼真的结果。同时,为了提高跨尺度姿态变换的保真度,我们提出了一种基于曲线图的体结构保持损失(BSR损耗),以保持半体上的合理的身体结构到全身。与Deepfashion DataSet进行的广泛实验表明了与其他监督和无监督和无监督的方法相比的方法的优势。此外,对面部的令人满意的结果显示了我们在其他变形任务中的方法的多功能性。
translated by 谷歌翻译
基于图像的虚拟试验是以人为中心的现实潜力,是以人为中心的图像生成的最有希望的应用之一。在这项工作中,我们迈出了一步,探索多功能的虚拟尝试解决方案,我们认为这应该具有三个主要属性,即,它们应支持无监督的培训,任意服装类别和可控的服装编辑。为此,我们提出了一个特征性的端到端网络,即用空间自适应的斑点适应性GAN ++(Pasta-gan ++),以实现用于高分辨率不合规的虚拟试验的多功能系统。具体而言,我们的意大利面++由一个创新的贴布贴片的拆卸模块组成,可以将完整的服装切换为归一化贴剂,该贴片能够保留服装样式信息,同时消除服装空间信息,从而减轻在未受监督训练期间过度适应的问题。此外,面食++引入了基于贴片的服装表示和一个贴片引导的解析合成块,使其可以处理任意服装类别并支持本地服装编辑。最后,为了获得具有逼真的纹理细节的尝试结果,面食gan ++结合了一种新型的空间自适应残留模块,以将粗翘曲的服装功能注入发电机。对我们新收集的未配对的虚拟试验(UPT)数据集进行了广泛的实验,证明了面食gan ++比现有SOTA的优越性及其可控服装编辑的能力。
translated by 谷歌翻译
人类姿势转移旨在将源人的外观转移到目标姿势。利用基于流量的非刚性人类图像的翘曲的现有方法取得了巨大的成功。然而,由于源和目标之间的空间相关性未充分利用,它们未能保留合成图像中的外观细节。为此,我们提出了基于流动的双重关注GaN(FDA-GaN),以应用于更高的发电质量的遮挡和变形感知功能融合。具体而言,可变形的局部注意力和流量相似性关注,构成双重关注机制,可以分别导出负责可变形和遮挡感知融合的输出特征。此外,为了维持传输的姿势和全球位置一致性,我们设计了一种姿势归一化网络,用于从目标姿势到源人员学习自适应标准化。定性和定量结果都表明,我们的方法在公共IPer和Deepfashion数据集中优于最先进的模型。
translated by 谷歌翻译
基于图像的虚拟试图是由于其巨大的真实潜力,以人为本的图像生成最有希望的应用之一。然而,由于大多数预先接近店内服装到目标人物,他们需要对成对的训练数据集进行费力和限制性的结构,严重限制了它们的可扩展性。虽然最近的一些作品试图直接从一个人转移服装,但减轻了收集配对数据集的需要,它们的表现受缺乏配对(监督)信息影响。特别地,衣服的解开样式和空间信息成为一个挑战,通过需要辅助数据或广泛的在线优化程序来解决任何方法,从而仍抑制其可扩展性。实现A \ EMPH {可扩展}虚拟试样系统,可以以无监督的方式在源和目标人物之间传输任意服装,因此我们提出了一种纹理保留的端到端网络,该包装空间 - 适应甘(意大利面),促进了现实世界的未配对虚拟试验。具体而言,要解开每位服装的风格和空间信息,意大利面甘包括一个创新的补丁路由解剖模块,用于成功挡住衣服纹理和形状特性。由源人关键点引导,修补程序路由的解剖学模块首先将衣服脱发到标准化的贴片中,从而消除了衣服的固有空间信息,然后将归一化贴片重建到符合目标人员姿势的翘曲衣服。鉴于翘曲的衣服,Pasta-GaN进一步推出了一种新型空间适应性的残余块,指导发电机合成更现实的服装细节。
translated by 谷歌翻译
人类视频运动转移(HVMT)的目的是鉴于源头的形象,生成了模仿驾驶人员运动的视频。 HVMT的现有方法主要利用生成对抗网络(GAN),以根据根据源人员图像和每个驾驶视频框架估计的流量来执行翘曲操作。但是,由于源头,量表和驾驶人员之间的巨大差异,这些方法始终会产生明显的人工制品。为了克服这些挑战,本文提出了基于gan的新型人类运动转移(远程移动)框架。为了产生逼真的动作,远遥采用了渐进的一代范式:它首先在没有基于流动的翘曲的情况下生成每个身体的零件,然后将所有零件变成驾驶运动的完整人。此外,为了保留自然的全球外观,我们设计了一个全球对齐模块,以根据其布局与驾驶员的规模和位置保持一致。此外,我们提出了一个纹理对准模块,以使人的每个部分都根据纹理的相似性对齐。最后,通过广泛的定量和定性实验,我们的远及以两个公共基准取得了最先进的结果。
translated by 谷歌翻译
虚拟试验旨在在店内服装和参考人员图像的情况下产生光真实的拟合结果。现有的方法通常建立多阶段框架来分别处理衣服翘曲和身体混合,或严重依赖基于中间解析器的标签,这些标签可能嘈杂甚至不准确。为了解决上述挑战,我们通过开发一种新型的变形注意流(DAFLOF)提出了一个单阶段的尝试框架,该框架将可变形的注意方案应用于多流量估计。仅将姿势关键点作为指导,分别为参考人员和服装图像估计了自我和跨跨性别的注意力流。通过对多个流场进行采样,通过注意机制同时提取并合并了来自不同语义区域的特征级和像素级信息。它使衣服翘曲和身体合成,同时以端到端的方式导致照片真实的结果。在两个尝试数据集上进行的广泛实验表明,我们提出的方法在定性和定量上都能达到最先进的性能。此外,其他两个图像编辑任务上的其他实验说明了我们用于多视图合成和图像动画方法的多功能性。
translated by 谷歌翻译
在本文中,我们专注于人物图像的生成,即在各种条件下产生人物图像,例如腐败的纹理或不同的姿势。在此任务中解决纹理遮挡和大构成错位,以前的作品只使用相应的区域的风格来推断遮挡区域并依靠点明智的对齐来重新组织上下文纹理信息,缺乏全局关联地区的能力代码并保留源的局部结构。为了解决这些问题,我们提出了一种Glocal框架,通过全球推理不同语义区域之间的样式相互关系来改善遮挡感知纹理估计,这也可以用于恢复纹理染色中的损坏图像。对于本地结构信息保存,我们进一步提取了源图像的本地结构,并通过本地结构传输在所生成的图像中重新获得。我们基准测试我们的方法,以充分表征其对Deepfashion DataSet的性能,并显示出突出我们方法的新颖性的广泛消融研究。
translated by 谷歌翻译
The vision community has explored numerous pose guided human editing methods due to their extensive practical applications. Most of these methods still use an image-to-image formulation in which a single image is given as input to produce an edited image as output. However, the problem is ill-defined in cases when the target pose is significantly different from the input pose. Existing methods then resort to in-painting or style transfer to handle occlusions and preserve content. In this paper, we explore the utilization of multiple views to minimize the issue of missing information and generate an accurate representation of the underlying human model. To fuse the knowledge from multiple viewpoints, we design a selector network that takes the pose keypoints and texture from images and generates an interpretable per-pixel selection map. After that, the encodings from a separate network (trained on a single image human reposing task) are merged in the latent space. This enables us to generate accurate, precise, and visually coherent images for different editing tasks. We show the application of our network on 2 newly proposed tasks - Multi-view human reposing, and Mix-and-match human image generation. Additionally, we study the limitations of single-view editing and scenarios in which multi-view provides a much better alternative.
translated by 谷歌翻译
基于图像的虚拟试验努力将服装的外观转移到目标人的图像上。先前的工作主要集中在上身衣服(例如T恤,衬衫和上衣)上,并忽略了全身或低身物品。这种缺点来自一个主要因素:用于基于图像的虚拟试验的当前公开可用数据集并不解释此品种,从而限制了该领域的进度。为了解决这种缺陷,我们介绍着着装代码,其中包含多类服装的图像。着装代码比基于图像的虚拟试验的公共可用数据集大于3倍以上,并且具有前视图,全身参考模型的高分辨率配对图像(1024x768)。为了生成具有高视觉质量且细节丰富的高清尝试图像,我们建议学习细粒度的区分功能。具体而言,我们利用一种语义意识歧视器,该歧视器在像素级而不是图像级或贴片级上进行预测。广泛的实验评估表明,所提出的方法在视觉质量和定量结果方面超过了基线和最先进的竞争者。着装码数据集可在https://github.com/aimagelab/dress-code上公开获得。
translated by 谷歌翻译
Image and video synthesis has become a blooming topic in computer vision and machine learning communities along with the developments of deep generative models, due to its great academic and application value. Many researchers have been devoted to synthesizing high-fidelity human images as one of the most commonly seen object categories in daily lives, where a large number of studies are performed based on various deep generative models, task settings and applications. Thus, it is necessary to give a comprehensive overview on these variant methods on human image generation. In this paper, we divide human image generation techniques into three paradigms, i.e., data-driven methods, knowledge-guided methods and hybrid methods. For each route, the most representative models and the corresponding variants are presented, where the advantages and characteristics of different methods are summarized in terms of model architectures and input/output requirements. Besides, the main public human image datasets and evaluation metrics in the literature are also summarized. Furthermore, due to the wide application potentials, two typical downstream usages of synthesized human images are covered, i.e., data augmentation for person recognition tasks and virtual try-on for fashion customers. Finally, we discuss the challenges and potential directions of human image generation to shed light on future research.
translated by 谷歌翻译
Automatic font generation without human experts is a practical and significant problem, especially for some languages that consist of a large number of characters. Existing methods for font generation are often in supervised learning. They require a large number of paired data, which are labor-intensive and expensive to collect. In contrast, common unsupervised image-to-image translation methods are not applicable to font generation, as they often define style as the set of textures and colors. In this work, we propose a robust deformable generative network for unsupervised font generation (abbreviated as DGFont++). We introduce a feature deformation skip connection (FDSC) to learn local patterns and geometric transformations between fonts. The FDSC predicts pairs of displacement maps and employs the predicted maps to apply deformable convolution to the low-level content feature maps. The outputs of FDSC are fed into a mixer to generate final results. Moreover, we introduce contrastive self-supervised learning to learn a robust style representation for fonts by understanding the similarity and dissimilarities of fonts. To distinguish different styles, we train our model with a multi-task discriminator, which ensures that each style can be discriminated independently. In addition to adversarial loss, another two reconstruction losses are adopted to constrain the domain-invariant characteristics between generated images and content images. Taking advantage of FDSC and the adopted loss functions, our model is able to maintain spatial information and generates high-quality character images in an unsupervised manner. Experiments demonstrate that our model is able to generate character images of higher quality than state-of-the-art methods.
translated by 谷歌翻译
最近已经示出了从2D图像中提取隐式3D表示的生成神经辐射场(GNERF)模型,以产生代表刚性物体的现实图像,例如人面或汽车。然而,他们通常难以产生代表非刚性物体的高质量图像,例如人体,这对许多计算机图形应用具有很大的兴趣。本文提出了一种用于人类图像综合的3D感知语义导向生成模型(3D-SAGGA),其集成了GNERF和纹理发生器。前者学习人体的隐式3D表示,并输出一组2D语义分段掩模。后者将这些语义面部掩模转化为真实的图像,为人类的外观添加了逼真的纹理。如果不需要额外的3D信息,我们的模型可以使用照片现实可控生成学习3D人类表示。我们在Deepfashion DataSet上的实验表明,3D-SAGGAN显着优于最近的基线。
translated by 谷歌翻译
发型转移是将源发型修改为目标的任务。尽管最近的发型转移模型可以反映发型的精致特征,但它们仍然有两个主要局限性。首先,当源和目标图像具有不同的姿势(例如,查看方向或面部尺寸)时,现有方法无法转移发型,这在现实世界中很普遍。同样,当源图像中有非平凡的区域被其原始头发遮住时,先前的模型会产生不切实际的图像。当将长发修改为短发时,肩膀或背景被长发遮住了。为了解决这些问题,我们为姿势不变的发型转移,发型提出了一个新颖的框架。我们的模型包括两个阶段:1)基于流动的头发对齐和2)头发合成。在头发对齐阶段,我们利用基于关键点的光流估计器将目标发型与源姿势对齐。然后,我们基于语义区域感知的嵌入面膜(SIM)估计器在头发合成阶段生成最终的发型转移图像。我们的SIM估计器将源图像中的封闭区域划分为不同的语义区域,以反映其在涂料过程中的独特特征。为了证明我们的模型的有效性,我们使用多视图数据集(K-Hairstyle和Voxceleb)进行定量和定性评估。结果表明,发型通过在不同姿势的图像之间成功地转移发型来实现最先进的表现,而这是以前从未实现的。
translated by 谷歌翻译
Automatic image colorization is a particularly challenging problem. Due to the high illness of the problem and multi-modal uncertainty, directly training a deep neural network usually leads to incorrect semantic colors and low color richness. Existing transformer-based methods can deliver better results but highly depend on hand-crafted dataset-level empirical distribution priors. In this work, we propose DDColor, a new end-to-end method with dual decoders, for image colorization. More specifically, we design a multi-scale image decoder and a transformer-based color decoder. The former manages to restore the spatial resolution of the image, while the latter establishes the correlation between semantic representations and color queries via cross-attention. The two decoders incorporate to learn semantic-aware color embedding by leveraging the multi-scale visual features. With the help of these two decoders, our method succeeds in producing semantically consistent and visually plausible colorization results without any additional priors. In addition, a simple but effective colorfulness loss is introduced to further improve the color richness of generated results. Our extensive experiments demonstrate that the proposed DDColor achieves significantly superior performance to existing state-of-the-art works both quantitatively and qualitatively. Codes will be made publicly available.
translated by 谷歌翻译
Automatic colorization of anime line drawing has attracted much attention in recent years since it can substantially benefit the animation industry. User-hint based methods are the mainstream approach for line drawing colorization, while reference-based methods offer a more intuitive approach. Nevertheless, although reference-based methods can improve feature aggregation of the reference image and the line drawing, the colorization results are not compelling in terms of color consistency or semantic correspondence. In this paper, we introduce an attention-based model for anime line drawing colorization, in which a channel-wise and spatial-wise Convolutional Attention module is used to improve the ability of the encoder for feature extraction and key area perception, and a Stop-Gradient Attention module with cross-attention and self-attention is used to tackle the cross-domain long-range dependency problem. Extensive experiments show that our method outperforms other SOTA methods, with more accurate line structure and semantic color information.
translated by 谷歌翻译
基于图像的虚拟试验旨在综合一个穿给定服装的人的图像。为了解决任务,现有的方法会经过衣物项目,以适合该人的身体并生成穿着该物品的人的分割图,然后再将物品与人融合。但是,当扭曲和分割生成阶段在没有信息交换的情况下单独运行时,扭曲的衣服和分割图之间的未对准发生了,从而导致最终图像中的工件。信息断开还会导致在身体部位遮住的衣服区域附近过度翘曲,所谓的像素 - 刺式伪像。为了解决这些问题,我们提出了一个新颖的尝试条件发生器,作为两个阶段的统一模块(即扭曲和分割生成阶段)。条件生成器中新提出的特征融合块实现了信息交换,并且条件生成器不会造成任何未对准或像素 - 平方形工件。我们还介绍了歧视者的拒绝,从而滤除了不正确的细分图预测并确保虚拟试验框架的性能。高分辨率数据集上的实验表明,我们的模型成功处理了未对准和遮挡,并显着优于基线。代码可从https://github.com/sangyun884/hr-viton获得。
translated by 谷歌翻译
近年来,由于其在图像生成过程中的可控性,有条件的图像合成引起了不断的关注。虽然最近的作品取得了现实的结果,但大多数都没有处理细微细节的细粒度风格。为了解决这个问题,提出了一种名为DRAN的新型归一化模块。它学会了细粒度的风格表示,同时保持普通风格的稳健性。具体来说,我们首先引入多级结构,空间感知金字塔汇集,以指导模型学习粗略的功能。然后,为了自适应地保险熔断不同的款式,我们提出动态门控,使得可以根据不同的空间区域选择不同的样式。为了评估DRAN的有效性和泛化能力,我们对化妆和语义图像合成进行了一组实验。定量和定性实验表明,配备了DRAN,基线模型能够实现复杂风格转移和纹理细节重建的显着改善。
translated by 谷歌翻译
强大的模拟器高度降低了在培训和评估自动车辆时对真实测试的需求。数据驱动的模拟器蓬勃发展,最近有条件生成对冲网络(CGANS)的进步,提供高保真图像。主要挑战是在施加约束之后的同时合成光量造型图像。在这项工作中,我们建议通过重新思考鉴别者架构来提高所生成的图像的质量。重点是在给定对语义输入生成图像的问题类上,例如场景分段图或人体姿势。我们建立成功的CGAN模型,提出了一种新的语义感知鉴别器,更好地指导发电机。我们的目标是学习一个共享的潜在表示,编码足够的信息,共同进行语义分割,内容重建以及粗糙的粒度的对抗性推理。实现的改进是通用的,并且可以应用于任何条件图像合成的任何架构。我们展示了我们在场景,建筑和人类综合任务上的方法,跨越三个不同的数据集。代码可在https://github.com/vita-epfl/semdisc上获得。
translated by 谷歌翻译
Image-based virtual try-on techniques have shown great promise for enhancing the user-experience and improving customer satisfaction on fashion-oriented e-commerce platforms. However, existing techniques are currently still limited in the quality of the try-on results they are able to produce from input images of diverse characteristics. In this work, we propose a Context-Driven Virtual Try-On Network (C-VTON) that addresses these limitations and convincingly transfers selected clothing items to the target subjects even under challenging pose configurations and in the presence of self-occlusions. At the core of the C-VTON pipeline are: (i) a geometric matching procedure that efficiently aligns the target clothing with the pose of the person in the input images, and (ii) a powerful image generator that utilizes various types of contextual information when synthesizing the final try-on result. C-VTON is evaluated in rigorous experiments on the VITON and MPV datasets and in comparison to state-of-the-art techniques from the literature. Experimental results show that the proposed approach is able to produce photo-realistic and visually convincing results and significantly improves on the existing state-of-the-art.
translated by 谷歌翻译