我们介绍了Gaudi,Gaudi是一种生成模型,能够捕获可以从移动的相机中沉浸式的复杂和现实3D场景的分布。我们通过一种可扩展而强大的方法解决了这个具有挑战性的问题,我们首先优化了散布辐射场和相机姿势的潜在表示。然后,该潜在表示将学习一个生成模型,该模型可以使3D场景的无条件生成和条件生成。我们的模型概括了以前的作品,该作品通过删除可以在样本中共享相机姿势分布的假设来关注单个对象。我们表明,高迪(Gaudi)在多个数据集的无条件生成设置中获得了最先进的性能,并允许有条件地生成3D场景给定的调理变量,例如稀疏图像观测值或描述场景的文本。
translated by 谷歌翻译
We introduce DiffRF, a novel approach for 3D radiance field synthesis based on denoising diffusion probabilistic models. While existing diffusion-based methods operate on images, latent codes, or point cloud data, we are the first to directly generate volumetric radiance fields. To this end, we propose a 3D denoising model which directly operates on an explicit voxel grid representation. However, as radiance fields generated from a set of posed images can be ambiguous and contain artifacts, obtaining ground truth radiance field samples is non-trivial. We address this challenge by pairing the denoising formulation with a rendering loss, enabling our model to learn a deviated prior that favours good image quality instead of trying to replicate fitting errors like floating artifacts. In contrast to 2D-diffusion models, our model learns multi-view consistent priors, enabling free-view synthesis and accurate shape generation. Compared to 3D GANs, our diffusion-based approach naturally enables conditional generation such as masked completion or single-view 3D synthesis at inference time.
translated by 谷歌翻译
Generative models have shown great promise in synthesizing photorealistic 3D objects, but they require large amounts of training data. We introduce SinGRAF, a 3D-aware generative model that is trained with a few input images of a single scene. Once trained, SinGRAF generates different realizations of this 3D scene that preserve the appearance of the input while varying scene layout. For this purpose, we build on recent progress in 3D GAN architectures and introduce a novel progressive-scale patch discrimination approach during training. With several experiments, we demonstrate that the results produced by SinGRAF outperform the closest related works in both quality and diversity by a large margin.
translated by 谷歌翻译
This paper presents a 3D generative model that uses diffusion models to automatically generate 3D digital avatars represented as neural radiance fields. A significant challenge in generating such avatars is that the memory and processing costs in 3D are prohibitive for producing the rich details required for high-quality avatars. To tackle this problem we propose the roll-out diffusion network (Rodin), which represents a neural radiance field as multiple 2D feature maps and rolls out these maps into a single 2D feature plane within which we perform 3D-aware diffusion. The Rodin model brings the much-needed computational efficiency while preserving the integrity of diffusion in 3D by using 3D-aware convolution that attends to projected features in the 2D feature plane according to their original relationship in 3D. We also use latent conditioning to orchestrate the feature generation for global coherence, leading to high-fidelity avatars and enabling their semantic editing based on text prompts. Finally, we use hierarchical synthesis to further enhance details. The 3D avatars generated by our model compare favorably with those produced by existing generative techniques. We can generate highly detailed avatars with realistic hairstyles and facial hair like beards. We also demonstrate 3D avatar generation from image or text as well as text-guided editability.
translated by 谷歌翻译
随着几个行业正在朝着建模大规模的3D虚拟世界迈进,因此需要根据3D内容的数量,质量和多样性来扩展的内容创建工具的需求变得显而易见。在我们的工作中,我们旨在训练Parterant 3D生成模型,以合成纹理网格,可以通过3D渲染引擎直接消耗,因此立即在下游应用中使用。 3D生成建模的先前工作要么缺少几何细节,因此在它们可以生成的网格拓扑中受到限制,通常不支持纹理,或者在合成过程中使用神经渲染器,这使得它们在常见的3D软件中使用。在这项工作中,我们介绍了GET3D,这是一种生成模型,该模型直接生成具有复杂拓扑,丰富几何细节和高保真纹理的显式纹理3D网格。我们在可区分的表面建模,可区分渲染以及2D生成对抗网络中桥接了最新成功,以从2D图像集合中训练我们的模型。 GET3D能够生成高质量的3D纹理网格,从汽车,椅子,动物,摩托车和人类角色到建筑物,对以前的方法进行了重大改进。
translated by 谷歌翻译
计算机愿景中的经典问题是推断从几个可用于以交互式速率渲染新颖视图的图像的3D场景表示。以前的工作侧重于重建预定定义的3D表示,例如,纹理网格或隐式表示,例如隐式表示。辐射字段,并且通常需要输入图像,具有精确的相机姿势和每个新颖场景的长处理时间。在这项工作中,我们提出了场景表示变换器(SRT),一种方法,该方法处理新的区域的构成或未铺设的RGB图像,Infers Infers“设置 - 潜在场景表示”,并合成新颖的视图,全部在一个前馈中经过。为了计算场景表示,我们提出了视觉变压器的概括到图像组,实现全局信息集成,从而实现3D推理。一个有效的解码器变压器通过参加场景表示来参加光场以呈现新颖的视图。通过最大限度地减少新型视图重建错误,学习是通过最终到底的。我们表明,此方法在PSNR和Synthetic DataSets上的速度方面优于最近的基线,包括为纸张创建的新数据集。此外,我们展示了使用街景图像支持现实世界户外环境的交互式可视化和语义分割。
translated by 谷歌翻译
我们研究了从3D对象组成的场景的稀疏源观察的新型视图综合的问题。我们提出了一种简单但有效的方法,既不是持续的也不是隐含的,挑战近期观测综合的趋势。我们的方法将观察显式编码为启用摊销渲染的体积表示。我们证明,虽然由于其表现力,但由于其表现力,但由于其富有力的力量,我们的简单方法获得了与最新的基线的比较比较了与最先进的基线的相当甚至更好的新颖性重建质量,同时增加了渲染速度超过400倍。我们的模型采用类别无关方式培训,不需要特定于场景的优化。因此,它能够将新颖的视图合成概括为在训练期间未见的对象类别。此外,我们表明,通过简单的制定,我们可以使用视图综合作为自我监控信号,以便在没有明确的3D监督的情况下高效学习3D几何。
translated by 谷歌翻译
Diffusion models have emerged as the state-of-the-art for image generation, among other tasks. Here, we present an efficient diffusion-based model for 3D-aware generation of neural fields. Our approach pre-processes training data, such as ShapeNet meshes, by converting them to continuous occupancy fields and factoring them into a set of axis-aligned triplane feature representations. Thus, our 3D training scenes are all represented by 2D feature planes, and we can directly train existing 2D diffusion models on these representations to generate 3D neural fields with high quality and diversity, outperforming alternative approaches to 3D-aware generation. Our approach requires essential modifications to existing triplane factorization pipelines to make the resulting features easy to learn for the diffusion model. We demonstrate state-of-the-art results on 3D generation on several object classes from ShapeNet.
translated by 谷歌翻译
使用单视图2D照片仅集合,无监督的高质量多视图 - 一致的图像和3D形状一直是一个长期存在的挑战。现有的3D GAN是计算密集型的,也是没有3D-一致的近似;前者限制了所生成的图像的质量和分辨率,并且后者对多视图一致性和形状质量产生不利影响。在这项工作中,我们提高了3D GAN的计算效率和图像质量,而无需依赖这些近似。为此目的,我们介绍了一种表现力的混合明确隐式网络架构,与其他设计选择一起,不仅可以实时合成高分辨率多视图一致图像,而且还产生高质量的3D几何形状。通过解耦特征生成和神经渲染,我们的框架能够利用最先进的2D CNN生成器,例如Stylega2,并继承它们的效率和表现力。在其他实验中,我们展示了与FFHQ和AFHQ猫的最先进的3D感知合成。
translated by 谷歌翻译
生成辐射场的进步推动了3D感知图像合成的边界。通过观察到3D对象应该从多个观点看起来真实的观察,这些方法将多视图约束引入正则化以从2D图像学习有效的3D辐射场。尽管有了进步,但由于形状彩色模糊,它们通常会缺少准确的3D形状,这限制了它们在下游任务中的适用性。在这项工作中,我们通过提出一种新的阴影引导的生成隐式模型来解决这种模糊性,能够学习持续改进的形状表示。我们的主要洞察力是,在不同的照明条件下,精确的3D形状还应产生逼真的渲染。通过明确地模拟照明和具有各种照明条件的阴影来实现这种多照明约束。通过将合成的图像馈送到鉴别器来导出梯度。为了补偿计算表面法线的额外计算负担,我们进一步通过表面跟踪设计了高效的体积渲染策略,将培训和推理时间分别将培训和推理时间减少了24%和48%。我们在多个数据集上的实验表明,该方法在捕获准确的基础3D形状时实现了光电型3D感知图像合成。我们展示了我们对现有方法的3D形重建的方法的改进性能,并展示了其对图像复兴的适用性。我们的代码将在https://github.com/xingangpan/shadegan发布。
translated by 谷歌翻译
Generative models, as an important family of statistical modeling, target learning the observed data distribution via generating new instances. Along with the rise of neural networks, deep generative models, such as variational autoencoders (VAEs) and generative adversarial network (GANs), have made tremendous progress in 2D image synthesis. Recently, researchers switch their attentions from the 2D space to the 3D space considering that 3D data better aligns with our physical world and hence enjoys great potential in practice. However, unlike a 2D image, which owns an efficient representation (i.e., pixel grid) by nature, representing 3D data could face far more challenges. Concretely, we would expect an ideal 3D representation to be capable enough to model shapes and appearances in details, and to be highly efficient so as to model high-resolution data with fast speed and low memory cost. However, existing 3D representations, such as point clouds, meshes, and recent neural fields, usually fail to meet the above requirements simultaneously. In this survey, we make a thorough review of the development of 3D generation, including 3D shape generation and 3D-aware image synthesis, from the perspectives of both algorithms and more importantly representations. We hope that our discussion could help the community track the evolution of this field and further spark some innovative ideas to advance this challenging task.
translated by 谷歌翻译
我们提出了一种无监督的方法,用于对铰接对象的3D几何形式表示学习,其中不使用图像置态对或前景口罩进行训练。尽管可以通过现有的3D神经表示的明确姿势控制铰接物体的影像图像,但这些方法需要地面真相3D姿势和前景口罩进行训练,这是昂贵的。我们通过学习GAN培训来学习表示形式来消除这种需求。该发电机经过训练,可以通过对抗训练从随机姿势和潜在向量产生逼真的铰接物体图像。为了避免GAN培训的高计算成本,我们提出了基于三平面的铰接对象的有效神经表示形式,然后为其无监督培训提供了基于GAN的框架。实验证明了我们方法的效率,并表明基于GAN的培训可以在没有配对监督的情况下学习可控的3D表示。
translated by 谷歌翻译
我们呈现NESF,一种用于单独从构成的RGB图像中生成3D语义场的方法。代替经典的3D表示,我们的方法在最近的基础上建立了隐式神经场景表示的工作,其中3D结构被点亮功能捕获。我们利用这种方法来恢复3D密度领域,我们然后在其中培训由构成的2D语义地图监督的3D语义分段模型。尽管仅在2D信号上培训,我们的方法能够从新颖的相机姿势生成3D一致的语义地图,并且可以在任意3D点查询。值得注意的是,NESF与产生密度场的任何方法兼容,并且随着密度场的质量改善,其精度可提高。我们的实证分析在复杂的实际呈现的合成场景中向竞争性2D和3D语义分割基线表现出可比的质量。我们的方法是第一个提供真正密集的3D场景分段,需要仅需要2D监督培训,并且不需要任何关于新颖场景的推论的语义输入。我们鼓励读者访问项目网站。
translated by 谷歌翻译
We introduce 3inGAN, an unconditional 3D generative model trained from 2D images of a single self-similar 3D scene. Such a model can be used to produce 3D "remixes" of a given scene, by mapping spatial latent codes into a 3D volumetric representation, which can subsequently be rendered from arbitrary views using physically based volume rendering. By construction, the generated scenes remain view-consistent across arbitrary camera configurations, without any flickering or spatio-temporal artifacts. During training, we employ a combination of 2D, obtained through differentiable volume tracing, and 3D Generative Adversarial Network (GAN) losses, across multiple scales, enforcing realism on both its 3D structure and the 2D renderings. We show results on semi-stochastic scenes of varying scale and complexity, obtained from real and synthetic sources. We demonstrate, for the first time, the feasibility of learning plausible view-consistent 3D scene variations from a single exemplar scene and provide qualitative and quantitative comparisons against recent related methods.
translated by 谷歌翻译
It is common practice in deep learning to represent a measurement of the world on a discrete grid, e.g. a 2D grid of pixels. However, the underlying signal represented by these measurements is often continuous, e.g. the scene depicted in an image. A powerful continuous alternative is then to represent these measurements using an implicit neural representation, a neural function trained to output the appropriate measurement value for any input spatial location. In this paper, we take this idea to its next level: what would it take to perform deep learning on these functions instead, treating them as data? In this context we refer to the data as functa, and propose a framework for deep learning on functa. This view presents a number of challenges around efficient conversion from data to functa, compact representation of functa, and effectively solving downstream tasks on functa. We outline a recipe to overcome these challenges and apply it to a wide range of data modalities including images, 3D shapes, neural radiance fields (NeRF) and data on manifolds. We demonstrate that this approach has various compelling properties across data modalities, in particular on the canonical tasks of generative modeling, data imputation, novel view synthesis and classification. Code: https://github.com/deepmind/functa
translated by 谷歌翻译
我们介绍了一种基于神经辐射场的生成3D模型的方法,仅从每个对象的单个视图训练。虽然产生现实图像不再是一项艰巨的任务,产生相应的3D结构,使得它们可以从不同视图呈现是非微不足道的。我们表明,与现有方法不同,一个不需要多视图数据来实现这一目标。具体而言,我们表明,通过将许多图像对齐,与在共享潜在空间上的单个网络调节的近似规范姿势对齐,您可以学习模型为一类对象的形状和外观的辐射字段的空间。我们通过培训模型来展示这一点,以使用仅包含每个拍摄对象的一个视图的数据集重建对象类别而没有深度或几何信息。我们的实验表明,我们实现最先进的导致单眼深度预测的综合合成和竞争结果。
translated by 谷歌翻译
StyleGAN has achieved great progress in 2D face reconstruction and semantic editing via image inversion and latent editing. While studies over extending 2D StyleGAN to 3D faces have emerged, a corresponding generic 3D GAN inversion framework is still missing, limiting the applications of 3D face reconstruction and semantic editing. In this paper, we study the challenging problem of 3D GAN inversion where a latent code is predicted given a single face image to faithfully recover its 3D shapes and detailed textures. The problem is ill-posed: innumerable compositions of shape and texture could be rendered to the current image. Furthermore, with the limited capacity of a global latent code, 2D inversion methods cannot preserve faithful shape and texture at the same time when applied to 3D models. To solve this problem, we devise an effective self-training scheme to constrain the learning of inversion. The learning is done efficiently without any real-world 2D-3D training pairs but proxy samples generated from a 3D GAN. In addition, apart from a global latent code that captures the coarse shape and texture information, we augment the generation network with a local branch, where pixel-aligned features are added to faithfully reconstruct face details. We further consider a new pipeline to perform 3D view-consistent editing. Extensive experiments show that our method outperforms state-of-the-art inversion methods in both shape and texture reconstruction quality. Code and data will be released.
translated by 谷歌翻译
Unsupervised learning with generative models has the potential of discovering rich representations of 3D scenes. While geometric deep learning has explored 3Dstructure-aware representations of scene geometry, these models typically require explicit 3D supervision. Emerging neural scene representations can be trained only with posed 2D images, but existing methods ignore the three-dimensional structure of scenes. We propose Scene Representation Networks (SRNs), a continuous, 3Dstructure-aware scene representation that encodes both geometry and appearance. SRNs represent scenes as continuous functions that map world coordinates to a feature representation of local scene properties. By formulating the image formation as a differentiable ray-marching algorithm, SRNs can be trained end-toend from only 2D images and their camera poses, without access to depth or shape. This formulation naturally generalizes across scenes, learning powerful geometry and appearance priors in the process. We demonstrate the potential of SRNs by evaluating them for novel view synthesis, few-shot reconstruction, joint shape and appearance interpolation, and unsupervised discovery of a non-rigid face model. 1
translated by 谷歌翻译
综合照片 - 现实图像和视频是计算机图形的核心,并且是几十年的研究焦点。传统上,使用渲染算法(如光栅化或射线跟踪)生成场景的合成图像,其将几何形状和材料属性的表示为输入。统称,这些输入定义了实际场景和呈现的内容,并且被称为场景表示(其中场景由一个或多个对象组成)。示例场景表示是具有附带纹理的三角形网格(例如,由艺术家创建),点云(例如,来自深度传感器),体积网格(例如,来自CT扫描)或隐式曲面函数(例如,截短的符号距离)字段)。使用可分辨率渲染损耗的观察结果的这种场景表示的重建被称为逆图形或反向渲染。神经渲染密切相关,并将思想与经典计算机图形和机器学习中的思想相结合,以创建用于合成来自真实观察图像的图像的算法。神经渲染是朝向合成照片现实图像和视频内容的目标的跨越。近年来,我们通过数百个出版物显示了这一领域的巨大进展,这些出版物显示了将被动组件注入渲染管道的不同方式。这种最先进的神经渲染进步的报告侧重于将经典渲染原则与学习的3D场景表示结合的方法,通常现在被称为神经场景表示。这些方法的一个关键优势在于它们是通过设计的3D-一致,使诸如新颖的视点合成捕获场景的应用。除了处理静态场景的方法外,我们还涵盖了用于建模非刚性变形对象的神经场景表示...
translated by 谷歌翻译
通过将图像形成过程分解成逐个申请的去噪自身额,扩散模型(DMS)实现了最先进的合成导致图像数据和超越。另外,它们的配方允许引导机构来控制图像生成过程而不会再刷新。然而,由于这些模型通常在像素空间中直接操作,因此强大的DMS的优化通常消耗数百个GPU天,并且由于顺序评估,推理是昂贵的。为了在保留其质量和灵活性的同时启用有限计算资源的DM培训,我们将它们应用于强大的佩带自动化器的潜在空间。与以前的工作相比,这种代表上的培训扩散模型允许第一次达到复杂性降低和细节保存之间的近乎最佳点,极大地提高了视觉保真度。通过将跨关注层引入模型架构中,我们将扩散模型转化为强大而柔性的发电机,以进行诸如文本或边界盒和高分辨率合成的通用调节输入,以卷积方式变得可以实现。我们的潜在扩散模型(LDMS)实现了一种新的技术状态,可在各种任务中进行图像修复和高竞争性能,包括无条件图像生成,语义场景合成和超级分辨率,同时与基于像素的DMS相比显着降低计算要求。代码可在https://github.com/compvis/lattent-diffusion获得。
translated by 谷歌翻译