我们提出了仅使用目标文本提示的3D模型的零击生成技术。在没有任何3D监督的情况下,我们的方法变形了极限细分表面的控制形状及其纹理地图和正常地图,以获得与输入文本提示相对应的3D资产,并且可以轻松地部署到游戏或建模应用程序中。我们仅依靠预先训练的剪辑模型,该模型将输入文本提示与我们3D模型的渲染图像进行了分化。虽然先前的作品集中在风格化或对生成模型的必要培训上,但我们直接对网格参数进行优化,以生成形状,纹理或两者兼而有之。为了限制优化以产生合理的网格和纹理,我们使用图像增强量引入了许多技术,并使用预验证的先验,该技术在给定文本嵌入的情况下生成了剪贴图像嵌入。
translated by 谷歌翻译
我们将神经渲染与多模态图像和文本表示相结合,以仅从自然语言描述中综合不同的3D对象。我们的方法,梦场,可以产生多种物体的几何和颜色而无需3D监控。由于不同,标题3D数据的稀缺性,先前的方法仅生成来自少数类别的对象,例如ShapEnet。相反,我们指导生成与从Web的标题图像的大型数据集预先培训的图像文本模型。我们的方法优化了许多相机视图的神经辐射场,使得根据预先训练的剪辑模型,渲染图像非常高度地使用目标字幕。为了提高保真度和视觉质量,我们引入简单的几何前瞻,包括突出透射率正则化,场景界限和新的MLP架构。在实验中,梦场从各种自然语言标题中产生现实,多视图一致的物体几何和颜色。
translated by 谷歌翻译
Recent CLIP-guided 3D optimization methods, e.g., DreamFields and PureCLIPNeRF achieve great success in zero-shot text-guided 3D synthesis. However, due to the scratch training and random initialization without any prior knowledge, these methods usually fail to generate accurate and faithful 3D structures that conform to the corresponding text. In this paper, we make the first attempt to introduce the explicit 3D shape prior to CLIP-guided 3D optimization methods. Specifically, we first generate a high-quality 3D shape from input texts in the text-to-shape stage as the 3D shape prior. We then utilize it as the initialization of a neural radiance field and then optimize it with the full prompt. For the text-to-shape generation, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between images synthesized by the text-to-image model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, namely, Dream3D, is capable of generating imaginative 3D content with better visual quality and shape accuracy than state-of-the-art methods.
translated by 谷歌翻译
While recent work on text-conditional 3D object generation has shown promising results, the state-of-the-art methods typically require multiple GPU-hours to produce a single sample. This is in stark contrast to state-of-the-art generative image models, which produce samples in a number of seconds or minutes. In this paper, we explore an alternative method for 3D object generation which produces 3D models in only 1-2 minutes on a single GPU. Our method first generates a single synthetic view using a text-to-image diffusion model, and then produces a 3D point cloud using a second diffusion model which conditions on the generated image. While our method still falls short of the state-of-the-art in terms of sample quality, it is one to two orders of magnitude faster to sample from, offering a practical trade-off for some use cases. We release our pre-trained point cloud diffusion models, as well as evaluation code and models, at https://github.com/openai/point-e.
translated by 谷歌翻译
在这项工作中,我们开发直观的控制,用于编辑3D对象的风格。我们的框架Text2Mesh,通过预测符合目标文本提示的颜色和本地几何细节来体验3D网格。我们考虑使用与学习的神经网络耦合的固定网格输入(内容)进行3D对象的脱信表示,我们使用神经风格现场网络。为了修改样式,我们通过利用剪辑的代表性来获取文本提示(描述样式)和风格化网格之间的相似性分数。Text2Mesh既不需要预先训练的生成模型,也不需要专门的3D网状数据集。它可以处理具有任意属的低质量网格(非歧管,边界等),并且不需要UV参数化。我们展示了我们技术在各种各样的3D网格上综合了符合无数款式的能力。
translated by 谷歌翻译
We propose ClipFace, a novel self-supervised approach for text-guided editing of textured 3D morphable model of faces. Specifically, we employ user-friendly language prompts to enable control of the expressions as well as appearance of 3D faces. We leverage the geometric expressiveness of 3D morphable models, which inherently possess limited controllability and texture expressivity, and develop a self-supervised generative model to jointly synthesize expressive, textured, and articulated faces in 3D. We enable high-quality texture generation for 3D faces by adversarial self-supervised training, guided by differentiable rendering against collections of real RGB images. Controllable editing and manipulation are given by language prompts to adapt texture and expression of the 3D morphable model. To this end, we propose a neural network that predicts both texture and expression latent codes of the morphable model. Our model is trained in a self-supervised fashion by exploiting differentiable rendering and losses based on a pre-trained CLIP model. Once trained, our model jointly predicts face textures in UV-space, along with expression parameters to capture both geometry and texture changes in facial expressions in a single forward pass. We further show the applicability of our method to generate temporally changing textures for a given animation sequence.
translated by 谷歌翻译
We present 3D Highlighter, a technique for localizing semantic regions on a mesh using text as input. A key feature of our system is the ability to interpret "out-of-domain" localizations. Our system demonstrates the ability to reason about where to place non-obviously related concepts on an input 3D shape, such as adding clothing to a bare 3D animal model. Our method contextualizes the text description using a neural field and colors the corresponding region of the shape using a probability-weighted blend. Our neural optimization is guided by a pre-trained CLIP encoder, which bypasses the need for any 3D datasets or 3D annotations. Thus, 3D Highlighter is highly flexible, general, and capable of producing localizations on a myriad of input shapes. Our code is publicly available at https://github.com/threedle/3DHighlighter.
translated by 谷歌翻译
我们提出了夹子演员,这是人类网状动画的文本驱动运动建议和神经网格式化系统。剪贴画将动画3D人类网格通过推荐运动序列和学习网格样式属性来符合文本提示。当艺术家设计的网格内容从一开始就不符合文本时,先前的工作将无法产生合理的结果。取而代之的是,我们通过利用具有语言标签的大规模人类运动数据集来构建文本驱动的人类运动推荐系统。鉴于自然语言提示,剪贴器首先提出了一种人类运动,该运动以粗到精细的方式符合提示。然后,我们提出了一种合成的直接优化方法,该方法从每个帧的姿势中以分离的方式详细详细介绍了建议的网格序列。它允许样式属性以时间一致和姿势不合时宜的方式符合提示。脱钩的神经优化还可以使人类运动的时空视图增强。我们进一步提出了掩盖加权的嵌入注意力,该嵌入的注意力通过拒绝含有稀缺前景像素的分心渲染来稳定优化过程。我们证明剪贴器会产生合理的和人类识别的样式3D人物,并从自然语言提示中带有详细的几何形状和纹理。
translated by 谷歌翻译
Text-guided 3D object generation aims to generate 3D objects described by user-defined captions, which paves a flexible way to visualize what we imagined. Although some works have been devoted to solving this challenging task, these works either utilize some explicit 3D representations (e.g., mesh), which lack texture and require post-processing for rendering photo-realistic views; or require individual time-consuming optimization for every single case. Here, we make the first attempt to achieve generic text-guided cross-category 3D object generation via a new 3D-TOGO model, which integrates a text-to-views generation module and a views-to-3D generation module. The text-to-views generation module is designed to generate different views of the target 3D object given an input caption. prior-guidance, caption-guidance and view contrastive learning are proposed for achieving better view-consistency and caption similarity. Meanwhile, a pixelNeRF model is adopted for the views-to-3D generation module to obtain the implicit 3D neural representation from the previously-generated views. Our 3D-TOGO model generates 3D objects in the form of the neural radiance field with good texture and requires no time-cost optimization for every single caption. Besides, 3D-TOGO can control the category, color and shape of generated 3D objects with the input caption. Extensive experiments on the largest 3D object dataset (i.e., ABO) are conducted to verify that 3D-TOGO can better generate high-quality 3D objects according to the input captions across 98 different categories, in terms of PSNR, SSIM, LPIPS and CLIP-score, compared with text-NeRF and Dreamfields.
translated by 谷歌翻译
在创建3D内容时,通常需要高度专业的技能来设计和生成对象和其他资产的模型。我们通过从多模式输入(包括2D草图,图像和文本)中检索高质量的3D资产来解决此问题。我们使用夹子,因为它为高级潜在特征提供了桥梁。我们使用这些功能来执行多模式融合,以解决影响常见数据驱动方法的缺乏艺术控制。我们的方法通过使用输入潜在的嵌入方式组合,可以通过3D资产数据库进行多模式条件特征驱动的检索。我们探讨了不同输入类型和加权方法的特征嵌入不同组合的影响。
translated by 谷歌翻译
随着几个行业正在朝着建模大规模的3D虚拟世界迈进,因此需要根据3D内容的数量,质量和多样性来扩展的内容创建工具的需求变得显而易见。在我们的工作中,我们旨在训练Parterant 3D生成模型,以合成纹理网格,可以通过3D渲染引擎直接消耗,因此立即在下游应用中使用。 3D生成建模的先前工作要么缺少几何细节,因此在它们可以生成的网格拓扑中受到限制,通常不支持纹理,或者在合成过程中使用神经渲染器,这使得它们在常见的3D软件中使用。在这项工作中,我们介绍了GET3D,这是一种生成模型,该模型直接生成具有复杂拓扑,丰富几何细节和高保真纹理的显式纹理3D网格。我们在可区分的表面建模,可区分渲染以及2D生成对抗网络中桥接了最新成功,以从2D图像集合中训练我们的模型。 GET3D能够生成高质量的3D纹理网格,从汽车,椅子,动物,摩托车和人类角色到建筑物,对以前的方法进行了重大改进。
translated by 谷歌翻译
As a powerful representation of 3D scenes, the neural radiance field (NeRF) enables high-quality novel view synthesis from multi-view images. Stylizing NeRF, however, remains challenging, especially on simulating a text-guided style with both the appearance and the geometry altered simultaneously. In this paper, we present NeRF-Art, a text-guided NeRF stylization approach that manipulates the style of a pre-trained NeRF model with a simple text prompt. Unlike previous approaches that either lack sufficient geometry deformations and texture details or require meshes to guide the stylization, our method can shift a 3D scene to the target style characterized by desired geometry and appearance variations without any mesh guidance. This is achieved by introducing a novel global-local contrastive learning strategy, combined with the directional constraint to simultaneously control both the trajectory and the strength of the target style. Moreover, we adopt a weight regularization method to effectively suppress cloudy artifacts and geometry noises which arise easily when the density field is transformed during geometry stylization. Through extensive experiments on various styles, we demonstrate that our method is effective and robust regarding both single-view stylization quality and cross-view consistency. The code and more results can be found in our project page: https://cassiepython.github.io/nerfart/.
translated by 谷歌翻译
我们提出了一种有效的方法,用于从多视图图像观察中联合优化拓扑,材料和照明。与最近的多视图重建方法不同,通常在神经网络中产生纠缠的3D表示,我们将三角形网格输出具有空间不同的材料和环境照明,这些方法可以在任何传统的图形引擎中未修改。我们利用近期工作在可差异化的渲染中,基于坐标的网络紧凑地代表体积纹理,以及可微分的游行四边形,以便直接在表面网上直接实现基于梯度的优化。最后,我们介绍了环境照明的分流和近似的可分辨率配方,以有效地回收全频照明。实验表明我们的提取模型用于高级场景编辑,材料分解和高质量的视图插值,全部以三角形的渲染器(光栅化器和路径示踪剂)的交互式速率运行。
translated by 谷歌翻译
Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can easily reconstruct the body geometry and infer the full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT introduces the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pre-trained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed current state-of-the-art avatar creation methods when only a single image is available. Code will be public for reseach purpose at https://elicit3d.github.io .
translated by 谷歌翻译
由于没有大型配对的文本形状数据,这两种方式之间的大量语义差距以及3D形状的结构复杂性,因此文本指导的3D形状生成仍然具有挑战性。本文通过引入2D图像作为垫脚石来连接两种方式并消除对配对的文本形状数据的需求,提出了一个名为“图像”的新框架,称为“垫脚石”(ISS)。我们的关键贡献是一种两阶段的功能空间对准方法,它通过利用具有多视图Supperions的预训练的单视重构造(SVR)模型来映射剪辑功能以形成形状:首先将剪辑图像剪辑剪辑功能到详细信息 - SVR模型中的丰富形状空间,然后将剪辑文本功能映射到形状空间,并通过鼓励输入文本和渲染图像之间的剪辑一致性来优化映射。此外,我们制定了一个文本制定的形状样式化模块,以用新颖的纹理打扮出输出形状。除了从文本上生成3D Shape生成的现有作品外,我们的新方法是在不需要配对的文本形状数据的情况下创建形状的一般性。实验结果表明,我们的方法在忠诚度和与文本一致性方面优于最先进的和我们的基线。此外,我们的方法可以通过逼真的和幻想结构和纹理对生成的形状进行样式化。
translated by 谷歌翻译
2D-to-3D reconstruction is an ill-posed problem, yet humans are good at solving this problem due to their prior knowledge of the 3D world developed over years. Driven by this observation, we propose NeRDi, a single-view NeRF synthesis framework with general image priors from 2D diffusion models. Formulating single-view reconstruction as an image-conditioned 3D generation problem, we optimize the NeRF representations by minimizing a diffusion loss on its arbitrary view renderings with a pretrained image diffusion model under the input-view constraint. We leverage off-the-shelf vision-language models and introduce a two-section language guidance as conditioning inputs to the diffusion model. This is essentially helpful for improving multiview content coherence as it narrows down the general image prior conditioned on the semantic and visual features of the single-view input image. Additionally, we introduce a geometric loss based on estimated depth maps to regularize the underlying 3D geometry of the NeRF. Experimental results on the DTU MVS dataset show that our method can synthesize novel views with higher quality even compared to existing methods trained on this dataset. We also demonstrate our generalizability in zero-shot NeRF synthesis for in-the-wild images.
translated by 谷歌翻译
我们呈现剪辑NERF,一种用于神经辐射字段(NERF)的多模态3D对象操纵方法。通过利用近期对比语言图像预培训(剪辑)模型的联合语言图像嵌入空间,我们提出了一个统一的框架,它允许以用户友好的方式操纵nerf,使用短文本提示或示例图像。具体地,为了结合NERF的新型视图合成能力以及从生成模型的潜在表示的可控操纵能力,我们引入了一种允许单独控制形状和外观的脱屑的条件NERF架构。这是通过通过将学习的变形字段应用于对体积渲染阶段的位置编码和延迟颜色调节来实现的来实现。要将这种解除潜在的潜在潜在表示到剪辑嵌入,我们设计了两个代码映射器,将剪辑嵌入为输入并更新潜在码以反映目标编辑。用基于剪辑的匹配损耗训练映射器,以确保操纵精度。此外,我们提出了一种逆优化方法,可以将输入图像精确地将输入图像投影到潜在码以进行操作以使在真实图像上进行编辑。我们在各种文本提示和示例图像上进行广泛的实验评估我们的方法,并为交互式编辑提供了直观的接口。我们的实现是在https://cassiepython.github.io/clipnerf/上获得的
translated by 谷歌翻译
我们解决了3D室内场景的语言引导语义风格转移的新问题。输入是一个3D室内场景网格和几个描述目标场景的短语。首先,通过多层感知器将3D顶点坐标映射到RGB残基。其次,通过针对室内场景量身定制的视点采样策略将彩色的3D网格分化为2D图像。第三,通过预训练的视觉模型将渲染的2D图像与短语进行比较。最后,错误被反向传播到多层感知器,以更新与某些语义类别相对应的顶点颜色。我们对公共扫描仪和场景数据集进行了大规模定性分析和A/B用户测试。我们证明:(1)视觉令人愉悦的结果,这些结果可能对多媒体应用有用。 (2)从与人类先验一致的观点渲染3D​​室内场景很重要。 (3)合并语义可显着提高样式转移质量。 (4)HSV正则化项会导致结果与输入更一致,并且通常评分更好。代码和用户研究工具箱可从https://github.com/air-discover/lasst获得
translated by 谷歌翻译
This paper presents a 3D generative model that uses diffusion models to automatically generate 3D digital avatars represented as neural radiance fields. A significant challenge in generating such avatars is that the memory and processing costs in 3D are prohibitive for producing the rich details required for high-quality avatars. To tackle this problem we propose the roll-out diffusion network (Rodin), which represents a neural radiance field as multiple 2D feature maps and rolls out these maps into a single 2D feature plane within which we perform 3D-aware diffusion. The Rodin model brings the much-needed computational efficiency while preserving the integrity of diffusion in 3D by using 3D-aware convolution that attends to projected features in the 2D feature plane according to their original relationship in 3D. We also use latent conditioning to orchestrate the feature generation for global coherence, leading to high-fidelity avatars and enabling their semantic editing based on text prompts. Finally, we use hierarchical synthesis to further enhance details. The 3D avatars generated by our model compare favorably with those produced by existing generative techniques. We can generate highly detailed avatars with realistic hairstyles and facial hair like beards. We also demonstrate 3D avatar generation from image or text as well as text-guided editability.
translated by 谷歌翻译
综合照片 - 现实图像和视频是计算机图形的核心,并且是几十年的研究焦点。传统上,使用渲染算法(如光栅化或射线跟踪)生成场景的合成图像,其将几何形状和材料属性的表示为输入。统称,这些输入定义了实际场景和呈现的内容,并且被称为场景表示(其中场景由一个或多个对象组成)。示例场景表示是具有附带纹理的三角形网格(例如,由艺术家创建),点云(例如,来自深度传感器),体积网格(例如,来自CT扫描)或隐式曲面函数(例如,截短的符号距离)字段)。使用可分辨率渲染损耗的观察结果的这种场景表示的重建被称为逆图形或反向渲染。神经渲染密切相关,并将思想与经典计算机图形和机器学习中的思想相结合,以创建用于合成来自真实观察图像的图像的算法。神经渲染是朝向合成照片现实图像和视频内容的目标的跨越。近年来,我们通过数百个出版物显示了这一领域的巨大进展,这些出版物显示了将被动组件注入渲染管道的不同方式。这种最先进的神经渲染进步的报告侧重于将经典渲染原则与学习的3D场景表示结合的方法,通常现在被称为神经场景表示。这些方法的一个关键优势在于它们是通过设计的3D-一致,使诸如新颖的视点合成捕获场景的应用。除了处理静态场景的方法外,我们还涵盖了用于建模非刚性变形对象的神经场景表示...
translated by 谷歌翻译