由于没有大型配对的文本形状数据,这两种方式之间的大量语义差距以及3D形状的结构复杂性,因此文本指导的3D形状生成仍然具有挑战性。本文通过引入2D图像作为垫脚石来连接两种方式并消除对配对的文本形状数据的需求,提出了一个名为“图像”的新框架,称为“垫脚石”(ISS)。我们的关键贡献是一种两阶段的功能空间对准方法,它通过利用具有多视图Supperions的预训练的单视重构造(SVR)模型来映射剪辑功能以形成形状:首先将剪辑图像剪辑剪辑功能到详细信息 - SVR模型中的丰富形状空间,然后将剪辑文本功能映射到形状空间,并通过鼓励输入文本和渲染图像之间的剪辑一致性来优化映射。此外,我们制定了一个文本制定的形状样式化模块,以用新颖的纹理打扮出输出形状。除了从文本上生成3D Shape生成的现有作品外,我们的新方法是在不需要配对的文本形状数据的情况下创建形状的一般性。实验结果表明,我们的方法在忠诚度和与文本一致性方面优于最先进的和我们的基线。此外,我们的方法可以通过逼真的和幻想结构和纹理对生成的形状进行样式化。
translated by 谷歌翻译
Recent CLIP-guided 3D optimization methods, e.g., DreamFields and PureCLIPNeRF achieve great success in zero-shot text-guided 3D synthesis. However, due to the scratch training and random initialization without any prior knowledge, these methods usually fail to generate accurate and faithful 3D structures that conform to the corresponding text. In this paper, we make the first attempt to introduce the explicit 3D shape prior to CLIP-guided 3D optimization methods. Specifically, we first generate a high-quality 3D shape from input texts in the text-to-shape stage as the 3D shape prior. We then utilize it as the initialization of a neural radiance field and then optimize it with the full prompt. For the text-to-shape generation, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between images synthesized by the text-to-image model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, namely, Dream3D, is capable of generating imaginative 3D content with better visual quality and shape accuracy than state-of-the-art methods.
translated by 谷歌翻译
Text-guided 3D object generation aims to generate 3D objects described by user-defined captions, which paves a flexible way to visualize what we imagined. Although some works have been devoted to solving this challenging task, these works either utilize some explicit 3D representations (e.g., mesh), which lack texture and require post-processing for rendering photo-realistic views; or require individual time-consuming optimization for every single case. Here, we make the first attempt to achieve generic text-guided cross-category 3D object generation via a new 3D-TOGO model, which integrates a text-to-views generation module and a views-to-3D generation module. The text-to-views generation module is designed to generate different views of the target 3D object given an input caption. prior-guidance, caption-guidance and view contrastive learning are proposed for achieving better view-consistency and caption similarity. Meanwhile, a pixelNeRF model is adopted for the views-to-3D generation module to obtain the implicit 3D neural representation from the previously-generated views. Our 3D-TOGO model generates 3D objects in the form of the neural radiance field with good texture and requires no time-cost optimization for every single caption. Besides, 3D-TOGO can control the category, color and shape of generated 3D objects with the input caption. Extensive experiments on the largest 3D object dataset (i.e., ABO) are conducted to verify that 3D-TOGO can better generate high-quality 3D objects according to the input captions across 98 different categories, in terms of PSNR, SSIM, LPIPS and CLIP-score, compared with text-NeRF and Dreamfields.
translated by 谷歌翻译
我们将神经渲染与多模态图像和文本表示相结合,以仅从自然语言描述中综合不同的3D对象。我们的方法,梦场,可以产生多种物体的几何和颜色而无需3D监控。由于不同,标题3D数据的稀缺性,先前的方法仅生成来自少数类别的对象,例如ShapEnet。相反,我们指导生成与从Web的标题图像的大型数据集预先培训的图像文本模型。我们的方法优化了许多相机视图的神经辐射场,使得根据预先训练的剪辑模型,渲染图像非常高度地使用目标字幕。为了提高保真度和视觉质量,我们引入简单的几何前瞻,包括突出透射率正则化,场景界限和新的MLP架构。在实验中,梦场从各种自然语言标题中产生现实,多视图一致的物体几何和颜色。
translated by 谷歌翻译
In this work, we are dedicated to text-guided image generation and propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN. The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN, which is realized by introducing a mapping network. In the training stage, we encode an image with CLIP and map the output feature to a latent code, which is further used to reconstruct the image. In this way, the mapping network is optimized in a self-supervised learning way. In the inference stage, since CLIP can embed both image and text into a shared feature embedding space, we replace CLIP image encoder in the training architecture with CLIP text encoder, while keeping the following mapping network as well as StyleGAN model. As a result, we can flexibly input a text description to generate an image. Moreover, by simply adding mapped text features of an attribute to a mapped CLIP image feature, we can effectively edit the attribute to the image. Extensive experiments demonstrate the superior performance of our proposed CLIP2GAN compared to previous methods.
translated by 谷歌翻译
While recent work on text-conditional 3D object generation has shown promising results, the state-of-the-art methods typically require multiple GPU-hours to produce a single sample. This is in stark contrast to state-of-the-art generative image models, which produce samples in a number of seconds or minutes. In this paper, we explore an alternative method for 3D object generation which produces 3D models in only 1-2 minutes on a single GPU. Our method first generates a single synthetic view using a text-to-image diffusion model, and then produces a 3D point cloud using a second diffusion model which conditions on the generated image. While our method still falls short of the state-of-the-art in terms of sample quality, it is one to two orders of magnitude faster to sample from, offering a practical trade-off for some use cases. We release our pre-trained point cloud diffusion models, as well as evaluation code and models, at https://github.com/openai/point-e.
translated by 谷歌翻译
3D shapes have complementary abstractions from low-level geometry to part-based hierarchies to languages, which convey different levels of information. This paper presents a unified framework to translate between pairs of shape abstractions: $\textit{Text}$ $\Longleftrightarrow$ $\textit{Point Cloud}$ $\Longleftrightarrow$ $\textit{Program}$. We propose $\textbf{Neural Shape Compiler}$ to model the abstraction transformation as a conditional generation process. It converts 3D shapes of three abstract types into unified discrete shape code, transforms each shape code into code of other abstract types through the proposed $\textit{ShapeCode Transformer}$, and decodes them to output the target shape abstraction. Point Cloud code is obtained in a class-agnostic way by the proposed $\textit{Point}$VQVAE. On Text2Shape, ShapeGlot, ABO, Genre, and Program Synthetic datasets, Neural Shape Compiler shows strengths in $\textit{Text}$ $\Longrightarrow$ $\textit{Point Cloud}$, $\textit{Point Cloud}$ $\Longrightarrow$ $\textit{Text}$, $\textit{Point Cloud}$ $\Longrightarrow$ $\textit{Program}$, and Point Cloud Completion tasks. Additionally, Neural Shape Compiler benefits from jointly training on all heterogeneous data and tasks.
translated by 谷歌翻译
我们提出了仅使用目标文本提示的3D模型的零击生成技术。在没有任何3D监督的情况下,我们的方法变形了极限细分表面的控制形状及其纹理地图和正常地图,以获得与输入文本提示相对应的3D资产,并且可以轻松地部署到游戏或建模应用程序中。我们仅依靠预先训练的剪辑模型,该模型将输入文本提示与我们3D模型的渲染图像进行了分化。虽然先前的作品集中在风格化或对生成模型的必要培训上,但我们直接对网格参数进行优化,以生成形状,纹理或两者兼而有之。为了限制优化以产生合理的网格和纹理,我们使用图像增强量引入了许多技术,并使用预验证的先验,该技术在给定文本嵌入的情况下生成了剪贴图像嵌入。
translated by 谷歌翻译
随着几个行业正在朝着建模大规模的3D虚拟世界迈进,因此需要根据3D内容的数量,质量和多样性来扩展的内容创建工具的需求变得显而易见。在我们的工作中,我们旨在训练Parterant 3D生成模型,以合成纹理网格,可以通过3D渲染引擎直接消耗,因此立即在下游应用中使用。 3D生成建模的先前工作要么缺少几何细节,因此在它们可以生成的网格拓扑中受到限制,通常不支持纹理,或者在合成过程中使用神经渲染器,这使得它们在常见的3D软件中使用。在这项工作中,我们介绍了GET3D,这是一种生成模型,该模型直接生成具有复杂拓扑,丰富几何细节和高保真纹理的显式纹理3D网格。我们在可区分的表面建模,可区分渲染以及2D生成对抗网络中桥接了最新成功,以从2D图像集合中训练我们的模型。 GET3D能够生成高质量的3D纹理网格,从汽车,椅子,动物,摩托车和人类角色到建筑物,对以前的方法进行了重大改进。
translated by 谷歌翻译
现有的神经样式传输方法需要参考样式图像来将样式图像的纹理信息传输到内容图像。然而,在许多实际情况中,用户可能没有参考样式图像,但仍然有兴趣通过想象它们来传输样式。为了处理此类应用程序,我们提出了一个新的框架,它可以实现样式转移`没有'风格图像,但仅使用所需风格的文本描述。使用预先训练的文本图像嵌入模型的剪辑,我们仅通过单个文本条件展示了内容图像样式的调制。具体而言,我们提出了一种针对现实纹理传输的多视图增强的修补程序文本图像匹配丢失。广泛的实验结果证实了具有反映语义查询文本的现实纹理的成功图像风格转移。
translated by 谷歌翻译
As a powerful representation of 3D scenes, the neural radiance field (NeRF) enables high-quality novel view synthesis from multi-view images. Stylizing NeRF, however, remains challenging, especially on simulating a text-guided style with both the appearance and the geometry altered simultaneously. In this paper, we present NeRF-Art, a text-guided NeRF stylization approach that manipulates the style of a pre-trained NeRF model with a simple text prompt. Unlike previous approaches that either lack sufficient geometry deformations and texture details or require meshes to guide the stylization, our method can shift a 3D scene to the target style characterized by desired geometry and appearance variations without any mesh guidance. This is achieved by introducing a novel global-local contrastive learning strategy, combined with the directional constraint to simultaneously control both the trajectory and the strength of the target style. Moreover, we adopt a weight regularization method to effectively suppress cloudy artifacts and geometry noises which arise easily when the density field is transformed during geometry stylization. Through extensive experiments on various styles, we demonstrate that our method is effective and robust regarding both single-view stylization quality and cross-view consistency. The code and more results can be found in our project page: https://cassiepython.github.io/nerfart/.
translated by 谷歌翻译
随着信息中的各种方式存在于现实世界中的各种方式,多式联信息之间的有效互动和融合在计算机视觉和深度学习研究中的多模式数据的创造和感知中起着关键作用。通过卓越的功率,在多式联运信息中建模互动,多式联运图像合成和编辑近年来已成为一个热门研究主题。与传统的视觉指导不同,提供明确的线索,多式联路指南在图像合成和编辑方面提供直观和灵活的手段。另一方面,该领域也面临着具有固有的模态差距的特征的几个挑战,高分辨率图像的合成,忠实的评估度量等。在本调查中,我们全面地阐述了最近多式联运图像综合的进展根据数据模型和模型架构编辑和制定分类。我们从图像合成和编辑中的不同类型的引导方式开始介绍。然后,我们描述了多模式图像综合和编辑方法,其具有详细的框架,包括生成的对抗网络(GAN),GaN反转,变压器和其他方法,例如NERF和扩散模型。其次是在多模式图像合成和编辑中广泛采用的基准数据集和相应的评估度量的综合描述,以及分析各个优点和限制的不同合成方法的详细比较。最后,我们为目前的研究挑战和未来的研究方向提供了深入了解。与本调查相关的项目可在HTTPS://github.com/fnzhan/mise上获得
translated by 谷歌翻译
Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can easily reconstruct the body geometry and infer the full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT introduces the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pre-trained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed current state-of-the-art avatar creation methods when only a single image is available. Code will be public for reseach purpose at https://elicit3d.github.io .
translated by 谷歌翻译
从单眼图像中恢复纹理的3D网格是高度挑战的,尤其是对于缺乏3D地面真理的野外物体。在这项工作中,我们提出了网络文化,这是一个新的框架,可通过利用3D GAN预先训练的3D纹理网格合成的3D GAN的生成性先验。重建是通过在3D GAN中搜索最类似于目标网格的潜在空间来实现重建。由于预先训练的GAN以网状几何形状和纹理封装了丰富的3D语义,因此在GAN歧管内进行搜索,因此自然地使重建的真实性和忠诚度正常。重要的是,这种正则化直接应用于3D空间,从而提供了在2D空间中未观察到的网格零件的关键指导。标准基准测试的实验表明,我们的框架获得了忠实的3D重建,并在观察到的部分和未观察到的部分中都具有一致的几何形状和纹理。此外,它可以很好地推广到不太常见的网格中,例如可变形物体的扩展表达。代码在https://github.com/junzhezhang/mesh-inversion上发布
translated by 谷歌翻译
2D-to-3D reconstruction is an ill-posed problem, yet humans are good at solving this problem due to their prior knowledge of the 3D world developed over years. Driven by this observation, we propose NeRDi, a single-view NeRF synthesis framework with general image priors from 2D diffusion models. Formulating single-view reconstruction as an image-conditioned 3D generation problem, we optimize the NeRF representations by minimizing a diffusion loss on its arbitrary view renderings with a pretrained image diffusion model under the input-view constraint. We leverage off-the-shelf vision-language models and introduce a two-section language guidance as conditioning inputs to the diffusion model. This is essentially helpful for improving multiview content coherence as it narrows down the general image prior conditioned on the semantic and visual features of the single-view input image. Additionally, we introduce a geometric loss based on estimated depth maps to regularize the underlying 3D geometry of the NeRF. Experimental results on the DTU MVS dataset show that our method can synthesize novel views with higher quality even compared to existing methods trained on this dataset. We also demonstrate our generalizability in zero-shot NeRF synthesis for in-the-wild images.
translated by 谷歌翻译
我们提出了夹子演员,这是人类网状动画的文本驱动运动建议和神经网格式化系统。剪贴画将动画3D人类网格通过推荐运动序列和学习网格样式属性来符合文本提示。当艺术家设计的网格内容从一开始就不符合文本时,先前的工作将无法产生合理的结果。取而代之的是,我们通过利用具有语言标签的大规模人类运动数据集来构建文本驱动的人类运动推荐系统。鉴于自然语言提示,剪贴器首先提出了一种人类运动,该运动以粗到精细的方式符合提示。然后,我们提出了一种合成的直接优化方法,该方法从每个帧的姿势中以分离的方式详细详细介绍了建议的网格序列。它允许样式属性以时间一致和姿势不合时宜的方式符合提示。脱钩的神经优化还可以使人类运动的时空视图增强。我们进一步提出了掩盖加权的嵌入注意力,该嵌入的注意力通过拒绝含有稀缺前景像素的分心渲染来稳定优化过程。我们证明剪贴器会产生合理的和人类识别的样式3D人物,并从自然语言提示中带有详细的几何形状和纹理。
translated by 谷歌翻译
Text-driven person image generation is an emerging and challenging task in cross-modality image generation. Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on. However, previous methods mostly employ single-modality information as the prior condition (e.g. pose-guided person image generation), or utilize the preset words for text-driven human synthesis. Introducing a sentence composed of free words with an editable semantic pose map to describe person appearance is a more user-friendly way. In this paper, we propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation. Specifically, two collaborative modules are proposed, the Stylized Memory Retrieval (SMR) module for fine-grained feature distillation in data processing and the Multi-scale Cross-modality Alignment (MCA) module for coarse-to-fine feature alignment in diffusion. These two modules guarantee the alignment quality of the text and image, from image-level to feature-level, from low-resolution to high-resolution. As a result, HumanDiffusion realizes open-vocabulary person image generation with desired semantic poses. Extensive experiments conducted on DeepFashion demonstrate the superiority of our method compared with previous approaches. Moreover, better results could be obtained for complicated person images with various details and uncommon poses.
translated by 谷歌翻译
In this work, we propose TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions. The proposed method consists of three components: StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization. The inversion module maps real images to the latent space of a well-trained StyleGAN. The visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. The instancelevel optimization is for identity preservation in manipulation. Our model can produce diverse and high-quality images with an unprecedented resolution at 1024 2 . Using a control mechanism based on style-mixing, our Tedi-GAN inherently supports image synthesis with multi-modal inputs, such as sketches or semantic labels, with or without instance guidance. To facilitate text-guided multimodal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at https://github.com/weihaox/TediGAN.
translated by 谷歌翻译
高质量的HDRI(高动态范围图像),通常是HDR Panoramas,是创建图形中3D场景的3D场景的最受欢迎的方法之一。考虑到捕获HDRI的困难,高度需要一种多功能和可控的生成模型,外行用户可以直观地控制生成过程。但是,现有的最新方法仍然难以合成复杂场景的高质量全景。在这项工作中,我们提出了一个零击文本驱动的框架Text2Light,以生成4K+分辨率HDRIS,而无需配对培训数据。给定一个自由形式的文本作为场景的描述,我们通过两个专用步骤合成相应的HDRI:1)在低动态范围(LDR)(LDR)和低分辨率的文本驱动全景生成,以及2)超分辨率逆音映射在分辨率和动态范围内扩大LDR Panorama。具体来说,为了获得零击文本驱动的全景生成,我们首先将双代码簿作为不同环境纹理的离散表示形式。然后,在预先训练的剪辑模型的驱动下,一个文本条件的全局采样器学会了根据输入文本从全局代码簿中采样整体语义。此外,一个结构感知的本地采样器学会了以整体语义为指导的LDR Panoramas逐个贴片。为了获得超分辨率的逆音映射,我们从LDR Panorama得出了360度成像的连续表示,作为一组固定在球体上的结构性潜在代码。这种连续表示可以使多功能模块同时提高分辨率和动态范围。广泛的实验证明了Text2light在产生高质量HDR全景方面具有卓越的能力。此外,我们还展示了我们在现实渲染和沉浸式VR中工作的可行性。
translated by 谷歌翻译
Language is one of the primary means by which we describe the 3D world around us. While rapid progress has been made in text-to-2D-image synthesis, similar progress in text-to-3D-shape synthesis has been hindered by the lack of paired (text, shape) data. Moreover, extant methods for text-to-shape generation have limited shape diversity and fidelity. We introduce TextCraft, a method to address these limitations by producing high-fidelity and diverse 3D shapes without the need for (text, shape) pairs for training. TextCraft achieves this by using CLIP and using a multi-resolution approach by first generating in a low-dimensional latent space and then upscaling to a higher resolution, improving the fidelity of the generated shape. To improve shape diversity, we use a discrete latent space which is modelled using a bidirectional transformer conditioned on the interchangeable image-text embedding space induced by CLIP. Moreover, we present a novel variant of classifier-free guidance, which further improves the accuracy-diversity trade-off. Finally, we perform extensive experiments that demonstrate that TextCraft outperforms state-of-the-art baselines.
translated by 谷歌翻译