我们周围的视觉世界可以被描述为结构化的对象和相关关系。只有在底层对象的描述及其相关关系的描述中,可以将房间的图像召唤。虽然在设计可能将各个物体组成的深度神经网络上进行了重大工作,但在构图对象之间的各个关系方面取得了更少的工作。主要困难是,虽然对象的放置是相互独立的,但它们的关系彼此纠缠并依赖。为了规避这个问题,现有的作品主要通过利用文本或图形的形式来通过利用整体编码器来构成关系。在这项工作中,我们建议将每个关系作为非正规化密度(基于能量的模型)表示,使我们能够以分解方式构成单独的关系。我们表明这种分解分解允许模型生成和编辑具有多组关系的场景更忠实地。我们进一步表明,分解使我们的模型能够有效地理解底层关系场景结构。项目页面:https://comushvisual relations.github.io/。
translated by 谷歌翻译
大型文本引导的扩散模型(例如Dalle-2)能够在自然语言描述下生成令人惊叹的影像图像。尽管这样的模型非常灵活,但它们很难理解某些概念的组成,例如使不同对象的属性或对象之间的关系混淆。在本文中,我们提出了一种使用扩散模型的替代结构化方法来生成组成。图像是通过组成一组扩散模型来生成的,每个扩散模型都对图像的某个组件进行建模。为此,我们将扩散模型解释为基于能量的模型,其中可以明确组合能量函数定义的数据分布。所提出的方法可以在测试时间生成比训练中看到的场景要复杂得多,构成句子描述,对象关系,人面部属性,甚至对在现实世界中很少见的新组合进行推广。我们进一步说明了如何使用我们的方法来组成预训练的文本引导的扩散模型,并生成包含输入描述中描述的所有细节的影像图像,包括对Dalle-2表现出的某些对象属性的结合。这些结果表明,所提出的方法在促进视觉产生的结构化概括方面的有效性。项目页面:https://energy-lase-model.github.io/compositional-visual-generation-with-composable-diffusion-models/
translated by 谷歌翻译
Large-scale models combining text and images have made incredible progress in recent years. However, they can still fail at tasks requiring compositional knowledge, such as correctly picking out a red cube from a picture of multiple shapes. We examine the ability of CLIP (Radford et al., 2021), to caption images requiring compositional knowledge. We implement five compositional language models to probe the kinds of structure that CLIP may be using, and develop a novel training algorithm, Compositional Skipgram for Images (CoSI), to train these models. We look at performance in attribute-based tasks, requiring the identification of a particular combination of attribute and object (such as "red cube"), and in relational settings, where the spatial relation between two shapes (such as "cube behind sphere") must be identified. We find that in some conditions, CLIP is able to learn attribute-object labellings, and to generalize to unseen attribute-object combinations. However, we also see evidence that CLIP is not able to bind features together reliably. Moreover, CLIP is not able to reliably learn relations between objects, whereas some compositional models are able to learn these perfectly. Of the five models we developed, none were able to generalize to unseen relations.
translated by 谷歌翻译
顺序操纵任务需要机器人识别环境的状态,并计划导致期望目标状态的一系列动作,其中来自原始传感器输入的对象实体之间的空间关系的能力至关重要。之前的作品依赖于明确的状态估计或与新的对象或新任务的结束学习斗争。在这项工作中,我们提出了SARNET(空间对象形式的表示网络),其从RGB图像中提取了以RGB图像的对象形式,根据感兴趣对象的规范视图。我们展示了Sornet学习的对象嵌入,在三个空间推理任务上概括了零射击到未知对象实体:空间关系分类,技能前提分类和相对方向回归,显着优于基线。此外,我们提出了现实世界的机器人实验,证明了学习对象嵌入在任务规划中的使用进行了连续操作。
translated by 谷歌翻译
人类具有以零拍的方式识别和获取新颖的视觉概念的非凡能力。考虑到以前学到的视觉概念及其关系的高级,象征性的描述,人类可以识别新颖的概念而不看到任何例子。此外,他们可以通过学习视觉概念和关系来解析和传达符号结构来获取新概念。赋予机器中的这些功能在提高推理时提高其概括能力方面至关重要。在这项工作中,我们介绍了零拍的概念识别和获取(ZEROC),这是一种神经符号结构,可以以零拍的方式识别和获取新颖的概念。 ZEROC代表概念作为组成概念模型的图(作为节点)及其关系(作为边缘)。为了允许推理时间组成,我们采用基于能量的模型(EBM)来建模概念和关系。我们设计ZEROC架构,以便它允许在概念的符号图结构及其相应的EBM之间进行一对一的映射,该图是第一次允许获取新概念,传达其图形结构并将其应用于分类和分类和在推理时检测任务(甚至跨域)。我们介绍了用于学习和推断ZEROC的算法。我们在一个充满挑战的网格世界数据集上评估了零,该数据集旨在探测零拍的概念识别和获取,并展示其功能。
translated by 谷歌翻译
近年来,文本引导的图像操纵在多媒体和计算机视觉社区中获得了越来越多的关注。条件图像生成的输入已从图像 - 仅推向多模。在本文中,我们研究一个设置,允许用户使用复杂的文本指令编辑具有多个对象的图像以添加,删除或更改对象。任务的输入是多模式,包括(1)参考图像和(2)自然语言的指令,其描述对图像的期望修改。我们提出了一种基于GaN的方法来解决这个问题。关键的想法是将文本视为神经运算符,以在本地修改图像功能。我们表明,拟议的模型对三个公共数据集的最近强大的基线进行了有利的。具体地,它产生更高保真度和语义相关性的图像,并且当用作图像查询时,导致更好的检索性能。
translated by 谷歌翻译
Synthesizing high-quality images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing textto-image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) to generate 256×256 photo-realistic images conditioned on text descriptions. We decompose the hard problem into more manageable sub-problems through a sketch-refinement process. The Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. It is able to rectify defects in Stage-I results and add compelling details with the refinement process. To improve the diversity of the synthesized images and stabilize the training of the conditional-GAN, we introduce a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold. Extensive experiments and comparisons with state-of-the-arts on benchmark datasets demonstrate that the proposed method achieves significant improvements on generating photo-realistic images conditioned on text descriptions.
translated by 谷歌翻译
We tackle the problem of target-free text-guided image manipulation, which requires one to modify the input reference image based on the given text instruction, while no ground truth target image is observed during training. To address this challenging task, we propose a Cyclic-Manipulation GAN (cManiGAN) in this paper, which is able to realize where and how to edit the image regions of interest. Specifically, the image editor in cManiGAN learns to identify and complete the input image, while cross-modal interpreter and reasoner are deployed to verify the semantic correctness of the output image based on the input instruction. While the former utilizes factual/counterfactual description learning for authenticating the image semantics, the latter predicts the "undo" instruction and provides pixel-level supervision for the training of cManiGAN. With such operational cycle-consistency, our cManiGAN can be trained in the above weakly supervised setting. We conduct extensive experiments on the datasets of CLEVR and COCO, and the effectiveness and generalizability of our proposed method can be successfully verified. Project page: https://sites.google.com/view/wancyuanfan/projects/cmanigan.
translated by 谷歌翻译
'Actions' play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform 'Reasoning about Actions & Change' (RAC). This has been an important research direction in Artificial Intelligence (AI) in general, but the study of RAC with visual and linguistic inputs is relatively recent. The CLEVR_HYP (Sampat et. al., 2021) is one such testbed for hypothetical vision-language reasoning with actions as the key focus. In this work, we propose a novel learning strategy that can improve reasoning about the effects of actions. We implement an encoder-decoder architecture to learn the representation of actions as vectors. We combine the aforementioned encoder-decoder architecture with existing modality parsers and a scene graph question answering model to evaluate our proposed system on the CLEVR_HYP dataset. We conduct thorough experiments to demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
translated by 谷歌翻译
In this work, we are dedicated to text-guided image generation and propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN. The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN, which is realized by introducing a mapping network. In the training stage, we encode an image with CLIP and map the output feature to a latent code, which is further used to reconstruct the image. In this way, the mapping network is optimized in a self-supervised learning way. In the inference stage, since CLIP can embed both image and text into a shared feature embedding space, we replace CLIP image encoder in the training architecture with CLIP text encoder, while keeping the following mapping network as well as StyleGAN model. As a result, we can flexibly input a text description to generate an image. Moreover, by simply adding mapped text features of an attribute to a mapped CLIP image feature, we can effectively edit the attribute to the image. Extensive experiments demonstrate the superior performance of our proposed CLIP2GAN compared to previous methods.
translated by 谷歌翻译
我们介绍了自回归文本到图像(Parti)模型的途径,该模型生成高保真的影像图像并支持涉及复杂组成和世界知识的内容丰富的合成。 Parti将文本对图像生成视为类似于机器翻译的序列到序列建模问题,图像令牌的序列是目标输出,而不是其他语言的文本令牌。这种策略自然可以利用大型语言模型的先前工作,通过扩展数据和模型尺寸,能力和性能的持续进展。我们的方法很简单:首先,Parti使用基于变压器的图像令牌VIT-VQGAN将图像编码为离散令牌的序列。其次,我们通过将编码器二次变压器模型缩放到20B参数来实现一致的质量改进,其新的最新零弹药FID得分为7.23,而MS-Coco的FIDED得分为3.22。我们对本地化叙述以及党的详细分析(P2),这是1600多个英语提示的新的整体基准,证明了Parti在各种类别和难度方面的有效性。我们还探索并突出了我们的模型的局限性,以定义和体现关注重点领域以进一步改进。有关高分辨率图像,请参见https://parti.research.google/。
translated by 谷歌翻译
我们解决了用草图和文本查询检索图像的问题。我们提出任务形成器(文本和草图变压器),这是一种可使用文本说明和草图作为输入的端到端训练模型。我们认为,两种输入方式都以一种单独的方式无法轻易实现的方式相互补充。任务形成器遵循延迟融合双编码方法,类似于剪辑,该方法允许有效且可扩展的检索,因为检索集可以独立于查询而独立于索引。我们从经验上证明,与传统的基于文本的图像检索相比,除文本外,使用输入草图(甚至是绘制的草图)大大增加了检索召回。为了评估我们的方法,我们在可可数据集的测试集中收集了5,000个手绘草图。收集的草图可获得https://janesjanes.github.io/tsbir/。
translated by 谷歌翻译
人工智能(AI)的基本目标是模仿人类的核心认知活动。尽管在AI研究中取得了巨大的成功,但大多数现有方法仅具有单认知能力。为了克服这一局限性并迈出了朝着人工通用智能(AGI)迈出的坚实一步,我们开发了一个通过庞大的多模式数据进行预训练的基础模型,可以快速适应各种下游认知任务。为了实现这一目标,我们建议通过从Internet上拖延的语义相关数据进行自我监督的学习来预先培训我们的基础模型,并表明可以在各种下游任务上获得有希望的结果。特别是,使用开发的模型解剖工具,我们证明了我们的基础模型现在拥有强大的想象力。我们认为,我们的工作从我们的“弱或狭窄AI”的常见实践到“强或广泛的AI”迈出了转变的迈向AGI。
translated by 谷歌翻译
由于没有大型配对的文本形状数据,这两种方式之间的大量语义差距以及3D形状的结构复杂性,因此文本指导的3D形状生成仍然具有挑战性。本文通过引入2D图像作为垫脚石来连接两种方式并消除对配对的文本形状数据的需求,提出了一个名为“图像”的新框架,称为“垫脚石”(ISS)。我们的关键贡献是一种两阶段的功能空间对准方法,它通过利用具有多视图Supperions的预训练的单视重构造(SVR)模型来映射剪辑功能以形成形状:首先将剪辑图像剪辑剪辑功能到详细信息 - SVR模型中的丰富形状空间,然后将剪辑文本功能映射到形状空间,并通过鼓励输入文本和渲染图像之间的剪辑一致性来优化映射。此外,我们制定了一个文本制定的形状样式化模块,以用新颖的纹理打扮出输出形状。除了从文本上生成3D Shape生成的现有作品外,我们的新方法是在不需要配对的文本形状数据的情况下创建形状的一般性。实验结果表明,我们的方法在忠诚度和与文本一致性方面优于最先进的和我们的基线。此外,我们的方法可以通过逼真的和幻想结构和纹理对生成的形状进行样式化。
translated by 谷歌翻译
场景图是一个场景的结构化表示,可以清楚地表达场景中对象之间的对象,属性和关系。随着计算机视觉技术继续发展,只需检测和识别图像中的对象,人们不再满足。相反,人们期待着对视觉场景更高的理解和推理。例如,给定图像,我们希望不仅检测和识别图像中的对象,还要知道对象之间的关系(视觉关系检测),并基于图像内容生成文本描述(图像标题)。或者,我们可能希望机器告诉我们图像中的小女孩正在做什么(视觉问题应答(VQA)),甚至从图像中移除狗并找到类似的图像(图像编辑和检索)等。这些任务需要更高水平的图像视觉任务的理解和推理。场景图只是场景理解的强大工具。因此,场景图引起了大量研究人员的注意力,相关的研究往往是跨模型,复杂,快速发展的。然而,目前没有对场景图的相对系统的调查。为此,本调查对现行场景图研究进行了全面调查。更具体地说,我们首先总结了场景图的一般定义,随后对场景图(SGG)和SGG的发电方法进行了全面和系统的讨论,借助于先验知识。然后,我们调查了场景图的主要应用,并汇总了最常用的数据集。最后,我们对场景图的未来发展提供了一些见解。我们相信这将是未来研究场景图的一个非常有帮助的基础。
translated by 谷歌翻译
Text-guided 3D object generation aims to generate 3D objects described by user-defined captions, which paves a flexible way to visualize what we imagined. Although some works have been devoted to solving this challenging task, these works either utilize some explicit 3D representations (e.g., mesh), which lack texture and require post-processing for rendering photo-realistic views; or require individual time-consuming optimization for every single case. Here, we make the first attempt to achieve generic text-guided cross-category 3D object generation via a new 3D-TOGO model, which integrates a text-to-views generation module and a views-to-3D generation module. The text-to-views generation module is designed to generate different views of the target 3D object given an input caption. prior-guidance, caption-guidance and view contrastive learning are proposed for achieving better view-consistency and caption similarity. Meanwhile, a pixelNeRF model is adopted for the views-to-3D generation module to obtain the implicit 3D neural representation from the previously-generated views. Our 3D-TOGO model generates 3D objects in the form of the neural radiance field with good texture and requires no time-cost optimization for every single caption. Besides, 3D-TOGO can control the category, color and shape of generated 3D objects with the input caption. Extensive experiments on the largest 3D object dataset (i.e., ABO) are conducted to verify that 3D-TOGO can better generate high-quality 3D objects according to the input captions across 98 different categories, in terms of PSNR, SSIM, LPIPS and CLIP-score, compared with text-NeRF and Dreamfields.
translated by 谷歌翻译
作为一项具有挑战性的任务,文本到图像生成旨在根据给定的文本说明生成照片真实和语义一致的图像。现有方法主要从一个句子中提取文本信息,以表示图像,文本表示良好地影响生成图像的质量。但是,直接利用一个句子中的有限信息错过了一些关键属性描述,这是准确描述图像的关键因素。为了减轻上述问题,我们提出了一种有效的文本表示方法,并具有属性信息的补充。首先,我们构建一个属性内存,以用句子输入共同控制文本对图像生成。其次,我们探讨了两种更新机制,即样品感知和样本 - 关节机制,以动态优化广义属性存储器。此外,我们设计了一个属性句子结合条件生成器学习方案,以使多个表示的特征嵌入对齐,从而促进跨模式网络训练。实验结果表明,该提出的方法对CUB(FID从14.81到8.57)和可可(FID从21.42到12.39)的数据集获得了实质性改进。
translated by 谷歌翻译
我们提出了快速的文本2stylegan,这是一种自然语言界面,可适应预先训练的甘体,以实现文本引导的人脸合成。利用对比性语言图像预训练(剪辑)的最新进展,在培训过程中不需要文本数据。Fast Text2Stylegan被配制为条件变异自动编码器(CVAE),可在测试时为生成的图像提供额外的控制和多样性。我们的模型在遇到新的文本提示时不需要重新训练或微调剂或剪辑。与先前的工作相反,我们不依赖于测试时间的优化,这使我们的方法数量级比先前的工作快。从经验上讲,在FFHQ数据集上,我们的方法提供了与先前的工作相比,自然语言描述中具有不同详细程度的自然语言描述中的图像。
translated by 谷歌翻译
文本引导的图像操纵任务最近在视觉和语言社区中获得了关注。虽然大多数事先研究专注于单拐操纵,但我们本文的目标是解决更具挑战性的多转映像操纵(MTIM)任务。考虑到一系列指令和先前生成的图像,此任务的先前模型成功生成了图像。然而,这种方法遭受了发布的遭受,并且缺乏指令中描述的物体的产生质量,从而降低了整体性能。为了克服这些问题,我们提出了一种称为视觉引导语言的新建筑,GaN(Lattegan)。在这里,我们通过引入视觉引导的语言注意(拿铁)模块来解决先前方法的局限性,该语言模块提取生成器的细粒度文本表示,以及识别全局和全局的文本条件的U-Net鉴别器架构。假冒或真实图像的本地代表。在两个不同的MTIM数据集,CodraW和I-CLEVR上进行广泛的实验,证明了所提出的模型的最先进的性能。
translated by 谷歌翻译
Automatic synthesis of realistic images from text would be interesting and useful, but current AI systems are still far from this goal. However, in recent years generic and powerful recurrent neural network architectures have been developed to learn discriminative text feature representations. Meanwhile, deep convolutional generative adversarial networks (GANs) have begun to generate highly compelling images of specific categories, such as faces, album covers, and room interiors. In this work, we develop a novel deep architecture and GAN formulation to effectively bridge these advances in text and image modeling, translating visual concepts from characters to pixels. We demonstrate the capability of our model to generate plausible images of birds and flowers from detailed text descriptions.
translated by 谷歌翻译