文本指导的图像生成模型,例如DALL-E 2和稳定的扩散,最近受到了学术界和公众的关注。这些模型提供了文本描述,能够生成描绘各种概念和样式的高质量图像。但是,此类模型接受了大量公共数据的培训,并从其培训数据中隐含地学习关系,这些数据并不明显。我们证明,可以通过简单地用视觉上类似的非拉丁字符替换文本描述中的单个字符来触发并注入生成的图像中的常见多模型模型,这些偏见可以被触发并注入生成的图像。这些所谓的同符文更换使恶意用户或服务提供商能够诱导偏见到生成的图像中,甚至使整个一代流程变得无用。我们实际上说明了对DALL-E 2和稳定扩散的这种攻击,例如文本引导的图像生成模型,并进一步表明夹子的行为也相似。我们的结果进一步表明,经过多语言数据训练的文本编码器提供了一种减轻同符替代效果的方法。
translated by 谷歌翻译
While text-to-image synthesis currently enjoys great popularity among researchers and the general public, the security of these models has been neglected so far. Many text-guided image generation models rely on pre-trained text encoders from external sources, and their users trust that the retrieved models will behave as promised. Unfortunately, this might not be the case. We introduce backdoor attacks against text-guided generative models and demonstrate that their text encoders pose a major tampering risk. Our attacks only slightly alter an encoder so that no suspicious model behavior is apparent for image generations with clean prompts. By then inserting a single non-Latin character into the prompt, the adversary can trigger the model to either generate images with pre-defined attributes or images following a hidden, potentially malicious description. We empirically demonstrate the high effectiveness of our attacks on Stable Diffusion and highlight that the injection process of a single backdoor takes less than two minutes. Besides phrasing our approach solely as an attack, it can also force an encoder to forget phrases related to certain concepts, such as nudity or violence, and help to make image generation safer.
translated by 谷歌翻译
由于现在在许多现实世界应用中使用了深度学习,因此研究越来越集中于深度学习模型的隐私以及如何防止攻击者获得有关培训数据的敏感信息。但是,在隐私攻击的背景下,尚未对诸如剪辑之类的图像文本模型进行研究。虽然会员推理攻击旨在判断是否使用特定数据点进行培训,但我们引入了一种新型的隐私攻击,该隐私攻击名为“身份推理攻击”(IDIA),该攻击(IDIA)是为CLIP等多模式图像文本模型而设计的。使用IDIA,攻击者可以通过以黑盒方式查询模型,并以同一个人的不同图像来揭示特定人是否是培训数据的一部分。让模型从各种可能的文本标签中进行选择,攻击者可以探究该模型是否识别该人,因此可以用于培训。通过剪辑上的几个实验,我们表明攻击者可以以非常高的精度识别用于培训的个人,并且该模型学会了将名称与被描绘的人联系起来。我们的实验表明,多模式图像文本模型确实泄漏了有关其训练数据的敏感信息,因此应谨慎处理。
translated by 谷歌翻译
文本对图像模型提供了前所未有的自由,可以通过自然语言指导创作。然而,尚不清楚如何行使这种自由以生成特定独特概念,修改其外观或以新角色和新颖场景构成它们的图像。换句话说,我们问:我们如何使用语言指导的模型将猫变成绘画,或者想象基于我们喜欢的玩具的新产品?在这里,我们提出了一种简单的方法,可以允许这种创造性自由。我们仅使用3-5个用户提供的概念(例如对象或样式)的图像,我们学会通过在冷冻文本到图像模型的嵌入空间中通过新的“单词”表示它。这些“单词”可以组成自然语言句子,以直观的方式指导个性化的创作。值得注意的是,我们发现有证据表明单词嵌入足以捕获独特而多样的概念。我们将我们的方法比较了各种基线,并证明它可以更忠实地描绘出一系列应用程序和任务的概念。我们的代码,数据和新单词将在以下网址提供:https://textual-inversion.github.io
translated by 谷歌翻译
Text-conditioned image generation models have recently achieved astonishing results in image quality and text alignment and are consequently employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the internet, they also suffer, as we demonstrate, from degenerated and biased human behavior. In turn, they may even reinforce such biases. To help combat these undesired side effects, we present safe latent diffusion (SLD). Specifically, to measure the inappropriate degeneration due to unfiltered and imbalanced training sets, we establish a novel image generation test bed-inappropriate image prompts (I2P)-containing dedicated, real-world image-to-text prompts covering concepts such as nudity and violence. As our exhaustive empirical evaluation demonstrates, the introduced SLD removes and suppresses inappropriate image parts during the diffusion process, with no additional training required and no adverse effect on overall image quality or text alignment.
translated by 谷歌翻译
Nine language-vision AI models trained on web scrapes with the Contrastive Language-Image Pretraining (CLIP) objective are evaluated for evidence of a bias studied by psychologists: the sexual objectification of girls and women, which occurs when a person's human characteristics are disregarded and the person is treated as a body or a collection of body parts. A first experiment uses standardized images of women from the Sexual OBjectification and EMotion Database, and finds that, commensurate with prior research in psychology, human characteristics are disassociated from images of objectified women: the model's recognition of emotional state is mediated by whether the subject is fully or partially clothed. Embedding association tests (EATs) return significant effect sizes for both anger (d >.8) and sadness (d >.5). A second experiment measures the effect in a representative application: an automatic image captioner (Antarctic Captions) includes words denoting emotion less than 50% as often for images of partially clothed women than for images of fully clothed women. A third experiment finds that images of female professionals (scientists, doctors, executives) are likely to be associated with sexual descriptions relative to images of male professionals. A fourth experiment shows that a prompt of "a [age] year old girl" generates sexualized images (as determined by an NSFW classifier) up to 73% of the time for VQGAN-CLIP (age 17), and up to 40% of the time for Stable Diffusion (ages 14 and 18); the corresponding rate for boys never surpasses 9%. The evidence indicates that language-vision AI models trained on automatically collected web scrapes learn biases of sexual objectification, which propagate to downstream applications.
translated by 谷歌翻译
我们介绍了自回归文本到图像(Parti)模型的途径,该模型生成高保真的影像图像并支持涉及复杂组成和世界知识的内容丰富的合成。 Parti将文本对图像生成视为类似于机器翻译的序列到序列建模问题,图像令牌的序列是目标输出,而不是其他语言的文本令牌。这种策略自然可以利用大型语言模型的先前工作,通过扩展数据和模型尺寸,能力和性能的持续进展。我们的方法很简单:首先,Parti使用基于变压器的图像令牌VIT-VQGAN将图像编码为离散令牌的序列。其次,我们通过将编码器二次变压器模型缩放到20B参数来实现一致的质量改进,其新的最新零弹药FID得分为7.23,而MS-Coco的FIDED得分为3.22。我们对本地化叙述以及党的详细分析(P2),这是1600多个英语提示的新的整体基准,证明了Parti在各种类别和难度方面的有效性。我们还探索并突出了我们的模型的局限性,以定义和体现关注重点领域以进一步改进。有关高分辨率图像,请参见https://parti.research.google/。
translated by 谷歌翻译
利用深度学习的最新进展,文本到图像生成模型目前具有吸引公众关注的优点。其中两个模型Dall-E 2和Imagen已经证明,可以从图像的简单文本描述中生成高度逼真的图像。基于一种称为扩散模型的新型图像生成方法,文本对图像模型可以生产许多不同类型的高分辨率图像,其中人类想象力是唯一的极限。但是,这些模型需要大量的计算资源来训练,并处理从互联网收集的大量数据集。此外,代码库和模型均未发布。因此,它可以防止AI社区尝试这些尖端模型,从而使其结果复制变得复杂,即使不是不可能。在本文中,我们的目标是首先回顾这些模型使用的不同方法和技术,然后提出我们自己的文本模型模型实施。高度基于DALL-E 2,我们引入了一些轻微的修改,以应对所引起的高计算成本。因此,我们有机会进行实验,以了解这些模型的能力,尤其是在低资源制度中。特别是,我们提供了比Dall-e 2的作者(包括消融研究)更深入的分析。此外,扩散模型使用所谓的指导方法来帮助生成过程。我们引入了一种新的指导方法,该方法可以与其他指导方法一起使用,以提高图像质量。最后,我们的模型产生的图像质量相当好,而不必维持最先进的文本对图像模型的重大培训成本。
translated by 谷歌翻译
当前机器学习的大部分基础的大型数据集提出了有关不适当内容的严重问题,例如冒犯,侮辱,威胁或可能引起焦虑。这要求增加数据集文档,例如使用数据表。它们除其他主题外,还鼓励反思数据集的组成。到目前为止,该文档是手动完成的,因此可能是乏味且容易出错的,尤其是对于大型图像数据集。在这里,我们询问了机器是否可以帮助我们反思不适当的内容的“循环”问题,回答了数据表中的问题16。为此,我们建议使用存储在预训练的变压器模型中的信息来协助我们进行文档过程。具体而言,基于社会 - 道德价值数据集的及时调整引导剪辑以识别潜在的不适当的内容,从而减少了人工的劳动。然后,我们根据使用视觉模型生成的字幕来记录使用单词云找到的不适当图像。两个流行的大规模计算机视觉数据集的文档(ImageNet和OpenImages)以这种方式产生,这表明机器确实可以帮助数据集创建者回答有关不适当图像内容的问题16。
translated by 谷歌翻译
可以提示文本指导的图像生成模型,使用具有对手设计的非CLE单词生成图像,以牢固地唤起特定的视觉概念。引入了这一一代的两种方法:杏仁核提示,其中涉及通过从不同语言串联子字单元来设计隐秘的混合单词;和令人回味的提示,其中涉及设计其广泛形态特征与现有单词相似的非CE单词,以触发强大的视觉关联。这两种方法也可以合并以生成与更具体的视觉概念相关联的图像。讨论了这些技术对汇总现有内容适度方法的含义,尤其是对冒犯性或有害图像的产生。
translated by 谷歌翻译
Current image generation models struggle to reliably produce well-formed visual text. In this paper, we investigate a key contributing factor: popular text-to-image models lack character-level input features, making it much harder to predict a word's visual makeup as a series of glyphs. To quantify the extent of this effect, we conduct a series of controlled experiments comparing character-aware vs. character-blind text encoders. In the text-only domain, we find that character-aware models provide large gains on a novel spelling task (WikiSpell). Transferring these learnings onto the visual domain, we train a suite of image generation models, and show that character-aware variants outperform their character-blind counterparts across a range of novel text rendering tasks (our DrawText benchmark). Our models set a much higher state-of-the-art on visual spelling, with 30+ point accuracy gains over competitors on rare words, despite training on far fewer examples.
translated by 谷歌翻译
从开放式域文本提示中生成和编辑图像是迄今为止需要昂贵且经过特殊训练的型号的一项挑战性的任务。我们为这两个任务展示了一种新颖的方法,该方法能够通过使用多模式编码器来指导图像世代,从而从具有显着语义复杂性的文本提示中产生高视觉质量的图像。我们在各种任务上说明了如何使用夹[37]引导VQGAN [11]产生的视觉质量输出比先前的较不灵活的方法,例如DALL-E [38],Glide [33]和Open-Edit [24],尽管没有接受培训的任务。我们的代码在公共存储库中可用。
translated by 谷歌翻译
几年的研究表明,在理论和实践中,机器学习系统容易受到对抗的例子。到目前为止,这种攻击主要有针对性的视觉模型,利用人类和机器感知之间的差距。虽然基于文本的模型也被对抗例子遭到攻击,但这种攻击努力保持语义意义和无法区分。在本文中,我们探讨了大类的对抗示例,可用于在黑盒设置中攻击基于文本的模型,而不会对输入进行任何人类可知的视觉修改。我们使用对人眼不可察觉的编码特异性扰动来操纵从神经计算机翻译管道到网络搜索引擎的各种自然语言处理(NLP)系统的输出。我们发现,通过单一的难以察觉的编码注射 - 代表一个无形的字符,同型角色,重新排序或删除 - 攻击者可以显着降低易受伤害的模型的性能,并且三次注射大多数型号可以在功能上打破。除了由Facebook,IBM和HuggingFace发布的开源模型之外,我们攻击目前部署的商业系统这一新颖的一系列攻击对许多语言处理系统提供了重大威胁:攻击者可以以目标方式影响系统而没有任何关于底层模型的假设。我们得出结论,基于文本的NLP系统需要仔细的输入消毒,就像传统应用程序一样,并且考虑到这样的系统现在正在迅速地部署,需要建筑师和运营商的紧急注意。
translated by 谷歌翻译
数字艺术合成在多媒体社区中受到越来越多的关注,因为有效地与公众参与了艺术。当前的数字艺术合成方法通常使用单模式输入作为指导,从而限制了模型的表现力和生成结果的多样性。为了解决这个问题,我们提出了多模式引导的艺术品扩散(MGAD)模型,该模型是一种基于扩散的数字艺术品生成方法,它利用多模式提示作为控制无分类器扩散模型的指导。此外,对比度语言图像预处理(剪辑)模型用于统一文本和图像模式。关于生成的数字艺术绘画质量和数量的广泛实验结果证实了扩散模型和多模式指导的组合有效性。代码可从https://github.com/haha-lisa/mgad-multimodal-guided-artwork-diffusion获得。
translated by 谷歌翻译
评估了三种最先进的语言和图像AI模型,即剪辑,滑移和BLIP,以证明以前在社会和实验心理学中观察到的偏见:将美国身份等同于白人。使用芝加哥面部数据库(CFD)的自我识别的亚洲,黑人,拉丁裔和白人的标准化图像的嵌入关联测试(eats)表明,白人与集体内词相比,比亚洲,黑色更相关,或拉丁裔/o个人。在评估社会心理学家报道的美国身份的三个核心方面时,单类饮食表明,白人个体的图像与爱国主义和出生在美国更相关,但与心理学的先前发现一致,白人个人是相关的不太可能平等对待所有种族和背景的人。三个下游机器学习任务表明了与白人相关联的偏见。在使用BLIP的视觉问题回答任务中,有97%的白人被确定为美国人,而仅3%的亚洲人。当被问及个人所描绘的生活状态时,该模型在亚洲人中有53%的时间回应中国,但始终具有美国对白人个人的国家。在图像字幕的任务中,Blip评论了亚洲人的种族多达36%的时间,但从未对白人人士进行比赛。最后,使用基于文本的剪辑指导的综合图像发生器(VQGAN)提供了CFD和文本“ American Person”的初始化图像,从而减轻了所有种族个体的肤色(黑人个人的肤色为35% ,基于像素亮度)。结果表明,语言和图像AI将其等同于美国身份与白人的偏见,并传播到此类模型的下游应用。
translated by 谷歌翻译
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/
translated by 谷歌翻译
We introduce M-VADER: a diffusion model (DM) for image generation where the output can be specified using arbitrary combinations of images and text. We show how M-VADER enables the generation of images specified using combinations of image and text, and combinations of multiple images. Previously, a number of successful DM image generation algorithms have been introduced that make it possible to specify the output image using a text prompt. Inspired by the success of those models, and led by the notion that language was already developed to describe the elements of visual contexts that humans find most important, we introduce an embedding model closely related to a vision-language model. Specifically, we introduce the embedding model S-MAGMA: a 13 billion parameter multimodal decoder combining components from an autoregressive vision-language model MAGMA and biases finetuned for semantic search.
translated by 谷歌翻译
The text-to-image model Stable Diffusion has recently become very popular. Only weeks after its open source release, millions are experimenting with image generation. This is due to its ease of use, since all it takes is a brief description of the desired image to "prompt" the generative model. Rarely do the images generated for a new prompt immediately meet the user's expectations. Usually, an iterative refinement of the prompt ("prompt engineering") is necessary for satisfying images. As a new perspective, we recast image prompt engineering as interactive image retrieval - on an "infinite index". Thereby, a prompt corresponds to a query and prompt engineering to query refinement. Selected image-prompt pairs allow direct relevance feedback, as the model can modify an image for the refined prompt. This is a form of one-sided interactive retrieval, where the initiative is on the user side, whereas the server side remains stateless. In light of an extensive literature review, we develop these parallels in detail and apply the findings to a case study of a creative search task on such a model. We note that the uncertainty in searching an infinite index is virtually never-ending. We also discuss future research opportunities related to retrieval models specialized for generative models and interactive generative image retrieval. The application of IR technology, such as query reformulation and relevance feedback, will contribute to improved workflows when using generative models, while the notion of an infinite index raises new challenges in IR research.
translated by 谷歌翻译
我们提出了快速的文本2stylegan,这是一种自然语言界面,可适应预先训练的甘体,以实现文本引导的人脸合成。利用对比性语言图像预训练(剪辑)的最新进展,在培训过程中不需要文本数据。Fast Text2Stylegan被配制为条件变异自动编码器(CVAE),可在测试时为生成的图像提供额外的控制和多样性。我们的模型在遇到新的文本提示时不需要重新训练或微调剂或剪辑。与先前的工作相反,我们不依赖于测试时间的优化,这使我们的方法数量级比先前的工作快。从经验上讲,在FFHQ数据集上,我们的方法提供了与先前的工作相比,自然语言描述中具有不同详细程度的自然语言描述中的图像。
translated by 谷歌翻译
视觉语言预训练(VLP)模型在各种下游任务上表现出色。他们的成功在很大程度上取决于预训练的跨模式数据集的规模。但是,中文中缺乏大规模数据集和基准阻碍了中国VLP模型和更广泛的多语言应用程序的发展。在这项工作中,我们发布了一个名为Wukong的大型中国跨模式数据集,其中包含从网络收集的1亿个中文图像文本对。 Wukong旨在基准基准不同的多模式预训练方法,以促进VLP研究和社区发展。此外,我们发布了一组模型,预先训练了各种图像编码器(vit-b/vit-l/swint),还将高级预训练技术应用于VLP,例如锁定图像文本调整,相对于代币的相似性学习和减少互动。还提供了广泛的实验和不同下游任务的基准测试,包括新的最大人验证的图像文本测试数据集。实验表明,Wukong可以作为不同的跨模式学习方法的有前途的中国预培训数据集和基准。对于10个数据集上的零摄像图像分类任务,$ Wukong_ {vit-l} $达到的平均准确度为73.03%。对于图像文本检索任务,它在AIC-ICC上的平均召回率为71.6%,比Wenlan 2.0高12.9%。此外,我们的Wukong模型在下游任务上进行了基准测试,例如多个数据集上的其他变体,例如Flickr8k-CN,Flickr-30K-CN,Coco-CN,Coco-CN等。更多信息可以参考:https://wukong-dataset.github.io/wukong-dataset/。
translated by 谷歌翻译