Recent diffusion-based AI art platforms are able to create impressive images from simple text descriptions. This makes them powerful tools for concept design in any discipline that requires creativity in visual design tasks. This is also true for early stages of architectural design with multiple stages of ideation, sketching and modelling. In this paper, we investigate how applicable diffusion-based models already are to these tasks. We research the applicability of the platforms Midjourney, DALL-E 2 and StableDiffusion to a series of common use cases in architectural design to determine which are already solvable or might soon be. We also analyze how they are already being used by analyzing a data set of 40 million Midjourney queries with NLP methods to extract common usage patterns. With this insights we derived a workflow to interior and exterior design that combines the strengths of the individual platforms.
translated by 谷歌翻译
The text-to-image model Stable Diffusion has recently become very popular. Only weeks after its open source release, millions are experimenting with image generation. This is due to its ease of use, since all it takes is a brief description of the desired image to "prompt" the generative model. Rarely do the images generated for a new prompt immediately meet the user's expectations. Usually, an iterative refinement of the prompt ("prompt engineering") is necessary for satisfying images. As a new perspective, we recast image prompt engineering as interactive image retrieval - on an "infinite index". Thereby, a prompt corresponds to a query and prompt engineering to query refinement. Selected image-prompt pairs allow direct relevance feedback, as the model can modify an image for the refined prompt. This is a form of one-sided interactive retrieval, where the initiative is on the user side, whereas the server side remains stateless. In light of an extensive literature review, we develop these parallels in detail and apply the findings to a case study of a creative search task on such a model. We note that the uncertainty in searching an infinite index is virtually never-ending. We also discuss future research opportunities related to retrieval models specialized for generative models and interactive generative image retrieval. The application of IR technology, such as query reformulation and relevance feedback, will contribute to improved workflows when using generative models, while the notion of an infinite index raises new challenges in IR research.
translated by 谷歌翻译
从开放式域文本提示中生成和编辑图像是迄今为止需要昂贵且经过特殊训练的型号的一项挑战性的任务。我们为这两个任务展示了一种新颖的方法,该方法能够通过使用多模式编码器来指导图像世代,从而从具有显着语义复杂性的文本提示中产生高视觉质量的图像。我们在各种任务上说明了如何使用夹[37]引导VQGAN [11]产生的视觉质量输出比先前的较不灵活的方法,例如DALL-E [38],Glide [33]和Open-Edit [24],尽管没有接受培训的任务。我们的代码在公共存储库中可用。
translated by 谷歌翻译
自2021年以来,文本到图像的生成引起了人们的关注。如今,可以通过深层生成模型从文本输入(“提示”)中综合美丽而有趣的数字图像和艺术品。围绕文本图像生成和AI生成的艺术的在线社区很快就出现了。本文根据3个月的人种学研究确定了在线社区中从业人员使用的六种类型的迅速修饰符。迅速修饰符的新颖分类学为研究人员提供了研究文本到图像生成实践的概念起点,但也可以帮助AI生成的ART的实践者改善其图像。我们进一步概述了如何在“及时工程”的实践中应用及时修饰符。我们讨论了这种新颖的创造性实践在人类互动(HCI)领域的研究机会。本文最后讨论了从人类互动(HAI)(HAI)在未来的应用中,除文本到图像生成和AI生成的艺术的用例之外,从人类互动(HAI)的角度讨论了更广泛的含义。
translated by 谷歌翻译
我们介绍了自回归文本到图像(Parti)模型的途径,该模型生成高保真的影像图像并支持涉及复杂组成和世界知识的内容丰富的合成。 Parti将文本对图像生成视为类似于机器翻译的序列到序列建模问题,图像令牌的序列是目标输出,而不是其他语言的文本令牌。这种策略自然可以利用大型语言模型的先前工作,通过扩展数据和模型尺寸,能力和性能的持续进展。我们的方法很简单:首先,Parti使用基于变压器的图像令牌VIT-VQGAN将图像编码为离散令牌的序列。其次,我们通过将编码器二次变压器模型缩放到20B参数来实现一致的质量改进,其新的最新零弹药FID得分为7.23,而MS-Coco的FIDED得分为3.22。我们对本地化叙述以及党的详细分析(P2),这是1600多个英语提示的新的整体基准,证明了Parti在各种类别和难度方面的有效性。我们还探索并突出了我们的模型的局限性,以定义和体现关注重点领域以进一步改进。有关高分辨率图像,请参见https://parti.research.google/。
translated by 谷歌翻译
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 2nd International Workshop on Reading Music Systems, held in Delft on the 2nd of November 2019.
translated by 谷歌翻译
Large, text-conditioned generative diffusion models have recently gained a lot of attention for their impressive performance in generating high-fidelity images from text alone. However, achieving high-quality results is almost unfeasible in a one-shot fashion. On the contrary, text-guided image generation involves the user making many slight changes to inputs in order to iteratively carve out the envisioned image. However, slight changes to the input prompt often lead to entirely different images being generated, and thus the control of the artist is limited in its granularity. To provide flexibility, we present the Stable Artist, an image editing approach enabling fine-grained control of the image generation process. The main component is semantic guidance (SEGA) which steers the diffusion process along variable numbers of semantic directions. This allows for subtle edits to images, changes in composition and style, as well as optimization of the overall artistic conception. Furthermore, SEGA enables probing of latent spaces to gain insights into the representation of concepts learned by the model, even complex ones such as 'carbon emission'. We demonstrate the Stable Artist on several tasks, showcasing high-quality image editing and composition.
translated by 谷歌翻译
最近已被证明扩散模型产生高质量的合成图像,尤其是与指导技术配对,以促进忠诚的多样性。我们探索文本条件图像综合问题的扩散模型,并比较了两种不同的指导策略:剪辑指导和自由分类指导。我们发现后者是人类评估者的优选,用于光敏和标题相似度,并且通常产生光素质拟种样品。使用自由分类指导的35亿参数文本条件扩散模型的样本由人类评估者对来自Dall-E的人的人们青睐,即使后者使用昂贵的剪辑重新划分。此外,我们发现我们的模型可以进行微调,以执行图像修复,从而实现强大的文本驱动的图像编辑。我们在过滤的数据集中培训较小的模型,并在https://github.com/openai/glide-text2im释放代码和权重。
translated by 谷歌翻译
文本指导的图像生成模型,例如DALL-E 2和稳定的扩散,最近受到了学术界和公众的关注。这些模型提供了文本描述,能够生成描绘各种概念和样式的高质量图像。但是,此类模型接受了大量公共数据的培训,并从其培训数据中隐含地学习关系,这些数据并不明显。我们证明,可以通过简单地用视觉上类似的非拉丁字符替换文本描述中的单个字符来触发并注入生成的图像中的常见多模型模型,这些偏见可以被触发并注入生成的图像。这些所谓的同符文更换使恶意用户或服务提供商能够诱导偏见到生成的图像中,甚至使整个一代流程变得无用。我们实际上说明了对DALL-E 2和稳定扩散的这种攻击,例如文本引导的图像生成模型,并进一步表明夹子的行为也相似。我们的结果进一步表明,经过多语言数据训练的文本编码器提供了一种减轻同符替代效果的方法。
translated by 谷歌翻译
Large language models trained for code generation can be applied to speaking virtual worlds into existence (creating virtual worlds). In this work we show that prompt-based methods can both accelerate in-VR level editing, as well as can become part of gameplay rather than just part of game development. As an example, we present Codex VR Pong which shows non-deterministic game mechanics using generative processes to not only create static content but also non-trivial interactions between 3D objects. This demonstration naturally leads to an integral discussion on how one would evaluate and benchmark experiences created by generative models - as there are no qualitative or quantitative metrics that apply in these scenarios. We conclude by discussing impending challenges of AI-assisted co-creation in VR.
translated by 谷歌翻译
大型语言模型,例如OpenAI的法典和DeepMind的字母,可以生成代码来解决以自然语言表达的各种问题。这项技术已经在至少一项广泛使用的编程编辑器扩展程序中进行了商业化:Github Copilot。在本文中,我们探讨了具有大型语言模型(LLM辅助编程)的编程与程序员协助的先前概念化相似,并且与众不同。我们借鉴了公开可用的经验报告,有关LLM辅助编程以及先前的可用性和设计研究。我们发现,尽管LLM辅助编程通过搜索和重用分享了一些编译,配对编程和编程的属性,但技术可能性和实践经验都存在根本差异。因此,应该将LLM辅助编程视为具有自己独特的属性和挑战的新方法。最后,我们借鉴了用户研究的观察结果,在该观察中,非专家最终用户程序员使用LLM辅助工具来求解电子表格中的数据任务。我们讨论可能出现的问题,并在将大型语言模型应用于最终用户编程时,尤其是对于几乎没有编程专业知识的用户。
translated by 谷歌翻译
最近,大规模文本驱动的合成模型由于其出色的产生高度多样化的图像而引起了很多关注,这些图像遵循给定的文本提示。这种基于文本的综合方法特别有吸引力,这些方法对人类用来口头描述其意图。因此,将文本驱动的图像合成扩展到文本驱动的图像编辑是很自然的。编辑对于这些生成模型来说是具有挑战性的,因为编辑技术的先天属性是保留大多数原始图像,而在基于文本的模型中,即使对文本提示的小修改也通常会导致完全不同的结果。最先进的方法可以通过要求用户提供空间掩码来本地化编辑,从而忽略蒙版区域内的原始结构和内容,从而减轻这种方法。在本文中,我们追求一个直观的及时提示编辑框架,其中编辑仅由文本控制。为此,我们深入分析了一个文本条件模型,并观察到跨注意层是控制图像的空间布局与提示中每个单词之间关系的关键。通过此观察,我们提出了几种应用程序,它们仅通过编辑文本提示来监视图像综合。这包括通过替换单词,通过添加规范来替换单词编辑的本地化编辑,甚至精心控制单词在图像中反映的程度。我们介绍了各种图像和提示的结果,证明了对编辑提示的高质量综合和忠诚度。
translated by 谷歌翻译
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/
translated by 谷歌翻译
利用深度学习的最新进展,文本到图像生成模型目前具有吸引公众关注的优点。其中两个模型Dall-E 2和Imagen已经证明,可以从图像的简单文本描述中生成高度逼真的图像。基于一种称为扩散模型的新型图像生成方法,文本对图像模型可以生产许多不同类型的高分辨率图像,其中人类想象力是唯一的极限。但是,这些模型需要大量的计算资源来训练,并处理从互联网收集的大量数据集。此外,代码库和模型均未发布。因此,它可以防止AI社区尝试这些尖端模型,从而使其结果复制变得复杂,即使不是不可能。在本文中,我们的目标是首先回顾这些模型使用的不同方法和技术,然后提出我们自己的文本模型模型实施。高度基于DALL-E 2,我们引入了一些轻微的修改,以应对所引起的高计算成本。因此,我们有机会进行实验,以了解这些模型的能力,尤其是在低资源制度中。特别是,我们提供了比Dall-e 2的作者(包括消融研究)更深入的分析。此外,扩散模型使用所谓的指导方法来帮助生成过程。我们引入了一种新的指导方法,该方法可以与其他指导方法一起使用,以提高图像质量。最后,我们的模型产生的图像质量相当好,而不必维持最先进的文本对图像模型的重大培训成本。
translated by 谷歌翻译
素描是一种常用于创新过程的自然和有效的视觉通信介质。深度学习模型的最新发展急剧改善了理解和生成视觉内容的机器能力。令人兴奋的发展领域探讨了用于模拟人类草图的深度学习方法,开设创造性应用的机会。本章介绍了开发深受学习驱动的创造性支持工具的三个基本步骤,这些步骤消耗和生成草图:1)在草图和移动用户界面之间生成新配对数据集的数据收集工作; 2)基于草图的用户界面检索系统,适用于最先进的计算机视觉技术; 3)一个对话的草图系统,支持基于自然语言的草图/批判创作过程的新颖互动。在本章中,我们在深度学习和人机互动社区中进行了对相关的事先工作,详细记录了数据收集过程和系统的架构,目前提供了定性和定量结果,并绘制了几个未来研究的景观在这个令人兴奋的地区的方向。
translated by 谷歌翻译
自然语言为图像编辑提供高度直观的界面。在本文中,我们基于自然语言描述与ROI掩模一起介绍用于在通用自然图像中执行局部(基于区域的)编辑的第一解决方案。我们通过利用并结合预先训练的语言图像模型(CLIP)来实现我们的目标,以使编辑朝向用户提供的文本提示,具有去噪扩散概率模型(DDPM)来产生自然的结果。为了使编辑区域与图像的不变部分无缝熔化,我们在噪声水平的进展下使用本地文本引导的扩散潜伏在空间上混合输入图像的声明版本。此外,我们表明向扩散过程增加增强,减轻了对抗性结果。我们与定性和定量的几个基线和相关方法进行比较,并表明我们的方法在整体现实主义方面优于这些解决方案,保留背景和匹配文本的能力。最后,我们显示了多个文本驱动的编辑应用程序,包括将新对象添加到图像,删除/替换/更改现有对象,背景替换和图像外推。
translated by 谷歌翻译
本次调查绘制了用于分析社交媒体数据的生成方法的研究状态的广泛的全景照片(Sota)。它填补了空白,因为现有的调查文章在其范围内或被约会。我们包括两个重要方面,目前正在挖掘和建模社交媒体的重要性:动态和网络。社会动态对于了解影响影响或疾病的传播,友谊的形成,友谊的形成等,另一方面,可以捕获各种复杂关系,提供额外的洞察力和识别否则将不会被注意的重要模式。
translated by 谷歌翻译
MetaVerse,巨大的虚拟物理网络空间,为艺术家带来了前所未有的机会,将我们的身体环境的每个角落与数字创造力混合。本文对计算艺术进行了全面的调查,其中七个关键主题与成权相关,描述了混合虚拟物理现实中的新颖艺术品。主题首先涵盖了MetaVerse的建筑元素,例如虚拟场景和字符,听觉,文本元素。接下来,已经反映了诸如沉浸式艺术,机器人艺术和其他用户以其他用户的方法提供了沉浸式艺术,机器人艺术和其他用户中心的若干非凡类型的新颖创作。最后,我们提出了几项研究议程:民主化的计算艺术,数字隐私和搬迁艺术家的安全性,为数字艺术品,技术挑战等等的所有权认可。该调查还担任艺术家和搬迁技术人员的介绍材料,以开始在超现实主义网络空间领域创造。
translated by 谷歌翻译
使用计算笔记本(例如,Jupyter Notebook),数据科学家根据他们的先前经验和外部知识(如在线示例)合理化他们的探索性数据分析(EDA)。对于缺乏关于数据集或问题的具体了解的新手或数据科学家,有效地获得和理解外部信息对于执行EDA至关重要。本文介绍了eDassistant,一个jupyterlab扩展,支持EDA的原位搜索示例笔记本电脑和有用的API的推荐,由搜索结果的新颖交互式可视化供电。代码搜索和推荐是由最先进的机器学习模型启用的,培训在线收集的EDA笔记本电脑的大型语料库。进行用户学习,以调查埃迪卡斯特和数据科学家的当前实践(即,使用外部搜索引擎)。结果证明了埃迪斯坦特的有效性和有用性,与会者赞赏其对EDA的顺利和环境支持。我们还报告了有关代码推荐工具的几种设计意义。
translated by 谷歌翻译
Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
translated by 谷歌翻译