我们解决以下动作效应预测任务。给定描绘世界初始状态和文本中表达的动作的图像,预测了动作后描绘世界状态的图像。预测应具有与输入图像相同的场景上下文。我们探讨了最近提出的GLIDE模型执行此任务的使用。Glide是一个生成性神经网络,可以合成图像的掩盖区域(涂层),以短片段为条件。我们的想法是掩盖预期动作效果的输入图像的区域。然后使用滑行以在所需动作为条件的蒙面区域内涂抹涂漆。这样,结果图像具有与输入图像相同的背景上下文,并更新以显示动作的效果。我们使用带有动作标记的自我中心视频的Epic数据集给出了实验的定性结果。
translated by 谷歌翻译
Generic image inpainting aims to complete a corrupted image by borrowing surrounding information, which barely generates novel content. By contrast, multi-modal inpainting provides more flexible and useful controls on the inpainted content, \eg, a text prompt can be used to describe an object with richer attributes, and a mask can be used to constrain the shape of the inpainted object rather than being only considered as a missing area. We propose a new diffusion-based model named SmartBrush for completing a missing region with an object using both text and shape-guidance. While previous work such as DALLE-2 and Stable Diffusion can do text-guided inapinting they do not support shape guidance and tend to modify background texture surrounding the generated object. Our model incorporates both text and shape guidance with precision control. To preserve the background better, we propose a novel training and sampling strategy by augmenting the diffusion U-net with object-mask prediction. Lastly, we introduce a multi-task training strategy by jointly training inpainting with text-to-image generation to leverage more training data. We conduct extensive experiments showing that our model outperforms all baselines in terms of visual quality, mask controllability, and background preservation.
translated by 谷歌翻译
自然语言为图像编辑提供高度直观的界面。在本文中,我们基于自然语言描述与ROI掩模一起介绍用于在通用自然图像中执行局部(基于区域的)编辑的第一解决方案。我们通过利用并结合预先训练的语言图像模型(CLIP)来实现我们的目标,以使编辑朝向用户提供的文本提示,具有去噪扩散概率模型(DDPM)来产生自然的结果。为了使编辑区域与图像的不变部分无缝熔化,我们在噪声水平的进展下使用本地文本引导的扩散潜伏在空间上混合输入图像的声明版本。此外,我们表明向扩散过程增加增强,减轻了对抗性结果。我们与定性和定量的几个基线和相关方法进行比较,并表明我们的方法在整体现实主义方面优于这些解决方案,保留背景和匹配文本的能力。最后,我们显示了多个文本驱动的编辑应用程序,包括将新对象添加到图像,删除/替换/更改现有对象,背景替换和图像外推。
translated by 谷歌翻译
Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
translated by 谷歌翻译
Curating datasets for object segmentation is a difficult task. With the advent of large-scale pre-trained generative models, conditional image generation has been given a significant boost in result quality and ease of use. In this paper, we present a novel method that enables the generation of general foreground-background segmentation models from simple textual descriptions, without requiring segmentation labels. We leverage and explore pre-trained latent diffusion models, to automatically generate weak segmentation masks for concepts and objects. The masks are then used to fine-tune the diffusion model on an inpainting task, which enables fine-grained removal of the object, while at the same time providing a synthetic foreground and background dataset. We demonstrate that using this method beats previous methods in both discriminative and generative performance and closes the gap with fully supervised training while requiring no pixel-wise object labels. We show results on the task of segmenting four different objects (humans, dogs, cars, birds).
translated by 谷歌翻译
We introduce Action-GPT, a plug and play framework for incorporating Large Language Models (LLMs) into text-based action generation models. Action phrases in current motion capture datasets contain minimal and to-the-point information. By carefully crafting prompts for LLMs, we generate richer and fine-grained descriptions of the action. We show that utilizing these detailed descriptions instead of the original action phrases leads to better alignment of text and motion spaces. Our experiments show qualitative and quantitative improvement in the quality of synthesized motions produced by recent text-to-motion models. Code, pretrained models and sample videos will be made available at https://actiongpt.github.io
translated by 谷歌翻译
一个人如何在没有特定任务的固定或任何模型修改的情况下将预训练的视觉模型调整为新颖的下游任务?受到NLP提示的启发,本文研究了视觉提示:在测试时间和新输入图像时,给定的输入输出图像示例示例,目标是自动生成输出图像,与给定的示例一致。我们表明,将这个问题作为简单的图像插入,实际上只是填充了串联的视觉提示图像中的一个孔 - 只要已经对正确的数据训练了介入算法,就非常有效。我们在我们策划的新数据集上训练蒙面的自动编码器-88K未标记的数字来自ARXIV上的学术报纸来源。我们将视觉提示应用于这些预处理的模型,并在各种下游图像到图像任务上展示结果,包括前景分割,单个对象检测,着色,边缘检测等。
translated by 谷歌翻译
Shape can specify key object constraints, yet existing text-to-image diffusion models ignore this cue and synthesize objects that are incorrectly scaled, cut off, or replaced with background content. We propose a training-free method, Shape-Guided Diffusion, which uses a novel Inside-Outside Attention mechanism to constrain the cross-attention (and self-attention) maps such that prompt tokens (and pixels) referring to the inside of the shape cannot attend outside the shape, and vice versa. To demonstrate the efficacy of our method, we propose a new image editing task where the model must replace an object specified by its mask and a text prompt. We curate a new ShapePrompts benchmark based on MS-COCO and achieve SOTA results in shape faithfulness, text alignment, and realism according to both quantitative metrics and human preferences. Our data and code will be made available at https://shape-guided-diffusion.github.io.
translated by 谷歌翻译
最近,大规模文本驱动的合成模型由于其出色的产生高度多样化的图像而引起了很多关注,这些图像遵循给定的文本提示。这种基于文本的综合方法特别有吸引力,这些方法对人类用来口头描述其意图。因此,将文本驱动的图像合成扩展到文本驱动的图像编辑是很自然的。编辑对于这些生成模型来说是具有挑战性的,因为编辑技术的先天属性是保留大多数原始图像,而在基于文本的模型中,即使对文本提示的小修改也通常会导致完全不同的结果。最先进的方法可以通过要求用户提供空间掩码来本地化编辑,从而忽略蒙版区域内的原始结构和内容,从而减轻这种方法。在本文中,我们追求一个直观的及时提示编辑框架,其中编辑仅由文本控制。为此,我们深入分析了一个文本条件模型,并观察到跨注意层是控制图像的空间布局与提示中每个单词之间关系的关键。通过此观察,我们提出了几种应用程序,它们仅通过编辑文本提示来监视图像综合。这包括通过替换单词,通过添加规范来替换单词编辑的本地化编辑,甚至精心控制单词在图像中反映的程度。我们介绍了各种图像和提示的结果,证明了对编辑提示的高质量综合和忠诚度。
translated by 谷歌翻译
最近已被证明扩散模型产生高质量的合成图像,尤其是与指导技术配对,以促进忠诚的多样性。我们探索文本条件图像综合问题的扩散模型,并比较了两种不同的指导策略:剪辑指导和自由分类指导。我们发现后者是人类评估者的优选,用于光敏和标题相似度,并且通常产生光素质拟种样品。使用自由分类指导的35亿参数文本条件扩散模型的样本由人类评估者对来自Dall-E的人的人们青睐,即使后者使用昂贵的剪辑重新划分。此外,我们发现我们的模型可以进行微调,以执行图像修复,从而实现强大的文本驱动的图像编辑。我们在过滤的数据集中培训较小的模型,并在https://github.com/openai/glide-text2im释放代码和权重。
translated by 谷歌翻译
Videos often capture objects, their visible properties, their motion, and the interactions between different objects. Objects also have physical properties such as mass, which the imaging pipeline is unable to directly capture. However, these properties can be estimated by utilizing cues from relative object motion and the dynamics introduced by collisions. In this paper, we introduce CRIPP-VQA, a new video question answering dataset for reasoning about the implicit physical properties of objects in a scene. CRIPP-VQA contains videos of objects in motion, annotated with questions that involve counterfactual reasoning about the effect of actions, questions about planning in order to reach a goal, and descriptive questions about visible properties of objects. The CRIPP-VQA test set enables evaluation under several out-of-distribution settings -- videos with objects with masses, coefficients of friction, and initial velocities that are not observed in the training distribution. Our experiments reveal a surprising and significant performance gap in terms of answering questions about implicit properties (the focus of this paper) and explicit properties of objects (the focus of prior work).
translated by 谷歌翻译
“行动”在人类与世界互动并使他们实现理想的目标方面起着至关重要的作用。结果,对人类的最常识(CS)知识围绕着行动。尽管“关于行动与变革的推理”(RAC)在知识代表社区中得到了广泛的研究,但它最近引起了NLP和计算机视觉研究人员的兴趣。本文调查了现有的任务,基准数据集,各种技术和模型,以及它们在视觉和语言领域中RAC中进步的各自绩效。最后,我们总结了我们的关键要点,讨论该研究领域面临的目前挑战,并概述了未来研究的潜在方向。
translated by 谷歌翻译
对象剪切已成为有效生成大量标记的训练数据的一种有希望的方法。它涉及将前景对象掩盖在背景图像上。背景图像与对象一致时,为培训对象识别模型提供了有用的上下文信息。尽管该方法可以轻松地生成大型标记的数据,但寻找下游任务的一致上下文图像仍然是一个难以捉摸的问题。在这项工作中,我们为自动上下文图像生成的新范式提出了一个新的范式。我们方法的核心是利用上下文和语言驱动图像生成之间的相互作用。通过在代表上下文的一小部分图像上应用图像字幕方法来提供上下文的语言描述。然后,这些语言描述用于使用基于语言的DALL-E图像生成框架来生成各种上下文图像集。然后将它们与对象合成,以提供分类器的增强培训集。我们在四个对象检测数据集上证明了方法比先前的上下文图像生成方法的优势。此外,我们还强调了数据生成方法对分布和零摄像数据生成方案的组成性质。
translated by 谷歌翻译
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problem$\unicode{x2014}$One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.
translated by 谷歌翻译
我们引入分层可控的视频生成,在没有任何监督的情况下,将视频的初始帧分解为前景和背景层,用户可以通过简单地操纵前景掩模来控制视频生成过程。关键挑战是无监督的前景背景分离,这是模糊的,并且能够预测用户操作,可以访问未获得原始视频序列。我们通过提出两阶段学习程序来解决这些挑战。在第一阶段,随着丰富的损失和动态前景大小,我们学习如何将帧分离为前景和背景图层,并在这些图层上调节,如何使用VQ-VAE发生器生成下一帧。在第二阶段,我们通过将(参数化)控制从未来框架拟合(参数化)控制来进行该网络来预测对掩码的编辑。我们展示了该学习的有效性和更粒度的控制机制,同时说明了在两个基准数据集上的最先进的性能。我们提供了一个视频摘要以及HTTPS://gabriel-中的视频结果.Github.io/layered_controllable_video_generation
translated by 谷歌翻译
随着信息中的各种方式存在于现实世界中的各种方式,多式联信息之间的有效互动和融合在计算机视觉和深度学习研究中的多模式数据的创造和感知中起着关键作用。通过卓越的功率,在多式联运信息中建模互动,多式联运图像合成和编辑近年来已成为一个热门研究主题。与传统的视觉指导不同,提供明确的线索,多式联路指南在图像合成和编辑方面提供直观和灵活的手段。另一方面,该领域也面临着具有固有的模态差距的特征的几个挑战,高分辨率图像的合成,忠实的评估度量等。在本调查中,我们全面地阐述了最近多式联运图像综合的进展根据数据模型和模型架构编辑和制定分类。我们从图像合成和编辑中的不同类型的引导方式开始介绍。然后,我们描述了多模式图像综合和编辑方法,其具有详细的框架,包括生成的对抗网络(GAN),GaN反转,变压器和其他方法,例如NERF和扩散模型。其次是在多模式图像合成和编辑中广泛采用的基准数据集和相应的评估度量的综合描述,以及分析各个优点和限制的不同合成方法的详细比较。最后,我们为目前的研究挑战和未来的研究方向提供了深入了解。与本调查相关的项目可在HTTPS://github.com/fnzhan/mise上获得
translated by 谷歌翻译
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
translated by 谷歌翻译
由于其各种潜在应用及其众多挑战,视频生成是机器学习中相对较新的,但流行的主题。视频生成中的当前方法为用户提供了很少或根本没有控制通过对生成视频中的对象被移动并位于每个帧的精确规范,即,用户无法明确控制每个对象如何视频应该移动。在本文中,我们提出了一种新颖的方法,该方法允许用户通过在这些对象上绘制边界框,然后在所需路径中移动这些框来移动所有数量的单个初始帧的对象。我们的模型利用两个AutoEncoders完全分解视频中的运动和内容信息,并实现与众所周知的基线和现有方法的结果相当。
translated by 谷歌翻译
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/
translated by 谷歌翻译
We propose "factor matting", an alternative formulation of the video matting problem in terms of counterfactual video synthesis that is better suited for re-composition tasks. The goal of factor matting is to separate the contents of video into independent components, each visualizing a counterfactual version of the scene where contents of other components have been removed. We show that factor matting maps well to a more general Bayesian framing of the matting problem that accounts for complex conditional interactions between layers. Based on this observation, we present a method for solving the factor matting problem that produces useful decompositions even for video with complex cross-layer interactions like splashes, shadows, and reflections. Our method is trained per-video and requires neither pre-training on external large datasets, nor knowledge about the 3D structure of the scene. We conduct extensive experiments, and show that our method not only can disentangle scenes with complex interactions, but also outperforms top methods on existing tasks such as classical video matting and background subtraction. In addition, we demonstrate the benefits of our approach on a range of downstream tasks. Please refer to our project webpage for more details: https://factormatte.github.io
translated by 谷歌翻译