Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
translated by 谷歌翻译
有效的缩放和灵活的任务接口使大型语言模型能够在许多任务中表现出色。帕利(Pali)根据视觉和文本输入生成文本,并使用该界面以许多语言执行许多视觉,语言和多模式任务。为了训练帕利,我们利用了大型的编码器语言模型和视觉变压器(VITS)。这使我们能够利用其现有能力,并利用培训它们的大量成本。我们发现,视觉和语言组成部分的联合缩放很重要。由于现有的语言变压器比其视觉对应物要大得多,因此我们训练迄今为止最大的VIT(VIT-E),以量化甚至大容量视觉模型的好处。为了训练Pali,我们基于一个新的图像文本训练集,其中包含10B图像和文本,以100多种语言来创建大型的多语言组合。帕利(Pali)在多个视觉和语言任务(例如字幕,视觉问题,索方式,场景文本理解)中实现了最新的,同时保留了简单,模块化和可扩展的设计。
translated by 谷歌翻译
读取图像中文本的能力通常缺乏视觉和语言(V&L)模型。我们如何学习表现出强烈的场景文本理解(Stu)的V&L模型?在本文中,我们提出了Prestu,这是一种专门为场景文本理解而设计的简单预训练食谱。Prestu将简单的OCR感知预训练目标与带有现成的OCR信号的大型图像文本数据集结合在一起。我们从经验上证明了这一预训练目标对TextVQA,TextCaps,ST-VQA和Vizwiz-VQA的优越性。我们还研究了哪些因素会影响Stu性能,其中我们强调了在预训练期间图像分辨率和数据集量表的重要性。
translated by 谷歌翻译
视觉问题回答(VQA)主要通过英语镜头进行了研究。但是,以其他方式以其他方式处理VQA将需要大量资源。在本文中,我们在数据和建模方面提出了多种语言视觉问题回答(MVQA)的可扩展解决方案。我们首先向MVQA数据生成提出了一个基于翻译的框架,该框架比直接收集问题和答案的常规方法所需的人类注释工作要少得多。然后,我们将框架应用于CrossModal-3600数据集中的多语言字幕,并开发了有效的注释协议,以创建Maverics-XM3600(MAXM),这是一种仅使用7种不同语言的仅测试的VQA基准。最后,我们提出了一种方法,用于统一,可扩展,开放式和端到端MVQA建模,并在13种语言中表现出强劲的性能。
translated by 谷歌翻译
密集的视频字幕旨在确定输入视频中感兴趣的事件,并为每个事件生成描述性标题。先前的方法通常遵循两个阶段的生成过程,该过程首先提出了每个事件的段,然后为每个已确定的细分市场提供标题。大规模序列产生预处理的最新进展在统一各种任务的任务制定方面取得了巨大的成功,但是到目前为止,更复杂的任务(例如密集的视频字幕)无法完全利用这种强大的范式。在这项工作中,我们展示了如何将密集视频字幕的两个子任务与一个序列生成任务建模,并同时预测事件和相应的描述。在YouCook2和Vitt上进行的实验表现出令人鼓舞的结果,并表明训练复杂任务的可行性,例如集成到大规模预处理模型中的端到端密集的视频字幕。
translated by 谷歌翻译
近年来,随着预审预周习惯的模型的越来越多,为特定的下游分类任务选择最佳的检查站的问题一直在增加注意力。尽管最近提出了几种方法来解决选择问题(例如LEEP,H-SCORE),但这些方法诉诸应用学习理论并非充分动机的启发式方法。在本文中,我们介绍了PACTRAN,这是一个理论上扎根的指标家族,用于验证模型选择和可传递性测量。我们首先展示了如何从转移学习设置下的最佳PAC-Bayesian界限中得出PACTRAN指标。然后,我们在许多视觉任务(VTAB)以及语言和视觉(OKVQA)任务上对PACTRAN的三个度量实例进行了经验评估。对结果的分析表明,与现有选择方法相比,PACTRAN是一种更一致和有效的可传递性度量。
translated by 谷歌翻译
The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pretraining. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pretraining data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [70] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks. 1
translated by 谷歌翻译
图像标题模型通常缺乏考虑用户兴趣的能力,通常默认为试图平衡可读性,信息性和信息过载的全局描述。另一方面,VQA模型通常缺乏提供长描述性答案的能力,同时期望文本问题非常精确。我们介绍一种控制图像标题应该专注于的概念的方法,使用称为指导文本的额外输入,该概念是指图像中的可接近或未放置的概念。我们的模型包括一个基于变换器的多模式编码器,它使用引导文本与全局和对象级别图像功能一起导出用于生成引导标题的早期融合表示。虽然在视觉基因组数据上培训的模型时,在使用自动对象标签的引导时具有适应良好的域的域中优势,但我们发现在概念标题上培训的引导标题模型概括为域外图像和引导文本。我们的人为评估结果表明,尝试野外引导的图像标题需要访问大,不受限制的域训练数据集,并且增加的样式分集(即使不增加唯一令牌的数量)是提高性能的关键因素。
translated by 谷歌翻译
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameterreduction techniques to lower memory consumption and increase the training speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT. * Work done as an intern at Google Research, driving data processing and downstream task evaluations.
translated by 谷歌翻译
Despite its importance for federated learning, continuous learning and many other applications, on-device training remains an open problem for EdgeAI. The problem stems from the large number of operations (e.g., floating point multiplications and additions) and memory consumption required during training by the back-propagation algorithm. Consequently, in this paper, we propose a new gradient filtering approach which enables on-device DNN model training. More precisely, our approach creates a special structure with fewer unique elements in the gradient map, thus significantly reducing the computational complexity and memory consumption of back propagation during training. Extensive experiments on image classification and semantic segmentation with multiple DNN models (e.g., MobileNet, DeepLabV3, UPerNet) and devices (e.g., Raspberry Pi and Jetson Nano) demonstrate the effectiveness and wide applicability of our approach. For example, compared to SOTA, we achieve up to 19$\times$ speedup and 77.1% memory savings on ImageNet classification with only 0.1% accuracy loss. Finally, our method is easy to implement and deploy; over 20$\times$ speedup and 90% energy savings have been observed compared to highly optimized baselines in MKLDNN and CUDNN on NVIDIA Jetson Nano. Consequently, our approach opens up a new direction of research with a huge potential for on-device training.
translated by 谷歌翻译