在本文中,我们设计和训练生成的图像到文本变压器Git,以统一视觉语言任务,例如图像/视频字幕和问题答案。尽管生成模型在预训练和微调之间提供了一致的网络体系结构,但现有工作通常包含复杂的结构(Uni/多模式编码器/解码器),并取决于外部模块,例如对象检测器/标记器和光学角色识别(OCR) )。在git中,我们将体系结构简化为一个图像编码器,而在单语言建模任务下将架构简化为一个文本解码器。我们还扩展了预训练数据和模型大小,以提高模型性能。没有铃铛和哨子,我们的git在12个具有挑战性的基准下建立了新的艺术状态。例如,我们的模型在文本贴图上首次超过了人类的表现(138.2 vs. 125.5在苹果酒中)。此外,我们提出了一种新的基于一代的图像分类和场景文本识别的方案,在标准基准上实现了不错的表现。
translated by 谷歌翻译
有效的缩放和灵活的任务接口使大型语言模型能够在许多任务中表现出色。帕利(Pali)根据视觉和文本输入生成文本,并使用该界面以许多语言执行许多视觉,语言和多模式任务。为了训练帕利,我们利用了大型的编码器语言模型和视觉变压器(VITS)。这使我们能够利用其现有能力,并利用培训它们的大量成本。我们发现,视觉和语言组成部分的联合缩放很重要。由于现有的语言变压器比其视觉对应物要大得多,因此我们训练迄今为止最大的VIT(VIT-E),以量化甚至大容量视觉模型的好处。为了训练Pali,我们基于一个新的图像文本训练集,其中包含10B图像和文本,以100多种语言来创建大型的多语言组合。帕利(Pali)在多个视觉和语言任务(例如字幕,视觉问题,索方式,场景文本理解)中实现了最新的,同时保留了简单,模块化和可扩展的设计。
translated by 谷歌翻译
自动视觉解对我们多样化和开放的世界需要计算机视觉模型,以概括为特定任务的最小定制,类似于人类视力。计算机视觉基础型号培训,培训多样化,大型数据集,可以适应各种下游任务,对该任务来解决现实世界计算机视觉应用而言至关重要。虽然现有的视觉基础模型如剪辑,对齐和吴道2.0主要集中在映射图像和文本表示到跨模型共享表示,我们介绍了一台新的计算机视觉基础模型,佛罗伦萨,扩大粗糙的表示(现场)到精细(对象),从静态(图像)到动态(视频),以及从RGB到多个模态(标题,深度)。通过从Web级图像文本数据中纳入通用视觉语言表示,我们的佛罗伦萨模型可以很容易地适应各种计算机视觉任务,例如分类,检索,对象检测,VQA,图像标题,视频检索和动作识别。此外,佛罗伦萨在许多类型的转移学习中表现出出色的表现:全面采样的微调,线性探测,几次射击传输和用于新颖图像和物体的零拍摄传输。所有这些属性对于我们的视觉基础模型至关重要,以提供通用视觉任务。佛罗伦萨实现了新的最先进的导致44个代表性基准,例如Imagenet-1K零射击分类,最高1精度为83.74,最高5个精度为97.18,62.4地图上的Coco微调, 80.36在VQA上,动力学-600上的87.8。
translated by 谷歌翻译
在本文中,我们提出了一种单一统一的变压器(UFO),其能够处理视觉语言的单峰输入(例如,图像或语言)或多模式输入(例如,图像和问题的串联)( VL)表示学习。现有方法通常为每个模态和/或特定融合网络设计个人网络,用于多模式任务。为了简化网络架构,我们使用单个变压器网络并在VL预培训期间强制执行多任务学习,其包括图像文本对比丢失,图像文本匹配丢失和基于双向的屏蔽语言建模损耗SEQ2Seq注意面具。相同的变压器网络用作不同预训练任务中的图像编码器,文本编码器或融合网络。经验上,我们观察不同任务之间的冲突,并在视觉问题应答,Coco图像标题(交叉熵优化)和Nocaps(在香料中)实现新的艺术状态。在其他下游任务中,例如,图像文本检索,我们也实现了竞争性能。
translated by 谷歌翻译
This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR [21], and utilize an improved approach OSCAR+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. Code, models and pre-extracted features are released at https://github.com/pzzhang/VinVL. ♥ Microsoft Corporation♠ University of Washington † indicates equal contributions.
translated by 谷歌翻译
The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pretraining. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pretraining data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [70] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks. 1
translated by 谷歌翻译
在本文中,我们提出了Unicorn,一种vision-language(vl)模型,使文本生成和边界框预测到单个架构中。具体而言,我们将每个框量化为四个离散框令牌,并将其序列化为序列,可以与文本令牌集成。我们将所有VL问题作为一代任务,其中目标序列由集成文本和框令牌组成。然后,我们训练变压器编码器解码器以以自动回归方式预测目标。通过如此统一的框架和输入输出格式,Unicorn在7 VL基准测试中实现了对现有技术的可比性的性能,涵盖了视觉接地,接地字幕,视觉问题应答和图像标题任务。当用多任务FINETUNING培训时,UNICORN可以通过单一的参数方法接近不同的VL任务,从而跨越下游任务边界。我们展示了具有单一模型不仅可以节省参数,而且还可以在某些任务上提高模型性能。最后,Unicorn显示了概括到诸如ImageNet对象本地化的新任务的能力。
translated by 谷歌翻译
Despite the remarkable progress of image captioning, existing captioners typically lack the controllable capability to generate desired image captions, e.g., describing the image in a rough or detailed manner, in a factual or emotional view, etc. In this paper, we show that a unified model is qualified to perform well in diverse domains and freely switch among multiple styles. Such a controllable capability is achieved by embedding the prompt learning into the image captioning framework. To be specific, we design a set of prompts to fine-tune the pre-trained image captioner. These prompts allow the model to absorb stylized data from different domains for joint training, without performance degradation in each domain. Furthermore, we optimize the prompts with learnable vectors in the continuous word embedding space, avoiding the heuristic prompt engineering and meanwhile exhibiting superior performance. In the inference stage, our model is able to generate desired stylized captions by choosing the corresponding prompts. Extensive experiments verify the controllable capability of the proposed method. Notably, we achieve outstanding performance on two diverse image captioning benchmarks including COCO Karpathy split and TextCaps using a unified model.
translated by 谷歌翻译
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use selfattention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar 1 , which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks. 2
translated by 谷歌翻译
近年来,根据Vision-Language预训练(VLP),我们在图像标题任务中掌握了显着的性能提升。比例被认为是这一进步的重要因素。然而,大多数现有工作仅侧重于预训练的变压器,在大约400万图像上具有中等大小(例如,12或24层)。在本文中,我们呈现柠檬,一个大规模的图像标题器,并为图像标题的VLP的缩放行为提供第一个实证研究。我们使用最先进的VINVL模型作为我们的参考模型,它由图像特征提取器和变压器模型组成,并将变压器上下放大,模型大小范围从13到675万参数。在数据方面,我们通过高达200万图像文本对进行实验,该对基于图像的Alt属性自动从Web自动收集(称为ALT200M)。广泛的分析有助于将性能趋势表征为模型大小和预训练数据尺寸增加。我们还比较不同的培训配方,特别是在大规模嘈杂数据上培训。结果,柠檬在几个主要图像标题基准上实现了新的技术状态,包括Coco标题,Nocaps和概念标题。我们还显示柠檬可以在以零拍摄方式使用时生成带有长尾视觉概念的标题。
translated by 谷歌翻译
本文介绍了Omnivl,这是一种新的基础模型,旨在使用一种通用体系结构来支持图像语言和视频语言任务。它为图像和视频输入采用了统一的基于变压器的视觉编码器,因此可以执行联合图像语言和视频语言预处理。我们首次证明了这样的范式受益于图像和视频任务,而不是传统的单向传输(例如,使用图像语言来帮助视频语言)。为此,我们提出了对图像语言和视频语言的脱钩关节预处理,以有效地将视觉模型分解为空间和时间维度,并在图像和视频任务上获得性能提升。此外,我们引入了一种新颖的统一视觉对比度(UNIVLC)损失,以利用图像文本,视频文本,图像标签(例如,图像分类),视频标签(例如,视频动作识别)在一起受到监督和吵闹的监督预处理数据都尽可能多地利用。无需额外的任务适配器,Omnivl可以同时支持仅视觉任务(例如,图像分类,视频操作识别),跨模式对齐任务(例如,图像/视频 - 文本检索)和多模式理解和生成任务(例如,图像/视频问答,字幕)。我们在各种下游任务上评估Omnivl,并以相似的模型大小和数据量表获得最新的或竞争结果。
translated by 谷歌翻译
视觉问题应答(VQA)任务利用视觉图像和语言分析来回回答图像的文本问题。它是一个流行的研究课题,在过去十年中越来越多的现实应用。本文介绍了我们最近对AliceMind-MMU的研究(阿里巴巴的编码器 - 解码器来自Damo Academy - 多媒体理解的机器智能实验室),其比人类在VQA上获得相似甚至略微更好的结果。这是通过系统地改善VQA流水线来实现的,包括:(1)具有全面的视觉和文本特征表示的预培训; (2)与学习参加的有效跨模型互动; (3)一个新颖的知识挖掘框架,具有专门的专业专家模块,适用于复杂的VQA任务。处理不同类型的视觉问题,需要具有相应的专业知识在提高我们的VQA架构的表现方面发挥着重要作用,这取决于人力水平。进行了广泛的实验和分析,以证明新的研究工作的有效性。
translated by 谷歌翻译
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.
translated by 谷歌翻译
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be finetuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.
translated by 谷歌翻译
基础模型由于在广泛的下游应用中的有效性而受到了很多关注。尽管在体系结构方面存在很大的融合,但大多数审慎的模型通常仍用于特定任务或模式。在这项工作中,我们建议将语言模型用作各种基础模型的通用接口。一系列预处理的编码者感知到了多种方式(例如视觉和语言),并与扮演通用任务层角色的语言模型对接。我们提出了一个半伴侣的语言建模目标,以共同确定界面和模块化编码器。我们从因果关系和非因果建模中涵盖了优势和能力,从而结合了两个世界的最佳状态。具体而言,所提出的方法不仅从因果语言建模中继承了内在学习和开放式生成的能力,而且由于双向编码器而有利于填补。更重要的是,我们的方法无缝地解锁了上述功能的组合,例如,通过填充编码器启用了文本学习或指导。各种仅语言和视觉语言基准的实验结果表明,我们的模型表现优于或与鉴定,零弹性概括和几乎没有的学习的专业模型竞争。
translated by 谷歌翻译
近年来,统一的视觉语言框架已经大大提高,其中大多数采用编码器架构将图像文本任务统一为序列到序列的生成。但是,现有的视频语言(VIDL)模型仍需要在每个任务的模型体系结构和培训目标中进行特定于任务的设计。在这项工作中,我们探索了一个统一的VIDL框架薰衣草,其中蒙版语言建模(MLM)用作所有前训练和下游任务的常见接口。这样的统一导致了简化的模型体系结构,在多模式编码器之上,只需要一个轻巧的MLM头,而不是具有更多参数的解码器。令人惊讶的是,实验结果表明,这个统一的框架在14个VIDL基准测试中实现了竞争性能,涵盖了视频问答,文本到视频检索和视频字幕。广泛的分析进一步证明了薰衣草比现有VIDL方法的优势:(i)在多任务列出时仅使用一组参数值支持所有下游任务; (ii)对各种下游任务的几乎没有概括; (iii)在视频问题回答任务上启用零射门评估。代码可从https://github.com/microsoft/lavender获得。
translated by 谷歌翻译
大多数现有的视觉语言预训练方法侧重于在预先绘制期间了解解决任务并使用伯特样目标(屏蔽语言建模和图像 - 文本匹配)。虽然它们在许多理解下游任务中表现良好,但是,例如,视觉问题应答,图像文本检索和视觉存在,它们没有生成的能力。为了解决这个问题,我们为视觉语言理解和一代(UNIVL)提出了统一的多模式预培训。建议的UNIVL能够处理理解任务和生成任务。我们增强了现有的预押范例,只使用带有因果面罩的随机掩模,即掩盖未来令牌的三角面具,使得预先接受的模型可以通过设计具有自动发育能力。我们将几个以前的理解任务作为文本生成任务制定,并建议使用基于提示的方法来进行不同的下游任务进行微调。我们的实验表明,在使用相同型号的同时了解任务和生成任务之间存在权衡,以及改善两个任务的可行方式是使用更多数据。我们的UNIVL框架可以在近似验证任务和生成任务中获得最近的愿景预培训方法的性能。此外,我们开展了基于及时的FineTuning更具数据效率 - 在几次拍摄场景中表现出差异的方法。
translated by 谷歌翻译
Vision-Language Transformers can be learned without human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders that does not require this supervision. In fact, in a head-to-head comparison between ViLT, the current state-of-the-art patch-based vision-language transformer which is pretrained with supervised object classification, and our model, VLC, we find that our approach 1. outperforms ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch visualizations, and 3. is competitive with many larger models that utilize ROIs trained on annotated bounding-boxes.
translated by 谷歌翻译
读取图像中文本的能力通常缺乏视觉和语言(V&L)模型。我们如何学习表现出强烈的场景文本理解(Stu)的V&L模型?在本文中,我们提出了Prestu,这是一种专门为场景文本理解而设计的简单预训练食谱。Prestu将简单的OCR感知预训练目标与带有现成的OCR信号的大型图像文本数据集结合在一起。我们从经验上证明了这一预训练目标对TextVQA,TextCaps,ST-VQA和Vizwiz-VQA的优越性。我们还研究了哪些因素会影响Stu性能,其中我们强调了在预训练期间图像分辨率和数据集量表的重要性。
translated by 谷歌翻译
Recent years have witnessed the rapid progress of image captioning. However, the demands for large memory storage and heavy computational burden prevent these captioning models from being deployed on mobile devices. The main obstacles lie in the heavyweight visual feature extractors (i.e., object detectors) and complicated cross-modal fusion networks. To this end, we propose LightCap, a lightweight image captioner for resource-limited devices. The core design is built on the recent CLIP model for efficient image captioning. To be specific, on the one hand, we leverage the CLIP model to extract the compact grid features without relying on the time-consuming object detectors. On the other hand, we transfer the image-text retrieval design of CLIP to image captioning scenarios by devising a novel visual concept extractor and a cross-modal modulator. We further optimize the cross-modal fusion model and parallel prediction heads via sequential and ensemble distillations. With the carefully designed architecture, our model merely contains 40M parameters, saving the model size by more than 75% and the FLOPs by more than 98% in comparison with the current state-of-the-art methods. In spite of the low capacity, our model still exhibits state-of-the-art performance on prevalent datasets, e.g., 136.6 CIDEr on COCO Karpathy test split. Testing on the smartphone with only a single CPU, the proposed LightCap exhibits a fast inference speed of 188ms per image, which is ready for practical applications.
translated by 谷歌翻译