近年来在开发更好的图像标题模型方面取得了巨大进展,但其中大多数依赖于单独的对象探测器来提取区域特征。最近的视觉语言研究通过利用网格表示来实现更灵活的模型训练和更快推理速度的速度来转向探测器趋势。但是,这种发展主要专注于图像理解任务,并且对标题生成任务的研究仍然较少。在本文中,我们涉及一种更好的无需探测器图像标题模型,并提出了一种基于纯视觉变压器的图像标题模型,称为VITCAP,其中使用了网格表示而不提取区域特征。为了提高性能,我们介绍了一种新颖的概念令牌网络(CTN)来预测语义概念,然后将它们纳入端到端的标题。特别地,CTN是基于视觉变换器构建的,并且旨在通过分类任务预测概念令牌,其中包含丰富的语义信息极大地利益标题任务。与以前的探测器的模型相比,Vitcap大大简化了架构,同时在各种具有挑战性的图像标题数据集上实现了竞争性能。特别是,Vitcap分别达到138.1苹果酒分数,即在Nocaps上的Coco-Caption Karpatal-Splity,93.8和108.6苹果酒分数和Google-CC标题数据集上分别达到138.1苹果酒分数。
translated by 谷歌翻译
Recent years have witnessed the rapid progress of image captioning. However, the demands for large memory storage and heavy computational burden prevent these captioning models from being deployed on mobile devices. The main obstacles lie in the heavyweight visual feature extractors (i.e., object detectors) and complicated cross-modal fusion networks. To this end, we propose LightCap, a lightweight image captioner for resource-limited devices. The core design is built on the recent CLIP model for efficient image captioning. To be specific, on the one hand, we leverage the CLIP model to extract the compact grid features without relying on the time-consuming object detectors. On the other hand, we transfer the image-text retrieval design of CLIP to image captioning scenarios by devising a novel visual concept extractor and a cross-modal modulator. We further optimize the cross-modal fusion model and parallel prediction heads via sequential and ensemble distillations. With the carefully designed architecture, our model merely contains 40M parameters, saving the model size by more than 75% and the FLOPs by more than 98% in comparison with the current state-of-the-art methods. In spite of the low capacity, our model still exhibits state-of-the-art performance on prevalent datasets, e.g., 136.6 CIDEr on COCO Karpathy test split. Testing on the smartphone with only a single CPU, the proposed LightCap exhibits a fast inference speed of 188ms per image, which is ready for practical applications.
translated by 谷歌翻译
This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR [21], and utilize an improved approach OSCAR+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. Code, models and pre-extracted features are released at https://github.com/pzzhang/VinVL. ♥ Microsoft Corporation♠ University of Washington † indicates equal contributions.
translated by 谷歌翻译
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use selfattention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar 1 , which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks. 2
translated by 谷歌翻译
Vision-Language Transformers can be learned without human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders that does not require this supervision. In fact, in a head-to-head comparison between ViLT, the current state-of-the-art patch-based vision-language transformer which is pretrained with supervised object classification, and our model, VLC, we find that our approach 1. outperforms ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch visualizations, and 3. is competitive with many larger models that utilize ROIs trained on annotated bounding-boxes.
translated by 谷歌翻译
近年来,根据Vision-Language预训练(VLP),我们在图像标题任务中掌握了显着的性能提升。比例被认为是这一进步的重要因素。然而,大多数现有工作仅侧重于预训练的变压器,在大约400万图像上具有中等大小(例如,12或24层)。在本文中,我们呈现柠檬,一个大规模的图像标题器,并为图像标题的VLP的缩放行为提供第一个实证研究。我们使用最先进的VINVL模型作为我们的参考模型,它由图像特征提取器和变压器模型组成,并将变压器上下放大,模型大小范围从13到675万参数。在数据方面,我们通过高达200万图像文本对进行实验,该对基于图像的Alt属性自动从Web自动收集(称为ALT200M)。广泛的分析有助于将性能趋势表征为模型大小和预训练数据尺寸增加。我们还比较不同的培训配方,特别是在大规模嘈杂数据上培训。结果,柠檬在几个主要图像标题基准上实现了新的技术状态,包括Coco标题,Nocaps和概念标题。我们还显示柠檬可以在以零拍摄方式使用时生成带有长尾视觉概念的标题。
translated by 谷歌翻译
连接视觉和语言在生成智能中起着重要作用。因此,已经致力于图像标题的大型研究工作,即用句法和语义有意义的句子描述图像。从2015年开始,该任务通常通过由Visual Encoder组成的管道和文本生成的语言模型来解决任务。在这些年来,两种组件通过对象区域,属性,介绍多模态连接,完全关注方法和伯特早期融合策略的利用而显着发展。但是,无论令人印象深刻的结果,图像标题的研究还没有达到结论性答案。这项工作旨在提供图像标题方法的全面概述,从视觉编码和文本生成到培训策略,数据集和评估度量。在这方面,我们量化地比较了许多相关的最先进的方法来确定架构和培训策略中最有影响力的技术创新。此外,讨论了问题的许多变体及其开放挑战。这项工作的最终目标是作为理解现有文献的工具,并突出显示计算机视觉和自然语言处理的研究领域的未来方向可以找到最佳的协同作用。
translated by 谷歌翻译
虽然标题模型已经获得了引人注目的结果,但在描述自然图像时,它们仍然不会涵盖现实世界概念的整个长尾分布。在本文中,我们通过在Web级自动收集的数据集上培训来解决与野外概念生成人类描述的任务。为此,我们提出了一种模型,该模型可以利用嘈杂的图像标题对,同时维持像Coco这样的传统人类注释数据集的描述性风格。我们的模型通过使用关键字和风格标记将内容从风格分开,使用单一目标是提示语言建模和比其他最近提出的更简单。在实验上,我们的模型在零拍摄设置中始终如一地占据了说明性质量和能力的现有方法。根据苹果酒公制,我们在使用外部数据时在Coco和Nocaps上获得新的最新状态。
translated by 谷歌翻译
Vision-and语言(VL)预培训已被证明对各种VL下游任务非常有效。虽然最近的工作表明,基于完全变换器的VL模型可以比以前的基于区域特征的方法更有效,但它们在下游任务上的性能通常显着降低。在本文中,我们呈现仪表〜(\ textbf {m} ultimodal \ textbf {e} nd-to-text \ textbf {t} ransform \ textbf {er}),我们通过它系统地调查如何设计和预先列车基于完全变换器的VL模型以端到端的方式。具体而言,我们将模型设计沿多个尺寸分析:视觉编码器(例如,剪辑 - vit,Swin变压器),文本编码器(例如,Roberta,Deberta),多模式融合(例如,合并注意力与共同关注),架构设计(例如,仅编码器与编码器 - 解码器)和预训练目标(例如,屏蔽图像建模)。我们对广泛的VL任务进行全面实验,并提供有关如何在保持快速推理速度的同时培训表演VL变压器的见解。值得注意的是,仪表〜使用仅使用4M图像进行预培训的VQAV2 TEST-STD设置的精度为77.64 \%,超过最先进的区域特征的VINVL模型+1.04 \%,以及优于以前最好的完全变换器的ALBEF模型+1.6 \%。
translated by 谷歌翻译
随着图像文本对的大量数据以及视觉和语言(V&L)任务的多样性,学者在该研究领域引入了大量的深度学习模型。此外,近年来,转移学习还显示出在计算机愿景中的巨大成功,例如图像分类,对象检测等以及在自然语言处理中以进行问答,机器翻译等的自然语言处理。继承转移学习的精神, V&L的研究工作已经在大规模数据集上设计了多种预训练技术,以增强下游任务的性能。本文的目的是提供当代V&L预审前模型的全面修订。特别是,我们对预处理的方法进行了分类和描述,以及最先进的视觉和语言预训练模型的摘要。此外,还提供了培训数据集和下游任务的列表,以进一步提高V&L预处理的观点。最后,我们决定采取进一步的一步,讨论众多未来研究的方向。
translated by 谷歌翻译
The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pretraining. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pretraining data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [70] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks. 1
translated by 谷歌翻译
变压器架构已经带来了计算语言领域的根本变化,这已经由经常性神经网络主导多年。它的成功还意味着具有语言和愿景的跨模型任务的大幅度变化,许多研究人员已经解决了这个问题。在本文中,我们审查了该领域中的一些最关键的里程碑,以及变压器架构如何纳入Visuol语言跨模型任务的整体趋势。此外,我们讨论了当前的局限性,并推测了我们发现迫在眉睫的一些前景。
translated by 谷歌翻译
现有视觉语言预训练(VLP)方法主要依赖于配对的图像文本数据集,这些数据集由大量人类劳动注释,或者从互联网上爬行,然后是精心制作的数据清洁技术。为了减少对良好的图像文本对的依赖,有望直接利用仅大规模的仅文本和仅图像的语料库。本文提出了一种数据增强方法,即跨模式cutmix(CMC),用于在未配对的VLP中进行隐式跨模式对齐学习。具体而言,CMC将自然句子从文本视图转换为多模式视图,在该视图中,句子中的视觉词语单词被带有相似语义的各种图像贴片随机替换。拟议中的CMC有几个吸引人的礼节。首先,它增强了数据多样性,同时保持语义含义完好无损地解决了对齐数据稀缺的问题;其次,通过将跨模式噪声连接到单模式数据上,它指导模型以学习跨模态的令牌级相互作用,以更好地降级。此外,我们提出了一种名为VLMIXER的新的未配对VLP方法,该方法将CMC与对比度学习集成在一起,以将Uni-Mododal和多模式视图汇总在一起,以在不同模式之间进行更好的实例级别对齐。在五个下游任务上进行的广泛实验表明,VLMIXER可以超过以前最先进的未配对VLP方法。
translated by 谷歌翻译
以前的视觉语言预训练模型主要构建具有令牌和对象(像素)的多模式输入,然后在它们之间执行交叉模式相互作用。我们认为,只有令牌和对象的输入限制了诸如短语到区域接地之类的高级语义对齐。同时,多层次对齐本质上是一致的,并且能够协同促进表示形式学习。因此,在本文中,我们建议学习视觉预训练(MVPTR)的多级语义一致性。在MVPTR中,我们遵循两种方式的嵌套结构,以引入概念为高级语义。为了简化从多模式多级输入的学习,我们的框架分为两个阶段,第一阶段着重于模式内多级表示学习,第二阶段通过粗粒和细粒度跨模态强化了跨模式的交互语义对齐任务。除了常用的图像文本匹配和掩盖语言模型任务外,我们还引入了第一阶段蒙版概念恢复任务以增强概念表示学习,第二阶段的另外两个任务在第二阶段中,以明确鼓励跨跨层次的多层次对准方式。我们的代码可在https://github.com/junction4nako/mvp_pytorch上找到。
translated by 谷歌翻译
图像字幕的当前最新方法采用基于区域的特征,因为它们提供了对象级信息,对于描述图像的内容至关重要;它们通常由对象检测器(例如更快的R-CNN)提取。但是,他们有几个问题,例如缺乏上下文信息,不准确检测的风险以及高计算成本。可以通过使用基于网格的功能来解决前两个。但是,如何提取和融合这两种功能是未知的。本文提出了一种仅使用变压器的神经结构,称为砂砾(基于网格和区域的图像字幕变压器),该构建物有效地利用了两个视觉特征来生成更好的字幕。粒度用基于DITR的方法代替了以前方法中使用的基于CNN的检测器,从而使其更快地计算。此外,它的整体设计仅由变压器组成,可以对模型进行端到端的训练。这种创新的设计和双重视觉功能的集成带来了重大的性能提高。几个图像字幕基准的实验结果表明,砂砾的推论准确性和速度优于先前的方法。
translated by 谷歌翻译
Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into $N\times N$ blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or ``O" in aPTP ``The block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT \cite{vilt} baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP \cite{blip} baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released at \url{https://github.com/sail-sg/ptp}.
translated by 谷歌翻译
远见和语言预测已成为解决多模式下游任务的普遍方法。当前的趋势是朝着更大的模型和预处理数据集迈进。从长远来看,这一计算头急促似乎是不合理的,而是朝着可持续的解决方案迈进,事实上,排除了资源有限的学术实验室。在这项工作中,我们提出了一个称为VICHA的新框架,该框架有效利用输入数据以通过以下方式提高学习,以: ,(c)利用图像级注释,称为视觉概念,使用现有基础模型(例如剪辑)获得,以提高图像编码器的性能。尽管对数据的预估计少了四倍,但我们的VICHA策略在下游任务(例如图像文本检索,VQA,视觉推理,视觉上和视觉接地)上的其他方法优于其他方法。该代码将在此处公开提供:https://github.com/mshukor/vicha
translated by 谷歌翻译
在本文中,我们提出了一种单一统一的变压器(UFO),其能够处理视觉语言的单峰输入(例如,图像或语言)或多模式输入(例如,图像和问题的串联)( VL)表示学习。现有方法通常为每个模态和/或特定融合网络设计个人网络,用于多模式任务。为了简化网络架构,我们使用单个变压器网络并在VL预培训期间强制执行多任务学习,其包括图像文本对比丢失,图像文本匹配丢失和基于双向的屏蔽语言建模损耗SEQ2Seq注意面具。相同的变压器网络用作不同预训练任务中的图像编码器,文本编码器或融合网络。经验上,我们观察不同任务之间的冲突,并在视觉问题应答,Coco图像标题(交叉熵优化)和Nocaps(在香料中)实现新的艺术状态。在其他下游任务中,例如,图像文本检索,我们也实现了竞争性能。
translated by 谷歌翻译
We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing visionlanguage models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an endto-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than regionbased approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). We conduct experiments on four well-established vision-language tasks by following standard VLPT settings. In particular, SOHO achieves absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR 2 test-P split, 6.7% accuracy on SNLI-VE test split, respectively.
translated by 谷歌翻译
自我监督的视觉和语言预处理(VLP)旨在从大规模的图像文本数据中学习可转移的多模式表示形式,并在填充后在广泛的视觉范围内实现强大的表现。以前的主流VLP方法通常采用依靠外部对象检测器来编码多模式变压器框架中的图像的两步策略,该框架遭受了限制性对象概念空间,有限的图像上下文和效率低下的计算。在本文中,我们提出了一个对象感知的端到端VLP框架,该框架将来自CNN的图像网格特征直接馈送到变压器中,并共同学习多模式表示。更重要的是,我们建议执行对象知识蒸馏,以促进在不同语义级别的学习跨模式对齐。为了实现这一目标,我们通过将对象特征及其来自外部检测器的语义标签作为监督来设计两个新颖的借口任务:1。)对象引导的蒙版视觉建模任务的重点是在多模式变压器中强制执行对象感知的表示的学习; 2.)短语区域对准任务旨在通过利用语言空间中名词短语和对象标签之间的相似性来改善跨模式对齐。对各种视觉语言任务进行的广泛实验证明了我们提出的框架的功效,并且我们在现有的预科策略中实现了竞争性或优越的表现。
translated by 谷歌翻译