We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR 2 . 1
translated by 谷歌翻译
Joint image-text embedding is the bedrock for most Visionand-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage finegrained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OTbased WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question
translated by 谷歌翻译
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use selfattention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar 1 , which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks. 2
translated by 谷歌翻译
We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing visionlanguage models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an endto-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than regionbased approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). We conduct experiments on four well-established vision-language tasks by following standard VLPT settings. In particular, SOHO achieves absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR 2 test-P split, 6.7% accuracy on SNLI-VE test split, respectively.
translated by 谷歌翻译
自我监督的视觉和语言预处理(VLP)旨在从大规模的图像文本数据中学习可转移的多模式表示形式,并在填充后在广泛的视觉范围内实现强大的表现。以前的主流VLP方法通常采用依靠外部对象检测器来编码多模式变压器框架中的图像的两步策略,该框架遭受了限制性对象概念空间,有限的图像上下文和效率低下的计算。在本文中,我们提出了一个对象感知的端到端VLP框架,该框架将来自CNN的图像网格特征直接馈送到变压器中,并共同学习多模式表示。更重要的是,我们建议执行对象知识蒸馏,以促进在不同语义级别的学习跨模式对齐。为了实现这一目标,我们通过将对象特征及其来自外部检测器的语义标签作为监督来设计两个新颖的借口任务:1。)对象引导的蒙版视觉建模任务的重点是在多模式变压器中强制执行对象感知的表示的学习; 2.)短语区域对准任务旨在通过利用语言空间中名词短语和对象标签之间的相似性来改善跨模式对齐。对各种视觉语言任务进行的广泛实验证明了我们提出的框架的功效,并且我们在现有的预科策略中实现了竞争性或优越的表现。
translated by 谷歌翻译
随着图像文本对的大量数据以及视觉和语言(V&L)任务的多样性,学者在该研究领域引入了大量的深度学习模型。此外,近年来,转移学习还显示出在计算机愿景中的巨大成功,例如图像分类,对象检测等以及在自然语言处理中以进行问答,机器翻译等的自然语言处理。继承转移学习的精神, V&L的研究工作已经在大规模数据集上设计了多种预训练技术,以增强下游任务的性能。本文的目的是提供当代V&L预审前模型的全面修订。特别是,我们对预处理的方法进行了分类和描述,以及最先进的视觉和语言预训练模型的摘要。此外,还提供了培训数据集和下游任务的列表,以进一步提高V&L预处理的观点。最后,我们决定采取进一步的一步,讨论众多未来研究的方向。
translated by 谷歌翻译
大规模预培训最近彻底改变了视野和语言(VL)研究。 LXMERT和Uniter等型号在广泛的VL任务中显着提升了最新技术。然而,这些模型中的大量参数在实践中阻碍了他们的应用。并行地,彩票假设(LTH)的工作表明,深度神经网络包含小匹配子网,可以在培训时达到比致密网络达到比例或更好的性能。在这项工作中,我们执行第一个实证研究,以评估这些可训练的子网是否也存在于预先训练的VL模型中。我们使用单位作为主测试用用作镜(也测试LXMERT和VILT),并整合7个代表VL任务进行实验,包括视觉问题应答,视觉致辞推理,视觉素食,参考表达理解,图像文本检索,GQA和NLVR $ ^ 2 $。通过综合分析,我们将主要结果总结如下。 ($ i $)很难找到严格匹配完整模型性能的子网。但是,我们可以在50%-70%的稀疏度下找到“轻松”赢得票价,维持99%的完整准确性。 ($ II $)由特定任务特定的修剪转移到其他任务的子网,而在培训前任务的普遍普遍转移锻炼预先培训任务,匹配的全部准确性为98%/ 96%所有任务的平均值。 ($ III $)除了统客外,其他型号如LXMERT和VILT也可以播放彩票。然而,我们为vilt获得的最高稀疏性远低于LXMERT和Uniter(30%与70%)。 ($ IV $)LTH在使用其他培训方法时也仍然相关(例如,对抗培训)。
translated by 谷歌翻译
在本文中,我们提出了一种单一统一的变压器(UFO),其能够处理视觉语言的单峰输入(例如,图像或语言)或多模式输入(例如,图像和问题的串联)( VL)表示学习。现有方法通常为每个模态和/或特定融合网络设计个人网络,用于多模式任务。为了简化网络架构,我们使用单个变压器网络并在VL预培训期间强制执行多任务学习,其包括图像文本对比丢失,图像文本匹配丢失和基于双向的屏蔽语言建模损耗SEQ2Seq注意面具。相同的变压器网络用作不同预训练任务中的图像编码器,文本编码器或融合网络。经验上,我们观察不同任务之间的冲突,并在视觉问题应答,Coco图像标题(交叉熵优化)和Nocaps(在香料中)实现新的艺术状态。在其他下游任务中,例如,图像文本检索,我们也实现了竞争性能。
translated by 谷歌翻译
Vision-Language Transformers can be learned without human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders that does not require this supervision. In fact, in a head-to-head comparison between ViLT, the current state-of-the-art patch-based vision-language transformer which is pretrained with supervised object classification, and our model, VLC, we find that our approach 1. outperforms ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch visualizations, and 3. is competitive with many larger models that utilize ROIs trained on annotated bounding-boxes.
translated by 谷歌翻译
随着变压器的发展,近年来预先训练的模型已经以突破性的步伐发展。他们在自然语言处理(NLP)和计算机视觉(CV)中主导了主流技术。如何将预训练适应视觉和语言(V-L)学习和改善下游任务绩效成为多模式学习的重点。在本文中,我们回顾了视力语言预训练模型(VL-PTMS)的最新进展。作为核心内容,我们首先简要介绍了几种方法,将原始图像和文本编码为单模式嵌入在预训练之前。然后,我们在建模文本和图像表示之间的相互作用时深入研究VL-PTM的主流体系结构。我们进一步提出了广泛使用的预训练任务,然后我们介绍了一些常见的下游任务。我们终于结束了本文,并提出了一些有前途的研究方向。我们的调查旨在为研究人员提供合成和指向相关研究的指针。
translated by 谷歌翻译
Vision-and语言(VL)预培训已被证明对各种VL下游任务非常有效。虽然最近的工作表明,基于完全变换器的VL模型可以比以前的基于区域特征的方法更有效,但它们在下游任务上的性能通常显着降低。在本文中,我们呈现仪表〜(\ textbf {m} ultimodal \ textbf {e} nd-to-text \ textbf {t} ransform \ textbf {er}),我们通过它系统地调查如何设计和预先列车基于完全变换器的VL模型以端到端的方式。具体而言,我们将模型设计沿多个尺寸分析:视觉编码器(例如,剪辑 - vit,Swin变压器),文本编码器(例如,Roberta,Deberta),多模式融合(例如,合并注意力与共同关注),架构设计(例如,仅编码器与编码器 - 解码器)和预训练目标(例如,屏蔽图像建模)。我们对广泛的VL任务进行全面实验,并提供有关如何在保持快速推理速度的同时培训表演VL变压器的见解。值得注意的是,仪表〜使用仅使用4M图像进行预培训的VQAV2 TEST-STD设置的精度为77.64 \%,超过最先进的区域特征的VINVL模型+1.04 \%,以及优于以前最好的完全变换器的ALBEF模型+1.6 \%。
translated by 谷歌翻译
以前的视觉语言预训练模型主要构建具有令牌和对象(像素)的多模式输入,然后在它们之间执行交叉模式相互作用。我们认为,只有令牌和对象的输入限制了诸如短语到区域接地之类的高级语义对齐。同时,多层次对齐本质上是一致的,并且能够协同促进表示形式学习。因此,在本文中,我们建议学习视觉预训练(MVPTR)的多级语义一致性。在MVPTR中,我们遵循两种方式的嵌套结构,以引入概念为高级语义。为了简化从多模式多级输入的学习,我们的框架分为两个阶段,第一阶段着重于模式内多级表示学习,第二阶段通过粗粒和细粒度跨模态强化了跨模式的交互语义对齐任务。除了常用的图像文本匹配和掩盖语言模型任务外,我们还引入了第一阶段蒙版概念恢复任务以增强概念表示学习,第二阶段的另外两个任务在第二阶段中,以明确鼓励跨跨层次的多层次对准方式。我们的代码可在https://github.com/junction4nako/mvp_pytorch上找到。
translated by 谷歌翻译
Vision-Language预培训(VLP)旨在从图像文本对中学习多模态表示,并以微调方式为下游视觉语言任务服务。主导VLP模型采用CNN变压器架构,该架构将图像与CNN嵌入,然后使用变压器对齐图像和文本。视觉内容之间的视觉关系在图像理解中发挥着重要作用,并且是模态对齐学习的基本。然而,由于局部接受领域在建模远程依赖性方面的弱点,CNNS具有局限性。因此,在相同的变压器网络中封装了学习视觉关系和模态对齐的两个目标。这种设计可能通过忽略每个目标的专用特性来限制变压器中的模态对准学习。为了解决这个问题,我们提出了一个完全变压器视觉嵌入VLP,以更好地学习视觉关系,进一步促进模态对齐。具体地,我们提出了一个名为Domank跨性流量的度量(IMF),以测量视觉和语言模态之间的交互(即,互别互别)。我们还设计了一种名为Massed Featuber Resollion(MFR)的新型屏蔽优化机制,在变压器中进一步推广了模范间学习。据我们所知,这是第一项探索VLP中可视化特征学习的变压器的利益的研究。我们在广泛的视觉语言任务中验证了我们的方法,包括图像文本检索,视觉问题应答(VQA),视觉征求和视觉推理。我们的方法不仅优于最先进的VLP性能,而且还显示了对IMF度量的好处。
translated by 谷歌翻译
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific modelsachieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.Preprint. Under review.
translated by 谷歌翻译
视觉语言(VL)预训练最近受到了广泛的关注。但是,大多数现有的端到端预训练方法只旨在解决诸如图像文本检索,视觉询问答案(VQA)和图像字幕等VL任务,以测试对图像的高级了解,或者仅对目标区域进行测试 - 对诸如短语接地和对象检测等任务的水平理解。我们提出了Fiber(基于回避的变压器),这是一种新的VL模型体系结构,可以无缝处理这两种类型的任务。 Fiber没有将多模式融合到模型深处,而不是将融合后的专用变压器层用于融合,而是通过将交叉注意力插入图像和文本骨干杆中,从而在记忆和性能方面带来了增长。此外,与以前的工作不同,它要么仅在图像文本数据上进行训练,要么在带有框级注释的细粒度数据上进行培训,我们提出了一种两阶段的预训练策略,该策略有效地使用了这两种数据:(( i)基于图像文本数据的粗粒细化预训练;然后是(ii)基于图像文本框数据的细粒度预训练。我们对各种VL任务进行全面的实验,从VQA,图像字幕和检索到短语接地,参考表达理解和对象检测。使用深层多模式融合,结合两阶段的预训练,光纤可对所有任务的强基础进行一致的性能改进,通常使用幅度更优于更多数据的方法。代码可从https://github.com/microsoft/fiber获得。
translated by 谷歌翻译
视觉语言预处理框架中的语言方式是天生离散的,在语言词汇中赋予每个单词是语义含义。相比之下,视觉方式本质上是连续和高维的,这可能禁止视觉和语言方式之间的对齐和融合。因此,我们建议通过联合学习一本赋予每个视觉令牌语义的代码手册来“离散”视觉表示。然后,我们利用这些离散的视觉语义作为自我监督的基础真相来构建我们的蒙版图像建模目标,这是蒙版语言建模的对应物,证明了语言模型成功。为了优化代码簿,我们扩展了VQ-VAE的配方,该配方提供了理论保证。实验验证了我们在常见视觉基准测试中的方法的有效性。
translated by 谷歌翻译
近年来,统一的视觉语言框架已经大大提高,其中大多数采用编码器架构将图像文本任务统一为序列到序列的生成。但是,现有的视频语言(VIDL)模型仍需要在每个任务的模型体系结构和培训目标中进行特定于任务的设计。在这项工作中,我们探索了一个统一的VIDL框架薰衣草,其中蒙版语言建模(MLM)用作所有前训练和下游任务的常见接口。这样的统一导致了简化的模型体系结构,在多模式编码器之上,只需要一个轻巧的MLM头,而不是具有更多参数的解码器。令人惊讶的是,实验结果表明,这个统一的框架在14个VIDL基准测试中实现了竞争性能,涵盖了视频问答,文本到视频检索和视频字幕。广泛的分析进一步证明了薰衣草比现有VIDL方法的优势:(i)在多任务列出时仅使用一组参数值支持所有下游任务; (ii)对各种下游任务的几乎没有概括; (iii)在视频问题回答任务上启用零射门评估。代码可从https://github.com/microsoft/lavender获得。
translated by 谷歌翻译
Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. To improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream visionlanguage tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR 2 , ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-ofthe-art, while enjoying faster inference speed. Code and models are available at https://github.com/salesforce/ALBEF.
translated by 谷歌翻译
我们提出了GLIPV2,这是一个接地的VL理解模型,该模型既服务于本地化任务(例如,对象检测,实例分割)和视觉语言(VL)理解任务(例如VQA,图像字幕)。 GLIPV2优雅地将本地化预训练和视觉语言预训练(VLP)具有三个预训练任务:短语接地作为对检测任务的VL重新重新制定,区域词对比度学习作为新型的区域词对比度对比度对比学习任务,以及蒙面的语言建模。这种统一不仅简化了先前的多阶段VLP程序,而且还可以在本地化和理解任务之间实现相互利益。实验结果表明,在各种本地化和理解任务上,单个GLIPV2模型(所有模型权重)在SOTA性能附近实现。该模型还显示了(1)在开放式摄制对象检测任务上进行的强零射击和很少的自适应性能,以及(2)VL理解任务上的卓越接地能力。代码将在https://github.com/microsoft/glip上发布。
translated by 谷歌翻译
我们介绍了一个统一的视觉 - 语言普试模型(VLMO),共同学习双编码器和带有模块化变压器网络的融合编码器。具体而言,我们介绍了模态 - 专家(Mome)变压器的混合,其中每个块包含一个模态特定专家和共同的自我注意层。由于Mome的柔性柔韧性,预先调整的VLMO可以精细调整为viSion语言分类任务的融合编码器,或用作双编码器,用于有效的图像文本检索。此外,我们提出了一个航向的预训练策略,它有效地利用了除了图像文本对之外的大规模图像和仅文本数据。实验结果表明,VLMO在各种视觉语言任务上实现了最先进的结果,包括VQA和NLVR2。代码和预用模型可以在https://aka.ms/vlmo获得。
translated by 谷歌翻译