以前的视觉语言预训练模型主要构建具有令牌和对象(像素)的多模式输入,然后在它们之间执行交叉模式相互作用。我们认为,只有令牌和对象的输入限制了诸如短语到区域接地之类的高级语义对齐。同时,多层次对齐本质上是一致的,并且能够协同促进表示形式学习。因此,在本文中,我们建议学习视觉预训练(MVPTR)的多级语义一致性。在MVPTR中,我们遵循两种方式的嵌套结构,以引入概念为高级语义。为了简化从多模式多级输入的学习,我们的框架分为两个阶段,第一阶段着重于模式内多级表示学习,第二阶段通过粗粒和细粒度跨模态强化了跨模式的交互语义对齐任务。除了常用的图像文本匹配和掩盖语言模型任务外,我们还引入了第一阶段蒙版概念恢复任务以增强概念表示学习,第二阶段的另外两个任务在第二阶段中,以明确鼓励跨跨层次的多层次对准方式。我们的代码可在https://github.com/junction4nako/mvp_pytorch上找到。
translated by 谷歌翻译
现有视觉语言预训练(VLP)方法主要依赖于配对的图像文本数据集,这些数据集由大量人类劳动注释,或者从互联网上爬行,然后是精心制作的数据清洁技术。为了减少对良好的图像文本对的依赖,有望直接利用仅大规模的仅文本和仅图像的语料库。本文提出了一种数据增强方法,即跨模式cutmix(CMC),用于在未配对的VLP中进行隐式跨模式对齐学习。具体而言,CMC将自然句子从文本视图转换为多模式视图,在该视图中,句子中的视觉词语单词被带有相似语义的各种图像贴片随机替换。拟议中的CMC有几个吸引人的礼节。首先,它增强了数据多样性,同时保持语义含义完好无损地解决了对齐数据稀缺的问题;其次,通过将跨模式噪声连接到单模式数据上,它指导模型以学习跨模态的令牌级相互作用,以更好地降级。此外,我们提出了一种名为VLMIXER的新的未配对VLP方法,该方法将CMC与对比度学习集成在一起,以将Uni-Mododal和多模式视图汇总在一起,以在不同模式之间进行更好的实例级别对齐。在五个下游任务上进行的广泛实验表明,VLMIXER可以超过以前最先进的未配对VLP方法。
translated by 谷歌翻译
远见和语言预测已成为解决多模式下游任务的普遍方法。当前的趋势是朝着更大的模型和预处理数据集迈进。从长远来看,这一计算头急促似乎是不合理的,而是朝着可持续的解决方案迈进,事实上,排除了资源有限的学术实验室。在这项工作中,我们提出了一个称为VICHA的新框架,该框架有效利用输入数据以通过以下方式提高学习,以: ,(c)利用图像级注释,称为视觉概念,使用现有基础模型(例如剪辑)获得,以提高图像编码器的性能。尽管对数据的预估计少了四倍,但我们的VICHA策略在下游任务(例如图像文本检索,VQA,视觉推理,视觉上和视觉接地)上的其他方法优于其他方法。该代码将在此处公开提供:https://github.com/mshukor/vicha
translated by 谷歌翻译
Vision语言中最现有的方法依赖于通过对象检测提取的对象中心特征,并在提取的功能和文本之间进行细粒度对齐。我们认为物体检测的使用可能不适合视觉语言预培训。相反,我们指出应该执行任务,以便文本中提到的“视觉概念”的区域位于图像中,并且在文本和视觉概念之间的平时对齐中,识别在其中的校准处于多个 - 粒度。本文提出了一种称为X-VLM的新方法,以执行“多粒度的视觉语言预训练”。实验结果表明,X-VLM在许多下游视觉语言任务中始终如一地优于最先进的方法。
translated by 谷歌翻译
从纯图像和具有对比性损失的纯图像和文本预测的自我监督的视觉语言是有效的,但是由于双流式体系结构仅在全球层面上与图像和文本表示形式对齐,因此忽略了细粒度​​的对齐。早些时候,受监督的,非对比度的方法具有更细粒度的对齐方式,但需要致密的注释,这些注释不可伸缩。我们提出了一个单个流体系结构,该体系结构使用两个新颖的任务:对称交叉模式重建(XMM)和一个伪标记的关键字预测,将图像和语言对齐:全局,细粒度的补丁和概念/语义(PSL)。在XMM中,我们从一种模态掩盖了输入令牌,并使用跨模式信息重建掩盖的令牌,从而改善了两种模式之间的细粒度对齐。在PSL中,我们使用注意力在标题中选择关键字,使用动量编码器推荐标题中缺少但在图像中表示的其他重要关键字,然后训练视觉编码器以预测这些关键字的存在,并帮助它。学习对于将文本令牌接地到图像区域至关重要的语义概念。我们证明了对图像文本检索,接地,视觉问题的回答/推理的竞争性能和提高的数据效率,以针对对更多数据进行培训的较大模型和模型。 Zaidkhan.me/simla上可用的代码和型号。
translated by 谷歌翻译
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use selfattention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar 1 , which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks. 2
translated by 谷歌翻译
自我监督的视觉和语言预处理(VLP)旨在从大规模的图像文本数据中学习可转移的多模式表示形式,并在填充后在广泛的视觉范围内实现强大的表现。以前的主流VLP方法通常采用依靠外部对象检测器来编码多模式变压器框架中的图像的两步策略,该框架遭受了限制性对象概念空间,有限的图像上下文和效率低下的计算。在本文中,我们提出了一个对象感知的端到端VLP框架,该框架将来自CNN的图像网格特征直接馈送到变压器中,并共同学习多模式表示。更重要的是,我们建议执行对象知识蒸馏,以促进在不同语义级别的学习跨模式对齐。为了实现这一目标,我们通过将对象特征及其来自外部检测器的语义标签作为监督来设计两个新颖的借口任务:1。)对象引导的蒙版视觉建模任务的重点是在多模式变压器中强制执行对象感知的表示的学习; 2.)短语区域对准任务旨在通过利用语言空间中名词短语和对象标签之间的相似性来改善跨模式对齐。对各种视觉语言任务进行的广泛实验证明了我们提出的框架的功效,并且我们在现有的预科策略中实现了竞争性或优越的表现。
translated by 谷歌翻译
We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing visionlanguage models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an endto-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than regionbased approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). We conduct experiments on four well-established vision-language tasks by following standard VLPT settings. In particular, SOHO achieves absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR 2 test-P split, 6.7% accuracy on SNLI-VE test split, respectively.
translated by 谷歌翻译
Learning fine-grained interplay between vision and language allows to a more accurate understanding for VisionLanguage tasks. However, it remains challenging to extract key image regions according to the texts for semantic alignments. Most existing works are either limited by textagnostic and redundant regions obtained with the frozen detectors, or failing to scale further due to its heavy reliance on scarce grounding (gold) data to pre-train detectors. To solve these problems, we propose Self-Locator Aided Network (SLAN) for cross-modal understanding tasks without any extra gold data. SLAN consists of a region filter and a region adaptor to localize regions of interest conditioned on different texts. By aggregating cross-modal information, the region filter selects key regions and the region adaptor updates their coordinates with text guidance. With detailed region-word alignments, SLAN can be easily generalized to many downstream tasks. It achieves fairly competitive results on five cross-modal understanding tasks (e.g., 85.7% and 69.2% on COCO image-to-text and text-to-image retrieval, surpassing previous SOTA methods). SLAN also demonstrates strong zero-shot and fine-tuned transferability to two localization tasks.
translated by 谷歌翻译
大规模的视觉预训练在各种下游任务中都表现出了令人印象深刻的进步。现有方法主要是通过图像和文本的全局表示形式的相似性或对图像和文本特征上的高级交叉模式关注来对跨模式对齐进行建模。但是,由于只有全局图像文本对齐信息,因此他们无法明确学习视觉区域和文本短语之间的细粒语义对齐。在本文中,我们介绍了Loupe,这是一种精细的语义一致性视觉语言预训练框架,该框架从新颖的游戏理论互动的角度学习了细粒度的语义对齐。为了有效地计算游戏理论相互作用,我们进一步提出了一种不确定性感知的神经Shapley交互学习模块。实验表明,Loupe在图像文本检索基准测试中实现了最新的。如果没有任何对象级的人类注释和微调,Loupe就可以在对象检测和视觉接地方面实现竞争性能。更重要的是,Loupe从大规模的原始图像文本对学习细粒语义的新方向。
translated by 谷歌翻译
Joint image-text embedding is the bedrock for most Visionand-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage finegrained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OTbased WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question
translated by 谷歌翻译
Vision-Language Transformers can be learned without human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders that does not require this supervision. In fact, in a head-to-head comparison between ViLT, the current state-of-the-art patch-based vision-language transformer which is pretrained with supervised object classification, and our model, VLC, we find that our approach 1. outperforms ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch visualizations, and 3. is competitive with many larger models that utilize ROIs trained on annotated bounding-boxes.
translated by 谷歌翻译
随着变压器的发展,近年来预先训练的模型已经以突破性的步伐发展。他们在自然语言处理(NLP)和计算机视觉(CV)中主导了主流技术。如何将预训练适应视觉和语言(V-L)学习和改善下游任务绩效成为多模式学习的重点。在本文中,我们回顾了视力语言预训练模型(VL-PTMS)的最新进展。作为核心内容,我们首先简要介绍了几种方法,将原始图像和文本编码为单模式嵌入在预训练之前。然后,我们在建模文本和图像表示之间的相互作用时深入研究VL-PTM的主流体系结构。我们进一步提出了广泛使用的预训练任务,然后我们介绍了一些常见的下游任务。我们终于结束了本文,并提出了一些有前途的研究方向。我们的调查旨在为研究人员提供合成和指向相关研究的指针。
translated by 谷歌翻译
图像和语言建模对于视觉前训练(VLP)至关重要,该培训旨在从大规模配对的图像文本数据中学习多模式表示。但是,我们观察到,大多数现有的VLP方法着重于建模图像和文本特征之间的相互作用,同时忽略图像和文本之间的信息差异,从而遭受焦点偏见。为了解决这个问题,我们提出了一个视觉语言掩盖自动编码器框架(VLMAE)。VLMAE采用视觉生成学习,促进该模型获得细粒度和公正的特征。与以前的作品不同,Vlmae注意图像中几乎所有关键的补丁,提供了更全面的理解。广泛的实验表明,VLMAE在各种视觉语言下游任务中取得更好的性能,包括视觉问答,即使有20%的预训练速度,图像文本检索和视觉接地也是如此。
translated by 谷歌翻译
Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. To improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream visionlanguage tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR 2 , ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-ofthe-art, while enjoying faster inference speed. Code and models are available at https://github.com/salesforce/ALBEF.
translated by 谷歌翻译
Vision-Language预培训(VLP)旨在从图像文本对中学习多模态表示,并以微调方式为下游视觉语言任务服务。主导VLP模型采用CNN变压器架构,该架构将图像与CNN嵌入,然后使用变压器对齐图像和文本。视觉内容之间的视觉关系在图像理解中发挥着重要作用,并且是模态对齐学习的基本。然而,由于局部接受领域在建模远程依赖性方面的弱点,CNNS具有局限性。因此,在相同的变压器网络中封装了学习视觉关系和模态对齐的两个目标。这种设计可能通过忽略每个目标的专用特性来限制变压器中的模态对准学习。为了解决这个问题,我们提出了一个完全变压器视觉嵌入VLP,以更好地学习视觉关系,进一步促进模态对齐。具体地,我们提出了一个名为Domank跨性流量的度量(IMF),以测量视觉和语言模态之间的交互(即,互别互别)。我们还设计了一种名为Massed Featuber Resollion(MFR)的新型屏蔽优化机制,在变压器中进一步推广了模范间学习。据我们所知,这是第一项探索VLP中可视化特征学习的变压器的利益的研究。我们在广泛的视觉语言任务中验证了我们的方法,包括图像文本检索,视觉问题应答(VQA),视觉征求和视觉推理。我们的方法不仅优于最先进的VLP性能,而且还显示了对IMF度量的好处。
translated by 谷歌翻译
随着图像文本对的大量数据以及视觉和语言(V&L)任务的多样性,学者在该研究领域引入了大量的深度学习模型。此外,近年来,转移学习还显示出在计算机愿景中的巨大成功,例如图像分类,对象检测等以及在自然语言处理中以进行问答,机器翻译等的自然语言处理。继承转移学习的精神, V&L的研究工作已经在大规模数据集上设计了多种预训练技术,以增强下游任务的性能。本文的目的是提供当代V&L预审前模型的全面修订。特别是,我们对预处理的方法进行了分类和描述,以及最先进的视觉和语言预训练模型的摘要。此外,还提供了培训数据集和下游任务的列表,以进一步提高V&L预处理的观点。最后,我们决定采取进一步的一步,讨论众多未来研究的方向。
translated by 谷歌翻译
在过去的几年中,训练前模型的出现将单峰领域(例如计算机视觉(CV)和自然语言处理(NLP))带到了一个新时代。实质性的作品表明它们对下游大学任务有益,并避免从头开始训练新的模型。那么,此类预训练的模型可以应用于多模式任务吗?研究人员探索了这个问题并取得了重大进展。本文调查了视觉预训练(VLP)的最新进展和新的前沿,包括图像文本和视频文本预训练。为了使读者更好地掌握VLP,我们首先从五个方面回顾了其最新进展:功能提取,模型体系结构,培训预训练目标,预训练数据集和下游任务。然后,我们详细概述了特定的VLP模型。最后,我们讨论了VLP中的新边界。据我们所知,这是对VLP的首次调查。我们希望这项调查能够阐明VLP领域的未来研究。
translated by 谷歌翻译
我们建议对视觉模型预处理的基于利润的损失,以鼓励基于梯度的解释,这些解释与区域级注释一致。我们将该目标称为注意面罩的一致性(AMC),并证明它与依赖于区域级注释的模型相比,它产生了卓越的视觉接地性能,以显式训练对象检测器,例如更快的R-CNN。 AMC通过鼓励基于梯度的解释掩盖来工作,该掩盖的注意力分数主要集中在包含这种注释的图像的注释区域中。尤其是,在标准视觉建模目标之上接受AMC训练的模型在FlickR30K视觉接地基准中获得了86.59%的最新精度,与最佳先前模型相比,绝对改善了5.48%。我们的方法在既定的基准中都表现出表达理解,并通过设计基于梯度的解释来更好地与人类注释保持一致,从而提供了极大的表现。
translated by 谷歌翻译
Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into $N\times N$ blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or ``O" in aPTP ``The block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT \cite{vilt} baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP \cite{blip} baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released at \url{https://github.com/sail-sg/ptp}.
translated by 谷歌翻译