Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether this makes it possible to learn those skills from text data and then use them to complete vision tasks without ever training on visual training data. Key to our approach is exploiting the joint embedding space of contrastively trained vision and language encoders. In practice, there can be systematic differences between embedding spaces for different modalities in contrastive models, and we analyze how these differences affect our approach and study a variety of strategies to mitigate this concern. We produce models using only text training data on three tasks: image captioning, visual entailment and visual question answering, and evaluate them on standard benchmarks using images. We find that this kind of transfer is possible and results in only a small drop in performance relative to models trained on images. We also showcase a variety of stylistic image captioning models that were trained using no image data and no human-curated language data, but instead text data from books, the web, or language models.
translated by 谷歌翻译
Large-scale vision-language models such as CLIP have shown impressive performance on zero-shot image classification and image-to-text retrieval. However, such zero-shot performance of CLIP-based models does not realize in tasks that require a finer-grained correspondence between vision and language, such as Visual Question Answering (VQA). We investigate why this is the case, and report an interesting phenomenon of CLIP, which we call the Concept Association Bias (CAB), as a potential cause of the difficulty of applying CLIP to VQA and similar tasks. CAB is especially apparent when two concepts are present in the given image while a text prompt only contains a single concept. In such a case, we find that CLIP tends to treat input as a bag of concepts and attempts to fill in the other missing concept crossmodally, leading to an unexpected zero-shot prediction. For example, when asked for the color of a lemon in an image, CLIP predicts ``purple'' if the image contains a lemon and an eggplant. We demonstrate the Concept Association Bias of CLIP by showing that CLIP's zero-shot classification performance greatly suffers when there is a strong concept association between an object (e.g. lemon) and an attribute (e.g. its color). On the other hand, when the association between object and attribute is weak, we do not see this phenomenon. Furthermore, we show that CAB is significantly mitigated when we enable CLIP to learn deeper structure across image and text embeddings by adding an additional Transformer on top of CLIP and fine-tuning it on VQA. We find that across such fine-tuned variants of CLIP, the strength of CAB in a model predicts how well it performs on VQA.
translated by 谷歌翻译
最近的文本到图像匹配模型对大型图像和句子的大公司进行了对比学习。虽然这些模型可以提供用于匹配和随后的零拍任务的强大分数,但它们不能给出给定图像的标题。在这项工作中,我们重新利用这些模型来生成在推理时间的图像时生成描述性文本,而无需进一步的训练或调整步骤。这是通过将具有大语言模型的视觉语义模型组合,从两种网络级模型中的知识中获益。由受监督标题方法获得的标题的限制性较小。此外,作为零射击学习方法,它非常灵活,我们展示了执行图像算法的能力,其中输入可以是图像或文本,输出是句子。这使得新颖的高级视觉能力,例如比较两个图像或解决视觉类比测试。
translated by 谷歌翻译
虽然标题模型已经获得了引人注目的结果,但在描述自然图像时,它们仍然不会涵盖现实世界概念的整个长尾分布。在本文中,我们通过在Web级自动收集的数据集上培训来解决与野外概念生成人类描述的任务。为此,我们提出了一种模型,该模型可以利用嘈杂的图像标题对,同时维持像Coco这样的传统人类注释数据集的描述性风格。我们的模型通过使用关键字和风格标记将内容从风格分开,使用单一目标是提示语言建模和比其他最近提出的更简单。在实验上,我们的模型在零拍摄设置中始终如一地占据了说明性质量和能力的现有方法。根据苹果酒公制,我们在使用外部数据时在Coco和Nocaps上获得新的最新状态。
translated by 谷歌翻译
We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning. In contrast to previous works using contrastive or generative captioning training, we propose a novel and simple recipe to pre-train a vision-language joint model, which is multi-task as well. The pre-training uses only noisy image captioning data, and is formulated to use the entire architecture end-to-end with both a strong language encoder and decoder. Our results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results across a variety of question answering tasks. Our multi-task mixture training learns from tasks of various question intents and thus generalizes better, including on zero-shot vision-language tasks. We conduct experiments in the challenging multi-task and open-vocabulary settings and across a variety of datasets and tasks, such as VQA2.0, SNLI-VE, NLVR2, GQA. We observe that the proposed approach is able to generalize to unseen tasks and that more diverse mixtures lead to higher accuracy in both known and novel tasks.
translated by 谷歌翻译
通用视觉(GPV)系统是旨在解决各种视觉任务的模型,而无需进行架构更改。如今,GPV主要从大型完全监督的数据集中学习技能和概念。通过获取数据以迅速学习每个技能的每个概念,将GPV扩展到数万个概念都变得令人望而却步。这项工作提出了一种有效且廉价的替代方法:从监督数据集中学习技能,从Web图像搜索中学习概念,并利用GPV的关键特征:跨技能传递视觉知识的能力。我们使用跨越10K+视觉概念的1M+图像的数据集来演示3个基准上的两个现有GPV(GPV-1和VL-T5)的Webly Supumented概念扩展:5个基于可可的数据集(80个主要概念),这是一个新的策划系列,这是一个新的策划系列。基于OpenImages和VisualGenome存储库(〜500个概念)以及Web衍生的数据集(10K+概念)的5个数据集。我们还提出了一种新的体系结构GPV-2,该架构支持各种任务 - 从分类和本地化等视觉任务到Qu Viewer+语言任务,例如QA和字幕,再到更多的利基市场,例如人类对象互动检测。 GPV-2从Web数据中受益匪浅,并且在这些基准测试中胜过GPV-1和VL-T5。我们的数据,代码和Web演示可在https://prior.allenai.org/projects/gpv2上获得。
translated by 谷歌翻译
大规模预制速度迅速成为视觉语言(VL)建模中的规范。然而,普遍的VL方法受标记数据的要求和复杂的多步预介质目标的要求受限。我们呈现Magma - 使用基于适配器的FineTuning使用额外的方式增强生成语言模型的简单方法。在冻结的情况下,我们培训一系列VL模型,从视觉和文本输入的任意组合自动生成文本。使用单一语言建模目的,预先预测完全结束于结束,与先前的方法相比,简化优化。重要的是,在培训期间,语言模型权重保持不变,允许从语言预磨练转移百科全书知识和内心的学习能力。 Magma在开放式生成任务上冻结的岩浆,实现了最先进的状态,结果在Okvqa基准和竞争结果上的一系列其他流行的VL基准测试中,同时预先训练用于培训SIMVLM的样本数量的0.2%。
translated by 谷歌翻译
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use selfattention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar 1 , which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks. 2
translated by 谷歌翻译
有效的缩放和灵活的任务接口使大型语言模型能够在许多任务中表现出色。帕利(Pali)根据视觉和文本输入生成文本,并使用该界面以许多语言执行许多视觉,语言和多模式任务。为了训练帕利,我们利用了大型的编码器语言模型和视觉变压器(VITS)。这使我们能够利用其现有能力,并利用培训它们的大量成本。我们发现,视觉和语言组成部分的联合缩放很重要。由于现有的语言变压器比其视觉对应物要大得多,因此我们训练迄今为止最大的VIT(VIT-E),以量化甚至大容量视觉模型的好处。为了训练Pali,我们基于一个新的图像文本训练集,其中包含10B图像和文本,以100多种语言来创建大型的多语言组合。帕利(Pali)在多个视觉和语言任务(例如字幕,视觉问题,索方式,场景文本理解)中实现了最新的,同时保留了简单,模块化和可扩展的设计。
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pretraining. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pretraining data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [70] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks. 1
translated by 谷歌翻译
我们介绍了一种零拍的视频字幕方法,该方法采用了两个冷冻网络:GPT-2语言模型和剪辑图像文本匹配模型。匹配分数用于引导语言模型生成一个句子,该句子的平均匹配分数高于视频帧的一个子集。与零拍图像字幕方法不同,我们的工作立即考虑整个句子。这是通过在生成过程中优化从头开始的一部分,通过在提示中修改所有其他令牌的表示,并通过迭代重复该过程,逐渐提高生成句子的特殊性和全面性来实现。我们的实验表明,生成的字幕是连贯的,并显示了广泛的现实知识。我们的代码可在以下网址找到:https://github.com/yoadtew/zero-shot-video-to-text
translated by 谷歌翻译
本文介绍了Omnivl,这是一种新的基础模型,旨在使用一种通用体系结构来支持图像语言和视频语言任务。它为图像和视频输入采用了统一的基于变压器的视觉编码器,因此可以执行联合图像语言和视频语言预处理。我们首次证明了这样的范式受益于图像和视频任务,而不是传统的单向传输(例如,使用图像语言来帮助视频语言)。为此,我们提出了对图像语言和视频语言的脱钩关节预处理,以有效地将视觉模型分解为空间和时间维度,并在图像和视频任务上获得性能提升。此外,我们引入了一种新颖的统一视觉对比度(UNIVLC)损失,以利用图像文本,视频文本,图像标签(例如,图像分类),视频标签(例如,视频动作识别)在一起受到监督和吵闹的监督预处理数据都尽可能多地利用。无需额外的任务适配器,Omnivl可以同时支持仅视觉任务(例如,图像分类,视频操作识别),跨模式对齐任务(例如,图像/视频 - 文本检索)和多模式理解和生成任务(例如,图像/视频问答,字幕)。我们在各种下游任务上评估Omnivl,并以相似的模型大小和数据量表获得最新的或竞争结果。
translated by 谷歌翻译
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific modelsachieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.Preprint. Under review.
translated by 谷歌翻译
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo~\cite{Deepmind:Flamingo2022} by 5.6\% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20\%.
translated by 谷歌翻译
Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without
translated by 谷歌翻译
Most existing text-video retrieval methods focus on cross-modal matching between the visual content of offline videos and textual query sentences. However, in real scenarios, online videos are frequently accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This inspires us to generate associated captions from offline videos to help with existing text-video retrieval methods. To do so, we propose to use the zero-shot video captioner with knowledge of pre-trained web-scale models (e.g., CLIP and GPT-2) to generate captions for offline videos without any training. Given the captions, one question naturally arises: what can auxiliary captions do for text-video retrieval? In this paper, we present a novel framework Cap4Video, which makes use of captions from three aspects: i) Input data: The video and captions can form new video-caption pairs as data augmentation for training. ii) Feature interaction: We perform feature interaction between video and caption to yield enhanced video representations. iii) Output score: The Query-Caption matching branch can be complementary to the original Query-Video matching branch for text-video retrieval. We conduct thorough ablation studies to demonstrate the effectiveness of our method. Without any post-processing, our Cap4Video achieves state-of-the-art performance on MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%).
translated by 谷歌翻译
视觉语言预训练(VLP)模型在各种下游任务上表现出色。他们的成功在很大程度上取决于预训练的跨模式数据集的规模。但是,中文中缺乏大规模数据集和基准阻碍了中国VLP模型和更广泛的多语言应用程序的发展。在这项工作中,我们发布了一个名为Wukong的大型中国跨模式数据集,其中包含从网络收集的1亿个中文图像文本对。 Wukong旨在基准基准不同的多模式预训练方法,以促进VLP研究和社区发展。此外,我们发布了一组模型,预先训练了各种图像编码器(vit-b/vit-l/swint),还将高级预训练技术应用于VLP,例如锁定图像文本调整,相对于代币的相似性学习和减少互动。还提供了广泛的实验和不同下游任务的基准测试,包括新的最大人验证的图像文本测试数据集。实验表明,Wukong可以作为不同的跨模式学习方法的有前途的中国预培训数据集和基准。对于10个数据集上的零摄像图像分类任务,$ Wukong_ {vit-l} $达到的平均准确度为73.03%。对于图像文本检索任务,它在AIC-ICC上的平均召回率为71.6%,比Wenlan 2.0高12.9%。此外,我们的Wukong模型在下游任务上进行了基准测试,例如多个数据集上的其他变体,例如Flickr8k-CN,Flickr-30K-CN,Coco-CN,Coco-CN等。更多信息可以参考:https://wukong-dataset.github.io/wukong-dataset/。
translated by 谷歌翻译
剪辑的发展[Radford等,2021]引发了关于语言监督是否可以导致与传统仅图像方法更可转移表示的视觉模型的争论。我们的工作通过对两种方法的学习能力进行了对下游分类任务的学习能力进行仔细控制的比较来研究这个问题。我们发现,当预训练数据集符合某些标准时 - 它足够大,并且包含具有较低变异性的描述性字幕 - 仅图像的方法也与剪辑的传输性能不匹配,即使它们接受了更多图像数据的培训。但是,与人们期望的相反,在某些情况下,没有满足这些标准,其中通过标题增加的监督实际上是有害的。在我们的发现的激励下,我们设计了简单的处方,以使剪辑能够更好地利用现有预训练数据集中存在的语言信息。
translated by 谷歌翻译
探索大规模预处理的基础模型对计算机视觉具有重大兴趣,因为这些模型可以快速转移到许多下游任务中。本文介绍了对比字幕(COCA),这是一种极简主义的设计,旨在为图像文本编码器编码器基础模型预算与对比度损失和字幕损失,从而从剪辑和诸如simvlm之类的生成方法之类的对比方法中包含模型能力。与所有解码器层都参与编码器输出的标准编码器 - 模块变压器相反,可口可乐省略了解码器层的上半部分的交叉注意,以编码单峰文本表示,并串联到剩余的解码器层,这些解码器与图像编码器相交的解码器层多模式图像文本表示。除了对多模态解码器输出的字幕损失外,我们还应用了单峰图像和文本嵌入之间的对比损失,该输出可以预测文本令牌自动加压。通过共享相同的计算图,可以用最小的开销有效地计算两个培训目标。可口可乐是端到端和从头开始的网络尺度alt-text数据和带注释的图像,通过将所有标签视为文本,无缝地统一自然语言监督以进行表示。从经验上讲,可口可乐通过零拍传输或在广泛的下游任务上进行零摄像转移或最少的特定任务适应,跨越视觉识别(Imagenet,Kinetics-400/600/700,瞬间, ),交叉模式检索(MSCOCO,FLICKR30K,MSR-VTT),多模式理解(VQA,SNLI-VE,NLVR2)和图像字幕(MSCOCO,NOCAPS)。值得注意的是,在Imagenet分类方面,COCA获得了86.3%的TOP-1准确性,带有冷冻编码器和学习的分类头90.6%,以及带有填充编码器的Imagenet上的新最先进的91.0%Top-1 Top-1精度。
translated by 谷歌翻译