图像标题是视觉语言理解的基本任务,其中模型将文本信息标题预测到给定输入图像。在本文中,我们提出了一种解决此任务的简单方法。我们使用剪辑编码作为标题的前缀,通过采用简单的映射网络,然后微调语言模型以生成图像标题。最近提出的剪辑模型包含丰富的语义特征,这些功能培训了文本背景,使其最适合视觉语言感知。我们的关键思想与预先接受训练的语言模型(GPT2)一起,我们获得了广泛了解视觉和文本数据。因此,我们的方法只需要相当快速的培训来产生称职的标题模型。如果没有额外的注释或预训练,它有效地为大规模和多样化的数据集生成有意义的标题。令人惊讶的是,即使仅在训练映射网络时,我们的方法也很好地运行良好,而剪辑和语言模型仍然冻结,则允许较轻的培训参数较轻的架构。通过定量评估,我们展示了我们的模型在充满挑战的概念标题和Nocaps数据集上实现了最先进的方法的可比结果,而它更简单,更快,更轻。我们的代码在https://github.com/rmokady/clip_prefix_caption中提供。
translated by 谷歌翻译
连接视觉和语言在生成智能中起着重要作用。因此,已经致力于图像标题的大型研究工作,即用句法和语义有意义的句子描述图像。从2015年开始,该任务通常通过由Visual Encoder组成的管道和文本生成的语言模型来解决任务。在这些年来,两种组件通过对象区域,属性,介绍多模态连接,完全关注方法和伯特早期融合策略的利用而显着发展。但是,无论令人印象深刻的结果,图像标题的研究还没有达到结论性答案。这项工作旨在提供图像标题方法的全面概述,从视觉编码和文本生成到培训策略,数据集和评估度量。在这方面,我们量化地比较了许多相关的最先进的方法来确定架构和培训策略中最有影响力的技术创新。此外,讨论了问题的许多变体及其开放挑战。这项工作的最终目标是作为理解现有文献的工具,并突出显示计算机视觉和自然语言处理的研究领域的未来方向可以找到最佳的协同作用。
translated by 谷歌翻译
最近的文本到图像匹配模型对大型图像和句子的大公司进行了对比学习。虽然这些模型可以提供用于匹配和随后的零拍任务的强大分数,但它们不能给出给定图像的标题。在这项工作中,我们重新利用这些模型来生成在推理时间的图像时生成描述性文本,而无需进一步的训练或调整步骤。这是通过将具有大语言模型的视觉语义模型组合,从两种网络级模型中的知识中获益。由受监督标题方法获得的标题的限制性较小。此外,作为零射击学习方法,它非常灵活,我们展示了执行图像算法的能力,其中输入可以是图像或文本,输出是句子。这使得新颖的高级视觉能力,例如比较两个图像或解决视觉类比测试。
translated by 谷歌翻译
虽然标题模型已经获得了引人注目的结果,但在描述自然图像时,它们仍然不会涵盖现实世界概念的整个长尾分布。在本文中,我们通过在Web级自动收集的数据集上培训来解决与野外概念生成人类描述的任务。为此,我们提出了一种模型,该模型可以利用嘈杂的图像标题对,同时维持像Coco这样的传统人类注释数据集的描述性风格。我们的模型通过使用关键字和风格标记将内容从风格分开,使用单一目标是提示语言建模和比其他最近提出的更简单。在实验上,我们的模型在零拍摄设置中始终如一地占据了说明性质量和能力的现有方法。根据苹果酒公制,我们在使用外部数据时在Coco和Nocaps上获得新的最新状态。
translated by 谷歌翻译
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use selfattention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar 1 , which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks. 2
translated by 谷歌翻译
Despite the remarkable progress of image captioning, existing captioners typically lack the controllable capability to generate desired image captions, e.g., describing the image in a rough or detailed manner, in a factual or emotional view, etc. In this paper, we show that a unified model is qualified to perform well in diverse domains and freely switch among multiple styles. Such a controllable capability is achieved by embedding the prompt learning into the image captioning framework. To be specific, we design a set of prompts to fine-tune the pre-trained image captioner. These prompts allow the model to absorb stylized data from different domains for joint training, without performance degradation in each domain. Furthermore, we optimize the prompts with learnable vectors in the continuous word embedding space, avoiding the heuristic prompt engineering and meanwhile exhibiting superior performance. In the inference stage, our model is able to generate desired stylized captions by choosing the corresponding prompts. Extensive experiments verify the controllable capability of the proposed method. Notably, we achieve outstanding performance on two diverse image captioning benchmarks including COCO Karpathy split and TextCaps using a unified model.
translated by 谷歌翻译
我们介绍了一种零拍的视频字幕方法,该方法采用了两个冷冻网络:GPT-2语言模型和剪辑图像文本匹配模型。匹配分数用于引导语言模型生成一个句子,该句子的平均匹配分数高于视频帧的一个子集。与零拍图像字幕方法不同,我们的工作立即考虑整个句子。这是通过在生成过程中优化从头开始的一部分,通过在提示中修改所有其他令牌的表示,并通过迭代重复该过程,逐渐提高生成句子的特殊性和全面性来实现。我们的实验表明,生成的字幕是连贯的,并显示了广泛的现实知识。我们的代码可在以下网址找到:https://github.com/yoadtew/zero-shot-video-to-text
translated by 谷歌翻译
从图像中产生短篇小说是艰巨的。与图像字幕不同,来自图像的故事产生构成了多个挑战:保持故事连贯性,适当评估故事的质量,将生成的故事转向某种风格,并解决图像故事对的参考数据集的稀缺性,以限制训练期间的训练监督。在这项工作中,我们介绍了插件的故事讲述者(PPST),并通过以下方式改进图像到故事的生成:1)通过合并大型预培训模型,即剪辑和GPT-2来减轻数据稀缺问题,以促进通过最少的监督,流利的图像到文本一代,以及2)通过合并风格适配器来控制故事的生成,从而实现了更相关的一代。我们通过非风格,浪漫风格和动作风格的PPST进行图像到故事的生成实验,并将我们生成的故事与以前的故事进行比较三个方面的故事,即故事连贯性,图像故事相关性和风格和风格健身,使用自动和人类评估。结果表明,PPST提高了故事的连贯性,并且具有更好的图像故事相关性,但尚未充分风格。
translated by 谷歌翻译
在本文中,我们设计和训练生成的图像到文本变压器Git,以统一视觉语言任务,例如图像/视频字幕和问题答案。尽管生成模型在预训练和微调之间提供了一致的网络体系结构,但现有工作通常包含复杂的结构(Uni/多模式编码器/解码器),并取决于外部模块,例如对象检测器/标记器和光学角色识别(OCR) )。在git中,我们将体系结构简化为一个图像编码器,而在单语言建模任务下将架构简化为一个文本解码器。我们还扩展了预训练数据和模型大小,以提高模型性能。没有铃铛和哨子,我们的git在12个具有挑战性的基准下建立了新的艺术状态。例如,我们的模型在文本贴图上首次超过了人类的表现(138.2 vs. 125.5在苹果酒中)。此外,我们提出了一种新的基于一代的图像分类和场景文本识别的方案,在标准基准上实现了不错的表现。
translated by 谷歌翻译
The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pretraining. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pretraining data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [70] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks. 1
translated by 谷歌翻译
大规模预制速度迅速成为视觉语言(VL)建模中的规范。然而,普遍的VL方法受标记数据的要求和复杂的多步预介质目标的要求受限。我们呈现Magma - 使用基于适配器的FineTuning使用额外的方式增强生成语言模型的简单方法。在冻结的情况下,我们培训一系列VL模型,从视觉和文本输入的任意组合自动生成文本。使用单一语言建模目的,预先预测完全结束于结束,与先前的方法相比,简化优化。重要的是,在培训期间,语言模型权重保持不变,允许从语言预磨练转移百科全书知识和内心的学习能力。 Magma在开放式生成任务上冻结的岩浆,实现了最先进的状态,结果在Okvqa基准和竞争结果上的一系列其他流行的VL基准测试中,同时预先训练用于培训SIMVLM的样本数量的0.2%。
translated by 谷歌翻译
近年来,根据Vision-Language预训练(VLP),我们在图像标题任务中掌握了显着的性能提升。比例被认为是这一进步的重要因素。然而,大多数现有工作仅侧重于预训练的变压器,在大约400万图像上具有中等大小(例如,12或24层)。在本文中,我们呈现柠檬,一个大规模的图像标题器,并为图像标题的VLP的缩放行为提供第一个实证研究。我们使用最先进的VINVL模型作为我们的参考模型,它由图像特征提取器和变压器模型组成,并将变压器上下放大,模型大小范围从13到675万参数。在数据方面,我们通过高达200万图像文本对进行实验,该对基于图像的Alt属性自动从Web自动收集(称为ALT200M)。广泛的分析有助于将性能趋势表征为模型大小和预训练数据尺寸增加。我们还比较不同的培训配方,特别是在大规模嘈杂数据上培训。结果,柠檬在几个主要图像标题基准上实现了新的技术状态,包括Coco标题,Nocaps和概念标题。我们还显示柠檬可以在以零拍摄方式使用时生成带有长尾视觉概念的标题。
translated by 谷歌翻译
Recent years have witnessed the rapid progress of image captioning. However, the demands for large memory storage and heavy computational burden prevent these captioning models from being deployed on mobile devices. The main obstacles lie in the heavyweight visual feature extractors (i.e., object detectors) and complicated cross-modal fusion networks. To this end, we propose LightCap, a lightweight image captioner for resource-limited devices. The core design is built on the recent CLIP model for efficient image captioning. To be specific, on the one hand, we leverage the CLIP model to extract the compact grid features without relying on the time-consuming object detectors. On the other hand, we transfer the image-text retrieval design of CLIP to image captioning scenarios by devising a novel visual concept extractor and a cross-modal modulator. We further optimize the cross-modal fusion model and parallel prediction heads via sequential and ensemble distillations. With the carefully designed architecture, our model merely contains 40M parameters, saving the model size by more than 75% and the FLOPs by more than 98% in comparison with the current state-of-the-art methods. In spite of the low capacity, our model still exhibits state-of-the-art performance on prevalent datasets, e.g., 136.6 CIDEr on COCO Karpathy test split. Testing on the smartphone with only a single CPU, the proposed LightCap exhibits a fast inference speed of 188ms per image, which is ready for practical applications.
translated by 谷歌翻译
新颖的对象字幕(NOC)旨在描述包含对象的图像,而无需在训练过程中观察其地面真相标题。由于缺乏字幕注释,无法通过序列到序列训练或苹果酒优化直接优化字幕模型。结果,我们提出了启用释义(P2C),这是一个针对NOC的两阶段学习框架,它将通过释义通过释义来优化输出字幕。使用P2C,字幕模型首先从仅在文本语料库中预先训练的语言模型中学习释义,从而扩展了Bank一词以提高语言流利度。为了进一步实施足够描述输入图像的视觉内容的输出字幕,我们对引入的忠诚度和充分性目标进行字幕模型执行自我贴形。由于在训练过程中没有任何地面真相标题可用于新颖的对象图像,因此我们的P2C利用交叉模式(图像文本)关联模块可以确保可以正确保留上述字幕特征。在实验中,我们不仅表明我们的P2C在NOCAPS和COCO字幕数据集上实现了最先进的性能,而且还通过替换NOC的语言和跨模式关联模型来验证学习框架的有效性和灵活性。实施详细信息和代码可在补充材料中找到。
translated by 谷歌翻译
Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether this makes it possible to learn those skills from text data and then use them to complete vision tasks without ever training on visual training data. Key to our approach is exploiting the joint embedding space of contrastively trained vision and language encoders. In practice, there can be systematic differences between embedding spaces for different modalities in contrastive models, and we analyze how these differences affect our approach and study a variety of strategies to mitigate this concern. We produce models using only text training data on three tasks: image captioning, visual entailment and visual question answering, and evaluate them on standard benchmarks using images. We find that this kind of transfer is possible and results in only a small drop in performance relative to models trained on images. We also showcase a variety of stylistic image captioning models that were trained using no image data and no human-curated language data, but instead text data from books, the web, or language models.
translated by 谷歌翻译
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be finetuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.
translated by 谷歌翻译
近年来在开发更好的图像标题模型方面取得了巨大进展,但其中大多数依赖于单独的对象探测器来提取区域特征。最近的视觉语言研究通过利用网格表示来实现更灵活的模型训练和更快推理速度的速度来转向探测器趋势。但是,这种发展主要专注于图像理解任务,并且对标题生成任务的研究仍然较少。在本文中,我们涉及一种更好的无需探测器图像标题模型,并提出了一种基于纯视觉变压器的图像标题模型,称为VITCAP,其中使用了网格表示而不提取区域特征。为了提高性能,我们介绍了一种新颖的概念令牌网络(CTN)来预测语义概念,然后将它们纳入端到端的标题。特别地,CTN是基于视觉变换器构建的,并且旨在通过分类任务预测概念令牌,其中包含丰富的语义信息极大地利益标题任务。与以前的探测器的模型相比,Vitcap大大简化了架构,同时在各种具有挑战性的图像标题数据集上实现了竞争性能。特别是,Vitcap分别达到138.1苹果酒分数,即在Nocaps上的Coco-Caption Karpatal-Splity,93.8和108.6苹果酒分数和Google-CC标题数据集上分别达到138.1苹果酒分数。
translated by 谷歌翻译
人类利用先验知识来描述图像,并能够使其解释适应特定的上下文信息,即使在上下文信息和图像不匹配时,也可以在发明合理的解释的范围内。在这项工作中,我们提出了通过整合上下文知识来字幕Wikipedia图像的新颖任务。具体而言,我们制作的模型共同推理了Wikipedia文章,Wikimedia图像及其相关描述以产生上下文化的标题。特别是,可以使用类似的Wikimedia图像来说明不同的文章,并且所产生的标题需要适应特定的上下文,因此使我们能够探索模型的限制以调整标题为不同的上下文信息。该领域中的一个特殊挑战性的任务是处理量不多的单词和命名实体。为了解决这个问题,我们提出了一个预训练目标,掩盖了命名实体建模(MNEM),并表明与基线模型相比,此借口任务可以改善。此外,我们验证了Wikipedia中使用MNEM目标预先训练的模型可以很好地推广到新闻字幕数据集。此外,我们根据字幕任务的难度定义了两种不同的测试拆分。我们提供有关每种方式的作用和重要性的见解,并突出我们模型的局限性。接受时,代码,模型和数据拆分可公开可用。
translated by 谷歌翻译
对于视频标题,“预培训和微调”已成为事实上的范式,其中想象成预训练(InP)通常用于帮助编码视频内容,并且从头开始进行任务导向的网络应对标题一代。将InP与最近提出的剪辑(对比语言图像预培训)进行比较,研究了INP的潜在缺陷,用于视频标题,并探索产生准确描述的关键。具体而言,我们对INP与剪辑的实证研究表明,INP使视频标题模型棘手捕获属性的语义和对无关背景信息的敏感。相比之下,剪辑在标题质量中的显着提升突出了属性感知表示学习的重要性。因此,我们被激励引入双属性预测,需要一个辅助任务,需要视频字幕模型来学习视频内容和属性之间的对应关系以及属性之间的共同发生关系。基准数据集的广泛实验表明,我们的方法能够更好地学习属性感知的表示,这对具有不同架构和解码算法的模型带来了一致的改进。
translated by 谷歌翻译
This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR [21], and utilize an improved approach OSCAR+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. Code, models and pre-extracted features are released at https://github.com/pzzhang/VinVL. ♥ Microsoft Corporation♠ University of Washington † indicates equal contributions.
translated by 谷歌翻译