作为人类,我们通过我们所有的感官来驾驭世界,使用每个人从每个人纠正其他人。我们介绍了Merlot Reserve,一个模型,该模型是联合随着时间的推移而表示视频的模型 - 通过从音频,字幕和视频帧学习的新培训目标。给出了一个视频,我们用掩模令牌替换文本和音频的片段;该模型通过选择正确的蒙版片段来学习。我们的目标比替代方面更快地学习,并在规模上表现良好:我们预先逼近2000万YouTube视频。经验结果表明,Merlot Reserve学会通过所有组成模式的视频的强烈陈述。在FineTuned时,它在VCR和TVQA上为VCR和TVQA进行了新的最先进,优先于前勤工作分别为5%和7%。消融表明,两个任务都受益于音频预制 - 甚至录像机,围绕图像中心的QA任务(没有声音)。此外,我们的客观使开箱即用的预测,揭示了强大的多式联合致辞理解。在一个完全零拍摄的环境中,我们的模型在四个视频理解任务中获得竞争结果,甚至优于最近提出的定位推理(星)基准的监督方法。我们分析为什么包含音频导致更好的视觉语言表示,这表明未来研究的重要机会。我们通过讨论多式联运预测的道德和社会影响来得出结论。
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
为了使AI安全地在医院,学校和工作场所等现实世界中安全部署,它必须能够坚定地理解物理世界。这种推理的基础是物理常识:了解可用对象的物理特性和提供的能力,如何被操纵以及它们如何与其他对象进行交互。物理常识性推理从根本上是一项多感官任务,因为物理特性是通过多种模式表现出来的,其中两个是视觉和声学。我们的论文通过贡献PACS来朝着现实世界中的物理常识推理:第一个用于物理常识属性注释的视听基准。 PACS包含13,400对答案对,涉及1,377个独特的物理常识性问题和1,526个视频。我们的数据集提供了新的机会来通过将音频作为此多模式问题的核心组成部分来推进物理推理的研究领域。使用PACS,我们在我们的新挑战性任务上评估了多种最先进的模型。尽管某些模型显示出令人鼓舞的结果(精度为70%),但它们都没有人类的绩效(精度为95%)。我们通过证明多模式推理的重要性并为未来的研究提供了可能的途径来结束本文。
translated by 谷歌翻译
Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and highquality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (∼45%).To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (∼65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.
translated by 谷歌翻译
在这项工作中,我们介绍了无文本视觉语言变压器(TVLT),其中均匀的变压器块使用最小的模态设计进行视觉和语言表示的原始视觉和音频输入,并且不使用特定于文本的模块,例如作为令牌化或自动语音识别(ASR)。 TVLT通过重建连续视频帧和音频谱图(蒙版自动编码)和对比度建模以使视频和音频对比度建模进行训练。 TVLT在各种多模式任务上的性能与其基于文本的对应物相当,例如视觉询问,图像检索,视频检索和多模式情感分析,具有28倍的推理速度和仅1/3参数。我们的发现表明,从低级视觉和音频信号中学习紧凑,有效的视觉语言表示的可能性,而无需假设文本的先前存在。我们的代码和检查点可在以下网址找到:https://github.com/zinengtang/tvlt
translated by 谷歌翻译
视频问题回答(VideoQA)是一项复杂的任务,需要多种模式数据进行培训。但是,对视频的问题和答案的手动注释是乏味的,禁止可扩展性。为了解决这个问题,最近的方法考虑了零拍设置,而无需手动注释视觉问题。特别是,一种有前途的方法调整了在网络级文本数据中预测的冻结自回归语言模型,以适应多模式输入。相比之下,我们在这里建立在冷冻双向语言模型(BILM)的基础上,并表明这种方法为零拍出的VideoQA提供了更强大,更便宜的替代方案。特别是(i)我们使用轻型训练模块将视觉输入与冷冻的BILM结合在一起,(ii)我们使用Web-Scrafe Multi-Mododal数据训练此类模块,最后(iii)我们通过掩盖语言执行零声录像带推断建模,其中蒙版文本是给定问题的答案。我们提出的方法Frozenbilm在零摄影的视频中的表现优于最高的,包括LSMDC-FIB,包括LSMDC-FIB,IVQA,MSRVTT-QA,MSVD-QA,ActivityNet-QA,TGIF-FRAMEQA,TGIF-FRAMEQA,,TGIF-FRAMEQA,,TGIF-FRAMEQA,,,MSRVTT-QA,MSRVTT-QA,MSRVTT-QA,MSRVTT-QA,MSRVTT-QA,,均优于最新技术。 How2QA和TVQA。它还在几次且完全监督的环境中展示了竞争性能。我们的代码和模型将在https://antoyang.github.io/frozenbilm.html上公开提供。
translated by 谷歌翻译
视频语言(VIDL)建模的巨大挑战在于从图像/视频理解模型和下游Vidl数据中提取的固定视频表示之间的断开。最近的研究试图通过端到端培训来减轻这种断开连接。为了使其进行计算可行,先前的作品倾向于“想象”视频输入,即,将一些稀疏的采样帧馈送到2D CNN中,然后是简单的均值汇集或连接以获得整体视频表示。虽然实现了有希望的结果,但这种简单的方法可能会失去对于执行下游VIDL任务至关重要的时间信息。在这项工作中,我们呈现紫罗兰色,全新的视频语言变压器,采用视频变压器,明确地模拟视频输入的时间动态。此外,与以前的研究不同,发现视频输入上的预训练任务(例如,屏蔽帧建模)不是非常有效的,我们设计了一个新的预训练任务,屏蔽了视觉令牌建模(MVM),以获得更好的视频建模。具体地,原始视频帧修补程序将“令牌化”转换为离散的视觉令牌,目标是基于蒙面的贴片恢复原始的视觉令牌。综合分析展示了通过视频变压器和MVM显式时间建模的有效性。因此,紫罗兰在5个视频问题的回答任务和4个文本到视频检索任务中实现了新的最先进的性能。
translated by 谷歌翻译
我们研究了联合视频和语言(VL)预培训,以实现跨模型学习和益处丰富的下游VL任务。现有的作品要么提取低质量的视频特征或学习有限的文本嵌入,但忽略了高分辨率视频和多样化的语义可以显着提高跨模型学习。在本文中,我们提出了一种新的高分辨率和多样化的视频 - 语言预训练模型(HD-VILA),用于许多可视任务。特别是,我们收集具有两个不同属性的大型数据集:1)第一个高分辨率数据集包括371.5k小时的720p视频,2)最多样化的数据集涵盖15个流行的YouTube类别。为了启用VL预培训,我们通过学习丰富的时空特征的混合变压器联合优化HD-VILA模型,以及多峰变压器,用于强制学习视频功能与多样化文本的交互。我们的预训练模式实现了新的最先进的导致10 VL了解任务和2个新颖的文本到视觉生成任务。例如,我们以零拍摄MSR-VTT文本到视频检索任务的相对增加38.5%R @ 1的相对增长,高分辨率数据集LSMDC为53.6%。学习的VL嵌入也有效地在文本到视觉操纵和超分辨率任务中产生视觉上令人愉悦和语义相关结果。
translated by 谷歌翻译
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.
translated by 谷歌翻译
探索大规模预处理的基础模型对计算机视觉具有重大兴趣,因为这些模型可以快速转移到许多下游任务中。本文介绍了对比字幕(COCA),这是一种极简主义的设计,旨在为图像文本编码器编码器基础模型预算与对比度损失和字幕损失,从而从剪辑和诸如simvlm之类的生成方法之类的对比方法中包含模型能力。与所有解码器层都参与编码器输出的标准编码器 - 模块变压器相反,可口可乐省略了解码器层的上半部分的交叉注意,以编码单峰文本表示,并串联到剩余的解码器层,这些解码器与图像编码器相交的解码器层多模式图像文本表示。除了对多模态解码器输出的字幕损失外,我们还应用了单峰图像和文本嵌入之间的对比损失,该输出可以预测文本令牌自动加压。通过共享相同的计算图,可以用最小的开销有效地计算两个培训目标。可口可乐是端到端和从头开始的网络尺度alt-text数据和带注释的图像,通过将所有标签视为文本,无缝地统一自然语言监督以进行表示。从经验上讲,可口可乐通过零拍传输或在广泛的下游任务上进行零摄像转移或最少的特定任务适应,跨越视觉识别(Imagenet,Kinetics-400/600/700,瞬间, ),交叉模式检索(MSCOCO,FLICKR30K,MSR-VTT),多模式理解(VQA,SNLI-VE,NLVR2)和图像字幕(MSCOCO,NOCAPS)。值得注意的是,在Imagenet分类方面,COCA获得了86.3%的TOP-1准确性,带有冷冻编码器和学习的分类头90.6%,以及带有填充编码器的Imagenet上的新最先进的91.0%Top-1 Top-1精度。
translated by 谷歌翻译
随着图像文本对的大量数据以及视觉和语言(V&L)任务的多样性,学者在该研究领域引入了大量的深度学习模型。此外,近年来,转移学习还显示出在计算机愿景中的巨大成功,例如图像分类,对象检测等以及在自然语言处理中以进行问答,机器翻译等的自然语言处理。继承转移学习的精神, V&L的研究工作已经在大规模数据集上设计了多种预训练技术,以增强下游任务的性能。本文的目的是提供当代V&L预审前模型的全面修订。特别是,我们对预处理的方法进行了分类和描述,以及最先进的视觉和语言预训练模型的摘要。此外,还提供了培训数据集和下游任务的列表,以进一步提高V&L预处理的观点。最后,我们决定采取进一步的一步,讨论众多未来研究的方向。
translated by 谷歌翻译
近年来,统一的视觉语言框架已经大大提高,其中大多数采用编码器架构将图像文本任务统一为序列到序列的生成。但是,现有的视频语言(VIDL)模型仍需要在每个任务的模型体系结构和培训目标中进行特定于任务的设计。在这项工作中,我们探索了一个统一的VIDL框架薰衣草,其中蒙版语言建模(MLM)用作所有前训练和下游任务的常见接口。这样的统一导致了简化的模型体系结构,在多模式编码器之上,只需要一个轻巧的MLM头,而不是具有更多参数的解码器。令人惊讶的是,实验结果表明,这个统一的框架在14个VIDL基准测试中实现了竞争性能,涵盖了视频问答,文本到视频检索和视频字幕。广泛的分析进一步证明了薰衣草比现有VIDL方法的优势:(i)在多任务列出时仅使用一组参数值支持所有下游任务; (ii)对各种下游任务的几乎没有概括; (iii)在视频问题回答任务上启用零射门评估。代码可从https://github.com/microsoft/lavender获得。
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
连接视觉和语言在生成智能中起着重要作用。因此,已经致力于图像标题的大型研究工作,即用句法和语义有意义的句子描述图像。从2015年开始,该任务通常通过由Visual Encoder组成的管道和文本生成的语言模型来解决任务。在这些年来,两种组件通过对象区域,属性,介绍多模态连接,完全关注方法和伯特早期融合策略的利用而显着发展。但是,无论令人印象深刻的结果,图像标题的研究还没有达到结论性答案。这项工作旨在提供图像标题方法的全面概述,从视觉编码和文本生成到培训策略,数据集和评估度量。在这方面,我们量化地比较了许多相关的最先进的方法来确定架构和培训策略中最有影响力的技术创新。此外,讨论了问题的许多变体及其开放挑战。这项工作的最终目标是作为理解现有文献的工具,并突出显示计算机视觉和自然语言处理的研究领域的未来方向可以找到最佳的协同作用。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube. Whereas most existing approaches learn low-level representations, we propose a joint visual-linguistic model to learn high-level features without any explicit supervision. In particular, inspired by its recent success in language modeling, we build upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively. We use VideoBERT in numerous tasks, including action classification and video captioning. We show that it can be applied directly to openvocabulary classification, and confirm that large amounts of training data and cross-modal information are critical to performance. Furthermore, we outperform the state-of-theart on video captioning, and quantitative results verify that the model learns high-level semantic features.
translated by 谷歌翻译
可以代表和描述环境声音的机器具有实际潜力,例如,用于音频标记和标题系统。普遍的学习范式已经依赖于并行音频文本数据,但是,Web上几乎没有可用。我们提出了vip-ant,它在不使用任何并行音频文本数据的情况下诱导\ textbf {a} udio- \ textBF {t} EXT对齐。我们的主要思想是在双模形图像文本表示和双模态图像 - 音频表示之间共享图像模型;图像模态用作枢轴,并将音频和文本连接在三模态嵌入空间中。在没有配对的音频文本数据的困难零拍设置中,我们的模型在ESC50和US8K音频分类任务上展示了最先进的零点性能,甚至超过了披肩标题的领域的监督状态检索(带音频查询)2.2 \%R @ 1。我们进一步调查了最小音频监控的情况,发现,例如,只有几百个监督的音频文本对将零拍音频分类精度提高8 \%US8K。然而,为了匹配人类奇偶校验,我们的经验缩放实验表明我们需要大约2米$ 2 ^ {21} \约2M $监督的音频标题对。我们的工作开辟了新的途径,用于学习音频文本连接,几乎没有并行音频文本数据。
translated by 谷歌翻译
Timeyou have a little pressure you are cutting the wood readjusting the table saw I am using a roller sure you applied glue Figure 1: We describe an efficient approach to learn visual representations from misaligned and noisy narrations (bottom) automatically extracted from instructional videos (top). Our video representations are learnt from scratch without relying on any manually annotated visual dataset yet outperform all self-supervised and many fully-supervised methods on several video recognition benchmarks.
translated by 谷歌翻译
本文提出了一种对比调整,这是一种简单的方法,采用对比训练来对准图像和文本模型,同时仍然利用他们的预训练。在我们的实证研究中,我们发现,锁定的预训练图像模型与解锁文本模型最佳。我们调用这种对比调整“锁定图像文本调整”(LIT TOONING)的实例,该实例仅教导文本模型,从预先训练的图像模型中读出了良好的表示新任务。亮度调谐模型将零拍摄传输到新视觉任务的能力提高,例如图像分类或检索。建议的亮度调整是广泛适用的;它可以使用三种不同的图像文本数据集可靠地使用多种预训练方法(监督和无监督)和多种架构(Reset,Vision变换器和MLP-MILLER)。利用基于变压器的预训练VIT-G / 14型号,LIT调谐模型在想象网测试集中实现了84.5%的零射频传输精度,并且在充满挑战的分发ObjectNet测试集中实现了81.1%。
translated by 谷歌翻译
The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the use of human-generated annotations leads to models with biased learning and poor domain generalization and robustness. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for video-exclusive ideas that advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.
translated by 谷歌翻译