Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts to create a general vision model are limited in the scope of assessed tasks and offer no overarching framework to perform them holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four functional domains $\unicode{x2014}$ Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation. Along with the benchmark, we provide a general encoder-decoder framework to allow for the evaluation of arbitrary visual representation on all 11 tasks. We evaluate various pre-trained visual representations with our framework and observe that (1) Transformer-based visual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual representations from vision-language pre-training are superior to those with vision-only pre-training across visual tasks. With G-VUE, we provide a holistic evaluation standard to motivate research toward building general-purpose visual systems via obtaining more general-purpose visual representations.
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
在过去的几年中,训练前模型的出现将单峰领域(例如计算机视觉(CV)和自然语言处理(NLP))带到了一个新时代。实质性的作品表明它们对下游大学任务有益,并避免从头开始训练新的模型。那么,此类预训练的模型可以应用于多模式任务吗?研究人员探索了这个问题并取得了重大进展。本文调查了视觉预训练(VLP)的最新进展和新的前沿,包括图像文本和视频文本预训练。为了使读者更好地掌握VLP,我们首先从五个方面回顾了其最新进展:功能提取,模型体系结构,培训预训练目标,预训练数据集和下游任务。然后,我们详细概述了特定的VLP模型。最后,我们讨论了VLP中的新边界。据我们所知,这是对VLP的首次调查。我们希望这项调查能够阐明VLP领域的未来研究。
translated by 谷歌翻译
The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .
translated by 谷歌翻译
Despite the superior performance brought by vision-and-language pretraining, it remains unclear whether learning with multi-modal data can help understand each individual modality. In this work, we investigate how language can help with visual representation learning from a probing perspective. Specifically, we compare vision-and-language and vision-only models by probing their visual representations on a broad range of tasks, in order to assess the quality of the learned representations in a fine-grained manner. Interestingly, our probing results suggest that vision-and-language models are better at label prediction tasks like object and attribute prediction, while vision-only models are stronger at dense prediction tasks that require more localized information. With further analysis using detailed metrics, our study suggests that language helps vision models learn better semantics, but not localization. Code is released at https://github.com/Lizw14/visual_probing.
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.
translated by 谷歌翻译
视觉问题应答(VQA)任务利用视觉图像和语言分析来回回答图像的文本问题。它是一个流行的研究课题,在过去十年中越来越多的现实应用。本文介绍了我们最近对AliceMind-MMU的研究(阿里巴巴的编码器 - 解码器来自Damo Academy - 多媒体理解的机器智能实验室),其比人类在VQA上获得相似甚至略微更好的结果。这是通过系统地改善VQA流水线来实现的,包括:(1)具有全面的视觉和文本特征表示的预培训; (2)与学习参加的有效跨模型互动; (3)一个新颖的知识挖掘框架,具有专门的专业专家模块,适用于复杂的VQA任务。处理不同类型的视觉问题,需要具有相应的专业知识在提高我们的VQA架构的表现方面发挥着重要作用,这取决于人力水平。进行了广泛的实验和分析,以证明新的研究工作的有效性。
translated by 谷歌翻译
我们提出了Unified-io,该模型执行了跨越经典计算机视觉任务的各种AI任务,包括姿势估计,对象检测,深度估计和图像生成,视觉和语言任务,例如区域字幕和引用表达理解,并引用表达理解,进行自然语言处理任务,例如回答和释义。由于与每个任务有关的异质输入和输出,包括RGB图像,每个像素映射,二进制掩码,边界框和语言,开发一个统一模型引起了独特的挑战。我们通过将每个受支持的输入和输出均匀地均匀地统一到一系列离散的词汇令牌来实现这一统一。在所有任务中,这种共同的表示使我们能够在视觉和语言字段中的80多个不同数据集上培训单个基于变压器的体系结构。 Unified-io是第一个能够在砂砾基准上执行所有7个任务的模型,并在NYUV2-DEPTH,Imagenet,VQA2.0,OK-VQA,SWIG,SWIG,VIZWIZ,BOOLQ,BOOLQ和SCITAIL,带有NYUV2-DEPTH,Imagenet,VQA2.0,诸如NYUV2-DEPTH,ImageNet,vqa2.0等16个不同的基准中产生强大的结果。没有任务或基准特定的微调。 unified-io的演示可在https://unified-io.allenai.org上获得。
translated by 谷歌翻译
Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into $N\times N$ blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or ``O" in aPTP ``The block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT \cite{vilt} baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP \cite{blip} baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released at \url{https://github.com/sail-sg/ptp}.
translated by 谷歌翻译
随着图像文本对的大量数据以及视觉和语言(V&L)任务的多样性,学者在该研究领域引入了大量的深度学习模型。此外,近年来,转移学习还显示出在计算机愿景中的巨大成功,例如图像分类,对象检测等以及在自然语言处理中以进行问答,机器翻译等的自然语言处理。继承转移学习的精神, V&L的研究工作已经在大规模数据集上设计了多种预训练技术,以增强下游任务的性能。本文的目的是提供当代V&L预审前模型的全面修订。特别是,我们对预处理的方法进行了分类和描述,以及最先进的视觉和语言预训练模型的摘要。此外,还提供了培训数据集和下游任务的列表,以进一步提高V&L预处理的观点。最后,我们决定采取进一步的一步,讨论众多未来研究的方向。
translated by 谷歌翻译
变压器是一种基于关注的编码器解码器架构,彻底改变了自然语言处理领域。灵感来自这一重大成就,最近在将变形式架构调整到计算机视觉(CV)领域的一些开创性作品,这已经证明了他们对各种简历任务的有效性。依靠竞争力的建模能力,与现代卷积神经网络相比在本文中,我们已经为三百不同的视觉变压器进行了全面的审查,用于三个基本的CV任务(分类,检测和分割),提出了根据其动机,结构和使用情况组织这些方法的分类。 。由于培训设置和面向任务的差异,我们还在不同的配置上进行了评估了这些方法,以便于易于和直观的比较而不是各种基准。此外,我们已经揭示了一系列必不可少的,但可能使变压器能够从众多架构中脱颖而出,例如松弛的高级语义嵌入,以弥合视觉和顺序变压器之间的差距。最后,提出了三个未来的未来研究方向进行进一步投资。
translated by 谷歌翻译
本文介绍了Omnivl,这是一种新的基础模型,旨在使用一种通用体系结构来支持图像语言和视频语言任务。它为图像和视频输入采用了统一的基于变压器的视觉编码器,因此可以执行联合图像语言和视频语言预处理。我们首次证明了这样的范式受益于图像和视频任务,而不是传统的单向传输(例如,使用图像语言来帮助视频语言)。为此,我们提出了对图像语言和视频语言的脱钩关节预处理,以有效地将视觉模型分解为空间和时间维度,并在图像和视频任务上获得性能提升。此外,我们引入了一种新颖的统一视觉对比度(UNIVLC)损失,以利用图像文本,视频文本,图像标签(例如,图像分类),视频标签(例如,视频动作识别)在一起受到监督和吵闹的监督预处理数据都尽可能多地利用。无需额外的任务适配器,Omnivl可以同时支持仅视觉任务(例如,图像分类,视频操作识别),跨模式对齐任务(例如,图像/视频 - 文本检索)和多模式理解和生成任务(例如,图像/视频问答,字幕)。我们在各种下游任务上评估Omnivl,并以相似的模型大小和数据量表获得最新的或竞争结果。
translated by 谷歌翻译
Vision-Language Transformers can be learned without human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders that does not require this supervision. In fact, in a head-to-head comparison between ViLT, the current state-of-the-art patch-based vision-language transformer which is pretrained with supervised object classification, and our model, VLC, we find that our approach 1. outperforms ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch visualizations, and 3. is competitive with many larger models that utilize ROIs trained on annotated bounding-boxes.
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing visionlanguage models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an endto-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than regionbased approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). We conduct experiments on four well-established vision-language tasks by following standard VLPT settings. In particular, SOHO achieves absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR 2 test-P split, 6.7% accuracy on SNLI-VE test split, respectively.
translated by 谷歌翻译
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific modelsachieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.Preprint. Under review.
translated by 谷歌翻译
最近的进展表明,使用对比图像文本对的大规模预训练可以是从自然语言监督的高质量视觉表演学习的有前途的替代方案。从更广泛的监督来源受益,这种新的范例展示了对下游分类任务和数据集的令人印象深刻的可转移性。然而,从图像文本对中学习的知识转移到更复杂的密集预测任务的问题几乎没有访问过。在这项工作中,我们通过隐式和明确地利用来自剪辑的预先训练的知识来提出了一种新的密集预测框架。具体地,我们将剪辑中的原始图像文本匹配问题转换为像素文本匹配问题,并使用像素文本分数图来指导致密预测模型的学习。通过进一步使用图像中的上下文信息来提示语言模型,我们能够促进我们的模型来更好地利用预先接受训练的知识。我们的方法是模型 - 不可行的,它可以应用于任意密集的预测系统和各种预先训练的视觉底座,包括夹模型和想象成预先训练的模型。广泛的实验证明了我们对语义分割,对象检测和实例分段任务的方法的卓越性能。代码可在https://github.com/raoyongming/denseclip获得
translated by 谷歌翻译
我们提出了GLIPV2,这是一个接地的VL理解模型,该模型既服务于本地化任务(例如,对象检测,实例分割)和视觉语言(VL)理解任务(例如VQA,图像字幕)。 GLIPV2优雅地将本地化预训练和视觉语言预训练(VLP)具有三个预训练任务:短语接地作为对检测任务的VL重新重新制定,区域词对比度学习作为新型的区域词对比度对比度对比学习任务,以及蒙面的语言建模。这种统一不仅简化了先前的多阶段VLP程序,而且还可以在本地化和理解任务之间实现相互利益。实验结果表明,在各种本地化和理解任务上,单个GLIPV2模型(所有模型权重)在SOTA性能附近实现。该模型还显示了(1)在开放式摄制对象检测任务上进行的强零射击和很少的自适应性能,以及(2)VL理解任务上的卓越接地能力。代码将在https://github.com/microsoft/glip上发布。
translated by 谷歌翻译
探索大规模预处理的基础模型对计算机视觉具有重大兴趣,因为这些模型可以快速转移到许多下游任务中。本文介绍了对比字幕(COCA),这是一种极简主义的设计,旨在为图像文本编码器编码器基础模型预算与对比度损失和字幕损失,从而从剪辑和诸如simvlm之类的生成方法之类的对比方法中包含模型能力。与所有解码器层都参与编码器输出的标准编码器 - 模块变压器相反,可口可乐省略了解码器层的上半部分的交叉注意,以编码单峰文本表示,并串联到剩余的解码器层,这些解码器与图像编码器相交的解码器层多模式图像文本表示。除了对多模态解码器输出的字幕损失外,我们还应用了单峰图像和文本嵌入之间的对比损失,该输出可以预测文本令牌自动加压。通过共享相同的计算图,可以用最小的开销有效地计算两个培训目标。可口可乐是端到端和从头开始的网络尺度alt-text数据和带注释的图像,通过将所有标签视为文本,无缝地统一自然语言监督以进行表示。从经验上讲,可口可乐通过零拍传输或在广泛的下游任务上进行零摄像转移或最少的特定任务适应,跨越视觉识别(Imagenet,Kinetics-400/600/700,瞬间, ),交叉模式检索(MSCOCO,FLICKR30K,MSR-VTT),多模式理解(VQA,SNLI-VE,NLVR2)和图像字幕(MSCOCO,NOCAPS)。值得注意的是,在Imagenet分类方面,COCA获得了86.3%的TOP-1准确性,带有冷冻编码器和学习的分类头90.6%,以及带有填充编码器的Imagenet上的新最先进的91.0%Top-1 Top-1精度。
translated by 谷歌翻译