In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. But in computer vision, the difficulties for in-context learning lie in that tasks vary significantly in the output representations, thus it is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images. With this idea, our training process is extremely simple, which performs standard masked image modeling on the stitch of input and output image pairs. This makes the model capable of performing tasks conditioned on visible image patches. Thus, during inference, we can adopt a pair of input and output images from the same task as the input condition, to indicate which task to perform. Without bells and whistles, our generalist Painter can achieve competitive performance compared to well-established task-specific models, on seven representative vision tasks ranging from high-level visual understanding to low-level image processing. Painter significantly outperforms recent generalist models on several challenging tasks. Surprisingly, our model shows capabilities of completing out-of-domain tasks, which do not exist in the training data, such as open-category keypoint detection and object segmentation, validating the powerful task transferability of in-context learning.
translated by 谷歌翻译
尽管语言任务自然而然地以单个,统一的建模框架(即生成代币序列)表示,但在计算机视觉中并非如此。结果,对于不同的视力任务,不同的架构和损失功能的扩散。在这项工作中,我们表明,如果根据共享像素到序列界面进行配制,也可以统一一组“核心”计算机视觉任务。我们专注于四个任务,即对象检测,实例分割,关键点检测和图像字幕,所有这些任务都具有各种类型的输出,例如边界框或密集的掩码。尽管如此,通过将每个任务的输出作为具有统一界面的离散令牌的顺序,我们表明可以在所有这些任务上训练具有单个模型体系结构和损失功能的神经网络,而没有特定于任务的自定义。为了解决特定的任务,我们使用一个简短的提示作为任务说明,序列输出适应提示,以便它可以产生特定于任务的输出。我们表明,与成熟的特定任务模型相比,这种模型可以实现竞争性能。
translated by 谷歌翻译
我们提出了Unified-io,该模型执行了跨越经典计算机视觉任务的各种AI任务,包括姿势估计,对象检测,深度估计和图像生成,视觉和语言任务,例如区域字幕和引用表达理解,并引用表达理解,进行自然语言处理任务,例如回答和释义。由于与每个任务有关的异质输入和输出,包括RGB图像,每个像素映射,二进制掩码,边界框和语言,开发一个统一模型引起了独特的挑战。我们通过将每个受支持的输入和输出均匀地均匀地统一到一系列离散的词汇令牌来实现这一统一。在所有任务中,这种共同的表示使我们能够在视觉和语言字段中的80多个不同数据集上培训单个基于变压器的体系结构。 Unified-io是第一个能够在砂砾基准上执行所有7个任务的模型,并在NYUV2-DEPTH,Imagenet,VQA2.0,OK-VQA,SWIG,SWIG,VIZWIZ,BOOLQ,BOOLQ和SCITAIL,带有NYUV2-DEPTH,Imagenet,VQA2.0,诸如NYUV2-DEPTH,ImageNet,vqa2.0等16个不同的基准中产生强大的结果。没有任务或基准特定的微调。 unified-io的演示可在https://unified-io.allenai.org上获得。
translated by 谷歌翻译
在本文中,我们将针对基于文本的描述的任意类别执行全新的计算机视觉任务,开放式全磁全面分割,该任务旨在执行全景分段(背景语义标签 +前景实例分段)。我们首先构建了一种基线方法,而无需填充或蒸馏以利用现有夹模型中的知识。然后,我们开发了一种新方法MaskClip,该方法是一种基于变压器的方法,该方法使用带有基于VIT的夹子主链的掩码查询来执行语义分割和对象实例分割。在这里,我们设计了一个相对的掩码注意力(RMA)模块,以将分割作为VIT夹模型的其他令牌。 MaskClip通过避免使用外部剪贴图像模型的暂停操作来裁剪图像贴片和计算功能,从而有效地有效地利用预训练的密集/局部剪辑功能。我们为开放式综合综合分割和最先进的结果获得了令人鼓舞的结果。我们显示具有自定义类别的MaskClip的定性插图。
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
In this work, we introduce Panoptic-DeepLab, a simple, strong, and fast system for panoptic segmentation, aiming to establish a solid baseline for bottom-up methods that can achieve comparable performance of two-stage methods while yielding fast inference speed. In particular, Panoptic-DeepLab adopts the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively. The semantic segmentation branch is the same as the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression. As a result, our single Panoptic-DeepLab simultaneously ranks first at all three Cityscapes benchmarks, setting the new state-of-art of 84.2% mIoU, 39.0% AP, and 65.5% PQ on test set. Additionally, equipped with MobileNetV3, Panoptic-DeepLab runs nearly in real-time with a single 1025 × 2049 image (15.8 frames per second), while achieving a competitive performance on Cityscapes (54.1 PQ% on test set). On Mapillary Vistas test set, our ensemble of six models attains 42.7% PQ, outperforming the challenge winner in 2018 by a healthy margin of 1.5%. Finally, our Panoptic-DeepLab also performs on par with several topdown approaches on the challenging COCO dataset. For the first time, we demonstrate a bottom-up approach could deliver state-of-the-art results on panoptic segmentation.
translated by 谷歌翻译
本文介绍了Simmim,这是一个简单的蒙面图像建模框架。我们在没有特殊设计的情况下简化了最近提出的相关方法,例如通过离散VAE或聚类的块状掩蔽和令牌化。要研究蒙版图像建模任务学习良好的表示,我们系统地研究了我们框架中的主要组成部分,并发现每个组件的简单设计揭示了非常强烈的表示学习性能:1)用中等的输入图像随机掩蔽输入图像大型蒙面贴片尺寸(例如,32)进行了强大的文本前任务; 2)通过直接回归预测RGB值的原始像素不比具有复杂设计的补丁分类方法更差; 3)预测头可以像线性层一样光,性能比较重的形式更差。使用VIT-B,我们的方法通过预训练在此数据集上进行预培训,我们的方法在ImageNet-1K上实现了83.8%的精细调整精度,超过了以前最佳方法+ 0.6%。当应用于大约6.5亿参数的更大模型时,SwinV2-H,它在Imagenet-1K上使用Imagenet-1K数据实现了87.1%的前1个精度。我们还利用这种方法来促进3B模型(SWINV2-G)的培训,比以前的实践中的数据减少40美元,我们在四个代表性视觉基准上实现了最先进的。代码和模型将在https://github.com/microsoft/simmim公开使用。
translated by 谷歌翻译
最近提出的深度感知视频Panoptic分段(DVPS)旨在预测视频中的Panoptic分段结果和深度映射,这是一个具有挑战性的场景理解问题。在本文中,我们提供了多相变压器,揭示了DVPS任务下的所有子任务。我们的方法通过基于查询的学习探讨了深度估计与Panoptic分割的关系。特别是,我们设计三个不同的查询,包括查询,填写询问和深度查询的东西。然后我们建议通过门控融合来学习这些查询之间的相关性。从实验中,我们从深度估计和Panoptic分割方面证明了我们设计的好处。由于每个物品查询还对实例信息进行了编码,因此通过具有外观学习的裁剪实例掩码功能来执行跟踪是自然的。我们的方法在ICCV-2021 BMTT挑战视频+深度轨道上排名第一。据报道,消融研究表明我们如何提高性能。代码将在https://github.com/harboryuan/polyphonicformer提供。
translated by 谷歌翻译
我们介绍了MGNET,这是一个多任务框架,用于单眼几何场景。我们将单眼几何场景的理解定义为两个已知任务的组合:全景分割和自我监管的单眼深度估计。全景分段不仅在语义上,而且在实例的基础上捕获完整场景。自我监督的单眼深度估计使用摄像机测量模型得出的几何约束,以便从单眼视频序列中测量深度。据我们所知,我们是第一个在一个模型中提出这两个任务的组合的人。我们的模型专注于低潜伏期,以实时在单个消费级GPU上实时提供快速推断。在部署过程中,我们的模型将产生密集的3D点云,其中具有来自单个高分辨率摄像头图像的实例意识到语义标签。我们对两个流行的自动驾驶基准(即CityScapes and Kitti)评估了模型,并在其他能够实时的方法中表现出竞争性能。源代码可从https://github.com/markusschoen/mgnet获得。
translated by 谷歌翻译
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.
translated by 谷歌翻译
Frozen pretrained models have become a viable alternative to the pretraining-then-finetuning paradigm for transfer learning. However, with frozen models there are relatively few parameters available for adapting to downstream tasks, which is problematic in computer vision where tasks vary significantly in input/output format and the type of information that is of value. In this paper, we present a study of frozen pretrained models when applied to diverse and representative computer vision tasks, including object detection, semantic segmentation and video action recognition. From this empirical analysis, our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes. We additionally examine the upper bound of performance using a giant frozen pretrained model with 3 billion parameters (SwinV2-G) and find that it reaches competitive performance on a varied set of major benchmarks with only one shared frozen base network: 60.0 box mAP and 52.2 mask mAP on COCO object detection test-dev, 57.6 val mIoU on ADE20K semantic segmentation, and 81.7 top-1 accuracy on Kinetics-400 action recognition. With this work, we hope to bring greater attention to this promising path of freezing pretrained image models.
translated by 谷歌翻译
我们介绍了一个新的图像分段任务,称为实体分段(ES),该任务旨在在不预测其语义标签的情况下划分图像中的所有视觉实体(对象和填充)。通过删除类标签预测的需要,对此类任务培训的模型可以更多地关注提高分割质量。它具有许多实际应用,例如图像操纵和编辑,其中分割掩模的质量至关重要,但类标签不太重要。我们通过统一的方式调查第一次研究,以调查卷大中心的代表对分割事物和东西的可行性,并显示这种代表在es的背景下非常好。更具体地说,我们提出了一种类似的完全卷积的架构,具有两种新颖的模块,专门设计用于利用es的类无话和非重叠要求。实验表明,在分割质量方面设计和培训的模型显着优于流行的专用Panoptic分段模型。此外,可以在多个数据集的组合中容易地培训ES模型,而无需解决数据集合并中的标签冲突,并且在一个或多个数据集中培训的模型可以概括到未经看管域的其他测试数据集。代码已在https://github.com/dvlab-research/entity发布。
translated by 谷歌翻译
图像分割是关于使用不同语义的分组像素,例如类别或实例成员身份,其中每个语义选择定义任务。虽然只有每个任务的语义不同,但目前的研究侧重于为每项任务设计专业架构。我们提出了蒙面关注掩模变压器(Mask2Former),这是一种能够寻址任何图像分段任务(Panoptic,实例或语义)的新架构。其关键部件包括屏蔽注意,通过限制预测掩模区域内的横向提取局部特征。除了将研究工作减少三次之外,它还优于四个流行的数据集中的最佳专业架构。最值得注意的是,Mask2Former为Panoptic semonation(Coco 57.8 PQ)设置了新的最先进的,实例分段(Coco上50.1 AP)和语义分割(ADE20K上的57.7 miou)。
translated by 谷歌翻译
In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, and scalable, and it can benefit from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K) among models under one billion parameters. Code is available at \url{https://github.com/IDEACVR/MaskDINO}.
translated by 谷歌翻译
Panoptic semonation涉及联合语义分割和实例分割的组合,其中图像内容分为两种类型:事物和东西。我们展示了Panoptic SegFormer,是与变压器的Panoptic Semonation的一般框架。它包含三个创新组件:高效的深度监督掩模解码器,查询解耦策略以及改进的后处理方法。我们还使用可变形的DETR来有效地处理多尺度功能,这是一种快速高效的DETR版本。具体而言,我们以层式方式监督掩模解码器中的注意模块。这种深度监督策略让注意模块快速关注有意义的语义区域。与可变形的DETR相比,它可以提高性能并将所需培训纪元的数量减少一半。我们的查询解耦策略对查询集的职责解耦并避免了事物和东西之间的相互干扰。此外,我们的后处理策略通过联合考虑分类和分割质量来解决突出的面具重叠而没有额外成本的情况。我们的方法会在基线DETR模型上增加6.2 \%PQ。 Panoptic SegFormer通过56.2 \%PQ实现最先进的结果。它还显示出对现有方法的更强大的零射鲁布利。代码释放\ url {https://github.com/zhiqi-li/panoptic-segformer}。
translated by 谷歌翻译
一个人如何在没有特定任务的固定或任何模型修改的情况下将预训练的视觉模型调整为新颖的下游任务?受到NLP提示的启发,本文研究了视觉提示:在测试时间和新输入图像时,给定的输入输出图像示例示例,目标是自动生成输出图像,与给定的示例一致。我们表明,将这个问题作为简单的图像插入,实际上只是填充了串联的视觉提示图像中的一个孔 - 只要已经对正确的数据训练了介入算法,就非常有效。我们在我们策划的新数据集上训练蒙面的自动编码器-88K未标记的数字来自ARXIV上的学术报纸来源。我们将视觉提示应用于这些预处理的模型,并在各种下游图像到图像任务上展示结果,包括前景分割,单个对象检测,着色,边缘检测等。
translated by 谷歌翻译
Despite the superior performance brought by vision-and-language pretraining, it remains unclear whether learning with multi-modal data can help understand each individual modality. In this work, we investigate how language can help with visual representation learning from a probing perspective. Specifically, we compare vision-and-language and vision-only models by probing their visual representations on a broad range of tasks, in order to assess the quality of the learned representations in a fine-grained manner. Interestingly, our probing results suggest that vision-and-language models are better at label prediction tasks like object and attribute prediction, while vision-only models are stronger at dense prediction tasks that require more localized information. With further analysis using detailed metrics, our study suggests that language helps vision models learn better semantics, but not localization. Code is released at https://github.com/Lizw14/visual_probing.
translated by 谷歌翻译
变压器是一种基于关注的编码器解码器架构,彻底改变了自然语言处理领域。灵感来自这一重大成就,最近在将变形式架构调整到计算机视觉(CV)领域的一些开创性作品,这已经证明了他们对各种简历任务的有效性。依靠竞争力的建模能力,与现代卷积神经网络相比在本文中,我们已经为三百不同的视觉变压器进行了全面的审查,用于三个基本的CV任务(分类,检测和分割),提出了根据其动机,结构和使用情况组织这些方法的分类。 。由于培训设置和面向任务的差异,我们还在不同的配置上进行了评估了这些方法,以便于易于和直观的比较而不是各种基准。此外,我们已经揭示了一系列必不可少的,但可能使变压器能够从众多架构中脱颖而出,例如松弛的高级语义嵌入,以弥合视觉和顺序变压器之间的差距。最后,提出了三个未来的未来研究方向进行进一步投资。
translated by 谷歌翻译
Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.
translated by 谷歌翻译