Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation in the last decades include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, CityScapes, and COCO, despite the latter being trained on each of the three tasks individually with three times the resources. With new ConvNeXt and DiNAT backbones, we observe even more performance improvement. We believe OneFormer is a significant step towards making image segmentation more universal and accessible. To support further research, we open-source our code and models at https://github.com/SHI-Labs/OneFormer
translated by 谷歌翻译
图像分割是关于使用不同语义的分组像素,例如类别或实例成员身份,其中每个语义选择定义任务。虽然只有每个任务的语义不同,但目前的研究侧重于为每项任务设计专业架构。我们提出了蒙面关注掩模变压器(Mask2Former),这是一种能够寻址任何图像分段任务(Panoptic,实例或语义)的新架构。其关键部件包括屏蔽注意,通过限制预测掩模区域内的横向提取局部特征。除了将研究工作减少三次之外,它还优于四个流行的数据集中的最佳专业架构。最值得注意的是,Mask2Former为Panoptic semonation(Coco 57.8 PQ)设置了新的最先进的,实例分段(Coco上50.1 AP)和语义分割(ADE20K上的57.7 miou)。
translated by 谷歌翻译
现代方法通常将语义分割标记为每个像素分类任务,而使用替代掩码分类处理实例级分割。我们的主要洞察力:掩码分类是足够的一般,可以使用完全相同的模型,丢失和培训过程来解决语义和实例级分段任务。在此观察之后,我们提出了一个简单的掩模分类模型,该模型预测了一组二进制掩码,每个模型与单个全局类标签预测相关联。总的来说,所提出的基于掩模分类的方法简化了语义和Panoptic分割任务的有效方法的景观,并显示出优异的经验结果。特别是,当类的数量大时,我们观察到掩码形成器优于每个像素分类基线。我们的面具基于分类的方法优于当前最先进的语义(ADE20K上的55.6 miou)和Panoptic Seation(Coco)模型的Panoptic Seationation(52.7 PQ)。
translated by 谷歌翻译
In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, and scalable, and it can benefit from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K) among models under one billion parameters. Code is available at \url{https://github.com/IDEACVR/MaskDINO}.
translated by 谷歌翻译
Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译
视觉任务中变形金刚的兴起不仅可以推进网络骨干设计,而且还启动了一个全新的页面,以实现端到端的图像识别(例如,对象检测和泛型分段)。源自自然语言处理(NLP)的变压器体系结构,包括自我注意力和交叉注意力,有效地学习了序列中元素之间的远距离相互作用。但是,我们观察到,大多数现有的基于变压器的视觉模型只是从NLP中借用了这个想法,忽略了语言和图像之间的关键差异,尤其是空间扁平的像素特征的极高序列长度。随后,这阻碍了像素特征和对象查询之间的交叉注意力学习。在本文中,我们重新考虑像素和对象查询之间的关系,并建议将交叉注意学习作为一个聚类过程进行重新重新制定。受传统K-均值聚类算法的启发,我们开发了K-Means面膜Xformer(Kmax-Deeplab)进行细分任务,这不仅可以改善最先进的艺术品,而且享有简单而优雅的设计。结果,我们的Kmax-Deeplab在Coco Val设置上以58.0%的PQ实现了新的最先进的性能,而CityScapes Val设置为68.4%PQ,44.0%AP和83.5%MIOU,而无需测试时间增加或外部数据集。我们希望我们的工作能够阐明设计为视觉任务量身定制的变压器。代码和型号可在https://github.com/google-research/deeplab2上找到
translated by 谷歌翻译
全景部分分割(PPS)旨在将泛型分割和部分分割统一为一个任务。先前的工作主要利用分离的方法来处理事物,物品和部分预测,而无需执行任何共享的计算和任务关联。在这项工作中,我们旨在将这些任务统一在架构层面上,设计第一个名为Panoptic-Partformer的端到端统一方法。特别是,由于视觉变压器的最新进展,我们将事物,内容和部分建模为对象查询,并直接学会优化所有三个预测作为统一掩码的预测和分类问题。我们设计了一个脱钩的解码器,以分别生成零件功能和事物/东西功能。然后,我们建议利用所有查询和相应的特征共同执行推理。最终掩码可以通过查询和相应特征之间的内部产品获得。广泛的消融研究和分析证明了我们框架的有效性。我们的全景局势群体在CityScapes PPS和Pascal Context PPS数据集上实现了新的最新结果,至少有70%的GFLOPS和50%的参数降低。特别是,在Pascal上下文PPS数据集上采用SWIN Transformer后,我们可以通过RESNET50骨干链和10%的改进获得3.4%的相对改进。据我们所知,我们是第一个通过\ textit {统一和端到端变压器模型来解决PPS问题的人。鉴于其有效性和概念上的简单性,我们希望我们的全景贡献者能够充当良好的基准,并帮助未来的PPS统一研究。我们的代码和型号可在https://github.com/lxtgh/panoptic-partformer上找到。
translated by 谷歌翻译
在图像变压器网络的编码器部分中的FineTuning佩带的骨干网一直是语义分段任务的传统方法。然而,这种方法揭示了图像在编码阶段提供的语义上下文。本文认为将图像的语义信息纳入预磨料的基于分层变换器的骨干,而FineTuning可显着提高性能。为实现这一目标,我们提出了一个简单且有效的框架,在语义关注操作的帮助下将语义信息包含在编码器中。此外,我们在训练期间使用轻量级语义解码器,为每个阶段提供监督对中间语义的先前地图。我们的实验表明,结合语义前导者增强了所建立的分层编码器的性能,随着絮凝物的数量略有增加。我们通过将Sromask集成到Swin-Cransformer的每个变体中提供了经验证明,因为我们的编码器与不同的解码器配对。我们的框架在CudeScapes数据集上实现了ADE20K数据集的新型58.22%的MIOU,并在Miou指标中提高了超过3%的内容。代码和检查点在https://github.com/picsart-ai-research/semask-egation上公开使用。
translated by 谷歌翻译
我们为深神经网络(称为HCFormer)提出了一个基于分层聚类的图像分割方案。我们将图像分割(包括语义,实例和全景分段)解释为像素聚类问题,并通过深层神经网络的自下而上,分层聚类来完成它。我们的分层聚类在分割头之前除去了像素解码器,并简化了分割管道,从而改善了分割精度和互操作性。HCFORMER可以使用相同的体系结构来解决语义,实例和全盘分割,因为像素聚类是各种图像分割的常见方法。在实验中,与语义分割(ADE20K上的55.5 MIOU),实例分割(可可二的47.1 AP)和泛型分段(可可在Coco上的55.7 PQ)相比,HCFormer达到了可比或卓越的分割精度。
translated by 谷歌翻译
我们提出了聚类蒙版变压器(CMT-DeepLab),这是一种基于变压器的框架,用于围绕聚类设计的泛型分割。它重新考虑了用于分割和检测的现有变压器架构;CMT-DeepLab认为对象查询是群集中心,该中心填充了应用于分割时将像素分组的作用。群集通过交替的过程计算,首先通过其功能亲和力将像素分配给簇,然后更新集群中心和像素功能。这些操作共同包含聚类蒙版变压器(CMT)层,该层产生了越野器的交叉注意,并且与最终的分割任务更加一致。CMT-DeepLab在可可Test-DEV集中实现了55.7%的PQ的新最先进的PQ,可显着提高先前ART的性能。
translated by 谷歌翻译
Panoptic semonation涉及联合语义分割和实例分割的组合,其中图像内容分为两种类型:事物和东西。我们展示了Panoptic SegFormer,是与变压器的Panoptic Semonation的一般框架。它包含三个创新组件:高效的深度监督掩模解码器,查询解耦策略以及改进的后处理方法。我们还使用可变形的DETR来有效地处理多尺度功能,这是一种快速高效的DETR版本。具体而言,我们以层式方式监督掩模解码器中的注意模块。这种深度监督策略让注意模块快速关注有意义的语义区域。与可变形的DETR相比,它可以提高性能并将所需培训纪元的数量减少一半。我们的查询解耦策略对查询集的职责解耦并避免了事物和东西之间的相互干扰。此外,我们的后处理策略通过联合考虑分类和分割质量来解决突出的面具重叠而没有额外成本的情况。我们的方法会在基线DETR模型上增加6.2 \%PQ。 Panoptic SegFormer通过56.2 \%PQ实现最先进的结果。它还显示出对现有方法的更强大的零射鲁布利。代码释放\ url {https://github.com/zhiqi-li/panoptic-segformer}。
translated by 谷歌翻译
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.
translated by 谷歌翻译
In this work, we introduce Panoptic-DeepLab, a simple, strong, and fast system for panoptic segmentation, aiming to establish a solid baseline for bottom-up methods that can achieve comparable performance of two-stage methods while yielding fast inference speed. In particular, Panoptic-DeepLab adopts the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively. The semantic segmentation branch is the same as the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression. As a result, our single Panoptic-DeepLab simultaneously ranks first at all three Cityscapes benchmarks, setting the new state-of-art of 84.2% mIoU, 39.0% AP, and 65.5% PQ on test set. Additionally, equipped with MobileNetV3, Panoptic-DeepLab runs nearly in real-time with a single 1025 × 2049 image (15.8 frames per second), while achieving a competitive performance on Cityscapes (54.1 PQ% on test set). On Mapillary Vistas test set, our ensemble of six models attains 42.7% PQ, outperforming the challenge winner in 2018 by a healthy margin of 1.5%. Finally, our Panoptic-DeepLab also performs on par with several topdown approaches on the challenging COCO dataset. For the first time, we demonstrate a bottom-up approach could deliver state-of-the-art results on panoptic segmentation.
translated by 谷歌翻译
The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-ofthe-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, topperforming method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.
translated by 谷歌翻译
尽管有不同的相关框架,已经通过不同和专门的框架解决了语义,实例和Panoptic分段。本文为这些基本相似的任务提供了统一,简单,有效的框架。该框架,名为K-Net,段段由一组被学习内核持续一致,其中每个内核负责为潜在实例或填充类生成掩码。要解决区分各种实例的困难,我们提出了一个内核更新策略,使每个内核动态和条件在输入图像中的有意义的组上。 K-NET可以以结尾的方式培训,具有二分匹配,其培训和推论是自然的NMS和无框。没有钟声和口哨,K-Net超越了先前发表的全面的全面的单一模型,在ADE20K Val上的MS Coco Test-Dev分割和语义分割上分别与55.2%PQ和54.3%Miou分裂。其实例分割性能也与MS COCO上的级联掩模R-CNN相同,具有60%-90%的推理速度。代码和模型将在https://github.com/zwwwayne/k-net/发布。
translated by 谷歌翻译
In dense image segmentation tasks (e.g., semantic, panoptic), existing methods can hardly generalize well to unseen image domains, predefined classes, and image resolution & quality variations. Motivated by these observations, we construct a large-scale entity segmentation dataset to explore fine-grained entity segmentation, with a strong focus on open-world and high-quality dense segmentation. The dataset contains images spanning diverse image domains and resolutions, along with high-quality mask annotations for training and testing. Given the high-quality and -resolution nature of the dataset, we propose CropFormer for high-quality segmentation, which can improve mask prediction using high-res image crops that provide more fine-grained image details than the full image. CropFormer is the first query-based Transformer architecture that can effectively ensemble mask predictions from multiple image crops, by learning queries that can associate the same entities across the full image and its crop. With CropFormer, we achieve a significant AP gain of $1.9$ on the challenging fine-grained entity segmentation task. The dataset and code will be released at http://luqi.info/entityv2.github.io/.
translated by 谷歌翻译
已知预测的集合,而是比单独采取的个体预测更好地执行更好。但是,对于需要重型计算资源的任务,\ texit {例如}语义细分,创建需要单独培训的学习者的集合几乎没有易行。在这项工作中,我们建议利用集合方法提供的性能提升,以增强语义分割,同时避免了集合的传统训练成本。我们的自我集成框架利用了通过特征金字塔网络方法生产的多尺度功能来提供独立解码器,从而在单个模型中创建集合。类似于集合,最终预测是每个学习者所做的预测的聚合。与以前的作品相比,我们的模型可以训练结束,减轻了传统的繁琐多阶段培训的合奏。我们的自身融合框架优于当前最先进的基准数据集ADE20K,Pascal Context和Coco-Stuff-10K用于语义细分,并且在城市景观竞争。代码将在Github.com/walbouss/senformer上使用。
translated by 谷歌翻译
传统上,分割任务是作为一个完整标签的像素分类任务提出的,可以从所有图像或视频共享的固定数量的预定义语义类别中预测每个像素的类。然而,遵循这种表述,在更现实的设置下,标准体系结构将不可避免地遇到各种挑战,其中类别的范围扩大了(例如,超出1K的级别)。另一方面,在典型的图像或视频中,只有少数类别,即存在一小部分完整标签。在本文中,我们提议将分割分解为两个子问题:(i)图像级或视频级多标签分类和(ii)像素级适应性选定标签分类。给定输入图像或视频,我们的框架首先在完整标签上进行多标签分类,然后对完整的标签进行分类,并根据其类置信度得分选择一个小子集。然后,我们使用等级自适应像素分类器对仅选择的标签执行像素的分类,该标签使用一组面向等级的可学习温度参数来调整像素分类分数。我们的方法在概念上是一般的,可以通过简单地使用轻质多标签分类头和等级适应像素分类器来改善各种现有的分割框架。我们通过四个任务进行了竞争性实验结果,证明了我们的框架的有效性,包括图像语义分割,图像泛型细分,视频实例分段和视频语义分段。尤其是,借助我们的rankSeg,Mask2Former在ADE20K PANOPTIC分段/YouTubevis 2019视频实例分段/VSPW视频语义分段基准分别获得了+0.8%/+0.7%/+0.7%。
translated by 谷歌翻译
In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design a deformable convolution based semantic segmentation head and a Mask R-CNN style instance segmentation head which solve these two subtasks simultaneously. More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification. It first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolve the conflicts between semantic and instance segmentation. Additionally, it handles the challenge caused by the varying number of instances and permits back propagation to the bottom modules in an end-to-end manner. Extensive experimental results on Cityscapes, COCO and our internal dataset demonstrate that our UPSNet achieves stateof-the-art performance with much faster inference. Code has been made available at: https://github.com/ uber-research/UPSNet. * Equal contribution.† This work was done when Hengshuang Zhao was an intern at Uber ATG.
translated by 谷歌翻译
最近建议的MaskFormer \ Cite {MaskFormer}对语义分割的任务提供了刷新的透视图:它从流行的像素级分类范例转移到蒙版级分类方法。实质上,它生成对应于类别段的配对概率和掩码,并在推理的分割映射期间结合它们。因此,分割质量依赖于查询如何捕获类别的语义信息及其空间位置。在我们的研究中,我们发现单尺度特征顶部的每个掩模分类解码器不足以提取可靠的概率或掩模。对于挖掘功能金字塔的丰富语义信息,我们提出了一个基于变压器的金字塔融合变压器(PFT),用于多尺度特征顶部的每个掩模方法语义分段。为了有效地利用不同分辨率的图像特征而不会产生过多的计算开销,PFT使用多尺度变压器解码器,具有跨尺度间间的关注来交换互补信息。广泛的实验评估和消融展示了我们框架的功效。特别是,与屏蔽Former相比,我们通过Reset-101c实现了3.2 miou改进了Reset-101c。此外,在ADE20K验证集上,我们的Swin-B骨架的结果与单尺度和多尺寸推断的屏蔽骨架中的较大的Swin-L骨架相匹配,分别实现54.1 miou和55.3 miou。使用Swin-L骨干,我们在ADE20K验证集中实现了56.0 Miou单尺度结果和57.2多尺度结果,从而获得数据集的最先进的性能。
translated by 谷歌翻译