We present MaX-DeepLab, the first end-to-end model for panoptic segmentation. Our approach simplifies the current pipeline that depends heavily on surrogate sub-tasks and hand-designed components, such as box detection, nonmaximum suppression, thing-stuff merging, etc. Although these sub-tasks are tackled by area experts, they fail to comprehensively solve the target task. By contrast, our MaX-DeepLab directly predicts class-labeled masks with a mask transformer, and is trained with a panoptic quality inspired loss via bipartite matching. Our mask transformer employs a dual-path architecture that introduces a global memory path in addition to a CNN path, allowing direct communication with any CNN layers. As a result, MaX-DeepLab shows a significant 7.1% PQ gain in the box-free regime on the challenging COCO dataset, closing the gap between box-based and box-free methods for the first time. A small variant of MaX-DeepLab improves 3.0% PQ over DETR with similar parameters and M-Adds. Furthermore, MaX-DeepLab, without test time augmentation, achieves new state-of-the-art 51.3% PQ on COCO test-dev set.
translated by 谷歌翻译
我们提出了聚类蒙版变压器(CMT-DeepLab),这是一种基于变压器的框架,用于围绕聚类设计的泛型分割。它重新考虑了用于分割和检测的现有变压器架构;CMT-DeepLab认为对象查询是群集中心,该中心填充了应用于分割时将像素分组的作用。群集通过交替的过程计算,首先通过其功能亲和力将像素分配给簇,然后更新集群中心和像素功能。这些操作共同包含聚类蒙版变压器(CMT)层,该层产生了越野器的交叉注意,并且与最终的分割任务更加一致。CMT-DeepLab在可可Test-DEV集中实现了55.7%的PQ的新最先进的PQ,可显着提高先前ART的性能。
translated by 谷歌翻译
现代方法通常将语义分割标记为每个像素分类任务,而使用替代掩码分类处理实例级分割。我们的主要洞察力:掩码分类是足够的一般,可以使用完全相同的模型,丢失和培训过程来解决语义和实例级分段任务。在此观察之后,我们提出了一个简单的掩模分类模型,该模型预测了一组二进制掩码,每个模型与单个全局类标签预测相关联。总的来说,所提出的基于掩模分类的方法简化了语义和Panoptic分割任务的有效方法的景观,并显示出优异的经验结果。特别是,当类的数量大时,我们观察到掩码形成器优于每个像素分类基线。我们的面具基于分类的方法优于当前最先进的语义(ADE20K上的55.6 miou)和Panoptic Seation(Coco)模型的Panoptic Seationation(52.7 PQ)。
translated by 谷歌翻译
视觉任务中变形金刚的兴起不仅可以推进网络骨干设计,而且还启动了一个全新的页面,以实现端到端的图像识别(例如,对象检测和泛型分段)。源自自然语言处理(NLP)的变压器体系结构,包括自我注意力和交叉注意力,有效地学习了序列中元素之间的远距离相互作用。但是,我们观察到,大多数现有的基于变压器的视觉模型只是从NLP中借用了这个想法,忽略了语言和图像之间的关键差异,尤其是空间扁平的像素特征的极高序列长度。随后,这阻碍了像素特征和对象查询之间的交叉注意力学习。在本文中,我们重新考虑像素和对象查询之间的关系,并建议将交叉注意学习作为一个聚类过程进行重新重新制定。受传统K-均值聚类算法的启发,我们开发了K-Means面膜Xformer(Kmax-Deeplab)进行细分任务,这不仅可以改善最先进的艺术品,而且享有简单而优雅的设计。结果,我们的Kmax-Deeplab在Coco Val设置上以58.0%的PQ实现了新的最先进的性能,而CityScapes Val设置为68.4%PQ,44.0%AP和83.5%MIOU,而无需测试时间增加或外部数据集。我们希望我们的工作能够阐明设计为视觉任务量身定制的变压器。代码和型号可在https://github.com/google-research/deeplab2上找到
translated by 谷歌翻译
Panoptic semonation涉及联合语义分割和实例分割的组合,其中图像内容分为两种类型:事物和东西。我们展示了Panoptic SegFormer,是与变压器的Panoptic Semonation的一般框架。它包含三个创新组件:高效的深度监督掩模解码器,查询解耦策略以及改进的后处理方法。我们还使用可变形的DETR来有效地处理多尺度功能,这是一种快速高效的DETR版本。具体而言,我们以层式方式监督掩模解码器中的注意模块。这种深度监督策略让注意模块快速关注有意义的语义区域。与可变形的DETR相比,它可以提高性能并将所需培训纪元的数量减少一半。我们的查询解耦策略对查询集的职责解耦并避免了事物和东西之间的相互干扰。此外,我们的后处理策略通过联合考虑分类和分割质量来解决突出的面具重叠而没有额外成本的情况。我们的方法会在基线DETR模型上增加6.2 \%PQ。 Panoptic SegFormer通过56.2 \%PQ实现最先进的结果。它还显示出对现有方法的更强大的零射鲁布利。代码释放\ url {https://github.com/zhiqi-li/panoptic-segformer}。
translated by 谷歌翻译
In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, and scalable, and it can benefit from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K) among models under one billion parameters. Code is available at \url{https://github.com/IDEACVR/MaskDINO}.
translated by 谷歌翻译
In this work, we introduce Panoptic-DeepLab, a simple, strong, and fast system for panoptic segmentation, aiming to establish a solid baseline for bottom-up methods that can achieve comparable performance of two-stage methods while yielding fast inference speed. In particular, Panoptic-DeepLab adopts the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively. The semantic segmentation branch is the same as the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression. As a result, our single Panoptic-DeepLab simultaneously ranks first at all three Cityscapes benchmarks, setting the new state-of-art of 84.2% mIoU, 39.0% AP, and 65.5% PQ on test set. Additionally, equipped with MobileNetV3, Panoptic-DeepLab runs nearly in real-time with a single 1025 × 2049 image (15.8 frames per second), while achieving a competitive performance on Cityscapes (54.1 PQ% on test set). On Mapillary Vistas test set, our ensemble of six models attains 42.7% PQ, outperforming the challenge winner in 2018 by a healthy margin of 1.5%. Finally, our Panoptic-DeepLab also performs on par with several topdown approaches on the challenging COCO dataset. For the first time, we demonstrate a bottom-up approach could deliver state-of-the-art results on panoptic segmentation.
translated by 谷歌翻译
图像分割是关于使用不同语义的分组像素,例如类别或实例成员身份,其中每个语义选择定义任务。虽然只有每个任务的语义不同,但目前的研究侧重于为每项任务设计专业架构。我们提出了蒙面关注掩模变压器(Mask2Former),这是一种能够寻址任何图像分段任务(Panoptic,实例或语义)的新架构。其关键部件包括屏蔽注意,通过限制预测掩模区域内的横向提取局部特征。除了将研究工作减少三次之外,它还优于四个流行的数据集中的最佳专业架构。最值得注意的是,Mask2Former为Panoptic semonation(Coco 57.8 PQ)设置了新的最先进的,实例分段(Coco上50.1 AP)和语义分割(ADE20K上的57.7 miou)。
translated by 谷歌翻译
Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译
The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-ofthe-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, topperforming method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.
translated by 谷歌翻译
在本文中,我们提出了简单的关注机制,我们称之为箱子。它可以实现网格特征之间的空间交互,从感兴趣的框中采样,并提高变压器的学习能力,以获得几个视觉任务。具体而言,我们呈现拳击手,短暂的框变压器,通过从输入特征映射上的参考窗口预测其转换来参加一组框。通过考虑其网格结构,拳击手通过考虑其网格结构来计算这些框的注意力。值得注意的是,Boxer-2D自然有关于其注意模块内容信息的框信息的原因,使其适用于端到端实例检测和分段任务。通过在盒注意模块中旋转的旋转的不变性,Boxer-3D能够从用于3D端到端对象检测的鸟瞰图平面产生识别信息。我们的实验表明,拟议的拳击手-2D在Coco检测中实现了更好的结果,并且在Coco实例分割上具有良好的和高度优化的掩模R-CNN可比性。 Boxer-3D已经为Waymo开放的车辆类别提供了令人信服的性能,而无需任何特定的类优化。代码将被释放。
translated by 谷歌翻译
全景部分分割(PPS)旨在将泛型分割和部分分割统一为一个任务。先前的工作主要利用分离的方法来处理事物,物品和部分预测,而无需执行任何共享的计算和任务关联。在这项工作中,我们旨在将这些任务统一在架构层面上,设计第一个名为Panoptic-Partformer的端到端统一方法。特别是,由于视觉变压器的最新进展,我们将事物,内容和部分建模为对象查询,并直接学会优化所有三个预测作为统一掩码的预测和分类问题。我们设计了一个脱钩的解码器,以分别生成零件功能和事物/东西功能。然后,我们建议利用所有查询和相应的特征共同执行推理。最终掩码可以通过查询和相应特征之间的内部产品获得。广泛的消融研究和分析证明了我们框架的有效性。我们的全景局势群体在CityScapes PPS和Pascal Context PPS数据集上实现了新的最新结果,至少有70%的GFLOPS和50%的参数降低。特别是,在Pascal上下文PPS数据集上采用SWIN Transformer后,我们可以通过RESNET50骨干链和10%的改进获得3.4%的相对改进。据我们所知,我们是第一个通过\ textit {统一和端到端变压器模型来解决PPS问题的人。鉴于其有效性和概念上的简单性,我们希望我们的全景贡献者能够充当良好的基准,并帮助未来的PPS统一研究。我们的代码和型号可在https://github.com/lxtgh/panoptic-partformer上找到。
translated by 谷歌翻译
In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design a deformable convolution based semantic segmentation head and a Mask R-CNN style instance segmentation head which solve these two subtasks simultaneously. More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification. It first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolve the conflicts between semantic and instance segmentation. Additionally, it handles the challenge caused by the varying number of instances and permits back propagation to the bottom modules in an end-to-end manner. Extensive experimental results on Cityscapes, COCO and our internal dataset demonstrate that our UPSNet achieves stateof-the-art performance with much faster inference. Code has been made available at: https://github.com/ uber-research/UPSNet. * Equal contribution.† This work was done when Hengshuang Zhao was an intern at Uber ATG.
translated by 谷歌翻译
本文介绍了端到端的实例分段框架,称为SOIT,该段具有实例感知变压器的段对象。灵感来自Detr〜\ Cite {carion2020end},我们的方法视图实例分段为直接设置预测问题,有效地消除了对ROI裁剪,一对多标签分配等许多手工制作组件的需求,以及非最大抑制( nms)。在SOIT中,通过在全局图像上下文下直接地将多个查询直接理解语义类别,边界框位置和像素 - WISE掩码的一组对象嵌入。类和边界盒可以通过固定长度的向量轻松嵌入。尤其是由一组参数嵌入像素方面的掩模以构建轻量级实例感知变压器。之后,实例感知变压器产生全分辨率掩码,而不涉及基于ROI的任何操作。总的来说,SOIT介绍了一个简单的单级实例分段框架,它是无乐和NMS的。 MS Coco DataSet上的实验结果表明,优于最先进的实例分割显着的优势。此外,在统一查询嵌入中的多个任务的联合学习还可以大大提高检测性能。代码可用于\ url {https://github.com/yuxiaodonghri/soit}。
translated by 谷歌翻译
尽管有不同的相关框架,已经通过不同和专门的框架解决了语义,实例和Panoptic分段。本文为这些基本相似的任务提供了统一,简单,有效的框架。该框架,名为K-Net,段段由一组被学习内核持续一致,其中每个内核负责为潜在实例或填充类生成掩码。要解决区分各种实例的困难,我们提出了一个内核更新策略,使每个内核动态和条件在输入图像中的有意义的组上。 K-NET可以以结尾的方式培训,具有二分匹配,其培训和推论是自然的NMS和无框。没有钟声和口哨,K-Net超越了先前发表的全面的全面的单一模型,在ADE20K Val上的MS Coco Test-Dev分割和语义分割上分别与55.2%PQ和54.3%Miou分裂。其实例分割性能也与MS COCO上的级联掩模R-CNN相同,具有60%-90%的推理速度。代码和模型将在https://github.com/zwwwayne/k-net/发布。
translated by 谷歌翻译
最近,在一步的Panoptic细分方法上越来越关注,旨在有效地旨在在完全卷积的管道内共同分割实例和材料。但是,大多数现有的工作直接向骨干功能提供给各种分段头,忽略语义和实例分割的需求不同:前者需要语义级别的判别功能,而后者需要跨实例可区分的功能。为了缓解这一点,我们建议首先预测用于增强骨干特征的不同位置之间的语义级和实例级相关性,然后分别将改进的鉴别特征馈送到相应的分割头中。具体地,我们将给定位置与所有位置之间的相关性组织为连续序列,并将其预测为整体。考虑到这种序列可以非常复杂,我们采用离散的傅里叶变换(DFT),一种可以近似由幅度和短语参数化的任意序列的工具。对于不同的任务,我们以完全卷积的方式从骨干网上生成这些参数,该参数通过相应的任务隐含地优化。结果,这些准确和一致的相关性有助于产生符合复杂的Panoptic细分任务的要求的合理辨别特征。为了验证我们的方法的有效性,我们对几个具有挑战性的Panoptic细分数据集进行实验,并以45.1美元\%PQ和ADE20K为32.6美元\%PQ实现最先进的绩效。
translated by 谷歌翻译
We present a new, embarrassingly simple approach to instance segmentation. Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that have made instance segmentation much more challenging. In order to predict a mask for each instance, mainstream approaches either follow the "detect-then-segment" strategy (e.g., Mask R-CNN), or predict embedding vectors first then use clustering techniques to group pixels into individual instances. We view the task of instance segmentation from a completely new perspective by introducing the notion of "instance categories", which assigns categories to each pixel within an instance according to the instance's location and size, thus nicely converting instance segmentation into a single-shot classification-solvable problem. We demonstrate a much simpler and flexible instance segmentation framework with strong performance, achieving on par accuracy with Mask R-CNN and outperforming recent single-shot instance segmenters in accuracy. We hope that this simple and strong framework can serve as a baseline for many instance-level recognition tasks besides instance segmentation. Code is available at https://git.io/AdelaiDet
translated by 谷歌翻译
我们为深神经网络(称为HCFormer)提出了一个基于分层聚类的图像分割方案。我们将图像分割(包括语义,实例和全景分段)解释为像素聚类问题,并通过深层神经网络的自下而上,分层聚类来完成它。我们的分层聚类在分割头之前除去了像素解码器,并简化了分割管道,从而改善了分割精度和互操作性。HCFORMER可以使用相同的体系结构来解决语义,实例和全盘分割,因为像素聚类是各种图像分割的常见方法。在实验中,与语义分割(ADE20K上的55.5 MIOU),实例分割(可可二的47.1 AP)和泛型分段(可可在Coco上的55.7 PQ)相比,HCFormer达到了可比或卓越的分割精度。
translated by 谷歌翻译
现有的实例分割方法已经达到了令人印象深刻的表现,但仍遭受了共同的困境:一个实例推断出冗余表示(例如,多个框,网格和锚点),这导致了多个重复的预测。因此,主流方法通常依赖于手工设计的非最大抑制(NMS)后处理步骤来选择最佳预测结果,这会阻碍端到端训练。为了解决此问题,我们建议一个称为Uniinst的无盒和无端机实例分割框架,该框架仅对每个实例产生一个唯一的表示。具体而言,我们设计了一种实例意识到的一对一分配方案,即仅产生一个表示(Oyor),该方案根据预测和地面真相之间的匹配质量,动态地为每个实例动态分配一个独特的表示。然后,一种新颖的预测重新排列策略被优雅地集成到框架中,以解决分类评分和掩盖质量之间的错位,从而使学习的表示形式更具歧视性。借助这些技术,我们的Uniinst,第一个基于FCN的盒子和无NMS实例分段框架,实现竞争性能,例如,使用Resnet-50-FPN和40.2 mask AP使用Resnet-101-FPN,使用Resnet-50-FPN和40.2 mask AP,使用Resnet-101-FPN,对抗AP可可测试-DEV的主流方法。此外,提出的实例感知方法对于遮挡场景是可靠的,在重锁定的ochuman基准上,通过杰出的掩码AP优于公共基线。我们的代码将在出版后提供。
translated by 谷歌翻译
最近建议的MaskFormer \ Cite {MaskFormer}对语义分割的任务提供了刷新的透视图:它从流行的像素级分类范例转移到蒙版级分类方法。实质上,它生成对应于类别段的配对概率和掩码,并在推理的分割映射期间结合它们。因此,分割质量依赖于查询如何捕获类别的语义信息及其空间位置。在我们的研究中,我们发现单尺度特征顶部的每个掩模分类解码器不足以提取可靠的概率或掩模。对于挖掘功能金字塔的丰富语义信息,我们提出了一个基于变压器的金字塔融合变压器(PFT),用于多尺度特征顶部的每个掩模方法语义分段。为了有效地利用不同分辨率的图像特征而不会产生过多的计算开销,PFT使用多尺度变压器解码器,具有跨尺度间间的关注来交换互补信息。广泛的实验评估和消融展示了我们框架的功效。特别是,与屏蔽Former相比,我们通过Reset-101c实现了3.2 miou改进了Reset-101c。此外,在ADE20K验证集上,我们的Swin-B骨架的结果与单尺度和多尺寸推断的屏蔽骨架中的较大的Swin-L骨架相匹配,分别实现54.1 miou和55.3 miou。使用Swin-L骨干,我们在ADE20K验证集中实现了56.0 Miou单尺度结果和57.2多尺度结果,从而获得数据集的最先进的性能。
translated by 谷歌翻译