我们介绍了一个新的图像分段任务,称为实体分段(ES),该任务旨在在不预测其语义标签的情况下划分图像中的所有视觉实体(对象和填充)。通过删除类标签预测的需要,对此类任务培训的模型可以更多地关注提高分割质量。它具有许多实际应用,例如图像操纵和编辑,其中分割掩模的质量至关重要,但类标签不太重要。我们通过统一的方式调查第一次研究,以调查卷大中心的代表对分割事物和东西的可行性,并显示这种代表在es的背景下非常好。更具体地说,我们提出了一种类似的完全卷积的架构,具有两种新颖的模块,专门设计用于利用es的类无话和非重叠要求。实验表明,在分割质量方面设计和培训的模型显着优于流行的专用Panoptic分段模型。此外,可以在多个数据集的组合中容易地培训ES模型,而无需解决数据集合并中的标签冲突,并且在一个或多个数据集中培训的模型可以概括到未经看管域的其他测试数据集。代码已在https://github.com/dvlab-research/entity发布。
translated by 谷歌翻译
为了提高实例级别检测/分割性能,现有的自我监督和半监督方法从未标记的数据提取非常任务 - 无关或非常任务特定的训练信号。我们认为这两种方法在任务特异性频谱的两端是任务性能的次优。利用太少的任务特定的培训信号导致底下地区任务的地面真理标签导致磨损,而相反的原因会在地面真理标签上过度装修。为此,我们提出了一种新的类别无关的半监督预测(CASP)框架,在提取来自未标记数据的训练信号中实现更有利的任务特异性平衡。与半监督学习相比,CASP通过忽略伪标签中的类信息并具有仅使用任务 - 不相关的未标记数据的单独预先预订阶段来减少训练信号的任务特异性。另一方面,CASP通过利用盒子/面具级伪标签来保留适量的任务特异性。因此,我们的预磨模模型可以更好地避免在下游任务上的FineTuned时避免在地面真理标签上抵抗/过度拟合。使用3.6M未标记的数据,我们在对象检测上实现了4.7%的显着性能增益。我们的预制模型还展示了对其他检测和分割任务/框架的优异可转移性。
translated by 谷歌翻译
In dense image segmentation tasks (e.g., semantic, panoptic), existing methods can hardly generalize well to unseen image domains, predefined classes, and image resolution & quality variations. Motivated by these observations, we construct a large-scale entity segmentation dataset to explore fine-grained entity segmentation, with a strong focus on open-world and high-quality dense segmentation. The dataset contains images spanning diverse image domains and resolutions, along with high-quality mask annotations for training and testing. Given the high-quality and -resolution nature of the dataset, we propose CropFormer for high-quality segmentation, which can improve mask prediction using high-res image crops that provide more fine-grained image details than the full image. CropFormer is the first query-based Transformer architecture that can effectively ensemble mask predictions from multiple image crops, by learning queries that can associate the same entities across the full image and its crop. With CropFormer, we achieve a significant AP gain of $1.9$ on the challenging fine-grained entity segmentation task. The dataset and code will be released at http://luqi.info/entityv2.github.io/.
translated by 谷歌翻译
尽管有不同的相关框架,已经通过不同和专门的框架解决了语义,实例和Panoptic分段。本文为这些基本相似的任务提供了统一,简单,有效的框架。该框架,名为K-Net,段段由一组被学习内核持续一致,其中每个内核负责为潜在实例或填充类生成掩码。要解决区分各种实例的困难,我们提出了一个内核更新策略,使每个内核动态和条件在输入图像中的有意义的组上。 K-NET可以以结尾的方式培训,具有二分匹配,其培训和推论是自然的NMS和无框。没有钟声和口哨,K-Net超越了先前发表的全面的全面的单一模型,在ADE20K Val上的MS Coco Test-Dev分割和语义分割上分别与55.2%PQ和54.3%Miou分裂。其实例分割性能也与MS COCO上的级联掩模R-CNN相同,具有60%-90%的推理速度。代码和模型将在https://github.com/zwwwayne/k-net/发布。
translated by 谷歌翻译
Panoptic semonation涉及联合语义分割和实例分割的组合,其中图像内容分为两种类型:事物和东西。我们展示了Panoptic SegFormer,是与变压器的Panoptic Semonation的一般框架。它包含三个创新组件:高效的深度监督掩模解码器,查询解耦策略以及改进的后处理方法。我们还使用可变形的DETR来有效地处理多尺度功能,这是一种快速高效的DETR版本。具体而言,我们以层式方式监督掩模解码器中的注意模块。这种深度监督策略让注意模块快速关注有意义的语义区域。与可变形的DETR相比,它可以提高性能并将所需培训纪元的数量减少一半。我们的查询解耦策略对查询集的职责解耦并避免了事物和东西之间的相互干扰。此外,我们的后处理策略通过联合考虑分类和分割质量来解决突出的面具重叠而没有额外成本的情况。我们的方法会在基线DETR模型上增加6.2 \%PQ。 Panoptic SegFormer通过56.2 \%PQ实现最先进的结果。它还显示出对现有方法的更强大的零射鲁布利。代码释放\ url {https://github.com/zhiqi-li/panoptic-segformer}。
translated by 谷歌翻译
图像分割是关于使用不同语义的分组像素,例如类别或实例成员身份,其中每个语义选择定义任务。虽然只有每个任务的语义不同,但目前的研究侧重于为每项任务设计专业架构。我们提出了蒙面关注掩模变压器(Mask2Former),这是一种能够寻址任何图像分段任务(Panoptic,实例或语义)的新架构。其关键部件包括屏蔽注意,通过限制预测掩模区域内的横向提取局部特征。除了将研究工作减少三次之外,它还优于四个流行的数据集中的最佳专业架构。最值得注意的是,Mask2Former为Panoptic semonation(Coco 57.8 PQ)设置了新的最先进的,实例分段(Coco上50.1 AP)和语义分割(ADE20K上的57.7 miou)。
translated by 谷歌翻译
In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, and scalable, and it can benefit from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K) among models under one billion parameters. Code is available at \url{https://github.com/IDEACVR/MaskDINO}.
translated by 谷歌翻译
In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design a deformable convolution based semantic segmentation head and a Mask R-CNN style instance segmentation head which solve these two subtasks simultaneously. More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification. It first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolve the conflicts between semantic and instance segmentation. Additionally, it handles the challenge caused by the varying number of instances and permits back propagation to the bottom modules in an end-to-end manner. Extensive experimental results on Cityscapes, COCO and our internal dataset demonstrate that our UPSNet achieves stateof-the-art performance with much faster inference. Code has been made available at: https://github.com/ uber-research/UPSNet. * Equal contribution.† This work was done when Hengshuang Zhao was an intern at Uber ATG.
translated by 谷歌翻译
Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译
视频分析的图像分割在不同的研究领域起着重要作用,例如智能城市,医疗保健,计算机视觉和地球科学以及遥感应用。在这方面,最近致力于发展新的细分策略;最新的杰出成就之一是Panoptic细分。后者是由语义和实例分割的融合引起的。明确地,目前正在研究Panoptic细分,以帮助获得更多对视频监控,人群计数,自主驾驶,医学图像分析的图像场景的更细致的知识,以及一般对场景更深入的了解。为此,我们介绍了本文的首次全面审查现有的Panoptic分段方法,以获得作者的知识。因此,基于所采用的算法,应用场景和主要目标的性质,执行现有的Panoptic技术的明确定义分类。此外,讨论了使用伪标签注释新数据集的Panoptic分割。继续前进,进行消融研究,以了解不同观点的Panoptic方法。此外,讨论了适合于Panoptic分割的评估度量,并提供了现有解决方案性能的比较,以告知最先进的并识别其局限性和优势。最后,目前对主题技术面临的挑战和吸引不久的将来吸引相当兴趣的未来趋势,可以成为即将到来的研究研究的起点。提供代码的文件可用于:https://github.com/elharroussomar/awesome-panoptic-egation
translated by 谷歌翻译
现代方法通常将语义分割标记为每个像素分类任务,而使用替代掩码分类处理实例级分割。我们的主要洞察力:掩码分类是足够的一般,可以使用完全相同的模型,丢失和培训过程来解决语义和实例级分段任务。在此观察之后,我们提出了一个简单的掩模分类模型,该模型预测了一组二进制掩码,每个模型与单个全局类标签预测相关联。总的来说,所提出的基于掩模分类的方法简化了语义和Panoptic分割任务的有效方法的景观,并显示出优异的经验结果。特别是,当类的数量大时,我们观察到掩码形成器优于每个像素分类基线。我们的面具基于分类的方法优于当前最先进的语义(ADE20K上的55.6 miou)和Panoptic Seation(Coco)模型的Panoptic Seationation(52.7 PQ)。
translated by 谷歌翻译
分割高度重叠的图像对象是具有挑战性的,因为图像上的真实对象轮廓和遮挡边界之间通常没有区别。与先前的实例分割方法不同,我们将图像形成模拟为两个重叠层的组成,并提出了双层卷积网络(BCNET),其中顶层检测到遮挡对象(遮挡器),而底层则渗透到部分闭塞实例(胶囊)。遮挡关系与双层结构的显式建模自然地将遮挡和遮挡实例的边界解散,并在掩模回归过程中考虑了它们之间的相互作用。我们使用两种流行的卷积网络设计(即完全卷积网络(FCN)和图形卷积网络(GCN))研究了双层结构的功效。此外,我们通过将图像中的实例表示为单独的可学习封闭器和封闭者查询,从而使用视觉变压器(VIT)制定双层解耦。使用一个/两个阶段和基于查询的对象探测器具有各种骨架和网络层选择验证双层解耦合的概括能力,如图像实例分段基准(可可,亲戚,可可)和视频所示实例分割基准(YTVIS,OVIS,BDD100K MOTS),特别是对于重闭塞病例。代码和数据可在https://github.com/lkeab/bcnet上找到。
translated by 谷歌翻译
We propose and study a task we name panoptic segmentation (PS). Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance). The proposed task requires generating a coherent scene segmentation that is rich and complete, an important step toward real-world vision systems. While early work in computer vision addressed related image/scene parsing tasks, these are not currently popular, possibly due to lack of appropriate metrics or associated recognition challenges. To address this, we propose a novel panoptic quality (PQ) metric that captures performance for all classes (stuff and things) in an interpretable and unified manner. Using the proposed metric, we perform a rigorous study of both human and machine performance for PS on three existing datasets, revealing interesting insights about the task. The aim of our work is to revive the interest of the community in a more unified view of image segmentation.
translated by 谷歌翻译
In contrast to fully supervised methods using pixel-wise mask labels, box-supervised instance segmentation takes advantage of simple box annotations, which has recently attracted increasing research attention. This paper presents a novel single-shot instance segmentation approach, namely Box2Mask, which integrates the classical level-set evolution model into deep neural network learning to achieve accurate mask prediction with only bounding box supervision. Specifically, both the input image and its deep features are employed to evolve the level-set curves implicitly, and a local consistency module based on a pixel affinity kernel is used to mine the local context and spatial relations. Two types of single-stage frameworks, i.e., CNN-based and transformer-based frameworks, are developed to empower the level-set evolution for box-supervised instance segmentation, and each framework consists of three essential components: instance-aware decoder, box-level matching assignment and level-set evolution. By minimizing the level-set energy function, the mask map of each instance can be iteratively optimized within its bounding box annotation. The experimental results on five challenging testbeds, covering general scenes, remote sensing, medical and scene text images, demonstrate the outstanding performance of our proposed Box2Mask approach for box-supervised instance segmentation. In particular, with the Swin-Transformer large backbone, our Box2Mask obtains 42.4% mask AP on COCO, which is on par with the recently developed fully mask-supervised methods. The code is available at: https://github.com/LiWentomng/boxlevelset.
translated by 谷歌翻译
The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-ofthe-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, topperforming method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.
translated by 谷歌翻译
Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation in the last decades include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, CityScapes, and COCO, despite the latter being trained on each of the three tasks individually with three times the resources. With new ConvNeXt and DiNAT backbones, we observe even more performance improvement. We believe OneFormer is a significant step towards making image segmentation more universal and accessible. To support further research, we open-source our code and models at https://github.com/SHI-Labs/OneFormer
translated by 谷歌翻译
In this work, we introduce Panoptic-DeepLab, a simple, strong, and fast system for panoptic segmentation, aiming to establish a solid baseline for bottom-up methods that can achieve comparable performance of two-stage methods while yielding fast inference speed. In particular, Panoptic-DeepLab adopts the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively. The semantic segmentation branch is the same as the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression. As a result, our single Panoptic-DeepLab simultaneously ranks first at all three Cityscapes benchmarks, setting the new state-of-art of 84.2% mIoU, 39.0% AP, and 65.5% PQ on test set. Additionally, equipped with MobileNetV3, Panoptic-DeepLab runs nearly in real-time with a single 1025 × 2049 image (15.8 frames per second), while achieving a competitive performance on Cityscapes (54.1 PQ% on test set). On Mapillary Vistas test set, our ensemble of six models attains 42.7% PQ, outperforming the challenge winner in 2018 by a healthy margin of 1.5%. Finally, our Panoptic-DeepLab also performs on par with several topdown approaches on the challenging COCO dataset. For the first time, we demonstrate a bottom-up approach could deliver state-of-the-art results on panoptic segmentation.
translated by 谷歌翻译
本文提出了一种用于对象和场景的高质量图像分割的新方法。灵感来自于形态学图像处理技术中的扩张和侵蚀操作,像素级图像分割问题被视为挤压对象边界。从这个角度来看,提出了一种新颖且有效的\ textBF {边界挤压}模块。该模块用于从内侧和外侧方向挤压对象边界,这有助于精确掩模表示。提出了双向基于流的翘曲过程来产生这种挤压特征表示,并且设计了两个特定的损耗信号以监控挤压过程。边界挤压模块可以通过构建一些现有方法构建作为即插即用模块,可以轻松应用于实例和语义分段任务。此外,所提出的模块是重量的,因此具有实际使用的潜力。实验结果表明,我们简单但有效的设计可以在几个不同的数据集中产生高质量的结果。此外,边界上的其他几个指标用于证明我们对以前的工作中的方法的有效性。我们的方法对实例和语义分割的具有利于Coco和CityCapes数据集来产生重大改进,并且在相同的设置下以前的最先进的速度优于先前的最先进的速度。代码和模型将在\ url {https:/github.com/lxtgh/bsseg}发布。
translated by 谷歌翻译
在这项工作中,我们将全景景观分割介绍为最整体的场景理解,无论是在视野(FOV)和图像级别的理解方面,用于基于标准摄像机的输入。完整的围绕理解为移动代理提供了最大的信息,这对于任何智能车辆至关重要,以便在安全至关重要的动态环境(例如现实世界流量)中做出明智的决定。为了克服缺乏带注释的全景图像,我们提出了一个框架,该框架允许在标准针孔图像上进行模型训练,并以成本限制的方式将学习的功能传输到不同的域。使用我们提出的方法和密集的对比度学习,我们设法对非适应方法实现了重大改进。根据有效的综合分割体系结构,我们可以在我们已建立的野生全景泛滥分割(WILDPPS)数据集中,以圆锥体质量(PQ)测量的3.5-6.5%提高3.5-6.5%。此外,我们的有效框架不需要访问目标域的图像,使其成为适合有限硬件设置的可行域概括方法。作为其他贡献,我们发布了WILDPPS:第一个全景全景图像数据集,以促进周围感知的进展,并探索一种结合受监督和对比度培训的新型培训程序。
translated by 谷歌翻译
全景部分分割(PPS)旨在将泛型分割和部分分割统一为一个任务。先前的工作主要利用分离的方法来处理事物,物品和部分预测,而无需执行任何共享的计算和任务关联。在这项工作中,我们旨在将这些任务统一在架构层面上,设计第一个名为Panoptic-Partformer的端到端统一方法。特别是,由于视觉变压器的最新进展,我们将事物,内容和部分建模为对象查询,并直接学会优化所有三个预测作为统一掩码的预测和分类问题。我们设计了一个脱钩的解码器,以分别生成零件功能和事物/东西功能。然后,我们建议利用所有查询和相应的特征共同执行推理。最终掩码可以通过查询和相应特征之间的内部产品获得。广泛的消融研究和分析证明了我们框架的有效性。我们的全景局势群体在CityScapes PPS和Pascal Context PPS数据集上实现了新的最新结果,至少有70%的GFLOPS和50%的参数降低。特别是,在Pascal上下文PPS数据集上采用SWIN Transformer后,我们可以通过RESNET50骨干链和10%的改进获得3.4%的相对改进。据我们所知,我们是第一个通过\ textit {统一和端到端变压器模型来解决PPS问题的人。鉴于其有效性和概念上的简单性,我们希望我们的全景贡献者能够充当良好的基准,并帮助未来的PPS统一研究。我们的代码和型号可在https://github.com/lxtgh/panoptic-partformer上找到。
translated by 谷歌翻译