对少量语义分割(FSS)的研究引起了极大的关注,目的是在查询图像中仅给出目标类别的少数注释的支持图像。这项具有挑战性的任务的关键是通过利用查询和支持图像之间的细粒度相关性来充分利用支持图像中的信息。但是,大多数现有方法要么将支持信息压缩为几个班级原型,要么在像素级别上使用的部分支持信息(例如,唯一的前景),从而导致不可忽略的信息损失。在本文中,我们提出了密集的像素,互源和支持的注意力加权面膜聚合(DCAMA),其中前景和背景支持信息都是通过配对查询和支持特征之间的多级像素的相关性通过多级像素的相关性充分利用的。 DCAMA在变压器体系结构中以缩放点产生的关注实现,将每个查询像素视为令牌,计算其与所有支持像素的相似之处,并预测其分割标签是所有支持像素标签的添加剂聚集 - 相似之处。基于DCAMA的唯一公式,我们进一步提出了对N-shot分割的有效有效的一通推断,其中所有支持图像的像素立即为掩模聚集收集。实验表明,我们的DCAMA在Pascal-5i,Coco-20i和FSS-1000的标准FSS基准上显着提高了最先进的状态以前的最佳记录。烧烤研究还验证了设计dcama。
translated by 谷歌翻译
在视觉识别任务中,很少的学习需要在很少的支持示例中学习对象类别的能力。鉴于深度学习的发展,它的重新流行主要是图像分类。这项工作着重于几片语义细分,这仍然是一个未开发的领域。最近的一些进步通常仅限于单级少量分段。在本文中,我们首先介绍了一个新颖的多通道(类)编码和解码体系结构,该体系结构有效地将多尺度查询信息和多类支持信息融合到一个查询支持嵌入中。多级分割直接在此嵌入后解码。为了获得更好的特征融合,在体系结构中提出了多层注意机制,其中包括对支持功能调制的关注和多尺度组合的注意力。最后,为了增强嵌入式空间学习,引入了一个额外的像素度量学习模块,并在输入图像的像素级嵌入式上提出了三重损失。对标准基准Pascal-5i和Coco-20i进行的广泛实验显示了我们方法对最新技术的明显好处
translated by 谷歌翻译
少量分割旨在培训一个分割模型,可以快速适应具有少量示例的新型课程。传统的训练范例是学习对从支持图像的特征上的查询图像进行预测。以前的方法仅利用支持图像的语义级原型作为条件信息。这些方法不能利用用于查询预测的所有像素 - WISE支持信息,这对于分割任务来说是至关重要的。在本文中,我们专注于利用支持和查询图像之间的像素方面的关系来促进几次拍摄分段任务。我们设计一种新颖的循环一致的变压器(Cyctr)模块,将像素天然气支持功能聚合到查询中。 Cyctr在来自不同图像的特征之间进行跨关注,即支持和查询图像。我们观察到可能存在意外的无关像素级支持特征。直接执行跨关注可以将这些功能从支持汇总到查询和偏置查询功能。因此,我们建议使用新的循环一致的注意机制来滤除可能的有害支持特征,并鼓励查询功能从支持图像上参加最富有信息的像素。所有几次分割基准测试的实验表明,与以前的最先进的方法相比,我们所提出的Cyctr导致显着的改进。具体而言,在Pascal-$ 5 ^ i $和20 ^ i $ datasets上,我们达到了66.6%和45.6%的5次分割,优于以前的最先进方法分别为4.6%和7.1%。
translated by 谷歌翻译
很少有分割的目的是仅给出少数标记的样品,旨在细分看不见的级对象。原型学习,支持功能通过平均全局和局部对象信息产生单个原型,在FSS中已广泛使用。但是,仅利用原型矢量可能不足以代表所有训练数据的功能。为了提取丰富的特征并做出更精确的预测,我们提出了一个多相似性和注意力网络(MSANET),包括两个新型模块,一个多相似性模块和一个注意模块。多相似模块利用支持图像和查询图像的多个特征图来估计准确的语义关系。注意模块指示网络专注于相关的信息。该网络在标准FSS数据集,Pascal-5i 1-Shot,Pascal-5i 5-Shot,Coco-20i 1-Shot和Coco-20i 5-Shot上进行了测试。具有RESNET-101骨架的MSANET可在所有4基准测试数据集中达到最先进的性能,而平均交叉点(MIOU)为69.13%,73.99%,51.09%,56.80%。代码可在https://github.com/aivresearch/msanet上获得
translated by 谷歌翻译
我们介绍一个新颖的成本聚合网络,用变压器(VAT)被复制体积聚集,通过使用卷曲和变压器来解决几次拍摄分段任务,以有效地处理查询和支持之间的高维相关映射。具体而言,我们提出了由卷嵌入模块组成的编码器,不仅将相关性图转换为更具易易概要,而且为成本聚合注入一些卷积电感偏置和体积变压器模块。我们的编码器具有金字塔形结构,让较粗糙的级别聚合来指导更精细的水平并强制执行互补匹配分数。然后,我们将输出送入我们的亲和感知解码器以及投影特征映射,以指导分割过程。组合这些组件,我们进行实验以证明所提出的方法的有效性,我们的方法为几次拍摄分割任务中的所有标准基准设置了新的最先进的。此外,我们发现所提出的方法甚至可以在语义对应任务中的标准基准中获得最先进的性能,尽管没有专门为此任务设计。我们还提供广泛的消融研究,以验证我们的建筑选择。培训的权重和代码可用于:https://seokju-cho.github.io/vat/。
translated by 谷歌翻译
现有的少数射击分段方法基于支持 - 引人入胜的匹配框架取得了巨大进展。但是,他们仍然受到所提供的少量支撑的覆盖率有限的覆盖范围。由简单的格式塔原理激励,即属于同一对象的像素比同一班级的不同对象的像素更相似,我们提出了一种新颖的自支撑匹配策略来减轻此问题,该策略使用查询原型来匹配查询功能查询原型是从高信心查询预测中收集的。该策略可以有效地捕获查询对象的一致潜在特性,从而符合查询功能。我们还提出了一个自适应的自支持背景原型生成模块和自支撑损失,以进一步促进自支撑匹配过程。我们的自支撑网络大大提高了原型质量,更强的骨架和更多支持,并在多个数据集上实现了SOTA。代码位于\ url {https://github.com/fanq15/ssp}。
translated by 谷歌翻译
很少有语义细分旨在识别一个看不见类别的对象区域,只有几个带注释的示例作为监督。几次分割的关键是在支持图像和查询图像之间建立牢固的语义关系,并防止过度拟合。在本文中,我们提出了一个有效的多相似性超关联网络(MSHNET),以解决几个射击语义分割问题。在MSHNET中,我们提出了一种新的生成原型相似性(GPS),与余弦相似性可以在支持图像和查询图像之间建立牢固的语义关系。基于全局特征的本地生成的原型相似性在逻辑上与基于本地特征的全局余弦相似性互补,并且可以通过同时使用两个相似性来更全面地表达查询图像和受支持图像之间的关系。此外,我们提出了MSHNET中的对称合并块(SMB),以有效合并多层,多弹射和多相似性超相关特征。 MSHNET是基于相似性而不是特定类别特征而构建的,这些特征可以实现更一般的统一性并有效地减少过度拟合。在两个基准的语义分割数据集Pascal-5i和Coco-20i上,MSHNET在1次和5次语义分段任务上实现了新的最先进的表演。
translated by 谷歌翻译
具有很少带注释的样本的训练语义分割模型在各种现实世界中具有巨大的潜力。对于少数拍摄的分段任务,主要的挑战是如何准确地测量使用有限的培训数据之间的支持样本和查询样品之间的语义对应关系。为了解决这个问题,我们建议用可变形的4D变压器汇总可学习的协方差矩阵,以有效预测分割图。具体而言,在这项工作中,我们首先设计了一种新颖的艰难示例挖掘机制,以学习高斯过程的协方差内核。在对应测量中,学到的协方差内核函数比现有基于余弦相似性的方法具有很大的优势。基于学到的协方差内核,设计有效的双重变形4D变压器模块旨在适应骨料特征相似性图中的分割结果。通过组合这两种设计,提出的方法不仅可以在公共基准测试上设置新的最新性能,而且比现有方法更快地收敛。三个公共数据集的实验证明了我们方法的有效性。
translated by 谷歌翻译
本文介绍了一个新颖的成本聚合网络,称为变压器(VAT),称为体积聚集,以进行几次分割。变压器的使用可以通过在全球接收场上的自我注意来使相关图的聚集受益。但是,变压器处理的相关图的令牌化可能是有害的,因为令牌边界处的不连续性会降低令牌边缘附近可用的局部环境,并减少电感偏差。为了解决这个问题,我们提出了一个4D卷积的SWIN变压器,在该问题上,高维的SWIN变压器之前是一系列的小内核卷积,这些卷积将局部环境赋予所有像素并引入卷积归纳偏置。另外,我们通过在锥体结构中应用变压器来提高聚合性能,在锥体结构中,在更粗糙的水平上的聚集指导聚集在较好的水平上。然后,在查询的外观嵌入中,在随后的解码器中过滤变压器输出中的噪声。使用此模型,为所有标准基准设置了一个新的最新基准,以几次射击分段设置。结果表明,增值税还达到了语义通信的最先进的性能,而成本汇总也起着核心作用。
translated by 谷歌翻译
Few-shot semantic segmentation aims to learn to segment new object classes with only a few annotated examples, which has a wide range of real-world applications. Most existing methods either focus on the restrictive setting of one-way few-shot segmentation or suffer from incomplete coverage of object regions. In this paper, we propose a novel few-shot semantic segmentation framework based on the prototype representation. Our key idea is to decompose the holistic class representation into a set of part-aware prototypes, capable of capturing diverse and fine-grained object features. In addition, we propose to leverage unlabeled data to enrich our part-aware prototypes, resulting in better modeling of intra-class variations of semantic objects. We develop a novel graph neural network model to generate and enhance the proposed part-aware prototypes based on labeled and unlabeled images. Extensive experimental evaluations on two benchmarks show that our method outperforms the prior art with a sizable margin.
translated by 谷歌翻译
几次拍摄的语义分割旨在将新颖的类对象分段为仅具有少数标记的支持图像。大多数高级解决方案利用度量学习框架,通过将每个查询功能与学习的类特定的原型匹配来执行分段。然而,由于特征比较不完整,该框架遭受了偏见的分类。为了解决这个问题,我们通过引入类别特定的和类别不可知的原型来提出自适应原型表示,从而构建与查询功能学习语义对齐的完整样本对。互补特征学习方式有效地丰富了特征比较,并有助于在几次拍摄设置中产生一个非偏见的分段模型。它用双分支端到端网络(\即,特定于类分支和类别不可知分支)实现,它生成原型,然后组合查询特征以执行比较。此外,所提出的类别无神不可话的分支简单而且有效。在实践中,它可以自适应地为查询图像生成多种类别 - 不可知的原型,并以自我对比方式学习特征对齐。广泛的Pascal-5 $ ^ i $和Coco-20 $ ^ i $展示了我们方法的优越性。在不牺牲推理效率的费用中,我们的模型实现了最先进的,导致1-Shot和5-Shot Settings进行语义分割。
translated by 谷歌翻译
我们研究了几个射击语义分割,该语义分割旨在在提供目标类别的一些带注释的支持图像时,旨在从查询图像中分割目标对象。最近的几种方法求助于特征掩蔽技术(FM)技术,以丢弃无关的特征激活,最终促进了分割蒙版的可靠预测。 FM的基本限制是无法保留影响分割面罩准确性的细粒空间细节,尤其是对于小目标对象。在本文中,我们开发了一种简单,有效且有效的方法来增强特征掩蔽(FM)。我们将增强的FM称为杂交遮罩(HM)。具体而言,我们通过研究和利用互补的基本输入掩蔽方法来补偿FM技术中细粒空间细节的损失。已经对三个公共可用的基准测试进行了实验,并具有强烈​​的少量分割(FSS)基准。我们通过在不同基准之间可见的边缘在当前的最新方法中表现出了进步的性能。我们的代码和训练有素的模型可在以下网址找到:https://github.com/moonsh/hm-hybrid-masking
translated by 谷歌翻译
Few-shot segmentation aims to devise a generalizing model that segments query images from unseen classes during training with the guidance of a few support images whose class tally with the class of the query. There exist two domain-specific problems mentioned in the previous works, namely spatial inconsistency and bias towards seen classes. Taking the former problem into account, our method compares the support feature map with the query feature map at multi scales to become scale-agnostic. As a solution to the latter problem, a supervised model, called as base learner, is trained on available classes to accurately identify pixels belonging to seen classes. Hence, subsequent meta learner has a chance to discard areas belonging to seen classes with the help of an ensemble learning model that coordinates meta learner with the base learner. We simultaneously address these two vital problems for the first time and achieve state-of-the-art performances on both PASCAL-5i and COCO-20i datasets.
translated by 谷歌翻译
很少有分段旨在学习一个细分模型,该模型可以推广到只有几个培训图像的新课程。在本文中,我们提出了一个交叉引用和局部全球条件网络(CRCNET),以进行几次分割。与以前仅预测查询图像掩码的作品不同,我们提出的模型同时对支持图像和查询图像进行了预测。我们的网络可以更好地在两个图像中使用交叉引用机制找到同时出现的对象,从而有助于少量分割任务。为了进一步改善功能比较,我们开发了一个局部全球条件模块,以捕获全球和本地关系。我们还开发了一个掩模修补模块,以重新完善前景区域的预测。Pascal VOC 2012,MS Coco和FSS-1000数据集的实验表明,我们的网络实现了新的最新性能。
translated by 谷歌翻译
Despite the remarkable success of existing methods for few-shot segmentation, there remain two crucial challenges. First, the feature learning for novel classes is suppressed during the training on base classes in that the novel classes are always treated as background. Thus, the semantics of novel classes are not well learned. Second, most of existing methods fail to consider the underlying semantic gap between the support and the query resulting from the representative bias by the scarce support samples. To circumvent these two challenges, we propose to activate the discriminability of novel classes explicitly in both the feature encoding stage and the prediction stage for segmentation. In the feature encoding stage, we design the Semantic-Preserving Feature Learning module (SPFL) to first exploit and then retain the latent semantics contained in the whole input image, especially those in the background that belong to novel classes. In the prediction stage for segmentation, we learn an Self-Refined Online Foreground-Background classifier (SROFB), which is able to refine itself using the high-confidence pixels of query image to facilitate its adaptation to the query image and bridge the support-query semantic gap. Extensive experiments on PASCAL-5$^i$ and COCO-20$^i$ datasets demonstrates the advantages of these two novel designs both quantitatively and qualitatively.
translated by 谷歌翻译
FSS(Few-shot segmentation)~aims to segment a target class with a small number of labeled images (support Set). To extract information relevant to target class, a dominant approach in best performing FSS baselines removes background features using support mask. We observe that this support mask presents an information bottleneck in several challenging FSS cases e.g., for small targets and/or inaccurate target boundaries. To this end, we present a novel method (MSI), which maximizes the support-set information by exploiting two complementary source of features in generating super correlation maps. We validate the effectiveness of our approach by instantiating it into three recent and strong FSS baselines. Experimental results on several publicly available FSS benchmarks show that our proposed method consistently improves the performance by visible margins and allows faster convergence. Our codes and models will be publicly released.
translated by 谷歌翻译
We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to $40\%$ computations while maintaining competitive performance.
translated by 谷歌翻译
Despite the great progress made by deep CNNs in image semantic segmentation, they typically require a large number of densely-annotated images for training and are difficult to generalize to unseen object categories. Few-shot segmentation has thus been developed to learn to perform segmentation from only a few annotated examples. In this paper, we tackle the challenging few-shot segmentation problem from a metric learning perspective and present PANet, a novel prototype alignment network to better utilize the information of the support set. Our PANet learns classspecific prototype representations from a few support images within an embedding space and then performs segmentation over the query images through matching each pixel to the learned prototypes. With non-parametric metric learning, PANet offers high-quality prototypes that are representative for each semantic class and meanwhile discriminative for different classes. Moreover, PANet introduces a prototype alignment regularization between support and query. With this, PANet fully exploits knowledge from the support and provides better generalization on few-shot segmentation. Significantly, our model achieves the mIoU score of 48.1% and 55.7% on PASCAL-5 i for 1-shot and 5-shot settings respectively, surpassing the state-of-the-art method by 1.8% and 8.6%.
translated by 谷歌翻译
Few-shot segmentation (FSS) aims to segment unseen classes using a few annotated samples. Typically, a prototype representing the foreground class is extracted from annotated support image(s) and is matched to features representing each pixel in the query image. However, models learnt in this way are insufficiently discriminatory, and often produce false positives: misclassifying background pixels as foreground. Some FSS methods try to address this issue by using the background in the support image(s) to help identify the background in the query image. However, the backgrounds of theses images is often quite distinct, and hence, the support image background information is uninformative. This article proposes a method, QSR, that extracts the background from the query image itself, and as a result is better able to discriminate between foreground and background features in the query image. This is achieved by modifying the training process to associate prototypes with class labels including known classes from the training data and latent classes representing unknown background objects. This class information is then used to extract a background prototype from the query image. To successfully associate prototypes with class labels and extract a background prototype that is capable of predicting a mask for the background regions of the image, the machinery for extracting and using foreground prototypes is induced to become more discriminative between different classes. Experiments for both 1-shot and 5-shot FSS on both the PASCAL-5i and COCO-20i datasets demonstrate that the proposed method results in a significant improvement in performance for the baseline methods it is applied to. As QSR operates only during training, these improved results are produced with no extra computational complexity during testing.
translated by 谷歌翻译
几乎没有射击的细分是一项具有挑战性的密集预测任务,它需要分割新的查询图像,仅给出一个小注释的支持集。因此,关键问题是设计一种方法,该方法可以从支持集中汇总详细信息,同时对外观和上下文的巨大变化进行稳健。为此,我们提出了基于密集的高斯过程(GP)回归的几种分割方法。鉴于支持集,我们密集的GP了解了从局部深层图像特征到掩模值的映射,能够捕获复杂的外观分布。此外,它提供了一种捕获不确定性的原则方法,这是CNN解码器获得的最终分割的另一种强大提示。我们进一步利用了我们方法的端到端学习能力,以学习GP的高维输出空间。我们的方法在Pascal-5 $^i $和Coco-20 $^i $ Benchmarks上设定了新的最新技术,在Coco-20 $^i $中获得了绝对的$+8.4 $ MIOU的绝对增益5杆设置。此外,在增加支撑设置大小时,我们的方法的分割质量会优雅地缩放,同时实现强大的跨数据库转移。代码和训练有素的模型可在\ url {https://github.com/joakimjohnander/dgpnet}上获得。
translated by 谷歌翻译