在本文中,我们提出了一种简单但有效的消息传递方法来提高语义分段结果的边界质量。灵感来自Superpixel块的产生的尖锐边缘,我们使用Superpixel指导在特征图中传递的信息。同时,块的尖锐边界也限制了消息传递范围。具体地,我们的平均特征是特征映射内的SuperPixel块覆盖,并将结果添加回每个特征向量。此外,为了获得更清晰的边缘和更远的空间依赖性,我们通过级联的不同尺度超像素块开发多尺度超顶像素模块(MSP)。我们的方法可以用作即插即用模块,并轻松插入任何分段网络而不引入新参数。广泛的实验是在三个强的基线,即pspnet,deeplabv3和deeplabv3 +上进行的,以及四个具有挑战性的场景解析数据集,包括Ade20k,Citycapes,Pascal VOC和Pascal背景。实验结果验证了其有效性和概括性。
translated by 谷歌翻译
现代语义分割方法非常重视调整特征表示,以改善各种方式的分割性能,例如度量学习,架构设计等。然而,几乎所有这些方法都忽略了边界像素的特殊性。由于CNN网络中的接收领域的连续扩展,这些像素容易获得来自两侧的困惑特征。通过这种方式,它们将误导模型优化方向并使往往共享许多相邻像素缺乏歧视的类别的类别重量,这将损害整体性能。在这项工作中,我们深入了解了这个问题,并提出了一种名为嵌入式超级棒(ES-CRF)的新方法来解决它。 ES-CRF涉及两个主要方面。一方面,ES-CRF创新地将CRF机制融合到CNN网络中作为有机整体,以实现更有效的端到端优化。它利用CRF引导在高级功能中通过像素之间的消息来净化边界像素的特征表示,并且在内像素属于同一对象的帮助下。另一方面,SuperPixel集成到ES-CRF中以在更可靠的消息传递之前利用本地对象。最后,我们的提出方法会产生关于两个具有挑战性的基准,即城市景观和ADE20K的新记录。此外,我们进行了详细的理论分析,以验证ES-CRF的优越性。
translated by 谷歌翻译
最近的非本地自我关注方法已经证明是有效地捕获用于语义细分的远程依赖性。这些方法通常形成RC * C的相似性图(通过压缩空间尺寸)或rhW * HW(通过压缩通道)来描述沿通道或空间尺寸的特征关系,其中C是通道,H和W的数量输入特征映射的空间尺寸。然而,这种做法倾向于沿着其他尺寸的浓缩特征依赖性,因此引起注意丢失,这可能导致小型/薄类别或大物体内部不一致的分割结果。为了解决这个问题,我们提出了一种重新的方法,即完全注意网络(FLANET),以编码单个相似性图中的空间和频道注意,同时保持高计算效率。具体地,对于每个频道图,我们的枫套可以通过新颖的完全注意模块收获来自所有其他信道地图的特征响应和相关的空间位置。我们的新方法在三个具有挑战性的语义细分数据集中实现了最先进的性能,即在城市景观测试集,ADE20K验证集和Pascal VOC测试集中的83.6%,46.99%和88.5% 。
translated by 谷歌翻译
Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89.0% and 82.1% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at https: //github.com/tensorflow/models/tree/master/research/deeplab.
translated by 谷歌翻译
语义分割是计算机视觉中的关键任务之一,它是为图像中的每个像素分配类别标签。尽管最近取得了重大进展,但大多数现有方法仍然遇到两个具有挑战性的问题:1)图像中的物体和东西的大小可能非常多样化,要求将多规模特征纳入完全卷积网络(FCN); 2)由于卷积网络的固有弱点,很难分类靠近物体/物体的边界的像素。为了解决第一个问题,我们提出了一个新的多受感受性现场模块(MRFM),明确考虑了多尺度功能。对于第二期,我们设计了一个边缘感知损失,可有效区分对象/物体的边界。通过这两种设计,我们的多种接收场网络在两个广泛使用的语义分割基准数据集上实现了新的最先进的结果。具体来说,我们在CityScapes数据集上实现了83.0的平均值,在Pascal VOC2012数据集中达到了88.4的平均值。
translated by 谷歌翻译
In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on the self-attention mechanism. Unlike previous works that capture contexts by multi-scale feature fusion, we propose a Dual Attention Network (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the feature at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results. We achieve new state-of-theart segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset. In particular, a Mean IoU score of 81.5% on Cityscapes test set is achieved without using coarse data. 1 .
translated by 谷歌翻译
Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-regionbased context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixellevel prediction. The proposed approach achieves state-ofthe-art performance on various datasets. It came first in Im-ageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.
translated by 谷歌翻译
共同出现的视觉模式使上下文聚集成为语义分割的重要范式。现有的研究重点是建模图像中的上下文,同时忽略图像以下相应类别的有价值的语义。为此,我们提出了一个新颖的软采矿上下文信息,超出了名为McIbi ++的图像范式,以进一步提高像素级表示。具体来说,我们首先设置了动态更新的内存模块,以存储各种类别的数据集级别的分布信息,然后利用信息在网络转发过程中产生数据集级别类别表示。之后,我们为每个像素表示形式生成一个类概率分布,并以类概率分布作为权重进行数据集级上下文聚合。最后,使用汇总的数据集级别和传统的图像级上下文信息来增强原始像素表示。此外,在推论阶段,我们还设计了一种粗到最新的迭代推理策略,以进一步提高分割结果。 MCIBI ++可以轻松地纳入现有的分割框架中,并带来一致的性能改进。此外,MCIBI ++可以扩展到视频语义分割框架中,比基线进行了大量改进。配备MCIBI ++,我们在七个具有挑战性的图像或视频语义分段基准测试中实现了最先进的性能。
translated by 谷歌翻译
Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a Criss-Cross Network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11× less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. 3) The state-of-the-art performance. We conduct extensive experiments on semantic segmentation benchmarks including Cityscapes, ADE20K, human parsing benchmark LIP, instance segmentation benchmark COCO, video segmentation benchmark CamVid. In particular, our CCNet achieves the mIoU scores of 81.9%, 45.76% and 55.47% on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at https://github.com/speedinghzl/CCNet.
translated by 谷歌翻译
像窗户,瓶子和镜子等玻璃状物体在现实世界中存在广泛存在。感应这些对象有许多应用,包括机器人导航和抓握。然而,由于玻璃样物体背后的任意场景,这项任务非常具有挑战性。本文旨在通过增强的边界学习解决玻璃状物体分割问题。特别是,我们首先提出了一种新的精致差分模块,其输出更精细的边界线索。然后,我们介绍了一个边缘感知点的图形卷积网络模块,以沿边界模拟全局形状。我们使用这两个模块来设计解码器,该解码器产生准确和干净的分段结果,尤其是在对象轮廓上。两个模块都是重量轻且有效的:它们可以嵌入到各种分段模型中。在最近的三个玻璃状物体分割数据集上进行了广泛的实验,包括Trans10K,MSD和GDD,我们的方法建立了新的最先进的结果。我们还说明了我们在三个通用分段数据集中的方法的强大泛化属性,包括城市景观,BDD和Coco Sift。代码和模型可用于\ url {https:/github.com/hehao13/ebrnet}。
translated by 谷歌翻译
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.
translated by 谷歌翻译
Australian Centre for Robotic Vision {guosheng.lin;anton.milan;chunhua.shen;
translated by 谷歌翻译
We notice information flow in convolutional neural networks is restricted inside local neighborhood regions due to the physical design of convolutional filters, which limits the overall understanding of complex scenes. In this paper, we propose the point-wise spatial attention network (PSANet) to relax the local neighborhood constraint. Each position on the feature map is connected to all the other ones through a self-adaptively learned attention mask. Moreover, information propagation in bi-direction for scene parsing is enabled. Information at other positions can be collected to help the prediction of the current position and vice versa, information at the current position can be distributed to assist the prediction of other ones. Our proposed approach achieves top performance on various competitive scene parsing datasets, including ADE20K, PASCAL VOC 2012 and Cityscapes, demonstrating its effectiveness and generality.
translated by 谷歌翻译
由于自动驾驶系统的成功,城市场景中的图像分割最近引起了很多关注。然而,有关前景目标的表现不佳,例如交通灯和杆,仍然限制了其进一步的实际应用。在城市场景中,由于特殊的相机位置和3D透视投影,前景目标总是隐藏在周围的东西中。更糟糕的是,由于接收场的连续扩展,加剧了高级功能中的前景和背景类之间的不平衡。我们称之为伪装。在本文中,我们介绍了一个新的附加模块,命名为特征平衡网络(FBNet),以消除城市场景细分中的特征伪装。 FBNET由两个关键组件,即块,BCE(BWBCE)和双重特征调制器(DFM)组成。 BWBCE用作辅助损失,以确保在BackProjagation期间确保前景类的均匀梯度及其周围环境。与此同时,DFM打算在BWBCE的监督下,加强高级功能中的前景阶段的深度表示。这两种模块彼此互相促进,以便有效地易于伪装。我们所提出的方法在两个具有挑战性的城市场景基准,即城市景观和BDD100K上实现了一种新的最先进的分割性能。代码将被释放以进行复制。
translated by 谷歌翻译
在语义细分中,将高级上下文信息与低级详细信息集成至关重要。为此,大多数现有的分割模型都采用双线性启动采样和卷积来具有不同尺度的地图,然后以相同的分辨率对齐。但是,双线性启动采样模糊了这些特征地图和卷积中所学到的精确信息,这会产生额外的计算成本。为了解决这些问题,我们提出了隐式特征对齐函数(IFA)。我们的方法的灵感来自隐式神经表示的快速扩展的主题,在该主题中,基于坐标的神经网络用于指定信号字段。在IFA中,特征向量被视为表示2D信息字段。给定查询坐标,附近的具有相对坐标的特征向量是从多级特征图中获取的,然后馈入MLP以生成相应的输出。因此,IFA隐含地将特征图在不同级别对齐,并能够在任意分辨率中产生分割图。我们证明了IFA在多个数据集上的功效,包括CityScapes,Pascal环境和ADE20K。我们的方法可以与各种体系结构的改进结合使用,并在共同基准上实现最新的计算准确性权衡。代码将在https://github.com/hzhupku/ifa上提供。
translated by 谷歌翻译
卷积神经网络在寻址像素级预测任务中的主要进展,例如语义分割,深度估计,表面正常预测等,从他们的强大功能中受益于视觉表现学习。通常,本领域模型的状态集成了对改进的深度特征表示的关注机制。最近,一些作品已经证明了学习的重要性,并结合了深度特征细化的空间和通道介绍。在本文中,WEAIM在有效地提升之前的方法和提出统一的深度框架,以便以原则的方式共同学习空间注意图和信道注意矢量,以便构建由此两种类型的注意力之间的引起的张量和模型相互作用。具体地,我们将估计和相互作用集成了概率表示学习框架内的关注,导致变分结构注意网络(Vista-net)。我们在神经网络内实现推理规则,从而允许概率的端到端学习和CNN前端参数。正如我们对六个大型数据集的大量实证评估所证明的致密视觉预测,Vista-Net在多个连续和离散预测任务中优于最先进的,从而确认在联合结构空间中提出的方法的益处 - 深度代表学习的关注估计。该代码可在https://github.com/ygjwd12345/vista-ner上获得。
translated by 谷歌翻译
玻璃在我们的日常生活中非常普遍。现有的计算机视觉系统忽略了它,因此可能会产生严重的后果,例如,机器人可能会坠入玻璃墙。但是,感知玻璃的存在并不简单。关键的挑战是,任意物体/场景可以出现在玻璃后面。在本文中,我们提出了一个重要的问题,即从单个RGB图像中检测玻璃表面。为了解决这个问题,我们构建了第一个大规模玻璃检测数据集(GDD),并提出了一个名为GDNet-B的新颖玻璃检测网络,该网络通过新颖的大型场探索大型视野中的丰富上下文提示上下文特征集成(LCFI)模块并将高级和低级边界特征与边界特征增强(BFE)模块集成在一起。广泛的实验表明,我们的GDNET-B可以在GDD测试集内外的图像上达到满足玻璃检测结果。我们通过将其应用于其他视觉任务(包括镜像分割和显着对象检测)来进一步验证我们提出的GDNET-B的有效性和概括能力。最后,我们显示了玻璃检测的潜在应用,并讨论了可能的未来研究方向。
translated by 谷歌翻译
大多数现有的语义分割方法都以图像级类标签作为监督,高度依赖于从标准分类网络生成的初始类激活图(CAM)。在本文中,提出了一种新颖的“渐进贴片学习”方法,以改善分类的局部细节提取,从而更好地覆盖整个对象的凸轮,而不仅仅是在常规分类模型中获得的CAM中的最歧视区域。 “补丁学习”将特征映射破坏成贴片,并在最终聚合之前并行独立处理每个本地贴片。这样的机制强迫网络从分散的歧视性本地部分中找到弱信息,从而提高了本地细节的敏感性。 “渐进的补丁学习”进一步将特征破坏和补丁学习扩展到多层粒度。与多阶段优化策略合作,这种“渐进的补丁学习”机制隐式地为模型提供了跨不同位置粒状性的特征提取能力。作为隐式多粒性渐进式融合方法的替代方案,我们还提出了一种明确的方法,以同时将单个模型中不同粒度的特征融合,从而进一步增强了完整对象覆盖的凸轮质量。我们提出的方法在Pascal VOC 2012数据集上取得了出色的性能,例如,测试集中有69.6 $%miou),它超过了大多数现有的弱监督语义细分方法。代码将在此处公开提供,https://github.com/tyroneli/ppl_wsss。
translated by 谷歌翻译
Image segmentation is a key topic in image processing and computer vision with applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among many others. Various algorithms for image segmentation have been developed in the literature. Recently, due to the success of deep learning models in a wide range of vision applications, there has been a substantial amount of works aimed at developing image segmentation approaches using deep learning models. In this survey, we provide a comprehensive review of the literature at the time of this writing, covering a broad spectrum of pioneering works for semantic and instance-level segmentation, including fully convolutional pixel-labeling networks, encoder-decoder architectures, multi-scale and pyramid based approaches, recurrent networks, visual attention models, and generative models in adversarial settings. We investigate the similarity, strengths and challenges of these deep learning models, examine the most widely used datasets, report performances, and discuss promising future research directions in this area.
translated by 谷歌翻译
Deep Convolutional Neural Networks (DCNNs) have recently shown state of the art performance in high level vision tasks, such as image classification and object detection. This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification (also called "semantic image segmentation"). We show that responses at the final layer of DCNNs are not sufficiently localized for accurate object segmentation. This is due to the very invariance properties that make DCNNs good for high level tasks. We overcome this poor localization property of deep networks by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF). Qualitatively, our "DeepLab" system is able to localize segment boundaries at a level of accuracy which is beyond previous methods. Quantitatively, our method sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 71.6% IOU accuracy in the test set. We show how these results can be obtained efficiently: Careful network re-purposing and a novel application of the 'hole' algorithm from the wavelet community allow dense computation of neural net responses at 8 frames per second on a modern GPU.
translated by 谷歌翻译