语义分割是计算机视觉中的关键任务之一,它是为图像中的每个像素分配类别标签。尽管最近取得了重大进展,但大多数现有方法仍然遇到两个具有挑战性的问题:1)图像中的物体和东西的大小可能非常多样化,要求将多规模特征纳入完全卷积网络(FCN); 2)由于卷积网络的固有弱点,很难分类靠近物体/物体的边界的像素。为了解决第一个问题,我们提出了一个新的多受感受性现场模块(MRFM),明确考虑了多尺度功能。对于第二期,我们设计了一个边缘感知损失,可有效区分对象/物体的边界。通过这两种设计,我们的多种接收场网络在两个广泛使用的语义分割基准数据集上实现了新的最先进的结果。具体来说,我们在CityScapes数据集上实现了83.0的平均值,在Pascal VOC2012数据集中达到了88.4的平均值。
translated by 谷歌翻译
Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-regionbased context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixellevel prediction. The proposed approach achieves state-ofthe-art performance on various datasets. It came first in Im-ageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.
translated by 谷歌翻译
Semantic segmentation requires both rich spatial information and sizeable receptive field. However, modern approaches usually compromise spatial resolution to achieve real-time inference speed, which leads to poor performance. In this paper, we address this dilemma with a novel Bilateral Segmentation Network (BiSeNet). We first design a Spatial Path with a small stride to preserve the spatial information and generate high-resolution features. Meanwhile, a Context Path with a fast downsampling strategy is employed to obtain sufficient receptive field. On top of the two paths, we introduce a new Feature Fusion Module to combine features efficiently. The proposed architecture makes a right balance between the speed and segmentation performance on Cityscapes, CamVid, and COCO-Stuff datasets. Specifically, for a 2048×1024 input, we achieve 68.4% Mean IOU on the Cityscapes test dataset with speed of 105 FPS on one NVIDIA Titan XP card, which is significantly faster than the existing methods with comparable performance.
translated by 谷歌翻译
语义分割是自主车辆了解周围场景的关键技术。当代模型的吸引力表现通常以牺牲重计算和冗长的推理时间为代价,这对于自行车来说是无法忍受的。在低分辨率图像上使用轻量级架构(编码器 - 解码器或双路)或推理,最近的方法实现了非常快的场景解析,即使在单个1080TI GPU上以100多件FPS运行。然而,这些实时方法与基于扩张骨架的模型之间的性能仍有显着差距。为了解决这个问题,我们提出了一家专门为实时语义细分设计的高效底座。所提出的深层双分辨率网络(DDRNET)由两个深部分支组成,之间进行多个双边融合。此外,我们设计了一个名为Deep聚合金字塔池(DAPPM)的新上下文信息提取器,以基于低分辨率特征映射放大有效的接收字段和熔丝多尺度上下文。我们的方法在城市景观和Camvid数据集上的准确性和速度之间实现了新的最先进的权衡。特别是,在单一的2080Ti GPU上,DDRNET-23-Slim在Camvid测试组上的Citycapes试验组102 FPS上的102 FPS,74.7%Miou。通过广泛使用的测试增强,我们的方法优于最先进的模型,需要计算得多。 CODES和培训的型号在线提供。
translated by 谷歌翻译
Real-time semantic segmentation has played an important role in intelligent vehicle scenarios. Recently, numerous networks have incorporated information from multi-size receptive fields to facilitate feature extraction in real-time semantic segmentation tasks. However, these methods preferentially adopt massive receptive fields to elicit more contextual information, which may result in inefficient feature extraction. We believe that the elaborated receptive fields are crucial, considering the demand for efficient feature extraction in real-time tasks. Therefore, we propose an effective and efficient architecture termed Dilation-wise Residual segmentation (DWRSeg), which possesses different sets of receptive field sizes within different stages. The architecture involves (i) a Dilation-wise Residual (DWR) module for extracting features based on different scales of receptive fields in the high level of the network; (ii) a Simple Inverted Residual (SIR) module that uses an inverted bottleneck structure to extract features from the low stage; and (iii) a simple fully convolutional network (FCN)-like decoder for aggregating multiscale feature maps to generate the prediction. Extensive experiments on the Cityscapes and CamVid datasets demonstrate the effectiveness of our method by achieving a state-of-the-art trade-off between accuracy and inference speed, in addition to being lighter weight. Without using pretraining or resorting to any training trick, we achieve 72.7% mIoU on the Cityscapes test set at a speed of 319.5 FPS on one NVIDIA GeForce GTX 1080 Ti card, which is significantly faster than existing methods. The code and trained models are publicly available.
translated by 谷歌翻译
Semantic segmentation is a classic computer vision problem dedicated to labeling each pixel with its corresponding category. As a basic task for advanced tasks such as industrial quality inspection, remote sensing information extraction, medical diagnostic aid, and autonomous driving, semantic segmentation has been developed for a long time in combination with deep learning, and a lot of works have been accumulated. However, neither the classic FCN-based works nor the popular Transformer-based works have attained fine-grained localization of pixel labels, which remains the main challenge in this field. Recently, with the popularity of autonomous driving, the segmentation of road scenes has received increasing attention. Based on the cross-task consistency theory, we incorporate edge priors into semantic segmentation tasks to obtain better results. The main contribution is that we provide a model-agnostic method that improves the accuracy of semantic segmentation models with zero extra inference runtime overhead, verified on the datasets of road and non-road scenes. From our experimental results, our method can effectively improve semantic segmentation accuracy.
translated by 谷歌翻译
跨不同层的特征的聚合信息是密集预测模型的基本操作。尽管表现力有限,但功能级联占主导地位聚合运营的选择。在本文中,我们引入了细分特征聚合(AFA),以融合不同的网络层,具有更具表现力的非线性操作。 AFA利用空间和渠道注意,以计算层激活的加权平均值。灵感来自神经体积渲染,我们将AFA扩展到规模空间渲染(SSR),以执行多尺度预测的后期融合。 AFA适用于各种现有网络设计。我们的实验表明了对挑战性的语义细分基准,包括城市景观,BDD100K和Mapillary Vistas的一致而显着的改进,可忽略不计的计算和参数开销。特别是,AFA改善了深层聚集(DLA)模型在城市景观上的近6%Miou的性能。我们的实验分析表明,AFA学会逐步改进分割地图并改善边界细节,导致新的最先进结果对BSDS500和NYUDV2上的边界检测基准。在http://vis.xyz/pub/dla-afa上提供代码和视频资源。
translated by 谷歌翻译
在本文中,我们专注于探索有效的方法,以更快,准确和域的不可知性语义分割。受到相邻视频帧之间运动对齐的光流的启发,我们提出了一个流对齐模块(FAM),以了解相邻级别的特征映射之间的\ textit {语义流},并将高级特征广播到高分辨率特征有效地,有效地有效。 。此外,将我们的FAM与共同特征的金字塔结构集成在一起,甚至在轻量重量骨干网络(例如Resnet-18和DFNET)上也表现出优于其他实时方法的性能。然后,为了进一步加快推理过程,我们还提出了一个新型的封闭式双流对齐模块,以直接对齐高分辨率特征图和低分辨率特征图,在该图中我们将改进版本网络称为SFNET-LITE。广泛的实验是在几个具有挑战性的数据集上进行的,结果显示了SFNET和SFNET-LITE的有效性。特别是,建议的SFNET-LITE系列在使用RESNET-18主链和78.8 MIOU以120 fps运行的情况下,使用RTX-3090上的STDC主链在120 fps运行时,在60 fps运行时达到80.1 miou。此外,我们将四个具有挑战性的驾驶数据集(即CityScapes,Mapillary,IDD和BDD)统一到一个大数据集中,我们将其命名为Unified Drive细分(UDS)数据集。它包含不同的域和样式信息。我们基准了UDS上的几项代表性作品。 SFNET和SFNET-LITE仍然可以在UDS上取得最佳的速度和准确性权衡,这在如此新的挑战性环境中是强大的基准。所有代码和模型均可在https://github.com/lxtgh/sfsegnets上公开获得。
translated by 谷歌翻译
现代的高性能语义分割方法采用沉重的主链和扩张的卷积来提取相关特征。尽管使用上下文和语义信息提取功能对于分割任务至关重要,但它为实时应用程序带来了内存足迹和高计算成本。本文提出了一种新模型,以实现实时道路场景语义细分的准确性/速度之间的权衡。具体来说,我们提出了一个名为“比例吸引的条带引导特征金字塔网络”(s \ textsuperscript {2} -fpn)的轻巧模型。我们的网络由三个主要模块组成:注意金字塔融合(APF)模块,比例吸引条带注意模块(SSAM)和全局特征Upsample(GFU)模块。 APF采用了注意力机制来学习判别性多尺度特征,并有助于缩小不同级别之间的语义差距。 APF使用量表感知的关注来用垂直剥离操作编码全局上下文,并建模长期依赖性,这有助于将像素与类似的语义标签相关联。此外,APF还采用频道重新加权块(CRB)来强调频道功能。最后,S \ TextSuperScript {2} -fpn的解码器然后采用GFU,该GFU用于融合APF和编码器的功能。已经对两个具有挑战性的语义分割基准进行了广泛的实验,这表明我们的方法通过不同的模型设置实现了更好的准确性/速度权衡。提出的模型已在CityScapes Dataset上实现了76.2 \%miou/87.3fps,77.4 \%miou/67fps和77.8 \%miou/30.5fps,以及69.6 \%miou,71.0 miou,71.0 \%miou,和74.2 \%\%\%\%\%\%。 miou在Camvid数据集上。这项工作的代码将在\ url {https://github.com/mohamedac29/s2-fpn提供。
translated by 谷歌翻译
Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89.0% and 82.1% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at https: //github.com/tensorflow/models/tree/master/research/deeplab.
translated by 谷歌翻译
One of recent trends [31,32,14] in network architecture design is stacking small filters (e.g., 1x1 or 3x3) in the entire network because the stacked small filters is more efficient than a large kernel, given the same computational complexity. However, in the field of semantic segmentation, where we need to perform dense per-pixel prediction, we find that the large kernel (and effective receptive field) plays an important role when we have to perform the classification and localization tasks simultaneously. Following our design principle, we propose a Global Convolutional Network to address both the classification and localization issues for the semantic segmentation. We also suggest a residual-based boundary refinement to further refine the object boundaries. Our approach achieves state-of-art performance on two public benchmarks and significantly outperforms previous results, 82.2% (vs 80.2%) on PASCAL VOC 2012 dataset and 76.9% (vs 71.8%) on Cityscapes dataset.
translated by 谷歌翻译
Australian Centre for Robotic Vision {guosheng.lin;anton.milan;chunhua.shen;
translated by 谷歌翻译
Semantic image segmentation is a basic street scene understanding task in autonomous driving, where each pixel in a high resolution image is categorized into a set of semantic labels. Unlike other scenarios, objects in autonomous driving scene exhibit very large scale changes, which poses great challenges for high-level feature representation in a sense that multi-scale information must be correctly encoded. To remedy this problem, atrous convolution [14] was introduced to generate features with larger receptive fields without sacrificing spatial resolution. Built upon atrous convolution, Atrous Spatial Pyramid Pooling (ASPP) [2] was proposed to concatenate multiple atrous-convolved features using different dilation rates into a final feature representation. Although ASPP is able to generate multi-scale features, we argue the feature resolution in the scale-axis is not dense enough for the autonomous driving scenario. To this end, we propose Densely connected Atrous Spatial Pyramid Pooling (DenseASPP), which connects a set of atrous convolutional layers in a dense way, such that it generates multi-scale features that not only cover a larger scale range, but also cover that scale range densely, without significantly increasing the model size. We evaluate DenseASPP on the street scene benchmark Cityscapes [4] and achieve state-of-the-art performance.
translated by 谷歌翻译
共同出现的视觉模式使上下文聚集成为语义分割的重要范式。现有的研究重点是建模图像中的上下文,同时忽略图像以下相应类别的有价值的语义。为此,我们提出了一个新颖的软采矿上下文信息,超出了名为McIbi ++的图像范式,以进一步提高像素级表示。具体来说,我们首先设置了动态更新的内存模块,以存储各种类别的数据集级别的分布信息,然后利用信息在网络转发过程中产生数据集级别类别表示。之后,我们为每个像素表示形式生成一个类概率分布,并以类概率分布作为权重进行数据集级上下文聚合。最后,使用汇总的数据集级别和传统的图像级上下文信息来增强原始像素表示。此外,在推论阶段,我们还设计了一种粗到最新的迭代推理策略,以进一步提高分割结果。 MCIBI ++可以轻松地纳入现有的分割框架中,并带来一致的性能改进。此外,MCIBI ++可以扩展到视频语义分割框架中,比基线进行了大量改进。配备MCIBI ++,我们在七个具有挑战性的图像或视频语义分段基准测试中实现了最先进的性能。
translated by 谷歌翻译
Recent work has made significant progress in improving spatial resolution for pixelwise labeling with Fully Convolutional Network (FCN) framework by employing Dilated/Atrous convolution, utilizing multi-scale features and refining boundaries. In this paper, we explore the impact of global contextual information in semantic segmentation by introducing the Context Encoding Module, which captures the semantic context of scenes and selectively highlights class-dependent featuremaps. The proposed Context Encoding Module significantly improves semantic segmentation results with only marginal extra computation cost over FCN. Our approach has achieved new state-of-theart results 51.7% mIoU on PASCAL-Context, 85.9% mIoU on PASCAL VOC 2012. Our single model achieves a final score of 0.5567 on ADE20K test set, which surpasses the winning entry of COCO-Place Challenge 2017. In addition, we also explore how the Context Encoding Module can improve the feature representation of relatively shallow networks for the image classification on CIFAR-10 dataset. Our 14 layer network has achieved an error rate of 3.45%, which is comparable with state-of-the-art approaches with over 10× more layers. The source code for the complete system are publicly available 1 .
translated by 谷歌翻译
In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on the self-attention mechanism. Unlike previous works that capture contexts by multi-scale feature fusion, we propose a Dual Attention Network (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the feature at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results. We achieve new state-of-theart segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset. In particular, a Mean IoU score of 81.5% on Cityscapes test set is achieved without using coarse data. 1 .
translated by 谷歌翻译
人们普遍认为,对于准确的语义细分,必须使用昂贵的操作(例如,非常卷积)结合使用昂贵的操作(例如非常卷积),从而导致缓慢的速度和大量的内存使用。在本文中,我们质疑这种信念,并证明既不需要高度的内部决议也不是必需的卷积。我们的直觉是,尽管分割是一个每像素的密集预测任务,但每个像素的语义通常都取决于附近的邻居和遥远的环境。因此,更强大的多尺度功能融合网络起着至关重要的作用。在此直觉之后,我们重新访问常规的多尺度特征空间(通常限制为P5),并将其扩展到更丰富的空间,最小的P9,其中最小的功能仅为输入大小的1/512,因此具有很大的功能接受场。为了处理如此丰富的功能空间,我们利用最近的BIFPN融合了多尺度功能。基于这些见解,我们开发了一个简化的分割模型,称为ESEG,该模型既没有内部分辨率高,也没有昂贵的严重卷积。也许令人惊讶的是,与多个数据集相比,我们的简单方法可以以比以前的艺术更快地实现更高的准确性。在实时设置中,ESEG-Lite-S在189 fps的CityScapes [12]上达到76.0%MIOU,表现优于更快的[9](73.1%MIOU时为170 fps)。我们的ESEG-LITE-L以79 fps的速度运行,达到80.1%MIOU,在很大程度上缩小了实时和高性能分割模型之间的差距。
translated by 谷歌翻译
在语义细分中,将高级上下文信息与低级详细信息集成至关重要。为此,大多数现有的分割模型都采用双线性启动采样和卷积来具有不同尺度的地图,然后以相同的分辨率对齐。但是,双线性启动采样模糊了这些特征地图和卷积中所学到的精确信息,这会产生额外的计算成本。为了解决这些问题,我们提出了隐式特征对齐函数(IFA)。我们的方法的灵感来自隐式神经表示的快速扩展的主题,在该主题中,基于坐标的神经网络用于指定信号字段。在IFA中,特征向量被视为表示2D信息字段。给定查询坐标,附近的具有相对坐标的特征向量是从多级特征图中获取的,然后馈入MLP以生成相应的输出。因此,IFA隐含地将特征图在不同级别对齐,并能够在任意分辨率中产生分割图。我们证明了IFA在多个数据集上的功效,包括CityScapes,Pascal环境和ADE20K。我们的方法可以与各种体系结构的改进结合使用,并在共同基准上实现最新的计算准确性权衡。代码将在https://github.com/hzhupku/ifa上提供。
translated by 谷歌翻译
Semantic segmentation works on the computer vision algorithm for assigning each pixel of an image into a class. The task of semantic segmentation should be performed with both accuracy and efficiency. Most of the existing deep FCNs yield to heavy computations and these networks are very power hungry, unsuitable for real-time applications on portable devices. This project analyzes current semantic segmentation models to explore the feasibility of applying these models for emergency response during catastrophic events. We compare the performance of real-time semantic segmentation models with non-real-time counterparts constrained by aerial images under oppositional settings. Furthermore, we train several models on the Flood-Net dataset, containing UAV images captured after Hurricane Harvey, and benchmark their execution on special classes such as flooded buildings vs. non-flooded buildings or flooded roads vs. non-flooded roads. In this project, we developed a real-time UNet based model and deployed that network on Jetson AGX Xavier module.
translated by 谷歌翻译
我们提出Segnext,这是一种简单的卷积网络体系结构,用于语义分割。由于自我注意力在编码空间信息中的效率,基于变压器的最新模型已主导语义分割领域。在本文中,我们表明卷积注意是一种比变形金刚中的自我注意机制更有效的编码上下文信息的方法。通过重新检查成功分割模型所拥有的特征,我们发现了几个关键组件,从而导致分割模型的性能提高。这促使我们设计了一个新型的卷积注意网络,该网络使用廉价的卷积操作。没有铃铛和哨子,我们的Segnext显着提高了先前最先进的方法对流行基准测试的性能,包括ADE20K,CityScapes,Coco-stuff,Pascal VOC,Pascal Context和ISAID。值得注意的是,segnext优于w/ nas-fpn的效率超过lavenet-l2,在帕斯卡VOC 2012测试排行榜上仅使用1/10参数,在Pascal VOC 2012测试排行榜上达到90.6%。平均而言,与具有相同或更少计算的ADE20K数据集上的最新方法相比,Segnext的改进约为2.0%。代码可在https://github.com/uyzhang/jseg(jittor)和https://github.com/visual-cratch-network/segnext(pytorch)获得。
translated by 谷歌翻译