尽管诸如HRNET之类的语义细分的最新架构表现出了令人印象深刻的准确性,但其出色的设计选择引起的复杂性阻碍了一系列模型加速工具,并且进一步利用了对当前硬件效率低下的操作。本文表明,具有类似于重新连接的主链和一个小的多尺度的简单编码器架构,比复杂的语义分割体系结构(例如HRNET,fovenet和ddrnets)表现出PAR或更好。由于这些骨干的有效接收场小得多,因此天真地将设计用于图像分类的深层骨架用于语义分割的任务会导致低于PAR的结果。在HRNET,DDRNET和FANET等作品中提出的各种设计选择中,隐含的是具有较大有效接收场的网络。自然要问一个简单的编码器架构是否会比较如果没有较大的有效接受场的骨架,尽管不使用效率低下的操作(例如扩张的卷积)。我们表明,通过对重新结构进行较小且廉价的修改,可以为语义分割创建非常简单和竞争的基线。我们为台式机和移动目标提供了如此简单的体系结构的家庭,它们匹配或超过CityScapes数据集中复杂模型的性能。我们希望我们的工作为从业者提供了简单而有效的基线,以开发有效的语义细分模型。
translated by 谷歌翻译
人们普遍认为,对于准确的语义细分,必须使用昂贵的操作(例如,非常卷积)结合使用昂贵的操作(例如非常卷积),从而导致缓慢的速度和大量的内存使用。在本文中,我们质疑这种信念,并证明既不需要高度的内部决议也不是必需的卷积。我们的直觉是,尽管分割是一个每像素的密集预测任务,但每个像素的语义通常都取决于附近的邻居和遥远的环境。因此,更强大的多尺度功能融合网络起着至关重要的作用。在此直觉之后,我们重新访问常规的多尺度特征空间(通常限制为P5),并将其扩展到更丰富的空间,最小的P9,其中最小的功能仅为输入大小的1/512,因此具有很大的功能接受场。为了处理如此丰富的功能空间,我们利用最近的BIFPN融合了多尺度功能。基于这些见解,我们开发了一个简化的分割模型,称为ESEG,该模型既没有内部分辨率高,也没有昂贵的严重卷积。也许令人惊讶的是,与多个数据集相比,我们的简单方法可以以比以前的艺术更快地实现更高的准确性。在实时设置中,ESEG-Lite-S在189 fps的CityScapes [12]上达到76.0%MIOU,表现优于更快的[9](73.1%MIOU时为170 fps)。我们的ESEG-LITE-L以79 fps的速度运行,达到80.1%MIOU,在很大程度上缩小了实时和高性能分割模型之间的差距。
translated by 谷歌翻译
语义分割是自主车辆了解周围场景的关键技术。当代模型的吸引力表现通常以牺牲重计算和冗长的推理时间为代价,这对于自行车来说是无法忍受的。在低分辨率图像上使用轻量级架构(编码器 - 解码器或双路)或推理,最近的方法实现了非常快的场景解析,即使在单个1080TI GPU上以100多件FPS运行。然而,这些实时方法与基于扩张骨架的模型之间的性能仍有显着差距。为了解决这个问题,我们提出了一家专门为实时语义细分设计的高效底座。所提出的深层双分辨率网络(DDRNET)由两个深部分支组成,之间进行多个双边融合。此外,我们设计了一个名为Deep聚合金字塔池(DAPPM)的新上下文信息提取器,以基于低分辨率特征映射放大有效的接收字段和熔丝多尺度上下文。我们的方法在城市景观和Camvid数据集上的准确性和速度之间实现了新的最先进的权衡。特别是,在单一的2080Ti GPU上,DDRNET-23-Slim在Camvid测试组上的Citycapes试验组102 FPS上的102 FPS,74.7%Miou。通过广泛使用的测试增强,我们的方法优于最先进的模型,需要计算得多。 CODES和培训的型号在线提供。
translated by 谷歌翻译
语义分割是将类标签分配给图像中每个像素的问题,并且是自动车辆视觉堆栈的重要组成部分,可促进场景的理解和对象检测。但是,许多表现最高的语义分割模型非常复杂且笨拙,因此不适合在计算资源有限且低延迟操作的板载自动驾驶汽车平台上部署。在这项调查中,我们彻底研究了旨在通过更紧凑,更有效的模型来解决这种未对准的作品,该模型能够在低内存嵌入式系统上部署,同时满足实时推理的限制。我们讨论了该领域中最杰出的作品,根据其主要贡献将它们置于分类法中,最后我们评估了在一致的硬件和软件设置下,所讨论模型的推理速度,这些模型代表了具有高端的典型研究环境GPU和使用低内存嵌入式GPU硬件的现实部署方案。我们的实验结果表明,许多作品能够在资源受限的硬件上实时性能,同时说明延迟和准确性之间的一致权衡。
translated by 谷歌翻译
我们展示了一个下一代神经网络架构,马赛克,用于移动设备上的高效和准确的语义图像分割。MOSAIC是通过各种移动硬件平台使用常用的神经操作设计,以灵活地部署各种移动平台。利用简单的非对称编码器 - 解码器结构,该解码器结构由有效的多尺度上下文编码器和轻量级混合解码器组成,以从聚合信息中恢复空间细节,Mosaic在平衡准确度和计算成本的同时实现了新的最先进的性能。基于搜索的分类网络,马赛克部署在定制的特征提取骨架顶部,达到目前行业标准MLPerf型号和最先进的架构,达到5%的绝对精度增益。
translated by 谷歌翻译
在本文中,我们专注于探索有效的方法,以更快,准确和域的不可知性语义分割。受到相邻视频帧之间运动对齐的光流的启发,我们提出了一个流对齐模块(FAM),以了解相邻级别的特征映射之间的\ textit {语义流},并将高级特征广播到高分辨率特征有效地,有效地有效。 。此外,将我们的FAM与共同特征的金字塔结构集成在一起,甚至在轻量重量骨干网络(例如Resnet-18和DFNET)上也表现出优于其他实时方法的性能。然后,为了进一步加快推理过程,我们还提出了一个新型的封闭式双流对齐模块,以直接对齐高分辨率特征图和低分辨率特征图,在该图中我们将改进版本网络称为SFNET-LITE。广泛的实验是在几个具有挑战性的数据集上进行的,结果显示了SFNET和SFNET-LITE的有效性。特别是,建议的SFNET-LITE系列在使用RESNET-18主链和78.8 MIOU以120 fps运行的情况下,使用RTX-3090上的STDC主链在120 fps运行时,在60 fps运行时达到80.1 miou。此外,我们将四个具有挑战性的驾驶数据集(即CityScapes,Mapillary,IDD和BDD)统一到一个大数据集中,我们将其命名为Unified Drive细分(UDS)数据集。它包含不同的域和样式信息。我们基准了UDS上的几项代表性作品。 SFNET和SFNET-LITE仍然可以在UDS上取得最佳的速度和准确性权衡,这在如此新的挑战性环境中是强大的基准。所有代码和模型均可在https://github.com/lxtgh/sfsegnets上公开获得。
translated by 谷歌翻译
We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardwareaware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2% more accurate on ImageNet classification while reducing latency by 20% compared to MobileNetV2. MobileNetV3-Small is 6.6% more accurate compared to a MobileNetV2 model with comparable latency. MobileNetV3-Large detection is over 25% faster at roughly the same accuracy as Mo-bileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 34% faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.
translated by 谷歌翻译
Real-time semantic segmentation has played an important role in intelligent vehicle scenarios. Recently, numerous networks have incorporated information from multi-size receptive fields to facilitate feature extraction in real-time semantic segmentation tasks. However, these methods preferentially adopt massive receptive fields to elicit more contextual information, which may result in inefficient feature extraction. We believe that the elaborated receptive fields are crucial, considering the demand for efficient feature extraction in real-time tasks. Therefore, we propose an effective and efficient architecture termed Dilation-wise Residual segmentation (DWRSeg), which possesses different sets of receptive field sizes within different stages. The architecture involves (i) a Dilation-wise Residual (DWR) module for extracting features based on different scales of receptive fields in the high level of the network; (ii) a Simple Inverted Residual (SIR) module that uses an inverted bottleneck structure to extract features from the low stage; and (iii) a simple fully convolutional network (FCN)-like decoder for aggregating multiscale feature maps to generate the prediction. Extensive experiments on the Cityscapes and CamVid datasets demonstrate the effectiveness of our method by achieving a state-of-the-art trade-off between accuracy and inference speed, in addition to being lighter weight. Without using pretraining or resorting to any training trick, we achieve 72.7% mIoU on the Cityscapes test set at a speed of 319.5 FPS on one NVIDIA GeForce GTX 1080 Ti card, which is significantly faster than existing methods. The code and trained models are publicly available.
translated by 谷歌翻译
视觉变形金刚(VITS)引起了对计算机视觉任务的卓越性能的关注。为解决单级低分辨率表示的限制,先前的工作适用于具有分层体系结构的高分辨率密集预测任务,以生成金字塔功能。然而,考虑到其分类的顺序拓扑,仍然对VITS探索多种表达学习。在这项工作中提高具有更多能力的VITS来学习语义和空间精确的多尺度表示,我们展示了高分辨率多分支架构的高分辨率多分支架构,带有视觉变压器,称为HRVIT,推动静脉前沿预测任务到新级别。我们探索异构分支设计,降低线性层中的冗余,并增加模型非线性以平衡模型性能和硬件效率。拟议的HRVIT在ADE20K上达到50.20%的Miou,83.16%Miou,用于语义细分任务,超过最先进的麻省理工学院和克斯犬,平均+1.78 miou改善,参数减少28%和21%拖鞋,展示HRVIT作为强大视力骨架的潜力。
translated by 谷歌翻译
现代的高性能语义分割方法采用沉重的主链和扩张的卷积来提取相关特征。尽管使用上下文和语义信息提取功能对于分割任务至关重要,但它为实时应用程序带来了内存足迹和高计算成本。本文提出了一种新模型,以实现实时道路场景语义细分的准确性/速度之间的权衡。具体来说,我们提出了一个名为“比例吸引的条带引导特征金字塔网络”(s \ textsuperscript {2} -fpn)的轻巧模型。我们的网络由三个主要模块组成:注意金字塔融合(APF)模块,比例吸引条带注意模块(SSAM)和全局特征Upsample(GFU)模块。 APF采用了注意力机制来学习判别性多尺度特征,并有助于缩小不同级别之间的语义差距。 APF使用量表感知的关注来用垂直剥离操作编码全局上下文,并建模长期依赖性,这有助于将像素与类似的语义标签相关联。此外,APF还采用频道重新加权块(CRB)来强调频道功能。最后,S \ TextSuperScript {2} -fpn的解码器然后采用GFU,该GFU用于融合APF和编码器的功能。已经对两个具有挑战性的语义分割基准进行了广泛的实验,这表明我们的方法通过不同的模型设置实现了更好的准确性/速度权衡。提出的模型已在CityScapes Dataset上实现了76.2 \%miou/87.3fps,77.4 \%miou/67fps和77.8 \%miou/30.5fps,以及69.6 \%miou,71.0 miou,71.0 \%miou,和74.2 \%\%\%\%\%\%。 miou在Camvid数据集上。这项工作的代码将在\ url {https://github.com/mohamedac29/s2-fpn提供。
translated by 谷歌翻译
Semantic segmentation works on the computer vision algorithm for assigning each pixel of an image into a class. The task of semantic segmentation should be performed with both accuracy and efficiency. Most of the existing deep FCNs yield to heavy computations and these networks are very power hungry, unsuitable for real-time applications on portable devices. This project analyzes current semantic segmentation models to explore the feasibility of applying these models for emergency response during catastrophic events. We compare the performance of real-time semantic segmentation models with non-real-time counterparts constrained by aerial images under oppositional settings. Furthermore, we train several models on the Flood-Net dataset, containing UAV images captured after Hurricane Harvey, and benchmark their execution on special classes such as flooded buildings vs. non-flooded buildings or flooded roads vs. non-flooded roads. In this project, we developed a real-time UNet based model and deployed that network on Jetson AGX Xavier module.
translated by 谷歌翻译
Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoderdecoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (i.e., without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.
translated by 谷歌翻译
The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-ofthe-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, topperforming method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.
translated by 谷歌翻译
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-theart fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.
translated by 谷歌翻译
为了实现不断增长的准确性,通常会开发大型和复杂的神经网络。这样的模型需要高度的计算资源,因此不能在边缘设备上部署。由于它们在几个应用领域的有用性,建立资源有效的通用网络非常感兴趣。在这项工作中,我们努力有效地结合了CNN和变压器模型的优势,并提出了一种新的有效混合体系结构。特别是在EDGENEXT中,我们引入了分裂深度转置注意力(SDTA)编码器,该编码器将输入张量分解为多个通道组,并利用深度旋转以及跨通道维度的自我注意力,以隐含地增加接受场并编码多尺度特征。我们在分类,检测和分割任务上进行的广泛实验揭示了所提出的方法的优点,优于相对较低的计算要求的最先进方法。我们具有130万参数的EDGENEXT模型在Imagenet-1k上达到71.2 \%TOP-1的精度,超过移动设备的绝对增益为2.2 \%,而拖鞋减少了28 \%。此外,我们具有560万参数的EDGENEXT模型在Imagenet-1k上达到了79.4 \%TOP-1的精度。代码和模型可在https://t.ly/_vu9上公开获得。
translated by 谷歌翻译
多尺度学习框架已被视为一种能够提高语义分割的能力类别。然而,这个问题并不是微不足道的,尤其是对于现实世界的部署,通常需要高效率推理潜伏期。在本文中,我们彻底分析了卷积块的设计(卷积的类型和卷积中的频道数量),以及跨多个尺度的相互作用方式,所有这些都是从轻量级的语义分割的角度来看。通过这样的深入比较,我们综述了三个原则,因此设计了轻巧且逐渐估计的网络(LPS-NET),这些网络以贪婪的方式在新颖地扩展了网络复杂性。从技术上讲,LPS-NET首先利用了建立小型网络的原则。然后,LPS-NET通过扩展单个维度(卷积块的数量,通道数量或输入分辨率)来逐步扩展到较大网络,以实现最佳的速度/准确性交易。在三个数据集上进行的广泛实验始终证明了LPS-NET优于几种有效的语义分割方法。更值得注意的是,我们的LPS-NET在CityScapes测试套装上达到73.4%MIOU,NVIDIA GTX 1080TI的速度为413.5fps,导致绩效提高1.5%,对抗最高的速度为65% - ART STDC。代码可在\ url {https://github.com/yihengzhang-cv/lps-net}中获得。
translated by 谷歌翻译
我们从实际应用的角度重新审视了现有的出色变压器。他们中的大多数甚至不如基本的重新连接系列效率那么高,并且偏离了现实的部署方案。这可能是由于当前的标准测量计算效率,例如FLOPS或参数是单方面的,次优的和对硬件的不敏感的。因此,本文直接将特定硬件的紧张延迟视为效率指标,该指标提供了涉及计算能力,内存成本和带宽的更全面的反馈。基于一系列受控实验,这项工作为面向浓度和部署的网络设计提供了四个实用指南,例如,在阶段级别,早期的变压器和晚期CNN,在Block Level的早期CNN和Late Transformer。因此,提出了一个面向Tensortrt的变压器家族,缩写为TRT-VIT。广泛的实验表明,在不同的视觉任务(例如,图像分类,对象检测和语义细分)方面,TRT-VIT显着优于现有的Convnet和视觉变压器。例如,在82.7%的Imagenet-1k Top-1精度下,TRT-VIT比CSWIN快2.7 $ \ times $,比双胞胎快2.0 $ \ times $。在MS-COCO对象检测任务上,TRT-VIT与双胞胎达到可比的性能,而推理速度则增加了2.8 $ \ times $。
translated by 谷歌翻译
准确的语义分割模型通常需要大量的计算资源,从而抑制其在实际应用中的使用。最近的作品依靠精心制作的轻质模型来快速推断。但是,这些模型不能灵活地适应不同的准确性和效率要求。在本文中,我们提出了一种简单但有效的微小语义细分(SLIMSEG)方法,该方法可以在推理期间以不同的能力执行,具体取决于所需的准确性效率 - 折衷。更具体地说,我们在训练过程中采用逐步向下知识蒸馏采用参数化通道。观察到每个子模型的分割结果之间的差异主要在语义边界附近,我们引入了额外的边界指导语义分割损失,以进一步提高每个子模型的性能。我们表明,我们提出的具有各种主流网络的Slimseg可以产生灵活的模型,从而使计算成本的动态调整和比独立模型更好。关于语义分割基准,城市景观和Camvid的广泛实验证明了我们框架的概括能力。
translated by 谷歌翻译
视觉表示学习是解决各种视力问题的关键。依靠开创性的网格结构先验,卷积神经网络(CNN)已成为大多数深视觉模型的事实上的标准架构。例如,经典的语义分割方法通常采用带有编码器编码器体系结构的完全横向卷积网络(FCN)。编码器逐渐减少了空间分辨率,并通过更大的接受场来学习更多抽象的视觉概念。由于上下文建模对于分割至关重要,因此最新的努力一直集中在通过扩张(即极度)卷积或插入注意力模块来增加接受场。但是,基于FCN的体系结构保持不变。在本文中,我们旨在通过将视觉表示学习作为序列到序列预测任务来提供替代观点。具体而言,我们部署纯变压器以将图像编码为一系列贴片,而无需局部卷积和分辨率减少。通过在变压器的每一层中建立的全球环境,可以学习更强大的视觉表示形式,以更好地解决视力任务。特别是,我们的细分模型(称为分割变压器(SETR))在ADE20K上擅长(50.28%MIOU,这是提交当天测试排行榜中的第一个位置),Pascal环境(55.83%MIOU),并在CityScapes上达到竞争成果。此外,我们制定了一个分层局部全球(HLG)变压器的家族,其特征是窗户内的本地关注和跨窗户的全球性专注于层次结构和金字塔架构。广泛的实验表明,我们的方法在各种视觉识别任务(例如,图像分类,对象检测和实例分割和语义分割)上实现了吸引力的性能。
translated by 谷歌翻译
Semantic segmentation is a challenging task that addresses most of the perception needs of Intelligent Vehicles (IV) in an unified way. Deep Neural Networks excel at this task, as they can be trained end-to-end to accurately classify multiple object categories in an image at pixel level. However, a good trade-off between high quality and computational resources is yet not present in state-of-the-art semantic segmentation approaches, limiting their application in real vehicles. In this paper, we propose a deep architecture that is able to run in real-time while providing accurate semantic segmentation. The core of our architecture is a novel layer that uses residual connections and factorized convolutions in order to remain efficient while retaining remarkable accuracy. Our approach is able to run at over 83 FPS in a single Titan X, and 7 FPS in a Jetson TX1 (embedded GPU). A comprehensive set of experiments on the publicly available Cityscapes dataset demonstrates that our system achieves an accuracy that is similar to the state of the art, while being orders of magnitude faster to compute than other architectures that achieve top precision. The resulting trade-off makes our model an ideal approach for scene understanding in IV applications. The code is publicly available at: https://github.com/Eromera/erfnet
translated by 谷歌翻译