Fig. 1. Masked images and corresponding inpainted results using our partial-convolution based network. Abstract. Existing deep learning based image inpainting methods use a standard convolutional network over the corrupted image, using con-volutional filter responses conditioned on both valid pixels as well as the substitute values in the masked holes (typically the mean value). This often leads to artifacts such as color discrepancy and blurriness. Post-processing is usually used to reduce such artifacts, but are expensive and may fail. We propose the use of partial convolutions, where the convolu-tion is masked and renormalized to be conditioned on only valid pixels. We further include a mechanism to automatically generate an updated mask for the next layer as part of the forward pass. Our model out-performs other methods for irregular masks. We show qualitative and quantitative comparisons with other methods to validate our approach.
translated by 谷歌翻译
Recent advances in deep learning, especially deep convolutional neuralnetworks (CNNs), have led to significant improvement over previous semanticsegmentation systems. Here we show how to improve pixel-wise semanticsegmentation by manipulating convolution-related operations that are of boththeoretical and practical value. First, we design dense upsampling convolution(DUC) to generate pixel-level prediction, which is able to capture and decodemore detailed information that is generally missing in bilinear upsampling.Second, we propose a hybrid dilated convolution (HDC) framework in the encodingphase. This framework 1) effectively enlarges the receptive fields (RF) of thenetwork to aggregate global information; 2) alleviates what we call the"gridding issue" caused by the standard dilated convolution operation. Weevaluate our approaches thoroughly on the Cityscapes dataset, and achieve astate-of-art result of 80.1% mIOU in the test set at the time of submission. Wealso have achieved state-of-the-art overall on the KITTI road estimationbenchmark and the PASCAL VOC2012 segmentation task. Our source code can befound at https://github.com/TuSimple/TuSimple-DUC .
translated by 谷歌翻译
Semantic image segmentation is an essential component of modern autonomous driving systems, as an accurate understanding of the surrounding scene is crucial to navigation and action planning. Current state-of-the-art approaches in semantic image segmentation rely on pre-trained networks that were initially developed for classifying images as a whole. While these networks exhibit outstanding recognition performance (i.e., what is visible?), they lack localization accuracy (i.e., where precisely is something located?). Therefore, additional processing steps have to be performed in order to obtain pixel-accurate seg-mentation masks at the full image resolution. To alleviate this problem we propose a novel ResNet-like architecture that exhibits strong localization and recognition performance. We combine multi-scale context with pixel-level accuracy by using two processing streams within our network: One stream carries information at the full image resolution , enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition. The two streams are coupled at the full image resolution using resid-uals. Without additional processing steps and without pre-training, our approach achieves an intersection-over-union score of 71.8% on the Cityscapes dataset.
translated by 谷歌翻译
在这项工作中,我们提出了就地激活批量归一化(InPlace-ABN) - 一种以计算有效的方式大幅减少现代深度神经网络的训练记忆足迹的新方法。我们的解决方案通过单个插件层来构建常规使用的BatchNorm +激活层继承,从而避免了入侵框架手术,同时为现有的深度学习框架提供了直接的适用性。通过降低中间结果并在后退期间恢复所需信息,我们节省了高达50%的内存通过存储的前向结果的反转,计算时间仅略微增加(0.8-2%)。此外,我们还演示了如何经常使用检验点方法在计算上与InPlace-ABN一样高效。在我们关于图像分类的实验中,我们使用最先进的方法展示了ImageNet-1k的平均结果。在语义分割的内存要求高的任务中,我们报告了COCO-Stuff,Cityscapes和MapillaryVistas的结果,在没有额外训练数据的情况下获得了后者的新的最新结果,但是在单一规模和模型场景中。代码可以在http://github.com/mapillary/inplace_abn找到。
translated by 谷歌翻译
当深度神经网络过度参数化并且受到大量噪声和正则化(例如重量衰减和丢失)的训练时,它们通常能够很好地工作。虽然丢失被广泛用作有效连接层的正则化技术,但对于卷积层通常效果较差。卷积层丢失的成功率可能是由于卷积层中的激活单元在空间上相关,因此信息仍然可以通过卷积网络尽管辍学。因此需要结构化的丢失形式来规范卷积网络。在本文中,我们介绍了DropBlock,一种结构性丢失的形式,其中在特征映射的连续区域中的单元被放在一起。我们发现除了卷积层之外,在跳过连接中应用DropbBlock可以提高准确性。此外,在训练期间逐渐增加丢弃的单元数导致更高的准确性并且对超参数选择更稳健。大量实验表明DropBlock比正常卷积网络中的丢失效果更好。在ImageNet分类中,使用DropBlock的ResNet-50架构实现了78.13美元\%$的准确度,这比基线的价格提高了1.6美元以上。在COCO检测方面,DropBlock将RetinaNet的平均精度从$ 36.8 \%$改善至$ 38.4 \%$。
translated by 谷歌翻译
In this work, we revisit atrous convolution, a powerful tool to explicitlyadjust filter's field-of-view as well as control the resolution of featureresponses computed by Deep Convolutional Neural Networks, in the application ofsemantic image segmentation. To handle the problem of segmenting objects atmultiple scales, we design modules which employ atrous convolution in cascadeor in parallel to capture multi-scale context by adopting multiple atrousrates. Furthermore, we propose to augment our previously proposed AtrousSpatial Pyramid Pooling module, which probes convolutional features at multiplescales, with image-level features encoding global context and further boostperformance. We also elaborate on implementation details and share ourexperience on training our system. The proposed `DeepLabv3' systemsignificantly improves over our previous DeepLab versions without DenseCRFpost-processing and attains comparable performance with other state-of-artmodels on the PASCAL VOC 2012 semantic image segmentation benchmark.
translated by 谷歌翻译
在这项工作中,我们使用DeepLearning来解决语义图像分割的任务,并做出三个主要贡献,通过实验证明它们具有实质性的实用价值。首先,我们使用上采样过滤器或“atrous convolution”突出显示卷积,作为密集预测任务中的强大工具。 Atrous卷积允许我们明确地控制在Deep ConvolutionalNeural Networks中计算特征响应的分辨率。它还允许我们有效地扩大视野过滤器以包含更大的上下文而不增加参数的数量或计算量。其次,我们提出了一种不稳定的空间锥体池(ASPP)来在多个尺度上对对象进行鲁棒分割。 ASPP探测带有多倍采样率和有效视场的滤波器的传入卷积特征层,从而捕获多个尺度的对象和图像上下文。第三,我们通过结合DCNN和概率图形模型的方法来改进对象边界的本地化。在DCNN中通常部署的最大池和下采样的组合实现了不变性,但是对定位精度有影响。我们通过将最终DCNN层的响应与完全连接的条件随机场(CRF)相结合来克服这一点,CRF定性和定量地显示以改善定位性能。我们提出的“DeepLab”系统在PASCAL VOC-2012语义图像分割任务中设置了新的先进技术,在测试集中达到了79.7%mIOU,并在其他三个数据集上推进了结果:PASCAL-Context,PASCAL-Person-Part ,和城市景观。我们所有的代码都在网上公开发布。
translated by 谷歌翻译
实时语义分割在自动驾驶和机器人等实际应用中起着重要作用。大多数研究语义分割的研究都侧重于准确性而很少考虑效率。一些强调高速推理的现有研究往往不能产生高精度的分割结果。在本文中,我们提出了一种新的卷积网络,称为带有非对称卷积的高效密集模块(EDANet),它采用非对称卷积结构,结合了扩展卷积和密集连通性,以低成本,推理时间和模型大小获得高效率。与FCN相比,EDANet的速度提高了11倍,参数减少了196倍,同时无需任何额外的解码器结构,上下文模块,后处理方案和预训练模型,它实现了更高的交叉结合(mIoU)分数的平均值。我们在Cityscapes和CamVid数据集上评估EDANet以评估其性能并将其与其他最先进的系统进行比较。我们的网络可以分别在单个GTX 1080Ti和Titan X上以108和81帧/秒的速度运行解析512x1024输入。
translated by 谷歌翻译
空间金字塔池模块或编码 - 解码器结构用于神经网络中用于语义分割任务。前一种网络能够通过以多种速率和多个有效视场观察过滤特征或过滤操作来对多尺度上下文信息进行编码,而后一种网络可以通过逐渐恢复空间信息来捕获更清晰的对象边界。在这项工作中,我们提出了两种方法的优点。具体来说,我们提出的模型DeepLabv3 +通过添加一个简单而有效的解码器模块来扩展DeepLabv3,以细化特别是沿对象边界的分割结果。我们进一步探索Xception模型并将深度可分离卷积应用于Atrous Spatial Pyramid Pooling和解码器模块,从而产生更快更强的编码器 - 解码器网络。我们证明了所提出的模型在PASCAL VOC 2012和Cityscapes数据集上的有效性,在没有任何后处理的情况下实现了89.0%和82.1%的测试集性能。我们的论文附有公开提供的Tensorflow模型的参考实现,网址为\ url {https://github.com/tensorflow/models/tree/master/research/deeplab}。
translated by 谷歌翻译
语义分割需要大量像素方式的注释,以容许准确的模型。在本文中,我们提出了一种基于视频预测的方法,通过综合新的训练样本来扩展训练集,以提高语义分割网络的准确性。我们利用视频预测模型预测未来帧的能力,以便预测未来的标签。还提出了联合传播策略以减少合成样品中的错误比对。我们证明了由合成样本增加的数据集上的训练分割模型导致准确性的显着改善。此外,我们引入了一种新颖的边界标签松弛技术,该技术使得对沿着对象边界的注释噪声和传播伪像具有鲁棒性。我们提出的方法在Cityscapes上实现了83.5%的最新技术,在CamVid上实现了82.9%。没有模型合奏的单一模型在KITTI语义分割测试集上达到了72.8%mIoU,超过了ROBchallenge 2018的获奖作品。我们的代码和视频可以在以下网址找到://nv-adlr.github.io/publication/2018 -分割。
translated by 谷歌翻译
Semantic image segmentation is a basic street scene understanding task in autonomous driving, where each pixel in a high resolution image is categorized into a set of semantic labels. Unlike other scenarios, objects in autonomous driving scene exhibit very large scale changes, which poses great challenges for high-level feature representation in a sense that multi-scale information must be correctly encoded. To remedy this problem, atrous convolution[14] was introduced to generate features with larger receptive fields without sacrificing spatial resolution. Built upon atrous convolution, Atrous Spatial Pyramid Pooling (ASPP)[2] was proposed to concatenate multiple atrous-convolved features using different dilation rates into a final feature representation. Although ASPP is able to generate multi-scale features, we argue the feature resolution in the scale-axis is not dense enough for the autonomous driving scenario. To this end, we propose Densely connected Atrous Spatial Pyramid Pooling (DenseASPP), which connects a set of atrous convolutional layers in a dense way, such that it generates multi-scale features that not only cover a larger scale range, but also cover that scale range densely, without significantly increasing the model size. We evaluate DenseASPP on the street scene benchmark Cityscapes[4] and achieve state-of-the-art performance.
translated by 谷歌翻译
本文提出了一个有效的模块空间瓶颈,用于加速深度神经网络中的卷积层。核心思想是将卷积分解为两个阶段,首先减少特征映射的空间分辨率,然后将其恢复到所需的大小。这种操作降低了空间域中的采样密度,这与信道域中的网络加速方法无关,但又是互补的。使用不同的采样率,我们可以在识别准确度和模型复杂度之间进行权衡。作为基本构建块,空间瓶颈可用于替换单个卷积层或两个卷积层的组合。通过将其应用于深度剩余网络,我们通过空间瓶颈来验证空间瓶颈的有效性。空间瓶颈分别在常规和通道瓶颈残余块上实现2倍和1.4倍的加速,在识别低分辨率图像时保留了准确性,甚至在识别高分辨率图像方面得到了改进。
translated by 谷歌翻译
语义分割需要丰富的空间信息和相当大的接收领域。然而,现代方法通常会牺牲空间分辨率来实现实时推理速度,从而导致性能不佳。在本文中,我们通过一个新的双边分段网络(BiSeNet)来解决这一难题。我们首先设计一个小路径的空间路径,以保留空间信息并生成高分辨率特征。同时,采用具有快速下采样策略的上下文路径来获得足够的感知域。在这两条路径的顶部,我们引入了一个新的功能融合模块,以有效地结合功能。建议的体系结构在Cityscapes,CamVid和COCO-Stuff数据集上的速度和分段性能之间取得了适当的平衡。具体来说,对于a2048x1024输入,我们在Cityscapes测试数据集上实现了68.4%的平均IOU,在一块NVIDIA Titan XP卡上的速度为105 FPS,这显着优于现有方法,性能相当。
translated by 谷歌翻译
One of recent trends [30, 31, 14] in network architecture design is stacking small filters (e.g., 1x1 or 3x3) in the entire network because the stacked small filters is more efficient than a large kernel, given the same computational complexity. However, in the field of semantic segmenta-tion, where we need to perform dense per-pixel prediction, we find that the large kernel (and effective receptive field) plays an important role when we have to perform the classification and localization tasks simultaneously. Following our design principle, we propose a Global Convolutional Network to address both the classification and localization issues for the semantic segmentation. We also suggest a residual-based boundary refinement to further refine the object boundaries. Our approach achieves state-of-art performance on two public benchmarks and significantly outper-forms previous results, 82.2% (vs 80.2%) on PASCAL VOC 2012 dataset and 76.9% (vs 71.8%) on Cityscapes dataset.
translated by 谷歌翻译
Semantic segmentation is a challenging task that addresses most of the perception needs of Intelligent Vehicles (IV) in an unified way. Deep Neural Networks excel at this task, as they can be trained end-to-end to accurately classify multiple object categories in an image at pixel level. However, a good trade-off between high quality and computational resources is yet not present in state-of-the-art semantic segmentation approaches, limiting their application in real vehicles. In this paper, we propose a deep architecture that is able to run in real-time while providing accurate semantic segmentation. The core of our architecture is a novel layer that uses residual connections and factorized convolutions in order to remain efficient while retaining remarkable accuracy. Our approach is able to run at over 83 FPS in a single Titan X, and 7 FPS in a Jetson TX1 (embedded GPU). A comprehensive set of experiments on the publicly available Cityscapes dataset demonstrates that our system achieves an accuracy that is similar to the state of the art, while being orders of magnitude faster to compute than other architectures that achieve top precision. The resulting trade-off makes our model an ideal approach for scene understanding in IV applications. The code is publicly available at: https://github.com/Eromera/erfnet
translated by 谷歌翻译
远程依赖关系可以捕获有用的上下文信息,从而有益于理解问题。在这项工作中,我们提出了一个纵横交错网络(CCNet),通过更有效和有效的方式获取这些重要信息。具体地说,对于每个像素,我们的CCNet可以通过一个新颖的交叉注意模块在纵横交错路径上收集其周围像素的上下文信息。通过进一步的重复操作,每个像素最终可以捕获来自所有像素的长程依赖性。总的来说,我们的CCNet具有以下优点:1)GPU内存友好。与非本地块相比,经常性的纵横交叉注意模块需要11美元以上的GPU内存使用量。 2)高计算效率。在计算远程依赖性时,当前的交叉注意力显着地将FLOP减少了大约85%的非本地块。 3)最先进的表现。我们对人口统计学分类基准测试进行了广泛的实验,包括Cityscapes,ADE20K和instancesegmentation基准COCO。特别是,我们的CCNet分别在Cityscapes测试集和ADE20K验证集上实现了81.4和45.22的mIoU评分,这是最新的最新结果。我们在\ url {https://github.com/speedinghzl/CCNet上公布了代码。
translated by 谷歌翻译
在本文中,我们提出广播卷积网络(BCN)从整个输入图像的全局字段中提取关键对象特征,并识别它们与局部特征的关系。 BCN是一个简单的网络模块,它收集有效的空间特征,嵌入位置信息并将它们广播到整个特征图。我们进一步介绍了利用BCN模块改进现有关系网络(RN)的多关系网络(multiRN)。在基于像素的关系推理问题中,在BCN的帮助下,multiRN通过将每个对象与多个对象同时关联,将“成对关系”非常规RN的概念扩展为“多向关系”。这产生了n个对象的O(n)复杂度,这是来自采用O(n ^ 2)的RN的巨大计算增益。通过实验,multiRN在CLEVR数据集上实现了最先进的性能,证明了BCN对关系推理问题的可靠性。
translated by 谷歌翻译
最近在图像分类研究方面取得的许多进展可以归功于培训过程的改进,例如数据分类和优化方法的变化。然而,在文献中,大多数细节要么被简要地提及为实现细节,要么仅在源代码中可见。在本文中,我们将研究这些细节的集合,并通过消融研究凭经验评估它们对最终模型精确度的影响。我们将通过将这些改进结合起来表明,我们能够显着改进各种CNN模型。例如,我们在ImageNet上将ResNet-50的前1个验证准确率从75.3%提高到79.29%。我们还将证明图像分类准确性的改进导致在其他应用程序域(例如对象检测和语义分割)中更好的传递学习性能。
translated by 谷歌翻译
我们专注于自动驾驶系统的语义分割这一极具挑战性的任务。它必须实时为流量关键对象提供合适的语义分割结果。在本文中,我们提出了一种非常有效但强大的深度神经网络,用于驱动被称为驱动分割网络(DSNet)的场景语义分割。 DSNet通过ShuffleNet V2和ENet启发的高效单元和架构设计实现了精确度和推理速度之间的最新平衡。更重要的是,DSNet通过我们新颖的DrivingImportance-weighted Loss来推动决策制定,这对于最重要的一类是至关重要的。我们在Cityscapes数据集上评估DSNet,我们的DSNet实现71.8%在验证集上的平均交叉联盟(IoU)和在测试集上的69.3%。分类IoU分数表明,驾驶重要性加权损失可以大幅度提高大多数驾驶关键等级。与ENET相比,DSNet的准确度提高了18.9%,速度提高了1.1倍,这意味着自动驾驶应用具有很大的潜力。
translated by 谷歌翻译
实例分割的最常见方法是复杂的并且使用具有对象建议,条件随机字段,模板匹配或循环神经网络的两级网络。在这项工作中,我们提出了TernausNetV2 - 简单的完全卷积网络,允许从实例级别的高分辨率卫星图像中提取对象。该网络具有popularencoder-decoder类型的体系结构,具有跳过连接,但具有一些必要的修改,允许用于语义以及例如分段任务。这种方法是通用的,并且允许将已经成功应用于语义分割的任何网络扩展到性能分割任务。此外,我们概括了为RGB图像预先训练的网络编码器,以使用额外的输入通道。它可以使用从视觉到更宽光谱范围的转移学习。 ForDeepGlobe-CVPR 2018建筑物检测子挑战,基于公共排行榜得分,我们的方法与其他方法相比表现出优越的性能。源代码对应的预训练权重可在https://github.com/ternaus/TernausNetV2上公开获得
translated by 谷歌翻译