Fig. 1. Masked images and corresponding inpainted results using our partial-convolution based network. Abstract. Existing deep learning based image inpainting methods use a standard convolutional network over the corrupted image, using con-volutional filter responses conditioned on both valid pixels as well as the substitute values in the masked holes (typically the mean value). This often leads to artifacts such as color discrepancy and blurriness. Post-processing is usually used to reduce such artifacts, but are expensive and may fail. We propose the use of partial convolutions, where the convolu-tion is masked and renormalized to be conditioned on only valid pixels. We further include a mechanism to automatically generate an updated mask for the next layer as part of the forward pass. Our model out-performs other methods for irregular masks. We show qualitative and quantitative comparisons with other methods to validate our approach.
translated by 谷歌翻译
Recent advances in deep learning, especially deep convolutional neuralnetworks (CNNs), have led to significant improvement over previous semanticsegmentation systems. Here we show how to improve pixel-wise semanticsegmentation by manipulating convolution-related operations that are of boththeoretical and practical value. First, we design dense upsampling convolution(DUC) to generate pixel-level prediction, which is able to capture and decodemore detailed information that is generally missing in bilinear upsampling.Second, we propose a hybrid dilated convolution (HDC) framework in the encodingphase. This framework 1) effectively enlarges the receptive fields (RF) of thenetwork to aggregate global information; 2) alleviates what we call the"gridding issue" caused by the standard dilated convolution operation. Weevaluate our approaches thoroughly on the Cityscapes dataset, and achieve astate-of-art result of 80.1% mIOU in the test set at the time of submission. Wealso have achieved state-of-the-art overall on the KITTI road estimationbenchmark and the PASCAL VOC2012 segmentation task. Our source code can befound at https://github.com/TuSimple/TuSimple-DUC .
translated by 谷歌翻译
当深度神经网络过度参数化并且受到大量噪声和正则化(例如重量衰减和丢失)的训练时,它们通常能够很好地工作。虽然丢失被广泛用作有效连接层的正则化技术,但对于卷积层通常效果较差。卷积层丢失的成功率可能是由于卷积层中的激活单元在空间上相关,因此信息仍然可以通过卷积网络尽管辍学。因此需要结构化的丢失形式来规范卷积网络。在本文中,我们介绍了DropBlock,一种结构性丢失的形式,其中在特征映射的连续区域中的单元被放在一起。我们发现除了卷积层之外,在跳过连接中应用DropbBlock可以提高准确性。此外,在训练期间逐渐增加丢弃的单元数导致更高的准确性并且对超参数选择更稳健。大量实验表明DropBlock比正常卷积网络中的丢失效果更好。在ImageNet分类中,使用DropBlock的ResNet-50架构实现了78.13美元\%$的准确度,这比基线的价格提高了1.6美元以上。在COCO检测方面,DropBlock将RetinaNet的平均精度从$ 36.8 \%$改善至$ 38.4 \%$。
translated by 谷歌翻译
深度图像分类模型的最新进展为改善相关计算机视觉任务中的最新性能提供了巨大的潜力。然而,当前GPU的严格的存储限制阻碍了向语义分割的过渡。卷积反向支持所需的特征映射高速缓存的范围甚至对形状尺寸大的Pascal图像提出了重大挑战,同时当源分辨率在百万像素范围内时需要仔细的架构考虑。为了解决这些问题,我们提出了一种新颖的基于DenseNet的梯形式架构,该架构具有高建模能力和非常精益的上采样数据路径。 Wealso建议通过利用DenseNet特征提取器的固有空间效率来大幅减少特征映射缓存的范围。与竞争对手相比,这些模型可提供高性能,参数更少,并允许在商品硬件上以百万像素分辨率进行培训。所呈现的实验结果优于Cityscapes,Pascal VOC2012,CamVid和ROB 2018数据集上的预测准确性和执行速度的最新技术。源代码将在出版时发布。
translated by 谷歌翻译
本文提出了一个有效的模块空间瓶颈,用于加速深度神经网络中的卷积层。核心思想是将卷积分解为两个阶段,首先减少特征映射的空间分辨率,然后将其恢复到所需的大小。这种操作降低了空间域中的采样密度,这与信道域中的网络加速方法无关,但又是互补的。使用不同的采样率,我们可以在识别准确度和模型复杂度之间进行权衡。作为基本构建块,空间瓶颈可用于替换单个卷积层或两个卷积层的组合。通过将其应用于深度剩余网络,我们通过空间瓶颈来验证空间瓶颈的有效性。空间瓶颈分别在常规和通道瓶颈残余块上实现2倍和1.4倍的加速,在识别低分辨率图像时保留了准确性,甚至在识别高分辨率图像方面得到了改进。
translated by 谷歌翻译
In this work, we revisit atrous convolution, a powerful tool to explicitlyadjust filter's field-of-view as well as control the resolution of featureresponses computed by Deep Convolutional Neural Networks, in the application ofsemantic image segmentation. To handle the problem of segmenting objects atmultiple scales, we design modules which employ atrous convolution in cascadeor in parallel to capture multi-scale context by adopting multiple atrousrates. Furthermore, we propose to augment our previously proposed AtrousSpatial Pyramid Pooling module, which probes convolutional features at multiplescales, with image-level features encoding global context and further boostperformance. We also elaborate on implementation details and share ourexperience on training our system. The proposed `DeepLabv3' systemsignificantly improves over our previous DeepLab versions without DenseCRFpost-processing and attains comparable performance with other state-of-artmodels on the PASCAL VOC 2012 semantic image segmentation benchmark.
translated by 谷歌翻译
在这项工作中,我们使用DeepLearning来解决语义图像分割的任务,并做出三个主要贡献,通过实验证明它们具有实质性的实用价值。首先,我们使用上采样过滤器或“atrous convolution”突出显示卷积,作为密集预测任务中的强大工具。 Atrous卷积允许我们明确地控制在Deep ConvolutionalNeural Networks中计算特征响应的分辨率。它还允许我们有效地扩大视野过滤器以包含更大的上下文而不增加参数的数量或计算量。其次,我们提出了一种不稳定的空间锥体池(ASPP)来在多个尺度上对对象进行鲁棒分割。 ASPP探测带有多倍采样率和有效视场的滤波器的传入卷积特征层,从而捕获多个尺度的对象和图像上下文。第三,我们通过结合DCNN和概率图形模型的方法来改进对象边界的本地化。在DCNN中通常部署的最大池和下采样的组合实现了不变性,但是对定位精度有影响。我们通过将最终DCNN层的响应与完全连接的条件随机场(CRF)相结合来克服这一点,CRF定性和定量地显示以改善定位性能。我们提出的“DeepLab”系统在PASCAL VOC-2012语义图像分割任务中设置了新的先进技术,在测试集中达到了79.7%mIOU,并在其他三个数据集上推进了结果:PASCAL-Context,PASCAL-Person-Part ,和城市景观。我们所有的代码都在网上公开发布。
translated by 谷歌翻译
Semantic image segmentation is an essential component of modern autonomous driving systems, as an accurate understanding of the surrounding scene is crucial to navigation and action planning. Current state-of-the-art approaches in semantic image segmentation rely on pre-trained networks that were initially developed for classifying images as a whole. While these networks exhibit outstanding recognition performance (i.e., what is visible?), they lack localization accuracy (i.e., where precisely is something located?). Therefore, additional processing steps have to be performed in order to obtain pixel-accurate seg-mentation masks at the full image resolution. To alleviate this problem we propose a novel ResNet-like architecture that exhibits strong localization and recognition performance. We combine multi-scale context with pixel-level accuracy by using two processing streams within our network: One stream carries information at the full image resolution , enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition. The two streams are coupled at the full image resolution using resid-uals. Without additional processing steps and without pre-training, our approach achieves an intersection-over-union score of 71.8% on the Cityscapes dataset.
translated by 谷歌翻译
空间金字塔池模块或编码 - 解码器结构用于神经网络中用于语义分割任务。前一种网络能够通过以多种速率和多个有效视场观察过滤特征或过滤操作来对多尺度上下文信息进行编码,而后一种网络可以通过逐渐恢复空间信息来捕获更清晰的对象边界。在这项工作中,我们提出了两种方法的优点。具体来说,我们提出的模型DeepLabv3 +通过添加一个简单而有效的解码器模块来扩展DeepLabv3,以细化特别是沿对象边界的分割结果。我们进一步探索Xception模型并将深度可分离卷积应用于Atrous Spatial Pyramid Pooling和解码器模块,从而产生更快更强的编码器 - 解码器网络。我们证明了所提出的模型在PASCAL VOC 2012和Cityscapes数据集上的有效性,在没有任何后处理的情况下实现了89.0%和82.1%的测试集性能。我们的论文附有公开提供的Tensorflow模型的参考实现,网址为\ url {https://github.com/tensorflow/models/tree/master/research/deeplab}。
translated by 谷歌翻译
在这项工作中,我们介绍了一种新颖的,基于CNN的架构,可以通过trainedend-end来提供无缝的场景分割结果。我们的目标是通过apanoptic输出格式预测一致的语义分割和检测结果,超越了独立训练分割和检测模型的简单组合。所提出的架构利用了新颖的分割头,其将由特征金字塔网络生成的多尺度特征与由轻量级DeepLab类模块传送的上下文信息无缝地集成。由于额外的贡献是观察全景度量,并提出了一种在评估非实例类别时克服其限制的替代方案。我们提出的网络体系结构在三个具有挑战性的街道 - 城市景观,印度驾驶数据集和Mapillary Vistas上产生了最先进的结果。
translated by 谷歌翻译
在这项工作中,我们提出了就地激活批量归一化(InPlace-ABN) - 一种以计算有效的方式大幅减少现代深度神经网络的训练记忆足迹的新方法。我们的解决方案通过单个插件层来构建常规使用的BatchNorm +激活层继承,从而避免了入侵框架手术,同时为现有的深度学习框架提供了直接的适用性。通过降低中间结果并在后退期间恢复所需信息,我们节省了高达50%的内存通过存储的前向结果的反转,计算时间仅略微增加(0.8-2%)。此外,我们还演示了如何经常使用检验点方法在计算上与InPlace-ABN一样高效。在我们关于图像分类的实验中,我们使用最先进的方法展示了ImageNet-1k的平均结果。在语义分割的内存要求高的任务中,我们报告了COCO-Stuff,Cityscapes和MapillaryVistas的结果,在没有额外训练数据的情况下获得了后者的新的最新结果,但是在单一规模和模型场景中。代码可以在http://github.com/mapillary/inplace_abn找到。
translated by 谷歌翻译
Semantic segmentation is a challenging task that addresses most of the perception needs of Intelligent Vehicles (IV) in an unified way. Deep Neural Networks excel at this task, as they can be trained end-to-end to accurately classify multiple object categories in an image at pixel level. However, a good trade-off between high quality and computational resources is yet not present in state-of-the-art semantic segmentation approaches, limiting their application in real vehicles. In this paper, we propose a deep architecture that is able to run in real-time while providing accurate semantic segmentation. The core of our architecture is a novel layer that uses residual connections and factorized convolutions in order to remain efficient while retaining remarkable accuracy. Our approach is able to run at over 83 FPS in a single Titan X, and 7 FPS in a Jetson TX1 (embedded GPU). A comprehensive set of experiments on the publicly available Cityscapes dataset demonstrates that our system achieves an accuracy that is similar to the state of the art, while being orders of magnitude faster to compute than other architectures that achieve top precision. The resulting trade-off makes our model an ideal approach for scene understanding in IV applications. The code is publicly available at: https://github.com/Eromera/erfnet
translated by 谷歌翻译
语义分割需要大量的计算成本。扩张的卷积通过在没有附加参数的情况下增加感知域来减轻这种复杂性的负担。对于更轻量级的模型,使用深度可分离卷积是实际选择之一。然而,这两种方法的简单组合导致操作太稀疏,这可能导致严重的性能下降。为了解决这个问题,我们提出了一个新的集中 - 综合卷积(CCC)块,它具有扩张卷积和深度可分离卷积的优点。 CCC块由信息集中阶段和综合卷积阶段组成。第一阶段使用两个深度方向的非对称卷积来获得来自相邻像素的压缩信息。第二阶段通过使用来自第一阶段的特征图的深度可分离的扩张卷积来增加感受野。通过使用建议的CCC模块替换常规ESP模块,在Cityscapes数据集中没有精确降级,与作为最快模型之一的ESPnet相比,我们可以将参数数量减少一半,并将失败次数减少35%。我们进一步将CCC应用于基于扩张卷积的其他分割模型,并且我们的方法通过减少参数和触发器的数量实现了相当或更高的性能。最后,在ImageNet分类任务上的实验表明,CCC可以成功地取代扩张的对话。
translated by 谷歌翻译
已经提出了区域辍学策略来增强卷积神经网络分类器的性能。事实证明,它们可以有效地指导模型参与对象的较少辨别部分(例如,与人的头部相对应的腿),从而使网络更好地概括并具有更好的对象定位能力。另一方面,当前用于区域性丢失的方法通过覆盖黑色像素或随机噪声的斑块来移除训练图像上的信息像素。 {这种移除是不可取的,因为它会导致信息丢失和训练期间的低效率。}因此,我们提出了CutMix增强策略:在训练图像之间切割和粘贴补丁,其中地面实况标签与补丁区域成比例地混合。通过有效利用训练像素和\ mbox {保持}区域放弃的正规化效果,CutMix始终优于CIFAR和ImageNet分类任务以及ImageNet弱监督本地化任务的最新增强策略。此外,与先前的增强方法不同,我们的CutMix训练的ImageNet分类器在用作预训练模型时,可以在Pascal检测和MS-COCO图像字幕基准测试中获得一致的性能提升。我们还展示了CutMix改进了针对输入损坏及其分布式检测性能的模型稳健性。
translated by 谷歌翻译
实时语义分割在自动驾驶和机器人等实际应用中起着重要作用。大多数研究语义分割的研究都侧重于准确性而很少考虑效率。一些强调高速推理的现有研究往往不能产生高精度的分割结果。在本文中,我们提出了一种新的卷积网络,称为带有非对称卷积的高效密集模块(EDANet),它采用非对称卷积结构,结合了扩展卷积和密集连通性,以低成本,推理时间和模型大小获得高效率。与FCN相比,EDANet的速度提高了11倍,参数减少了196倍,同时无需任何额外的解码器结构,上下文模块,后处理方案和预训练模型,它实现了更高的交叉结合(mIoU)分数的平均值。我们在Cityscapes和CamVid数据集上评估EDANet以评估其性能并将其与其他最先进的系统进行比较。我们的网络可以分别在单个GTX 1080Ti和Titan X上以108和81帧/秒的速度运行解析512x1024输入。
translated by 谷歌翻译
最近在图像分类研究方面取得的许多进展可以归功于培训过程的改进,例如数据分类和优化方法的变化。然而,在文献中,大多数细节要么被简要地提及为实现细节,要么仅在源代码中可见。在本文中,我们将研究这些细节的集合,并通过消融研究凭经验评估它们对最终模型精确度的影响。我们将通过将这些改进结合起来表明,我们能够显着改进各种CNN模型。例如,我们在ImageNet上将ResNet-50的前1个验证准确率从75.3%提高到79.29%。我们还将证明图像分类准确性的改进导致在其他应用程序域(例如对象检测和语义分割)中更好的传递学习性能。
translated by 谷歌翻译
One of recent trends [30, 31, 14] in network architecture design is stacking small filters (e.g., 1x1 or 3x3) in the entire network because the stacked small filters is more efficient than a large kernel, given the same computational complexity. However, in the field of semantic segmenta-tion, where we need to perform dense per-pixel prediction, we find that the large kernel (and effective receptive field) plays an important role when we have to perform the classification and localization tasks simultaneously. Following our design principle, we propose a Global Convolutional Network to address both the classification and localization issues for the semantic segmentation. We also suggest a residual-based boundary refinement to further refine the object boundaries. Our approach achieves state-of-art performance on two public benchmarks and significantly outper-forms previous results, 82.2% (vs 80.2%) on PASCAL VOC 2012 dataset and 76.9% (vs 71.8%) on Cityscapes dataset.
translated by 谷歌翻译
在本文中,我们提出广播卷积网络(BCN)从整个输入图像的全局字段中提取关键对象特征,并识别它们与局部特征的关系。 BCN是一个简单的网络模块,它收集有效的空间特征,嵌入位置信息并将它们广播到整个特征图。我们进一步介绍了利用BCN模块改进现有关系网络(RN)的多关系网络(multiRN)。在基于像素的关系推理问题中,在BCN的帮助下,multiRN通过将每个对象与多个对象同时关联,将“成对关系”非常规RN的概念扩展为“多向关系”。这产生了n个对象的O(n)复杂度,这是来自采用O(n ^ 2)的RN的巨大计算增益。通过实验,multiRN在CLEVR数据集上实现了最先进的性能,证明了BCN对关系推理问题的可靠性。
translated by 谷歌翻译
Convolutional Neural Networks spread through computer vision like a wildfire, impacting almost all visual tasks imaginable. Despite this, few researchers dare to train their models from scratch. Most work builds on one of a handful of Im-ageNet pre-trained models, and fine-tunes or adapts these for specific tasks. This is in large part due to the difficulty of properly initializing these networks from scratch. A small miscalibration of the initial weights leads to vanishing or exploding gradients, as well as poor convergence properties. In this work we present a fast and simple data-dependent initialization procedure, that sets the weights of a network such that all units in the network train at roughly the same rate, avoiding vanishing or exploding gradients. Our initialization matches the current state-of-the-art unsupervised or self-supervised pre-training methods on standard computer vision tasks, such as image classification and object detection, while reducing the pre-training time by three orders of magnitude. When combined with pre-training methods, our initialization significantly outperforms prior work, narrowing the gap between supervised and unsupervised pre-training.
translated by 谷歌翻译
Convolutional neural networks are built upon the con-volution operation, which extracts informative features by fusing spatial and channel-wise information together within local receptive fields. In order to boost the representa-tional power of a network, several recent approaches have shown the benefit of enhancing spatial encoding. In this work, we focus on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling in-terdependencies between channels. We demonstrate that by stacking these blocks together, we can construct SENet ar-chitectures that generalise extremely well across challenging datasets. Crucially, we find that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost. SENets formed the foundation of our ILSVRC 2017 classification submission which won first place and significantly reduced the top-5 error to 2.251%, achieving a ∼25% relative improvement over the winning entry of 2016. Code and models are available at https: //github.com/hujie-frank/SENet.
translated by 谷歌翻译