Fig. 1. Masked images and corresponding inpainted results using our partial-convolution based network. Abstract. Existing deep learning based image inpainting methods use a standard convolutional network over the corrupted image, using con-volutional filter responses conditioned on both valid pixels as well as the substitute values in the masked holes (typically the mean value). This often leads to artifacts such as color discrepancy and blurriness. Post-processing is usually used to reduce such artifacts, but are expensive and may fail. We propose the use of partial convolutions, where the convolu-tion is masked and renormalized to be conditioned on only valid pixels. We further include a mechanism to automatically generate an updated mask for the next layer as part of the forward pass. Our model out-performs other methods for irregular masks. We show qualitative and quantitative comparisons with other methods to validate our approach.
translated by 谷歌翻译
当深度神经网络过度参数化并且受到大量噪声和正则化(例如重量衰减和丢失)的训练时,它们通常能够很好地工作。虽然丢失被广泛用作有效连接层的正则化技术,但对于卷积层通常效果较差。卷积层丢失的成功率可能是由于卷积层中的激活单元在空间上相关,因此信息仍然可以通过卷积网络尽管辍学。因此需要结构化的丢失形式来规范卷积网络。在本文中,我们介绍了DropBlock,一种结构性丢失的形式,其中在特征映射的连续区域中的单元被放在一起。我们发现除了卷积层之外,在跳过连接中应用DropbBlock可以提高准确性。此外,在训练期间逐渐增加丢弃的单元数导致更高的准确性并且对超参数选择更稳健。大量实验表明DropBlock比正常卷积网络中的丢失效果更好。在ImageNet分类中,使用DropBlock的ResNet-50架构实现了78.13美元\%$的准确度,这比基线的价格提高了1.6美元以上。在COCO检测方面,DropBlock将RetinaNet的平均精度从$ 36.8 \%$改善至$ 38.4 \%$。
translated by 谷歌翻译
学会可靠地感知和理解场景是机器人在现实世界中运行的不可或缺的一部分。由于不同的对象类型以及由不同的照明和天气条件引起的外观变化,这个问题本质上具有挑战性。利用互补模态可以学习对这种扰动有利的语义更丰富的表示。尽管近年来取得了巨大进步,但大多数多模态卷积神经网络方法直接将来自各个模态流的特征映射连接起来,使得该模型不能仅仅关注用于融合的相关补充信息。针对这一局限性,我们提出了一种多模态语义分割框架,该框架动态地适应模态特定特征的融合,同时以自我监督的方式对对象类别,空间位置和场景上下文敏感。具体来说,我们提出了一种由两种模态特定编码器流组成的体系结构,它使用我们提出的自我监督模式偏移融合机制将中间编码器表示融合到单个解码器中,该融合机制最佳地组合了互补特征。由于中间表示不是跨模态对齐的,因此我们引入了解释方案以获得更好的相关性。此外,我们提出了称为AdapNet ++的计算有效的单峰分割架构,它结合了具有多尺度残差单元的新编码器和有效的空间金字塔池,其具有更大的有效感受域,参数减少10倍以上,并辅以强大的解码器。分辨率监督方案可以恢复高分辨率的细节。对几个基准测试的综合实证评估表明,单峰和多模式架构都可以实现最先进的性能。
translated by 谷歌翻译
Recent advances in deep learning, especially deep convolutional neuralnetworks (CNNs), have led to significant improvement over previous semanticsegmentation systems. Here we show how to improve pixel-wise semanticsegmentation by manipulating convolution-related operations that are of boththeoretical and practical value. First, we design dense upsampling convolution(DUC) to generate pixel-level prediction, which is able to capture and decodemore detailed information that is generally missing in bilinear upsampling.Second, we propose a hybrid dilated convolution (HDC) framework in the encodingphase. This framework 1) effectively enlarges the receptive fields (RF) of thenetwork to aggregate global information; 2) alleviates what we call the"gridding issue" caused by the standard dilated convolution operation. Weevaluate our approaches thoroughly on the Cityscapes dataset, and achieve astate-of-art result of 80.1% mIOU in the test set at the time of submission. Wealso have achieved state-of-the-art overall on the KITTI road estimationbenchmark and the PASCAL VOC2012 segmentation task. Our source code can befound at https://github.com/TuSimple/TuSimple-DUC .
translated by 谷歌翻译
在这项工作中,我们使用DeepLearning来解决语义图像分割的任务,并做出三个主要贡献,通过实验证明它们具有实质性的实用价值。首先,我们使用上采样过滤器或“atrous convolution”突出显示卷积,作为密集预测任务中的强大工具。 Atrous卷积允许我们明确地控制在Deep ConvolutionalNeural Networks中计算特征响应的分辨率。它还允许我们有效地扩大视野过滤器以包含更大的上下文而不增加参数的数量或计算量。其次,我们提出了一种不稳定的空间锥体池(ASPP)来在多个尺度上对对象进行鲁棒分割。 ASPP探测带有多倍采样率和有效视场的滤波器的传入卷积特征层,从而捕获多个尺度的对象和图像上下文。第三,我们通过结合DCNN和概率图形模型的方法来改进对象边界的本地化。在DCNN中通常部署的最大池和下采样的组合实现了不变性,但是对定位精度有影响。我们通过将最终DCNN层的响应与完全连接的条件随机场(CRF)相结合来克服这一点,CRF定性和定量地显示以改善定位性能。我们提出的“DeepLab”系统在PASCAL VOC-2012语义图像分割任务中设置了新的先进技术,在测试集中达到了79.7%mIOU,并在其他三个数据集上推进了结果:PASCAL-Context,PASCAL-Person-Part ,和城市景观。我们所有的代码都在网上公开发布。
translated by 谷歌翻译
本文提出了一个有效的模块空间瓶颈,用于加速深度神经网络中的卷积层。核心思想是将卷积分解为两个阶段,首先减少特征映射的空间分辨率,然后将其恢复到所需的大小。这种操作降低了空间域中的采样密度,这与信道域中的网络加速方法无关,但又是互补的。使用不同的采样率,我们可以在识别准确度和模型复杂度之间进行权衡。作为基本构建块,空间瓶颈可用于替换单个卷积层或两个卷积层的组合。通过将其应用于深度剩余网络,我们通过空间瓶颈来验证空间瓶颈的有效性。空间瓶颈分别在常规和通道瓶颈残余块上实现2倍和1.4倍的加速,在识别低分辨率图像时保留了准确性,甚至在识别高分辨率图像方面得到了改进。
translated by 谷歌翻译
In this work, we revisit atrous convolution, a powerful tool to explicitlyadjust filter's field-of-view as well as control the resolution of featureresponses computed by Deep Convolutional Neural Networks, in the application ofsemantic image segmentation. To handle the problem of segmenting objects atmultiple scales, we design modules which employ atrous convolution in cascadeor in parallel to capture multi-scale context by adopting multiple atrousrates. Furthermore, we propose to augment our previously proposed AtrousSpatial Pyramid Pooling module, which probes convolutional features at multiplescales, with image-level features encoding global context and further boostperformance. We also elaborate on implementation details and share ourexperience on training our system. The proposed `DeepLabv3' systemsignificantly improves over our previous DeepLab versions without DenseCRFpost-processing and attains comparable performance with other state-of-artmodels on the PASCAL VOC 2012 semantic image segmentation benchmark.
translated by 谷歌翻译
Semantic image segmentation is an essential component of modern autonomous driving systems, as an accurate understanding of the surrounding scene is crucial to navigation and action planning. Current state-of-the-art approaches in semantic image segmentation rely on pre-trained networks that were initially developed for classifying images as a whole. While these networks exhibit outstanding recognition performance (i.e., what is visible?), they lack localization accuracy (i.e., where precisely is something located?). Therefore, additional processing steps have to be performed in order to obtain pixel-accurate seg-mentation masks at the full image resolution. To alleviate this problem we propose a novel ResNet-like architecture that exhibits strong localization and recognition performance. We combine multi-scale context with pixel-level accuracy by using two processing streams within our network: One stream carries information at the full image resolution , enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition. The two streams are coupled at the full image resolution using resid-uals. Without additional processing steps and without pre-training, our approach achieves an intersection-over-union score of 71.8% on the Cityscapes dataset.
translated by 谷歌翻译
We introduce a lightweight , power efficient, and general purpose convolutional neural network, ESPNetv2 , for modeling visual and sequential data. Our network uses group point-wise and depth-wise dilated separable convo-lutions to learn representations from a large effective receptive field with fewer FLOPs and parameters. The performance of our network is evaluated on three different tasks: (1) object classification, (2) semantic segmentation, and (3) language modeling. Experiments on these tasks, including image classification on the ImageNet and language modeling on the PenTree bank dataset, demonstrate the superior performance of our method over the state-of-the-art methods. Our network has better generalization properties than ShuffleNetv2 when tested on the MSCOCO multi-object classification task and the Cityscapes urban scene semantic segmentation task. Our experiments show that ESPNetv2 is much more power efficient than existing state-of-the-art efficient methods including ShuffleNets and MobileNets. Our code is open-source and available at https://github.com/sacmehta/ESPNetv2.
translated by 谷歌翻译
空间金字塔池模块或编码 - 解码器结构用于神经网络中用于语义分割任务。前一种网络能够通过以多种速率和多个有效视场观察过滤特征或过滤操作来对多尺度上下文信息进行编码,而后一种网络可以通过逐渐恢复空间信息来捕获更清晰的对象边界。在这项工作中,我们提出了两种方法的优点。具体来说,我们提出的模型DeepLabv3 +通过添加一个简单而有效的解码器模块来扩展DeepLabv3,以细化特别是沿对象边界的分割结果。我们进一步探索Xception模型并将深度可分离卷积应用于Atrous Spatial Pyramid Pooling和解码器模块,从而产生更快更强的编码器 - 解码器网络。我们证明了所提出的模型在PASCAL VOC 2012和Cityscapes数据集上的有效性,在没有任何后处理的情况下实现了89.0%和82.1%的测试集性能。我们的论文附有公开提供的Tensorflow模型的参考实现,网址为\ url {https://github.com/tensorflow/models/tree/master/research/deeplab}。
translated by 谷歌翻译
最近在图像分类研究方面取得的许多进展可以归功于培训过程的改进,例如数据分类和优化方法的变化。然而,在文献中,大多数细节要么被简要地提及为实现细节,要么仅在源代码中可见。在本文中,我们将研究这些细节的集合,并通过消融研究凭经验评估它们对最终模型精确度的影响。我们将通过将这些改进结合起来表明,我们能够显着改进各种CNN模型。例如,我们在ImageNet上将ResNet-50的前1个验证准确率从75.3%提高到79.29%。我们还将证明图像分类准确性的改进导致在其他应用程序域(例如对象检测和语义分割)中更好的传递学习性能。
translated by 谷歌翻译
在本文中,我们提出广播卷积网络(BCN)从整个输入图像的全局字段中提取关键对象特征,并识别它们与局部特征的关系。 BCN是一个简单的网络模块,它收集有效的空间特征,嵌入位置信息并将它们广播到整个特征图。我们进一步介绍了利用BCN模块改进现有关系网络(RN)的多关系网络(multiRN)。在基于像素的关系推理问题中,在BCN的帮助下,multiRN通过将每个对象与多个对象同时关联,将“成对关系”非常规RN的概念扩展为“多向关系”。这产生了n个对象的O(n)复杂度,这是来自采用O(n ^ 2)的RN的巨大计算增益。通过实验,multiRN在CLEVR数据集上实现了最先进的性能,证明了BCN对关系推理问题的可靠性。
translated by 谷歌翻译
语义分割需要大量像素方式的注释,以容许准确的模型。在本文中,我们提出了一种基于视频预测的方法,通过综合新的训练样本来扩展训练集,以提高语义分割网络的准确性。我们利用视频预测模型预测未来帧的能力,以便预测未来的标签。还提出了联合传播策略以减少合成样品中的错误比对。我们证明了由合成样本增加的数据集上的训练分割模型导致准确性的显着改善。此外,我们引入了一种新颖的边界标签松弛技术,该技术使得对沿着对象边界的注释噪声和传播伪像具有鲁棒性。我们提出的方法在Cityscapes上实现了83.5%的最新技术,在CamVid上实现了82.9%。没有模型合奏的单一模型在KITTI语义分割测试集上达到了72.8%mIoU,超过了ROBchallenge 2018的获奖作品。我们的代码和视频可以在以下网址找到://nv-adlr.github.io/publication/2018 -分割。
translated by 谷歌翻译
实时语义分割在自动驾驶和机器人等实际应用中起着重要作用。大多数研究语义分割的研究都侧重于准确性而很少考虑效率。一些强调高速推理的现有研究往往不能产生高精度的分割结果。在本文中,我们提出了一种新的卷积网络,称为带有非对称卷积的高效密集模块(EDANet),它采用非对称卷积结构,结合了扩展卷积和密集连通性,以低成本,推理时间和模型大小获得高效率。与FCN相比,EDANet的速度提高了11倍,参数减少了196倍,同时无需任何额外的解码器结构,上下文模块,后处理方案和预训练模型,它实现了更高的交叉结合(mIoU)分数的平均值。我们在Cityscapes和CamVid数据集上评估EDANet以评估其性能并将其与其他最先进的系统进行比较。我们的网络可以分别在单个GTX 1080Ti和Titan X上以108和81帧/秒的速度运行解析512x1024输入。
translated by 谷歌翻译
Convolutional neural networks are built upon the con-volution operation, which extracts informative features by fusing spatial and channel-wise information together within local receptive fields. In order to boost the representa-tional power of a network, several recent approaches have shown the benefit of enhancing spatial encoding. In this work, we focus on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling in-terdependencies between channels. We demonstrate that by stacking these blocks together, we can construct SENet ar-chitectures that generalise extremely well across challenging datasets. Crucially, we find that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost. SENets formed the foundation of our ILSVRC 2017 classification submission which won first place and significantly reduced the top-5 error to 2.251%, achieving a ∼25% relative improvement over the winning entry of 2016. Code and models are available at https: //github.com/hujie-frank/SENet.
translated by 谷歌翻译
深度学习的最新进展已经显示出填充大孔的令人兴奋的希望,并导致图像修复的另一个方向。然而,由于认知理解不充分,存在基于学习的方法通常会产生伪影和谬误纹理。以前的生成网络受限于单一接收类型,并考虑到清晰度而放弃合并。无论目标属性如何,人类认知都是恒定的。由于多个感知领域提高了抽象图像特征的能力,并且合并可以保持特征不变,具体而言,采用深度学习来促进高级特征表示,增强局部补丁的模型学习能力。此外,引入用于生成不同掩模图像的方法,并创建随机掩模数据集。我们在ImageNet,Places2数据集和CelebA-HQ上对我们的方法进行基准测试。对常规,不规则和自定义区域完成的实验都是完成的,并且还提供了自由风格的图像修复。与先前最先进的方法进行定量比较表明,我们获得了更自然的图像完成。
translated by 谷歌翻译
We present an end-to-end trainable deep convolutional neural network (DCNN) for semantic segmentation with built-in awareness of semantically meaningful boundaries. Semantic segmentation is a fundamental remote sensing task, and most state-of-the-art methods rely on DCNNs as their workhorse. A major reason for their success is that deep networks learn to accumulate contextual information over very large receptive fields. However , this success comes at a cost, since the associated loss of effective spatial resolution washes out high-frequency details and leads to blurry object boundaries. Here, we propose to counter this effect by combining semantic segmentation with semantically informed edge detection, thus making class boundaries explicit in the model. First, we construct a comparatively simple , memory-efficient model by adding boundary detection to the segnet encoder-decoder architecture. Second, we also include boundary detection in fcn-type models and set up a high-end classifier ensemble. We show that boundary detection significantly improves semantic segmentation with CNNs in an end-to-end training scheme. Our best model achieves > 90% overall accuracy on the ISPRS Vaihingen benchmark.
translated by 谷歌翻译
在过去几年中,深度学习技术在图像修复方面产生了显着的改进。然而,许多这些技术无法重构合理的结构,因为它们通常过度平滑和/或不完整。本文开发了一种新的图像修复方法,可以更好地再现显示精细细节的填充区域。我们提出了两级对抗模型EdgeConnect,它包括一个由图像完成网络跟随的边缘生成器。边缘生成器使图像的缺失区域(规则和不规则)的边缘产生幻觉,并且图像完成网络使用幻觉边缘填充缺失区域作为先验。我们通过公开可用的数据集CelebA,Places2和Paris StreetView对我们的模型进行端到端评估,并表明它在数量和质量上优于当前最先进的技术。
translated by 谷歌翻译
Recent CNN based object detectors, either one-stage methods like YOLO, SSD, and RetinaNet, or two-stage detectors like Faster R-CNN, R-FCN and FPN, are usually trying to directly finetune from ImageNet pre-trained models designed for the task of image classification. However, there has been little work discussing the backbone feature extractor specifically designed for the task of object detection. More importantly , there are several differences between the tasks of image classification and object detection. (i) Recent object detectors like FPN and RetinaNet usually involve extra stages against the task of image classification to handle the objects with various scales. (ii) Object detection not only needs to recognize the category of the object instances but also spatially locate them. Large downsampling factors bring large valid receptive field, which is good for image classification, but compromises the object location ability. Due to the gap between the image classification and object detection, we propose DetNet in this paper, which is a novel backbone network specifically designed for object detection. Moreover, DetNet includes the extra stages against traditional backbone network for image classification, while maintains high spatial resolution in deeper layers. Without any bells and whistles, state-of-the-art results have been obtained for both object detection and instance segmentation on the MSCOCO benchmark based on our DetNet (4.8G FLOPs) backbone. Codes will be released 3 .
translated by 谷歌翻译
在这项工作中,我们提出了就地激活批量归一化(InPlace-ABN) - 一种以计算有效的方式大幅减少现代深度神经网络的训练记忆足迹的新方法。我们的解决方案通过单个插件层来构建常规使用的BatchNorm +激活层继承,从而避免了入侵框架手术,同时为现有的深度学习框架提供了直接的适用性。通过降低中间结果并在后退期间恢复所需信息,我们节省了高达50%的内存通过存储的前向结果的反转,计算时间仅略微增加(0.8-2%)。此外,我们还演示了如何经常使用检验点方法在计算上与InPlace-ABN一样高效。在我们关于图像分类的实验中,我们使用最先进的方法展示了ImageNet-1k的平均结果。在语义分割的内存要求高的任务中,我们报告了COCO-Stuff,Cityscapes和MapillaryVistas的结果,在没有额外训练数据的情况下获得了后者的新的最新结果,但是在单一规模和模型场景中。代码可以在http://github.com/mapillary/inplace_abn找到。
translated by 谷歌翻译