In this work, we propose "Residual Attention Network", a convolutional neural network using attention mechanism which can incorporate with state-of-art feed forward network architecture in an end-to-end training fashion. Our Residual Attention Network is built by stacking Attention Modules which generate attention-aware features. The attention-aware features from different modules change adaptively as layers going deeper. Inside each Attention Module, bottom-up top-down feedforward structure is used to unfold the feedforward and feedback attention process into a single feedforward process. Importantly, we propose attention residual learning to train very deep Residual Attention Networks which can be easily scaled up to hundreds of layers.Extensive analyses are conducted on CIFAR-10 and CIFAR-100 datasets to verify the effectiveness of every module mentioned above. Our Residual Attention Network achieves state-of-the-art object recognition performance on three benchmark datasets including CIFAR-10 (3.90% error), CIFAR-100 (20.45% error) and ImageNet (4.8% single model and single crop, top-5 error). Note that, our method achieves 0.6% top-1 accuracy improvement with 46% trunk depth and 69% forward FLOPs comparing to ResNet-200. The experiment also demonstrates that our network is robust against noisy labels.
translated by 谷歌翻译
In standard Convolutional Neural Networks (CNNs), the receptive fields of artificial neurons in each layer are designed to share the same size. It is well-known in the neuroscience community that the receptive field size of visual cortical neurons are modulated by the stimulus, which has been rarely considered in constructing CNNs. We propose a dynamic selection mechanism in CNNs that allows each neuron to adaptively adjust its receptive field size based on multiple scales of input information. A building block called Selective Kernel (SK) unit is designed, in which multiple branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer. Multiple SK units are stacked to a deep network termed Selective Kernel Networks (SKNets). On the ImageNet and CIFAR benchmarks, we empirically show that SKNet outperforms the existing state-of-the-art architectures with lower model complexity. Detailed analyses show that the neurons in SKNet can capture target objects with different scales, which verifies the capability of neurons for adaptively adjusting their receptive field sizes according to the input. The code and models are available at https://github.com/implus/SKNet.
translated by 谷歌翻译
现有的多尺度解决方案会导致仅增加接受场大小的风险,同时忽略小型接受场。因此,有效构建自适应神经网络以识别各种空间尺度对象是一个具有挑战性的问题。为了解决这个问题,我们首先引入一个新的注意力维度,即除了现有的注意力维度(例如渠道,空间和分支)之外,并提出了一个新颖的选择性深度注意网络,以对称地处理各种视觉中的多尺度对象任务。具体而言,在给定神经网络的每个阶段内的块,即重新连接,输出层次功能映射共享相同的分辨率但具有不同的接收场大小。基于此结构属性,我们设计了一个舞台建筑模块,即SDA,其中包括树干分支和类似SE的注意力分支。躯干分支的块输出融合在一起,以通过注意力分支指导其深度注意力分配。根据提出的注意机制,我们可以动态选择不同的深度特征,这有助于自适应调整可变大小输入对象的接收场大小。这样,跨块信息相互作用会导致沿深度方向的远距离依赖关系。与其他多尺度方法相比,我们的SDA方法结合了从以前的块到舞台输出的多个接受场,从而提供了更广泛,更丰富的有效接收场。此外,我们的方法可以用作其他多尺度网络以及注意力网络的可插入模块,并创造为SDA- $ x $ net。它们的组合进一步扩展了有效的接受场的范围,可以实现可解释的神经网络。我们的源代码可在\ url {https://github.com/qingbeiguo/sda-xnet.git}中获得。
translated by 谷歌翻译
Convolutional neural networks are built upon the convolution operation, which extracts informative features by fusing spatial and channel-wise information together within local receptive fields. In order to boost the representational power of a network, several recent approaches have shown the benefit of enhancing spatial encoding. In this work, we focus on the channel relationship and propose a novel architectural unit, which we term the "Squeezeand-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We demonstrate that by stacking these blocks together, we can construct SENet architectures that generalise extremely well across challenging datasets. Crucially, we find that SE blocks produce significant performance improvements for existing state-ofthe-art deep architectures at minimal additional computational cost. SENets formed the foundation of our ILSVRC 2017 classification submission which won first place and significantly reduced the top-5 error to 2.251%, achieving a ∼25% relative improvement over the winning entry of 2016.
translated by 谷歌翻译
在本文中,我们建议使用注意机制和全球环境进行图像分类的一般框架,该框架可以与各种网络体系结构结合起来以提高其性能。为了调查全球环境的能力,我们比较了四个数学模型,并观察到分开的条件生成模型中编码的全球环境可以提供更多的指导,因为“知道什么是任务无关紧要的,也将知道什么是相关的”。基于此观察结果,我们定义了一个新型的分离全球环境(CDGC),并设计了一个深层网络来获得它。通过参加CDGC,基线网络可以更准确地识别感兴趣的对象,从而改善性能。我们将框架应用于许多不同的网络体系结构,并与四个公开可用数据集的最新框架进行比较。广泛的结果证明了我们方法的有效性和优势。代码将在纸上接受公开。
translated by 谷歌翻译
在本文中,我们基于任何卷积神经网络中中间注意图的弱监督生成机制,并更加直接地披露了注意模块的有效性,以充分利用其潜力。鉴于现有的神经网络配备了任意注意模块,我们介绍了一个元评论家网络,以评估主网络中注意力图的质量。由于我们设计的奖励的离散性,提出的学习方法是在强化学习环境中安排的,在此设置中,注意力参与者和经常性的批评家交替优化,以提供临时注意力表示的即时批评和修订,因此,由于深度强化的注意力学习而引起了人们的关注。 (Dreal)。它可以普遍应用于具有不同类型的注意模块的网络体系结构,并通过最大程度地提高每个单独注意模块产生的最终识别性能的相对增益来促进其表现能力,如类别和实例识别基准的广泛实验所证明的那样。
translated by 谷歌翻译
人类自然有效地在复杂的场景中找到突出区域。通过这种观察的动机,引入了计算机视觉中的注意力机制,目的是模仿人类视觉系统的这一方面。这种注意机制可以基于输入图像的特征被视为动态权重调整过程。注意机制在许多视觉任务中取得了巨大的成功,包括图像分类,对象检测,语义分割,视频理解,图像生成,3D视觉,多模态任务和自我监督的学习。在本调查中,我们对计算机愿景中的各种关注机制进行了全面的审查,并根据渠道注意,空间关注,暂时关注和分支注意力进行分类。相关的存储库https://github.com/menghaoguo/awesome-vision-tions致力于收集相关的工作。我们还建议了未来的注意机制研究方向。
translated by 谷歌翻译
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers-8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions 1 , where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
translated by 谷歌翻译
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections-one between each layer and its subsequent layer-our network has L(L+1) 2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance. Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet.
translated by 谷歌翻译
We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call "cardinality" (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online 1 .
translated by 谷歌翻译
由于特定属性的定位不准确,监控场景中的行人属性识别仍然是一个具有挑战性的任务。在本文中,我们提出了一种基于注意力(VALA)的新型视图 - 属性定位方法,其利用查看信息来指导识别过程,专注于对特定属性对应区域的特定属性和注意机制。具体地,查看信息由视图预测分支利用,以生成四个视图权重,表示来自不同视图的属性的信心。然后将视图重量交付回撰写以撰写特定的视图属性,该属性将参与和监督深度特征提取。为了探索视图属性的空间位置,引入区域关注来聚合空间信息并编码视图特征的通道间依赖性。随后,特定于细小的特定属性特定区域是本地化的,并且通过区域关注获得了来自不同空间位置的视图属性的区域权重。通过将视图权重与区域权重组合来获得最终视图 - 属性识别结果。在三个宽数据集(RAP,RAPV2和PA-100K)上的实验证明了与最先进的方法相比我们的方法的有效性。
translated by 谷歌翻译
Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layerwise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods. The source code and trained models are available on https://mmcheng.net/res2net/.
translated by 谷歌翻译
语义分割是自主车辆了解周围场景的关键技术。当代模型的吸引力表现通常以牺牲重计算和冗长的推理时间为代价,这对于自行车来说是无法忍受的。在低分辨率图像上使用轻量级架构(编码器 - 解码器或双路)或推理,最近的方法实现了非常快的场景解析,即使在单个1080TI GPU上以100多件FPS运行。然而,这些实时方法与基于扩张骨架的模型之间的性能仍有显着差距。为了解决这个问题,我们提出了一家专门为实时语义细分设计的高效底座。所提出的深层双分辨率网络(DDRNET)由两个深部分支组成,之间进行多个双边融合。此外,我们设计了一个名为Deep聚合金字塔池(DAPPM)的新上下文信息提取器,以基于低分辨率特征映射放大有效的接收字段和熔丝多尺度上下文。我们的方法在城市景观和Camvid数据集上的准确性和速度之间实现了新的最先进的权衡。特别是,在单一的2080Ti GPU上,DDRNET-23-Slim在Camvid测试组上的Citycapes试验组102 FPS上的102 FPS,74.7%Miou。通过广泛使用的测试增强,我们的方法优于最先进的模型,需要计算得多。 CODES和培训的型号在线提供。
translated by 谷歌翻译
Deploying convolutional neural networks (CNNs) on embedded devices is difficult due to the limited memory and computation resources. The redundancy in feature maps is an important characteristic of those successful CNNs, but has rarely been investigated in neural architecture design. This paper proposes a novel Ghost module to generate more feature maps from cheap operations. Based on a set of intrinsic feature maps, we apply a series of linear transformations with cheap cost to generate many ghost feature maps that could fully reveal information underlying intrinsic features. The proposed Ghost module can be taken as a plug-and-play component to upgrade existing convolutional neural networks. Ghost bottlenecks are designed to stack Ghost modules, and then the lightweight Ghost-Net can be easily established. Experiments conducted on benchmarks demonstrate that the proposed Ghost module is an impressive alternative of convolution layers in baseline models, and our GhostNet can achieve higher recognition performance (e.g. 75.7% top-1 accuracy) than MobileNetV3 with similar computational cost on the ImageNet ILSVRC-2012 classification dataset. Code is available at https: //github.com/huawei-noah/ghostnet.
translated by 谷歌翻译
Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes these networks very slow to train. To tackle these problems, in this paper we conduct a detailed experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture where we decrease depth and increase width of residual networks. We call the resulting network structures wide residual networks (WRNs) and show that these are far superior over their commonly used thin and very deep counterparts. For example, we demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layerdeep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on ImageNet. Our code and models are available at https: //github.com/szagoruyko/wide-residual-networks.
translated by 谷歌翻译
We propose Convolutional Block Attention Module (CBAM), a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, our module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. Because CBAM is a lightweight and general module, it can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs. We validate our CBAM through extensive experiments on ImageNet-1K, MS COCO detection, and VOC 2007 detection datasets. Our experiments show consistent improvements in classification and detection performances with various models, demonstrating the wide applicability of CBAM. The code and models will be publicly available.
translated by 谷歌翻译
Australian Centre for Robotic Vision {guosheng.lin;anton.milan;chunhua.shen;
translated by 谷歌翻译
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.
translated by 谷歌翻译
在现实世界中,在雾度下拍摄的图像的降解可以是非常复杂的,其中雾度的空间分布从图像变化到图像。最近的方法采用深神经网络直接从朦胧图像中恢复清洁场景。然而,由于悖论由真正捕获的雾霾的变化和当前网络的固定退化参数引起的悖论,最近在真实朦胧的图像上的脱水方法的泛化能力不是理想的。解决现实世界建模问题阴霾退化,我们建议通过对不均匀雾度分布的鉴定和建模密度来解决这个问题。我们提出了一种新颖的可分离混合注意力(SHA)模块来编码雾霾密度,通过捕获正交方向上的特征来实现这一目标。此外,提出了密度图以明确地模拟雾度的不均匀分布。密度图以半监督方式生成位置编码。这种雾度密度感知和建模有效地捕获特征水平的不均匀分布性变性。通过SHA和密度图的合适组合,我们设计了一种新型的脱水网络架构,实现了良好的复杂性性能权衡。两个大规模数据集的广泛实验表明,我们的方法通过量化和定性地通过大幅度超越所有最先进的方法,将最佳发布的PSNR度量从28.53 DB升高到Haze4K测试数据集和在SOTS室内测试数据集中的37.17 dB至38.41 dB。
translated by 谷歌翻译
Deep residual networks [1] have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors. In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. This motivates us to propose a new residual unit, which makes training easier and improves generalization. We report improved results using a 1001-layer ResNet on CIFAR-10 (4.62% error) and CIFAR-100, and a 200-layer ResNet on ImageNet. Code is available at: https://github.com/KaimingHe/ resnet-1k-layers.
translated by 谷歌翻译