在每个卷积层中学习一个静态卷积内核是现代卷积神经网络(CNN)的常见训练范式。取而代之的是,动态卷积的最新研究表明,学习$ n $卷积核与输入依赖性注意的线性组合可以显着提高轻重量CNN的准确性,同时保持有效的推断。但是,我们观察到现有的作品endow卷积内核具有通过一个维度(关于卷积内核编号)的动态属性(关于内核空间的卷积内核编号),但其他三个维度(关于空间大小,输入通道号和输出通道编号和输出通道号,每个卷积内核)被忽略。受到这一点的启发,我们提出了Omni维动态卷积(ODCONV),这是一种更普遍而优雅的动态卷积设计,以推进这一研究。 ODCONV利用了一种新型的多维注意机制,采用平行策略来学习沿着任何卷积层的内核空间的所有四个维度学习卷积内核的互补关注。作为定期卷积的倒数替换,可以将ODCONV插入许多CNN架构中。 ImageNet和MS-Coco数据集的广泛实验表明,ODCONV为包括轻量重量和大型的各种盛行的CNN主链带来了可靠的准确性提升,例如3.77%〜5.71%| 1.86%〜3.72%〜3.72%的绝对1个绝对1改进至ImabivLenetV2 | ImageNet数据集上的重新连接家族。有趣的是,由于其功能学习能力的提高,即使具有一个单个内核的ODCONV也可以与具有多个内核的现有动态卷积对应物竞争或超越现有的动态卷积对应物,从而大大降低了额外的参数。此外,ODCONV也优于其他注意模块,用于调节输出特征或卷积重量。
translated by 谷歌翻译
由于存储器和计算资源有限,部署在移动设备上的卷积神经网络(CNNS)是困难的。我们的目标是通过利用特征图中的冗余来设计包括CPU和GPU的异构设备的高效神经网络,这很少在神经结构设计中进行了研究。对于类似CPU的设备,我们提出了一种新颖的CPU高效的Ghost(C-Ghost)模块,以生成从廉价操作的更多特征映射。基于一组内在的特征映射,我们使用廉价的成本应用一系列线性变换,以生成许多幽灵特征图,可以完全揭示内在特征的信息。所提出的C-Ghost模块可以作为即插即用组件,以升级现有的卷积神经网络。 C-Ghost瓶颈旨在堆叠C-Ghost模块,然后可以轻松建立轻量级的C-Ghostnet。我们进一步考虑GPU设备的有效网络。在建筑阶段的情况下,不涉及太多的GPU效率(例如,深度明智的卷积),我们建议利用阶段明智的特征冗余来制定GPU高效的幽灵(G-GHOST)阶段结构。舞台中的特征被分成两个部分,其中使用具有较少输出通道的原始块处理第一部分,用于生成内在特征,另一个通过利用阶段明智的冗余来生成廉价的操作。在基准测试上进行的实验证明了所提出的C-Ghost模块和G-Ghost阶段的有效性。 C-Ghostnet和G-Ghostnet分别可以分别实现CPU和GPU的准确性和延迟的最佳权衡。代码可在https://github.com/huawei-noah/cv-backbones获得。
translated by 谷歌翻译
We propose Convolutional Block Attention Module (CBAM), a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, our module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. Because CBAM is a lightweight and general module, it can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs. We validate our CBAM through extensive experiments on ImageNet-1K, MS COCO detection, and VOC 2007 detection datasets. Our experiments show consistent improvements in classification and detection performances with various models, demonstrating the wide applicability of CBAM. The code and models will be publicly available.
translated by 谷歌翻译
Recently, channel attention mechanism has demonstrated to offer great potential in improving the performance of deep convolutional neural networks (CNNs). However, most existing methods dedicate to developing more sophisticated attention modules for achieving better performance, which inevitably increase model complexity.To overcome the paradox of performance and complexity trade-off, this paper proposes an Efficient Channel Attention (ECA) module, which only involves a handful of parameters while bringing clear performance gain. By dissecting the channel attention module in SENet, we empirically show avoiding dimensionality reduction is important for learning channel attention, and appropriate cross-channel interaction can preserve performance while significantly decreasing model complexity. Therefore, we propose a local crosschannel interaction strategy without dimensionality reduction, which can be efficiently implemented via 1D convolution. Furthermore, we develop a method to adaptively select kernel size of 1D convolution, determining coverage of local cross-channel interaction. The proposed ECA module is efficient yet effective, e.g., the parameters and computations of our modules against backbone of ResNet50 are 80 vs. 24.37M and 4.7e-4 GFLOPs vs. 3.86 GFLOPs, respectively, and the performance boost is more than 2% in terms of Top-1 accuracy. We extensively evaluate our ECA module on image classification, object detection and instance segmentation with backbones of ResNets and MobileNetV2. The experimental results show our module is more efficient while performing favorably against its counterparts.
translated by 谷歌翻译
视觉变压器的最新进展在基于点产生自我注意的新空间建模机制驱动的各种任务中取得了巨大成功。在本文中,我们表明,视觉变压器背后的关键要素,即输入自适应,远程和高阶空间相互作用,也可以通过基于卷积的框架有效地实现。我们介绍了递归封闭式卷积($ \ textit {g}^\ textit {n} $ conv),该卷积{n} $ conv)与封闭的卷积和递归设计执行高阶空间交互。新操作是高度灵活和可定制的,它与卷积的各种变体兼容,并将自我注意的两阶相互作用扩展到任意订单,而无需引入大量额外的计算。 $ \ textit {g}^\ textit {n} $ conv可以用作插件模块,以改善各种视觉变压器和基于卷积的模型。根据该操作,我们构建了一个名为Hornet的新型通用视觉骨干家族。关于ImageNet分类,可可对象检测和ADE20K语义分割的广泛实验表明,大黄蜂的表现优于Swin变形金刚,并具有相似的整体体系结构和训练配置的明显边距。大黄蜂还显示出对更多训练数据和更大模型大小的有利可伸缩性。除了在视觉编码器中的有效性外,我们还可以将$ \ textit {g}^\ textit {n} $ conv应用于特定于任务的解码器,并始终通过较少的计算来提高密集的预测性能。我们的结果表明,$ \ textIt {g}^\ textit {n} $ conv可以成为视觉建模的新基本模块,可有效结合视觉变形金刚和CNN的优点。代码可从https://github.com/raoyongming/hornet获得
translated by 谷歌翻译
In standard Convolutional Neural Networks (CNNs), the receptive fields of artificial neurons in each layer are designed to share the same size. It is well-known in the neuroscience community that the receptive field size of visual cortical neurons are modulated by the stimulus, which has been rarely considered in constructing CNNs. We propose a dynamic selection mechanism in CNNs that allows each neuron to adaptively adjust its receptive field size based on multiple scales of input information. A building block called Selective Kernel (SK) unit is designed, in which multiple branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer. Multiple SK units are stacked to a deep network termed Selective Kernel Networks (SKNets). On the ImageNet and CIFAR benchmarks, we empirically show that SKNet outperforms the existing state-of-the-art architectures with lower model complexity. Detailed analyses show that the neurons in SKNet can capture target objects with different scales, which verifies the capability of neurons for adaptively adjusting their receptive field sizes according to the input. The code and models are available at https://github.com/implus/SKNet.
translated by 谷歌翻译
Light-weight convolutional neural networks (CNNs) suffer performance degradation as their low computational budgets constrain both the depth (number of convolution layers) and the width (number of channels) of CNNs, resulting in limited representation capability. To address this issue, we present Dynamic Convolution, a new design that increases model complexity without increasing the network depth or width. Instead of using a single convolution kernel per layer, dynamic convolution aggregates multiple parallel convolution kernels dynamically based upon their attentions, which are input dependent. Assembling multiple kernels is not only computationally efficient due to the small kernel size, but also has more representation power since these kernels are aggregated in a non-linear way via attention. By simply using dynamic convolution for the state-ofthe-art architecture MobileNetV3-Small, the top-1 accuracy of ImageNet classification is boosted by 2.9% with only 4% additional FLOPs and 2.9 AP gain is achieved on COCO keypoint detection.
translated by 谷歌翻译
ous vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones. (3) We validate PVT through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope that PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future research.
translated by 谷歌翻译
现有的多尺度解决方案会导致仅增加接受场大小的风险,同时忽略小型接受场。因此,有效构建自适应神经网络以识别各种空间尺度对象是一个具有挑战性的问题。为了解决这个问题,我们首先引入一个新的注意力维度,即除了现有的注意力维度(例如渠道,空间和分支)之外,并提出了一个新颖的选择性深度注意网络,以对称地处理各种视觉中的多尺度对象任务。具体而言,在给定神经网络的每个阶段内的块,即重新连接,输出层次功能映射共享相同的分辨率但具有不同的接收场大小。基于此结构属性,我们设计了一个舞台建筑模块,即SDA,其中包括树干分支和类似SE的注意力分支。躯干分支的块输出融合在一起,以通过注意力分支指导其深度注意力分配。根据提出的注意机制,我们可以动态选择不同的深度特征,这有助于自适应调整可变大小输入对象的接收场大小。这样,跨块信息相互作用会导致沿深度方向的远距离依赖关系。与其他多尺度方法相比,我们的SDA方法结合了从以前的块到舞台输出的多个接受场,从而提供了更广泛,更丰富的有效接收场。此外,我们的方法可以用作其他多尺度网络以及注意力网络的可插入模块,并创造为SDA- $ x $ net。它们的组合进一步扩展了有效的接受场的范围,可以实现可解释的神经网络。我们的源代码可在\ url {https://github.com/qingbeiguo/sda-xnet.git}中获得。
translated by 谷歌翻译
大多数现有的深神经网络都是静态的,这意味着它们只能以固定的复杂性推断。但资源预算可以大幅度不同。即使在一个设备上,实惠预算也可以用不同的场景改变,并且对每个所需预算的反复培训网络是非常昂贵的。因此,在这项工作中,我们提出了一种称为Mutualnet的一般方法,以训练可以以各种资源约束运行的单个网络。我们的方法列举了具有各种网络宽度和输入分辨率的模型配置队列。这种相互学习方案不仅允许模型以不同的宽度分辨率配置运行,而且还可以在这些配置之间传输独特的知识,帮助模型来学习更强大的表示。 Mutualnet是一般的培训方法,可以应用于各种网络结构(例如,2D网络:MobileNets,Reset,3D网络:速度,X3D)和各种任务(例如,图像分类,对象检测,分段和动作识别),并证明了实现各种数据集的一致性改进。由于我们只培训了这一模型,它对独立培训多种型号而言,它也大大降低了培训成本。令人惊讶的是,如果动态资源约束不是一个问题,则可以使用Mutualnet来显着提高单个网络的性能。总之,Mutualnet是静态和自适应,2D和3D网络的统一方法。代码和预先训练的模型可用于\ url {https://github.com/tayang1122/mutualnet}。
translated by 谷歌翻译
As a powerful engine, vanilla convolution has promoted huge breakthroughs in various computer tasks. However, it often suffers from sample and content agnostic problems, which limits the representation capacities of the convolutional neural networks (CNNs). In this paper, we for the first time model the scene features as a combination of the local spatial-adaptive parts owned by the individual and the global shift-invariant parts shared to all individuals, and then propose a novel two-branch dual complementary dynamic convolution (DCDC) operator to flexibly deal with these two types of features. The DCDC operator overcomes the limitations of vanilla convolution and most existing dynamic convolutions who capture only spatial-adaptive features, and thus markedly boosts the representation capacities of CNNs. Experiments show that the DCDC operator based ResNets (DCDC-ResNets) significantly outperform vanilla ResNets and most state-of-the-art dynamic convolutional networks on image classification, as well as downstream tasks including object detection, instance and panoptic segmentation tasks, while with lower FLOPs and parameters.
translated by 谷歌翻译
Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layerwise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods. The source code and trained models are available on https://mmcheng.net/res2net/.
translated by 谷歌翻译
Since the recent success of Vision Transformers (ViTs), explorations toward transformer-style architectures have triggered the resurgence of modern ConvNets. In this work, we explore the representation ability of DNNs through the lens of interaction complexities. We empirically show that interaction complexity is an overlooked but essential indicator for visual recognition. Accordingly, a new family of efficient ConvNets, named MogaNet, is presented to pursue informative context mining in pure ConvNet-based models, with preferable complexity-performance trade-offs. In MogaNet, interactions across multiple complexities are facilitated and contextualized by leveraging two specially designed aggregation blocks in both spatial and channel interaction spaces. Extensive studies are conducted on ImageNet classification, COCO object detection, and ADE20K semantic segmentation tasks. The results demonstrate that our MogaNet establishes new state-of-the-art over other popular methods in mainstream scenarios and all model scales. Typically, the lightweight MogaNet-T achieves 80.0\% top-1 accuracy with only 1.44G FLOPs using a refined training setup on ImageNet-1K, surpassing ParC-Net-S by 1.4\% accuracy but saving 59\% (2.04G) FLOPs.
translated by 谷歌翻译
本文解决了由多头自我注意力(MHSA)中高计算/空间复杂性引起的视觉变压器的低效率缺陷。为此,我们提出了层次MHSA(H-MHSA),其表示以层次方式计算。具体而言,我们首先将输入图像分为通常完成的补丁,每个补丁都被视为令牌。然后,拟议的H-MHSA学习本地贴片中的令牌关系,作为局部关系建模。然后,将小贴片合并为较大的贴片,H-MHSA对少量合并令牌的全局依赖性建模。最后,汇总了本地和全球专注的功能,以获得具有强大表示能力的功能。由于我们仅在每个步骤中计算有限数量的令牌的注意力,因此大大减少了计算负载。因此,H-MHSA可以在不牺牲细粒度信息的情况下有效地模拟令牌之间的全局关系。使用H-MHSA模块合并,我们建立了一个基于层次的变压器网络的家族,即HAT-NET。为了证明在场景理解中HAT-NET的优越性,我们就基本视觉任务进行了广泛的实验,包括图像分类,语义分割,对象检测和实例细分。因此,HAT-NET为视觉变压器提供了新的视角。可以在https://github.com/yun-liu/hat-net上获得代码和预估计的模型。
translated by 谷歌翻译
Convolutional neural networks are built upon the convolution operation, which extracts informative features by fusing spatial and channel-wise information together within local receptive fields. In order to boost the representational power of a network, several recent approaches have shown the benefit of enhancing spatial encoding. In this work, we focus on the channel relationship and propose a novel architectural unit, which we term the "Squeezeand-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We demonstrate that by stacking these blocks together, we can construct SENet architectures that generalise extremely well across challenging datasets. Crucially, we find that SE blocks produce significant performance improvements for existing state-ofthe-art deep architectures at minimal additional computational cost. SENets formed the foundation of our ILSVRC 2017 classification submission which won first place and significantly reduced the top-5 error to 2.251%, achieving a ∼25% relative improvement over the winning entry of 2016.
translated by 谷歌翻译
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two interesting issues. First, Vision Transformers present a queryirrelevant behavior at deep layers, where the attention maps exhibit nearly consistent contexts in global scope, regardless of the query patch position (also head-irrelevant). Second, the attention maps are intrinsically sparse, few tokens dominate the attention weights; introducing the knowledge from ConvNets would largely smooth the attention and enhance the performance. Motivated by above observations, we generalize self-attention formulation to abstract a queryirrelevant global context directly and further integrate the global context into convolutions. The resulting model, a Fully Convolutional Vision Transformer (i.e., FCViT), purely consists of convolutional layers and firmly inherits the merits of both attention mechanism and convolutions, including dynamic property, weight sharing, and short- and long-range feature modeling, etc. Experimental results demonstrate the effectiveness of FCViT. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still perform better than previous state-of-the-art ConvNeXt with even fewer parameters. FCViT-based models also demonstrate promising transferability to downstream tasks, like object detection, instance segmentation, and semantic segmentation. Codes and models are made available at: https://github.com/ma-xu/FCViT.
translated by 谷歌翻译
Deploying convolutional neural networks (CNNs) on embedded devices is difficult due to the limited memory and computation resources. The redundancy in feature maps is an important characteristic of those successful CNNs, but has rarely been investigated in neural architecture design. This paper proposes a novel Ghost module to generate more feature maps from cheap operations. Based on a set of intrinsic feature maps, we apply a series of linear transformations with cheap cost to generate many ghost feature maps that could fully reveal information underlying intrinsic features. The proposed Ghost module can be taken as a plug-and-play component to upgrade existing convolutional neural networks. Ghost bottlenecks are designed to stack Ghost modules, and then the lightweight Ghost-Net can be easily established. Experiments conducted on benchmarks demonstrate that the proposed Ghost module is an impressive alternative of convolution layers in baseline models, and our GhostNet can achieve higher recognition performance (e.g. 75.7% top-1 accuracy) than MobileNetV3 with similar computational cost on the ImageNet ILSVRC-2012 classification dataset. Code is available at https: //github.com/huawei-noah/ghostnet.
translated by 谷歌翻译
诸如对象检测和分割等密集的计算机视觉任务需要有效的多尺度特征表示,用于检测或分类具有不同大小的对象或区域。虽然卷积神经网络(CNNS)是这种任务的主导架构,但最近引入了视觉变压器(VITS)的目标是将它们替换为骨干。类似于CNN,VITS构建一个简单的多级结构(即,细致粗略),用于使用单尺度补丁进行多尺度表示。在这项工作中,通过从现有变压器的不同角度来看,我们探索了多尺度补丁嵌入和多路径结构,构建了多路径视觉变压器(MPVIT)。 MPVIT通过使用重叠的卷积贴片嵌入,将相同尺寸〜(即,序列长度,序列长度,序列长度的序列长度)嵌入不同尺度的斑块。然后,通过多个路径独立地将不同尺度的令牌独立地馈送到变压器编码器,并且可以聚合产生的特征,使得能够在相同特征级别的精细和粗糙的特征表示。由于多样化,多尺寸特征表示,我们的MPVits从微小〜(5m)缩放到基础〜(73米)一直在想象成分,对象检测,实例分段上的最先进的视觉变压器来实现卓越的性能,和语义细分。这些广泛的结果表明,MPVIT可以作为各种视觉任务的多功能骨干网。代码将在\ url {https://git.io/mpvit}上公开可用。
translated by 谷歌翻译
在本文中,我们将多尺度视觉变压器(MVIT)作为图像和视频分类的统一架构,以及对象检测。我们提出了一种改进的MVIT版本,它包含分解的相对位置嵌入和残余汇集连接。我们以五种尺寸实例化此架构,并评估Imagenet分类,COCO检测和动力学视频识别,在此优先效果。我们进一步比较了MVITS的汇集注意力来窗口注意力机制,其中它在准确性/计算中优于后者。如果没有钟声,MVIT在3个域中具有最先进的性能:ImageNet分类的准确性为88.8%,Coco对象检测的56.1盒AP和动力学-400视频分类的86.1%。代码和模型将公开可用。
translated by 谷歌翻译
空间卷积广泛用于许多深度视频模型。它基本上假设了时空不变性,即,使用不同帧中的每个位置的共享权重。这项工作提出了用于视频理解的时间 - 自适应卷积(Tadaconv),这表明沿着时间维度的自适应权重校准是促进在视频中建模复杂的时间动态的有效方法。具体而言,Tadaconv根据其本地和全局时间上下文校准每个帧的卷积权重,使空间卷积具有时间建模能力。与先前的时间建模操作相比,Tadaconv在通过卷积内核上运行而不是特征,其维度是比空间分辨率小的数量级更有效。此外,内核校准还具有增加的模型容量。通过用Tadaconv替换Reset中的空间互联网来构建坦达2D网络,这与多个视频动作识别和定位基准测试的最先进方法相比,导致PAR或更好的性能。我们还表明,作为可忽略的计算开销的容易插入操作,Tadaconv可以有效地改善许多具有令人信服的边距的现有视频模型。 HTTPS://github.com/alibaba-mmai-research/pytorch-video -Undersing提供代码和模型。
translated by 谷歌翻译