Self-attention has the promise of improving computer vision systems due to parameter-independent scaling of receptive fields and content-dependent interactions, in contrast to parameter-dependent scaling and content-independent interactions of convolutions. Self-attention models have recently been shown to have encouraging improvements on accuracy-parameter trade-offs compared to baseline convolutional models such as ResNet-50. In this work, we aim to develop self-attention models that can outperform not just the canonical baseline models, but even the high-performing convolutional models. We propose two extensions to selfattention that, in conjunction with a more efficient implementation of self-attention, improve the speed, memory usage, and accuracy of these models. We leverage these improvements to develop a new self-attention model family, HaloNets, which reach state-of-the-art accuracies on the parameterlimited setting of the ImageNet classification benchmark. In preliminary transfer learning experiments, we find that HaloNet models outperform much larger models and have better inference performance. On harder tasks such as object detection and instance segmentation, our simple local self-attention and convolutional hybrids show improvements over very strong baselines. These results mark another step in demonstrating the efficacy of self-attention models on settings traditionally dominated by convolutional models.
translated by 谷歌翻译
Convolutions are a fundamental building block of modern computer vision systems. Recent approaches have argued for going beyond convolutions in order to capture long-range dependencies. These efforts focus on augmenting convolutional models with content-based interactions, such as self-attention and non-local means, to achieve gains on a number of vision tasks. The natural question that arises is whether attention can be a stand-alone primitive for vision models instead of serving as just an augmentation on top of convolutions. In developing and testing a pure self-attention vision model, we verify that self-attention can indeed be an effective stand-alone layer. A simple procedure of replacing all instances of spatial convolutions with a form of self-attention applied to ResNet model produces a fully self-attentional model that outperforms the baseline on ImageNet classification with 12% fewer FLOPS and 29% fewer parameters. On COCO object detection, a pure self-attention model matches the mAP of a baseline RetinaNet while having 39% fewer FLOPS and 34% fewer parameters. Detailed ablation studies demonstrate that self-attention is especially impactful when used in later layers. These results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox. Code for this project is made available. 1 * Denotes equal contribution. Ordering determined by random shuffle. † Work done as a member of the Google AI Residency Program.
translated by 谷歌翻译
We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and no other changes, our approach improves upon the baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency. Through the design of BoTNet, we also point out how ResNet bottleneck blocks with self-attention can be viewed as Transformer blocks. Without any bells and whistles, BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark using the Mask R-CNN framework; surpassing the previous best published single model and single scale results of ResNeSt [67] evaluated on the COCO validation set. Finally, we present a simple adaptation of the BoTNet design for image classification, resulting in models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark while being up to 1.64x faster in "compute" 1 time than the popular EfficientNet models on TPU-v3 hardware. We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.
translated by 谷歌翻译
视觉变压器的最新进展在基于点产生自我注意的新空间建模机制驱动的各种任务中取得了巨大成功。在本文中,我们表明,视觉变压器背后的关键要素,即输入自适应,远程和高阶空间相互作用,也可以通过基于卷积的框架有效地实现。我们介绍了递归封闭式卷积($ \ textit {g}^\ textit {n} $ conv),该卷积{n} $ conv)与封闭的卷积和递归设计执行高阶空间交互。新操作是高度灵活和可定制的,它与卷积的各种变体兼容,并将自我注意的两阶相互作用扩展到任意订单,而无需引入大量额外的计算。 $ \ textit {g}^\ textit {n} $ conv可以用作插件模块,以改善各种视觉变压器和基于卷积的模型。根据该操作,我们构建了一个名为Hornet的新型通用视觉骨干家族。关于ImageNet分类,可可对象检测和ADE20K语义分割的广泛实验表明,大黄蜂的表现优于Swin变形金刚,并具有相似的整体体系结构和训练配置的明显边距。大黄蜂还显示出对更多训练数据和更大模型大小的有利可伸缩性。除了在视觉编码器中的有效性外,我们还可以将$ \ textit {g}^\ textit {n} $ conv应用于特定于任务的解码器,并始终通过较少的计算来提高密集的预测性能。我们的结果表明,$ \ textIt {g}^\ textit {n} $ conv可以成为视觉建模的新基本模块,可有效结合视觉变形金刚和CNN的优点。代码可从https://github.com/raoyongming/hornet获得
translated by 谷歌翻译
视觉变压器(VIT)用作强大的视觉模型。与卷积神经网络不同,在前几年主导视觉研究,视觉变压器享有捕获数据中的远程依赖性的能力。尽管如此,任何变压器架构的组成部分,自我关注机制都存在高延迟和低效的内存利用,使其不太适合高分辨率输入图像。为了缓解这些缺点,分层视觉模型在非交错的窗口上局部使用自我关注。这种放松会降低输入尺寸的复杂性;但是,它限制了横窗相互作用,损害了模型性能。在本文中,我们提出了一种新的班次不变的本地注意层,称为查询和参加(QNA),其以重叠的方式聚集在本地输入,非常类似于卷积。 QNA背后的关键想法是介绍学习的查询,这允许快速高效地实现。我们通过将其纳入分层视觉变压器模型来验证我们的层的有效性。我们展示了速度和内存复杂性的改进,同时实现了与最先进的模型的可比准确性。最后,我们的图层尺寸尤其良好,窗口大小,需要高于X10的内存,而不是比现有方法更快。
translated by 谷歌翻译
Convolutional networks have been the paradigm of choice in many computer vision applications. The convolution operation however has a significant weakness in that it only operates on a local neighborhood, thus missing global information. Self-attention, on the other hand, has emerged as a recent advance to capture long range interactions, but has mostly been applied to sequence modeling and generative modeling tasks. In this paper, we consider the use of self-attention for discriminative visual tasks as an alternative to convolutions. We introduce a novel two-dimensional relative self-attention mechanism that proves competitive in replacing convolutions as a stand-alone computational primitive for image classification. We find in control experiments that the best results are obtained when combining both convolutions and self-attention. We therefore propose to augment convolutional operators with this self-attention mechanism by concatenating convolutional feature maps with a set of feature maps produced via self-attention. Extensive experiments show that Attention Augmentation leads to consistent improvements in image classification on Im-ageNet and object detection on COCO across many different models and scales, including ResNets and a stateof-the art mobile constrained network, while keeping the number of parameters similar. In particular, our method achieves a 1.3% top-1 accuracy improvement on ImageNet classification over a ResNet50 baseline and outperforms other attention mechanisms for images such as . It also achieves an improvement of 1.4 mAP in COCO Object Detection on top of a RetinaNet baseline.
translated by 谷歌翻译
变形金刚迅速成为跨模式,域和任务的最深入学习架构之一。在视觉上,除了对普通变压器的持续努力外,层次变压器还引起了人们的重大关注,这要归功于它们的性能和轻松整合到现有框架中。这些模型通常采用局部注意机制,例如滑动窗口社区的注意力(NA)或Swin Transformer转移的窗户自我关注。尽管有效地降低了自我注意力的二次复杂性,但局部注意力却削弱了自我注意力最理想的两个特性:远距离相互依赖性建模和全球接受场。在本文中,我们引入了扩张的邻里注意力(DINA),这是NA的天然,灵活和有效的扩展,可以捕获更多的全球环境,并以无需额外的成本呈指数级扩展接受场。 NA的本地关注和Dina的稀疏全球关注相互补充,因此我们引入了扩张的邻里注意力变压器(Dinat),这是一种新的分层视觉变压器。 Dinat变体对基于注意的基线(例如NAT和SWIN)以及现代卷积基线Convnext都具有重大改进。我们的大型模型在可可对象检测中以1.5%的盒子AP领先于其在COCO物体检测中,1.3%的掩码AP在可可实例分段中,而ADE20K语义分段中的1.1%MIOU和更快的吞吐量。我们认为,NA和Dina的组合有可能增强本文提出的各种任务的能力。为了支持和鼓励朝着这个方向,远见和超越方向进行研究,我们在以下网址开放我们的项目:https://github.com/shi-labs/neighborhood-cithention-transformer。
translated by 谷歌翻译
我们提出了邻里注意力变压器(NAT),这是一种有效,准确和可扩展的层次变压器,在图像分类和下游视觉任务上都很好地工作。它建立在邻里注意力(NA)的基础上,这是一种简单而灵活的注意机制,将每个查询的接受场都定位到其最近的相邻像素。 NA是自我注意的本地化,并且随着接收场大小的增加而接近它。在拖曳和记忆使用方面,它也等同于Swin Transformer的转移窗口的注意力,而同样的接收场大小,同时受到了较少的约束。此外,NA包括局部电感偏见,从而消除了对像素移位等额外操作的需求。 NAT的实验结果具有竞争力; Nat-tiny在Imagenet上仅具有4.3 GFLOPS和28M参数,在MS-Coco上达到51.4%的MAP和ADE20K上的48.4%MIOU。我们在:https://github.com/shi-labs/neighborhood-cithention-transformer上开放了检查点,代码和CUDA内核。
translated by 谷歌翻译
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. 1
translated by 谷歌翻译
变形金刚最近在计算机视觉社区中引起了极大的关注。然而,缺乏关于图像大小的自我注意力机制的可扩展性限制了它们在最先进的视觉骨架中的广泛采用。在本文中,我们介绍了一种高效且可扩展的注意模型,我们称之为多轴注意,该模型由两个方面组成:阻止局部和扩张的全球关注。这些设计选择允许仅具有线性复杂性的任意输入分辨率上进行全局本地空间相互作用。我们还通过有效地将我们提出的注意模型与卷积混合在一起,提出了一个新的建筑元素,因此,通过简单地在多个阶段重复基本的构建块,提出了一个简单的层次视觉主链,称为Maxvit。值得注意的是,即使在早期的高分辨率阶段,Maxvit也能够在整个网络中“看到”。我们证明了模型在广泛的视觉任务上的有效性。根据图像分类,Maxvit在各种设置下实现最先进的性能:没有额外的数据,Maxvit获得了86.5%的Imagenet-1K Top-1精度;使用Imagenet-21K预训练,我们的模型可实现88.7%的TOP-1精度。对于下游任务,麦克斯维特(Maxvit)作为骨架可在对象检测以及视觉美学评估方面提供有利的性能。我们还表明,我们提出的模型表达了ImageNet上强大的生成建模能力,这表明了Maxvit块作为通用视觉模块的优势潜力。源代码和训练有素的模型将在https://github.com/google-research/maxvit上找到。
translated by 谷歌翻译
视觉识别的“咆哮20S”开始引入视觉变压器(VITS),这将被取代的Cummnets作为最先进的图像分类模型。另一方面,vanilla vit,当应用于一般计算机视觉任务等对象检测和语义分割时面临困难。它是重新引入多个ConvNet Priors的等级变压器(例如,Swin变压器),使变压器实际上可作为通用视觉骨干网,并在各种视觉任务上展示了显着性能。然而,这种混合方法的有效性仍然在很大程度上归功于变压器的内在优越性,而不是卷积的固有感应偏差。在这项工作中,我们重新审视设计空间并测试纯粹的Convnet可以实现的限制。我们逐渐“现代化”标准Reset朝着视觉变压器的设计设计,并发现几个有助于沿途绩效差异的关键组件。此探索的结果是一个纯粹的ConvNet型号被称为ConvNext。完全由标准的Convnet模块构建,ConvNexts在准确性和可扩展性方面与变压器竞争,实现了87.8%的ImageNet Top-1精度和表现优于COCO检测和ADE20K分割的Swin变压器,同时保持了标准Convnet的简单性和效率。
translated by 谷歌翻译
Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets (pronounced "coat" nets), a family of hybrid models built from two key insights:(1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention; (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints across various datasets: Without extra data, CoAtNet achieves 86.0% ImageNet top-1 accuracy; When pre-trained with 13M images from ImageNet-21K, our CoAtNet achieves 88.56% top-1 accuracy, matching ViT-huge pre-trained with 300M images from JFT-300M while using 23x less data; Notably, when we further scale up CoAtNet with JFT-3B, it achieves 90.88% top-1 accuracy on ImageNet, establishing a new state-of-the-art result.1 The initial projection stage can be seen as an aggressive down-sampling convolutional stem.
translated by 谷歌翻译
We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K in 224 resolution, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224 and 384, respectively. When transferred to downstream tasks, FocalNets exhibit clear superiority. For object detection with Mask R-CNN, FocalNet base trained with 1\times outperforms the Swin counterpart by 2.1 points and already surpasses Swin trained with 3\times schedule (49.0 v.s. 48.5). For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.3 and 64.4 mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of much larger attention-based models like Swinv2-G and BEIT-3. Code and checkpoints are available at https://github.com/microsoft/FocalNet.
translated by 谷歌翻译
Since the recent success of Vision Transformers (ViTs), explorations toward transformer-style architectures have triggered the resurgence of modern ConvNets. In this work, we explore the representation ability of DNNs through the lens of interaction complexities. We empirically show that interaction complexity is an overlooked but essential indicator for visual recognition. Accordingly, a new family of efficient ConvNets, named MogaNet, is presented to pursue informative context mining in pure ConvNet-based models, with preferable complexity-performance trade-offs. In MogaNet, interactions across multiple complexities are facilitated and contextualized by leveraging two specially designed aggregation blocks in both spatial and channel interaction spaces. Extensive studies are conducted on ImageNet classification, COCO object detection, and ADE20K semantic segmentation tasks. The results demonstrate that our MogaNet establishes new state-of-the-art over other popular methods in mainstream scenarios and all model scales. Typically, the lightweight MogaNet-T achieves 80.0\% top-1 accuracy with only 1.44G FLOPs using a refined training setup on ImageNet-1K, surpassing ParC-Net-S by 1.4\% accuracy but saving 59\% (2.04G) FLOPs.
translated by 谷歌翻译
在本文中,我们为视觉域提出了一个新的神经体系结构块,该区域称为区域和本地混合(MRL),其目的是有效,有效地混合提供的输入特征。我们将输入特征混合任务分叉为区域和本地规模的混合。为了实现有效的混合,我们利用自我注意力提供的域范围内的接收场,用于局部尺度混合的区域尺度混合和卷积内核。更具体地说,我们提出的方法将与定义区域内的本地特征相关联的区域特征,然后是局部规模的特征,由区域特征增强。实验表明,这种自我注意力和卷积的杂交带来了能力提高,概括(右感应偏见)和效率。在类似的网络设置下,MRL的表现优于其分类,对象检测和细分任务的同等。我们还表明,基于MRL的网络体系结构可实现H&E组织学数据集的最新性能。我们在Kumar,ConSEP和CPM-17数据集中获得了0.843、0.855和0.892的骰子,同时通过合并了MRL框架所提供的多功能性,通过合并诸如小组卷积之类的层来改善数据集特异性通用化。
translated by 谷歌翻译
我们提出了全球环境视觉变压器(GC VIT),这是一种新的结构,可增强参数和计算利用率。我们的方法利用了与本地自我注意的联合的全球自我发项模块,以有效但有效地建模长和短距离的空间相互作用,而无需昂贵的操作,例如计算注意力面罩或移动本地窗户。此外,我们通过建议在我们的体系结构中使用修改后的融合倒置残差块来解决VIT中缺乏归纳偏差的问题。我们提出的GC VIT在图像分类,对象检测和语义分割任务中实现了最新的结果。在用于分类的ImagEnet-1k数据集上,基本,小而微小的GC VIT,$ 28 $ M,$ 51 $ M和$ 90 $ M参数实现$ \ textbf {83.2 \%} $,$ \ textbf {83.9 \%} $和$ \ textbf {84.4 \%} $ top-1的精度,超过了相当大的先前艺术,例如基于CNN的Convnext和基于VIT的Swin Transformer,其优势大大。在对象检测,实例分割和使用MS Coco和ADE20K数据集的下游任务中,预训练的GC VIT主机在对象检测,实例分割和语义分割的任务中始终如一地超过事务,有时是通过大余量。可在https://github.com/nvlabs/gcvit上获得代码。
translated by 谷歌翻译
诸如对象检测和分割等密集的计算机视觉任务需要有效的多尺度特征表示,用于检测或分类具有不同大小的对象或区域。虽然卷积神经网络(CNNS)是这种任务的主导架构,但最近引入了视觉变压器(VITS)的目标是将它们替换为骨干。类似于CNN,VITS构建一个简单的多级结构(即,细致粗略),用于使用单尺度补丁进行多尺度表示。在这项工作中,通过从现有变压器的不同角度来看,我们探索了多尺度补丁嵌入和多路径结构,构建了多路径视觉变压器(MPVIT)。 MPVIT通过使用重叠的卷积贴片嵌入,将相同尺寸〜(即,序列长度,序列长度,序列长度的序列长度)嵌入不同尺度的斑块。然后,通过多个路径独立地将不同尺度的令牌独立地馈送到变压器编码器,并且可以聚合产生的特征,使得能够在相同特征级别的精细和粗糙的特征表示。由于多样化,多尺寸特征表示,我们的MPVits从微小〜(5m)缩放到基础〜(73米)一直在想象成分,对象检测,实例分段上的最先进的视觉变压器来实现卓越的性能,和语义细分。这些广泛的结果表明,MPVIT可以作为各种视觉任务的多功能骨干网。代码将在\ url {https://git.io/mpvit}上公开可用。
translated by 谷歌翻译
卷积和自我关注是表示学习的两个强大的技术,通常被认为是两个与彼此不同的对等方法。在本文中,我们表明它们之间存在强烈的潜在关系,从而在这两个范式的大部分计算实际上以相同的操作完成。具体来说,我们首先表明,具有内核大小k x k的传统卷积可以分解为k ^ 2个单独的1x1卷积,然后是换档和求和操作。然后,我们将自我注意模块中的查询,键和值解释为多个1x1卷积,然后计算注意力权重和值的聚合。因此,两个模块的第一阶段包括类似的操作。更重要的是,第一阶段有助于与第二阶段相比的主导计算复杂性(信道大小的正方形)。这种观察结果自然导致这两个看似独特的范例的优雅集成,即享有自我关注和卷积(ACMIX)的益处的混合模型,同时与纯卷积或自我关注对应相比具有最小的计算开销。广泛的实验表明,我们的模型在图像识别和下游任务上持续改进了竞争基础的结果。代码和预先训练的型号将在https://github.com/panxuran/acmix和https://gitee.com/mindspore/models发布。
translated by 谷歌翻译
这项工作介绍了一个简单的视觉变压器设计,作为对象本地化和实例分段任务的强大基线。变压器最近在图像分类任务中展示了竞争性能。为了采用对象检测和密集的预测任务,许多作品从卷积网络和高度定制的Vit架构继承了多级设计。在这种设计背后,目标是在计算成本和多尺度全球背景的有效聚合之间进行更好的权衡。然而,现有的作品采用多级架构设计作为黑匣子解决方案,无清楚地了解其真正的益处。在本文中,我们全面研究了三个架构设计选择对vit - 空间减少,加倍的频道和多尺度特征 - 并证明了vanilla vit架构可以在没有手动的多尺度特征的情况下实现这一目标,保持原始的Vit设计哲学。我们进一步完成了缩放规则,以优化模型的准确性和计算成本/型号大小的权衡。通过在整个编码器块中利用恒定的特征分辨率和隐藏大小,我们提出了一种称为通用视觉变压器(UVIT)的简单而紧凑的VIT架构,可实现COCO对象检测和实例分段任务的强劲性能。
translated by 谷歌翻译
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers.As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. We release the code at https: //github.com/facebookresearch/LeViT.
translated by 谷歌翻译