Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets (pronounced "coat" nets), a family of hybrid models built from two key insights:(1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention; (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints across various datasets: Without extra data, CoAtNet achieves 86.0% ImageNet top-1 accuracy; When pre-trained with 13M images from ImageNet-21K, our CoAtNet achieves 88.56% top-1 accuracy, matching ViT-huge pre-trained with 300M images from JFT-300M while using 23x less data; Notably, when we further scale up CoAtNet with JFT-3B, it achieves 90.88% top-1 accuracy on ImageNet, establishing a new state-of-the-art result.1 The initial projection stage can be seen as an aggressive down-sampling convolutional stem.
translated by 谷歌翻译
变形金刚最近在计算机视觉社区中引起了极大的关注。然而,缺乏关于图像大小的自我注意力机制的可扩展性限制了它们在最先进的视觉骨架中的广泛采用。在本文中,我们介绍了一种高效且可扩展的注意模型,我们称之为多轴注意,该模型由两个方面组成:阻止局部和扩张的全球关注。这些设计选择允许仅具有线性复杂性的任意输入分辨率上进行全局本地空间相互作用。我们还通过有效地将我们提出的注意模型与卷积混合在一起,提出了一个新的建筑元素,因此,通过简单地在多个阶段重复基本的构建块,提出了一个简单的层次视觉主链,称为Maxvit。值得注意的是,即使在早期的高分辨率阶段,Maxvit也能够在整个网络中“看到”。我们证明了模型在广泛的视觉任务上的有效性。根据图像分类,Maxvit在各种设置下实现最先进的性能:没有额外的数据,Maxvit获得了86.5%的Imagenet-1K Top-1精度;使用Imagenet-21K预训练,我们的模型可实现88.7%的TOP-1精度。对于下游任务,麦克斯维特(Maxvit)作为骨架可在对象检测以及视觉美学评估方面提供有利的性能。我们还表明,我们提出的模型表达了ImageNet上强大的生成建模能力,这表明了Maxvit块作为通用视觉模块的优势潜力。源代码和训练有素的模型将在https://github.com/google-research/maxvit上找到。
translated by 谷歌翻译
视觉识别的“咆哮20S”开始引入视觉变压器(VITS),这将被取代的Cummnets作为最先进的图像分类模型。另一方面,vanilla vit,当应用于一般计算机视觉任务等对象检测和语义分割时面临困难。它是重新引入多个ConvNet Priors的等级变压器(例如,Swin变压器),使变压器实际上可作为通用视觉骨干网,并在各种视觉任务上展示了显着性能。然而,这种混合方法的有效性仍然在很大程度上归功于变压器的内在优越性,而不是卷积的固有感应偏差。在这项工作中,我们重新审视设计空间并测试纯粹的Convnet可以实现的限制。我们逐渐“现代化”标准Reset朝着视觉变压器的设计设计,并发现几个有助于沿途绩效差异的关键组件。此探索的结果是一个纯粹的ConvNet型号被称为ConvNext。完全由标准的Convnet模块构建,ConvNexts在准确性和可扩展性方面与变压器竞争,实现了87.8%的ImageNet Top-1精度和表现优于COCO检测和ADE20K分割的Swin变压器,同时保持了标准Convnet的简单性和效率。
translated by 谷歌翻译
视觉变压器(VIT)最近在一系列计算机视觉任务中占据了主导地位,但训练数据效率低下,局部语义表示能力较低,而没有适当的电感偏差。卷积神经网络(CNNS)固有地捕获了区域感知语义,激发了研究人员将CNN引入VIT的架构中,以为VIT提供理想的诱导偏见。但是,嵌入在VIT中的微型CNN实现的位置是否足够好?在本文中,我们通过深入探讨混合CNNS/VIT的宏观结构如何增强层次VIT的性能。特别是,我们研究了令牌嵌入层,别名卷积嵌入(CE)的作用,并系统地揭示了CE如何在VIT中注入理想的感应偏置。此外,我们将最佳CE配置应用于最近发布的4个最先进的Vits,从而有效地增强了相应的性能。最后,释放了一个有效的混合CNN/VIT家族,称为CETNET,可以用作通用的视觉骨架。具体而言,CETNET在Imagenet-1K上获得了84.9%的TOP-1准确性(从头开始训练),可可基准上的48.6%的盒子地图和ADE20K上的51.6%MIOU,从而显着提高了相应的最新态度的性能。艺术基线。
translated by 谷歌翻译
Self-attention has the promise of improving computer vision systems due to parameter-independent scaling of receptive fields and content-dependent interactions, in contrast to parameter-dependent scaling and content-independent interactions of convolutions. Self-attention models have recently been shown to have encouraging improvements on accuracy-parameter trade-offs compared to baseline convolutional models such as ResNet-50. In this work, we aim to develop self-attention models that can outperform not just the canonical baseline models, but even the high-performing convolutional models. We propose two extensions to selfattention that, in conjunction with a more efficient implementation of self-attention, improve the speed, memory usage, and accuracy of these models. We leverage these improvements to develop a new self-attention model family, HaloNets, which reach state-of-the-art accuracies on the parameterlimited setting of the ImageNet classification benchmark. In preliminary transfer learning experiments, we find that HaloNet models outperform much larger models and have better inference performance. On harder tasks such as object detection and instance segmentation, our simple local self-attention and convolutional hybrids show improvements over very strong baselines. These results mark another step in demonstrating the efficacy of self-attention models on settings traditionally dominated by convolutional models.
translated by 谷歌翻译
最近,变压器和多层感知器(MLP)体系结构在各种视觉任务上取得了令人印象深刻的结果。但是,如何有效地结合这些操作员形成高性能混合视觉体系结构仍然是一个挑战。在这项工作中,我们通过提出一种新型的统一体系结构搜索方法来研究卷积,变压器和MLP的可学习组合。我们的方法包含两个关键设计,以实现高性能网络的搜索。首先,我们以统一的形式对截然不同的可搜索运算符进行建模,从而使操作员能够用相同的配置参数进行表征。这样,总体搜索空间规模大大减少,总搜索成本变得负担得起。其次,我们提出上下文感知的倒数采样模块(DSM),以减轻不同类型的操作员之间的差距。我们提出的DSM能够更好地适应不同类型的操作员的功能,这对于识别高性能混合体系结构很重要。最后,我们将可配置的运算符和DSM集成到统一的搜索空间中,并使用基于增强学习的搜索算法进行搜索,以充分探索操作员的最佳组合。为此,我们搜索一个基线网络并扩大规模,以获得一个名为UNINET的模型系列,该模型的准确性和效率比以前的Convnets和Transformers更好。特别是,我们的UNET-B5在ImageNet上获得了84.9%的TOP-1精度,比效应网络-B7和Botnet-T7分别少了44%和55%。通过在Imagenet-21K上进行预处理,我们的UNET-B6获得了87.4%,表现优于SWIN-L,拖鞋少51%,参数减少了41%。代码可在https://github.com/sense-x/uninet上找到。
translated by 谷歌翻译
视觉变压器的最新进展在基于点产生自我注意的新空间建模机制驱动的各种任务中取得了巨大成功。在本文中,我们表明,视觉变压器背后的关键要素,即输入自适应,远程和高阶空间相互作用,也可以通过基于卷积的框架有效地实现。我们介绍了递归封闭式卷积($ \ textit {g}^\ textit {n} $ conv),该卷积{n} $ conv)与封闭的卷积和递归设计执行高阶空间交互。新操作是高度灵活和可定制的,它与卷积的各种变体兼容,并将自我注意的两阶相互作用扩展到任意订单,而无需引入大量额外的计算。 $ \ textit {g}^\ textit {n} $ conv可以用作插件模块,以改善各种视觉变压器和基于卷积的模型。根据该操作,我们构建了一个名为Hornet的新型通用视觉骨干家族。关于ImageNet分类,可可对象检测和ADE20K语义分割的广泛实验表明,大黄蜂的表现优于Swin变形金刚,并具有相似的整体体系结构和训练配置的明显边距。大黄蜂还显示出对更多训练数据和更大模型大小的有利可伸缩性。除了在视觉编码器中的有效性外,我们还可以将$ \ textit {g}^\ textit {n} $ conv应用于特定于任务的解码器,并始终通过较少的计算来提高密集的预测性能。我们的结果表明,$ \ textIt {g}^\ textit {n} $ conv可以成为视觉建模的新基本模块,可有效结合视觉变形金刚和CNN的优点。代码可从https://github.com/raoyongming/hornet获得
translated by 谷歌翻译
卷积神经网络(CNN)是用于计算机视觉的主要的深神经网络(DNN)架构。最近,变压器和多层的Perceptron(MLP)的基础型号,如视觉变压器和MLP-MILER,开始引领新的趋势,因为它们在想象成分类任务中显示出了有希望的结果。在本文中,我们对这些DNN结构进行了实证研究,并试图了解他们各自的利弊。为了确保公平的比较,我们首先开发一个名为SPACH的统一框架,可以采用单独的空间和通道处理模块。我们在SPACH框架下的实验表明,所有结构都可以以适度的规模实现竞争性能。但是,当网络大小缩放时,它们展示了独特的行为。根据我们的调查结果,我们建议使用卷积和变压器模块的混合模型。由此产生的Hybrid-MS-S +模型实现了83.9%的前1个精度,63米参数和12.3g拖薄。它已与具有复杂设计的SOTA模型相提并论。代码和模型在https://github.com/microsoft/spach上公开使用。
translated by 谷歌翻译
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers.As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. We release the code at https: //github.com/facebookresearch/LeViT.
translated by 谷歌翻译
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two interesting issues. First, Vision Transformers present a queryirrelevant behavior at deep layers, where the attention maps exhibit nearly consistent contexts in global scope, regardless of the query patch position (also head-irrelevant). Second, the attention maps are intrinsically sparse, few tokens dominate the attention weights; introducing the knowledge from ConvNets would largely smooth the attention and enhance the performance. Motivated by above observations, we generalize self-attention formulation to abstract a queryirrelevant global context directly and further integrate the global context into convolutions. The resulting model, a Fully Convolutional Vision Transformer (i.e., FCViT), purely consists of convolutional layers and firmly inherits the merits of both attention mechanism and convolutions, including dynamic property, weight sharing, and short- and long-range feature modeling, etc. Experimental results demonstrate the effectiveness of FCViT. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still perform better than previous state-of-the-art ConvNeXt with even fewer parameters. FCViT-based models also demonstrate promising transferability to downstream tasks, like object detection, instance segmentation, and semantic segmentation. Codes and models are made available at: https://github.com/ma-xu/FCViT.
translated by 谷歌翻译
视觉变压器由于能够捕获图像中的长期依赖性的能力而成功地应用于图像识别任务。但是,变压器与现有卷积神经网络(CNN)之间的性能和计算成本仍然存在差距。在本文中,我们旨在解决此问题,并开发一个网络,该网络不仅可以超越规范变压器,而且可以超越高性能卷积模型。我们通过利用变压器来捕获长期依赖性和CNN来建模本地特征,从而提出了一个新的基于变压器的混合网络。此外,我们将其扩展为获得一个称为CMT的模型家族,比以前的基于卷积和基于变压器的模型获得了更好的准确性和效率。特别是,我们的CMT-S在ImageNet上获得了83.5%的TOP-1精度,而在拖鞋上的拖曳率分别比现有的DEIT和EficitiveNet小14倍和2倍。拟议的CMT-S还可以很好地概括CIFAR10(99.2%),CIFAR100(91.7%),花(98.7%)以及其他具有挑战性的视觉数据集,例如可可(44.3%地图),计算成本较小。
translated by 谷歌翻译
在本文中,我们为视觉域提出了一个新的神经体系结构块,该区域称为区域和本地混合(MRL),其目的是有效,有效地混合提供的输入特征。我们将输入特征混合任务分叉为区域和本地规模的混合。为了实现有效的混合,我们利用自我注意力提供的域范围内的接收场,用于局部尺度混合的区域尺度混合和卷积内核。更具体地说,我们提出的方法将与定义区域内的本地特征相关联的区域特征,然后是局部规模的特征,由区域特征增强。实验表明,这种自我注意力和卷积的杂交带来了能力提高,概括(右感应偏见)和效率。在类似的网络设置下,MRL的表现优于其分类,对象检测和细分任务的同等。我们还表明,基于MRL的网络体系结构可实现H&E组织学数据集的最新性能。我们在Kumar,ConSEP和CPM-17数据集中获得了0.843、0.855和0.892的骰子,同时通过合并了MRL框架所提供的多功能性,通过合并诸如小组卷积之类的层来改善数据集特异性通用化。
translated by 谷歌翻译
We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (i.e. shift, scale, and distortion invariance) while maintaining the merits of Transformers (i.e. dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (e.g. ImageNet-22k) and fine-tuned to downstream tasks. Pretrained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at https: //github.com/leoxiaobin/CvT.
translated by 谷歌翻译
视觉变压器(VIT)用作强大的视觉模型。与卷积神经网络不同,在前几年主导视觉研究,视觉变压器享有捕获数据中的远程依赖性的能力。尽管如此,任何变压器架构的组成部分,自我关注机制都存在高延迟和低效的内存利用,使其不太适合高分辨率输入图像。为了缓解这些缺点,分层视觉模型在非交错的窗口上局部使用自我关注。这种放松会降低输入尺寸的复杂性;但是,它限制了横窗相互作用,损害了模型性能。在本文中,我们提出了一种新的班次不变的本地注意层,称为查询和参加(QNA),其以重叠的方式聚集在本地输入,非常类似于卷积。 QNA背后的关键想法是介绍学习的查询,这允许快速高效地实现。我们通过将其纳入分层视觉变压器模型来验证我们的层的有效性。我们展示了速度和内存复杂性的改进,同时实现了与最先进的模型的可比准确性。最后,我们的图层尺寸尤其良好,窗口大小,需要高于X10的内存,而不是比现有方法更快。
translated by 谷歌翻译
为了实现不断增长的准确性,通常会开发大型和复杂的神经网络。这样的模型需要高度的计算资源,因此不能在边缘设备上部署。由于它们在几个应用领域的有用性,建立资源有效的通用网络非常感兴趣。在这项工作中,我们努力有效地结合了CNN和变压器模型的优势,并提出了一种新的有效混合体系结构。特别是在EDGENEXT中,我们引入了分裂深度转置注意力(SDTA)编码器,该编码器将输入张量分解为多个通道组,并利用深度旋转以及跨通道维度的自我注意力,以隐含地增加接受场并编码多尺度特征。我们在分类,检测和分割任务上进行的广泛实验揭示了所提出的方法的优点,优于相对较低的计算要求的最先进方法。我们具有130万参数的EDGENEXT模型在Imagenet-1k上达到71.2 \%TOP-1的精度,超过移动设备的绝对增益为2.2 \%,而拖鞋减少了28 \%。此外,我们具有560万参数的EDGENEXT模型在Imagenet-1k上达到了79.4 \%TOP-1的精度。代码和模型可在https://t.ly/_vu9上公开获得。
translated by 谷歌翻译
由于复杂的注意机制和模型设计,大多数现有的视觉变压器(VIT)无法在现实的工业部署方案中的卷积神经网络(CNN)高效,例如张力和coreml。这提出了一个独特的挑战:可以设计视觉神经网络以与CNN一样快地推断并表现强大吗?最近的作品试图设计CNN-Transformer混合体系结构来解决这个问题,但是这些作品的整体性能远非令人满意。为了结束这些结束,我们提出了下一代视觉变压器,以在现实的工业场景中有效部署,即下一步,从延迟/准确性权衡的角度来看,它在CNN和VIT上占主导地位。在这项工作中,下一个卷积块(NCB)和下一个变压器块(NTB)分别开发出用于使用部署友好机制捕获本地和全球信息。然后,下一个混合策略(NHS)旨在将NCB和NTB堆叠在有效的混合范式中,从而提高了各种下游任务中的性能。广泛的实验表明,在各种视觉任务方面的延迟/准确性权衡方面,下一个VIT明显优于现有的CNN,VIT和CNN转换混合体系结构。在Tensorrt上,在可可检测上,Next-Vit超过5.4 MAP(从40.4到45.8),在类似延迟下,ADE20K细分的8.2%MIOU(从38.8%到47.0%)。同时,它可以与CSWIN达到可比的性能,而推理速度则以3.6倍的速度加速。在COREML上,在类似的延迟下,在COCO检测上,下一步超过了可可检测的4.6 MAP(从42.6到47.2),ADE20K分割的3.5%MIOU(从45.2%到48.7%)。代码将最近发布。
translated by 谷歌翻译
Since the recent success of Vision Transformers (ViTs), explorations toward transformer-style architectures have triggered the resurgence of modern ConvNets. In this work, we explore the representation ability of DNNs through the lens of interaction complexities. We empirically show that interaction complexity is an overlooked but essential indicator for visual recognition. Accordingly, a new family of efficient ConvNets, named MogaNet, is presented to pursue informative context mining in pure ConvNet-based models, with preferable complexity-performance trade-offs. In MogaNet, interactions across multiple complexities are facilitated and contextualized by leveraging two specially designed aggregation blocks in both spatial and channel interaction spaces. Extensive studies are conducted on ImageNet classification, COCO object detection, and ADE20K semantic segmentation tasks. The results demonstrate that our MogaNet establishes new state-of-the-art over other popular methods in mainstream scenarios and all model scales. Typically, the lightweight MogaNet-T achieves 80.0\% top-1 accuracy with only 1.44G FLOPs using a refined training setup on ImageNet-1K, surpassing ParC-Net-S by 1.4\% accuracy but saving 59\% (2.04G) FLOPs.
translated by 谷歌翻译
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10× more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https: //github.com/facebookresearch/SlowFast.
translated by 谷歌翻译
过去一年目睹了将变压器模块应用于视力问题的快速发展。虽然一些研究人员已经证明,基于变压器的模型享有有利的拟合数据能力,但仍然越来越多的证据,表明这些模型尤其在训练数据受到限制时遭受过度拟合。本文通过执行逐步操作来提供实证研究,逐步运输基于变压器的模型到基于卷积的模型。我们在过渡过程中获得的结果为改善视觉识别提供了有用的消息。基于这些观察,我们提出了一个名为VIRFormer的新架构,该体系结构从“视觉友好的变压器”中缩写。具有相同的计算复杂度,在想象集分类精度方面,VISFormer占据了基于变压器的基于卷积的模型,并且当模型复杂性较低或训练集较小时,优势变得更加重要。代码可在https://github.com/danczs/visformer中找到。
translated by 谷歌翻译
多层erceptron(MLP),作为出现的第一个神经网络结构,是一个大的击中。但是由硬件计算能力和数据集的大小限制,它一旦沉没了数十年。在此期间,我们目睹了从手动特征提取到带有局部接收领域的CNN的范式转变,以及基于自我关注机制的全球接收领域的变换。今年(2021年),随着MLP混合器的推出,MLP已重新进入敏捷,并吸引了计算机视觉界的广泛研究。与传统的MLP进行比较,它变得更深,但改变了完全扁平化以补丁平整的输入。鉴于其高性能和较少的需求对视觉特定的感应偏见,但社区无法帮助奇迹,将MLP,最简单的结构与全球接受领域,但没有关注,成为一个新的电脑视觉范式吗?为了回答这个问题,本调查旨在全面概述视觉深层MLP模型的最新发展。具体而言,我们从微妙的子模块设计到全局网络结构,我们审查了这些视觉深度MLP。我们比较了不同网络设计的接收领域,计算复杂性和其他特性,以便清楚地了解MLP的开发路径。调查表明,MLPS的分辨率灵敏度和计算密度仍未得到解决,纯MLP逐渐发展朝向CNN样。我们建议,目前的数据量和计算能力尚未准备好接受纯的MLP,并且人工视觉指导仍然很重要。最后,我们提供了开放的研究方向和可能的未来作品的分析。我们希望这项努力能够点燃社区的进一步兴趣,并鼓励目前为神经网络进行更好的视觉量身定制设计。
translated by 谷歌翻译