灵感来自生物进化,我们通过比喻与经过验证的实用进化算法(EA)进行了类比的愿景变压器的合理性,并导致它们两者都有一致的数学表示。类似于EA的动态局部人口,我们改善了现有的变压器结构,并提出了更有效的吃模型,并设计任务相关的头来处理不同的任务更灵活。此外,我们将空间填充曲线介绍到电流视觉变压器中以将图像数据序列为均匀的顺序格式。因此,我们可以设计一个统一的Eat框架来解决多模式任务,将网络架构与数据格式自适应分开。与最近的视觉变压器工作相比,我们的方法对ImageNet分类任务进行了最先进的结果,同时具有较小的参数和更高的吞吐量。我们进一步开展多模态任务,以展示统一的饮食的优越性,例如基于文本的图像检索,我们的方法在CSS数据集上的基线上通过+3.7点提高了+3.7点。
translated by 谷歌翻译
由生物学进化的动机,本文通过类比与经过验证的实践进化算法(EA)相比,解释了视觉变压器的合理性,并得出了两者都具有一致的数学表述。然后,我们受到有效的EA变体的启发,我们提出了一个新型的金字塔饮食式主链,该主链仅包含拟议的\ emph {ea-ea-lase transformer}(eat)块,该块由三个残留零件组成,\ ie,\ emph {多尺度区域聚集}(msra),\ emph {global and local互动}(GLI)和\ emph {feed-forward Network}(ffn)模块,以分别建模多尺度,交互和个人信息。此外,我们设计了一个与变压器骨架对接的\ emph {与任务相关的头}(TRH),以更灵活地完成最终信息融合,并\ emph {reviv} a \ emph {调制变形MSA}(MD-MSA),以动态模型模型位置。关于图像分类,下游任务和解释性实验的大量定量和定量实验证明了我们方法比最新方法(SOTA)方法的有效性和优越性。 \例如,我们的手机(1.8m),微小(6.1m),小(24.3m)和基地(49.0m)型号达到了69.4、78.4、83.1和83.9的83.9 TOP-1仅在Imagenet-1 K上接受NAIVE训练的TOP-1食谱; Eatformer微型/小型/基本武装面具-R-CNN获得45.4/47.4/49.0盒AP和41.4/42.9/44.2掩膜可可检测,超过当代MPVIT-T,SWIN-T,SWIN-T和SWIN-S,而SWIN-S则是0.6/ 1.4/0.5盒AP和0.4/1.3/0.9掩码AP分别使用较少的拖鞋;我们的Eatformer-small/base在Upernet上获得了47.3/49.3 MIOU,超过Swin-T/S超过2.8/1.7。代码将在\ url {https://https://github.com/zhangzjn/eatformer}上提供。
translated by 谷歌翻译
The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in transformer models for image classification. To this end, we propose a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity and these tokens are then fused purely by attention multiple times to complement each other. Furthermore, to reduce computation, we develop a simple yet effective token fusion module based on cross attention, which uses a single token for each branch as a query to exchange information with other branches. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise. Extensive experiments demonstrate that our approach performs better than or on par with several concurrent works on vision transformer, in addition to efficient CNN models. For example, on the ImageNet1K dataset, with some architectural changes, our approach outperforms the recent DeiT by a large margin of 2% with a small to moderate increase in FLOPs and model parameters. Our source codes and models are available at https://github.com/IBM/CrossViT.
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译
我们在视觉变压器上呈现整洁但有效的递归操作,可以提高参数利用而不涉及额外参数。这是通过在变压器网络的深度分享权重来实现的。所提出的方法可以只使用NA \“IVE递归操作来获得大量增益(〜2%),不需要对设计网络原理的特殊或复杂的知识,并引入训练程序的最小计算开销。减少额外的计算通过递归操作,同时保持卓越的准确性,我们通过递归层的多个切片组自行引入近似方法,这可以通过最小的性能损失将成本消耗降低10〜30%。我们称我们的模型切片递归变压器(SRET) ,这与高效视觉变压器的广泛的其他设计兼容。我们最好的模型在含有较少参数的同时,在最先进的方法中对Imagenet建立了重大改进。建议的切片递归操作使我们能够建立一个变压器超过100甚至1000层,仍然仍然小尺寸(13〜15米),以避免困难当模型尺寸太大时,IES在优化中。灵活的可扩展性显示出缩放和构建极深和大维视觉变压器的巨大潜力。我们的代码和模型可在https://github.com/szq0214/sret中找到。
translated by 谷歌翻译
Vision变形金刚(VITS)最近获得了爆炸性的人气,但巨额的计算成本仍然是一个严峻的问题。由于VIT的计算复杂性相对于输入序列长度是二次的,因此用于计算还原的主流范例是减少令牌的数量。现有设计包括结构化空间压缩,该压缩使用逐行缩小的金字塔来减少大型特征映射的计算,并且动态丢弃冗余令牌的非结构化令牌修剪。然而,现有令牌修剪的限制在两倍以下:1)由修剪引起的不完全空间结构与现代深窄变压器通常使用的结构化空间压缩不兼容; 2)通常需要耗时的预训练程序。为了解决局限性并扩大令牌修剪的适用场景,我们提出了Evo-Vit,一种自动激励的慢速令牌演化方法,用于视觉变压器。具体而言,我们通过利用原产于视觉变压器的简单有效的全球课程关注来进行非结构化的案例 - 明智的选择。然后,我们建议使用不同的计算路径更新所选的信息令牌和未表征性令牌,即慢速更新。由于快速更新机制保持空间结构和信息流,因此Evo-Vit可以从训练过程的开始,从训练过程的开始,加速平坦和深窄的结构的Vanilla变压器。实验结果表明,我们的方法显着降低了视觉变压器的计算成本,同时在图像分类上保持了可比性。
translated by 谷歌翻译
Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16×16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4×4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost.
translated by 谷歌翻译
Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. In particular, we investigate the interplay of architecture and optimization of such dedicated transformers. We make two transformers architecture changes that significantly improve the accuracy of deep transformers. This leads us to produce models whose performance does not saturate early with more depth, for instance we obtain 86.5% top-1 accuracy on Imagenet when training with no external data, we thus attain the current SOTA with less FLOPs and parameters. Moreover, our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency, in the setting with no additional training data. We share our code and models 1 .
translated by 谷歌翻译
变压器是一种基于关注的编码器解码器架构,彻底改变了自然语言处理领域。灵感来自这一重大成就,最近在将变形式架构调整到计算机视觉(CV)领域的一些开创性作品,这已经证明了他们对各种简历任务的有效性。依靠竞争力的建模能力,与现代卷积神经网络相比在本文中,我们已经为三百不同的视觉变压器进行了全面的审查,用于三个基本的CV任务(分类,检测和分割),提出了根据其动机,结构和使用情况组织这些方法的分类。 。由于培训设置和面向任务的差异,我们还在不同的配置上进行了评估了这些方法,以便于易于和直观的比较而不是各种基准。此外,我们已经揭示了一系列必不可少的,但可能使变压器能够从众多架构中脱颖而出,例如松弛的高级语义嵌入,以弥合视觉和顺序变压器之间的差距。最后,提出了三个未来的未来研究方向进行进一步投资。
translated by 谷歌翻译
变形金刚最近在计算机视觉社区中引起了极大的关注。然而,缺乏关于图像大小的自我注意力机制的可扩展性限制了它们在最先进的视觉骨架中的广泛采用。在本文中,我们介绍了一种高效且可扩展的注意模型,我们称之为多轴注意,该模型由两个方面组成:阻止局部和扩张的全球关注。这些设计选择允许仅具有线性复杂性的任意输入分辨率上进行全局本地空间相互作用。我们还通过有效地将我们提出的注意模型与卷积混合在一起,提出了一个新的建筑元素,因此,通过简单地在多个阶段重复基本的构建块,提出了一个简单的层次视觉主链,称为Maxvit。值得注意的是,即使在早期的高分辨率阶段,Maxvit也能够在整个网络中“看到”。我们证明了模型在广泛的视觉任务上的有效性。根据图像分类,Maxvit在各种设置下实现最先进的性能:没有额外的数据,Maxvit获得了86.5%的Imagenet-1K Top-1精度;使用Imagenet-21K预训练,我们的模型可实现88.7%的TOP-1精度。对于下游任务,麦克斯维特(Maxvit)作为骨架可在对象检测以及视觉美学评估方面提供有利的性能。我们还表明,我们提出的模型表达了ImageNet上强大的生成建模能力,这表明了Maxvit块作为通用视觉模块的优势潜力。源代码和训练有素的模型将在https://github.com/google-research/maxvit上找到。
translated by 谷歌翻译
视觉变压器由于能够捕获图像中的长期依赖性的能力而成功地应用于图像识别任务。但是,变压器与现有卷积神经网络(CNN)之间的性能和计算成本仍然存在差距。在本文中,我们旨在解决此问题,并开发一个网络,该网络不仅可以超越规范变压器,而且可以超越高性能卷积模型。我们通过利用变压器来捕获长期依赖性和CNN来建模本地特征,从而提出了一个新的基于变压器的混合网络。此外,我们将其扩展为获得一个称为CMT的模型家族,比以前的基于卷积和基于变压器的模型获得了更好的准确性和效率。特别是,我们的CMT-S在ImageNet上获得了83.5%的TOP-1精度,而在拖鞋上的拖曳率分别比现有的DEIT和EficitiveNet小14倍和2倍。拟议的CMT-S还可以很好地概括CIFAR10(99.2%),CIFAR100(91.7%),花(98.7%)以及其他具有挑战性的视觉数据集,例如可可(44.3%地图),计算成本较小。
translated by 谷歌翻译
随着计算机愿景中变压器架构的普及,研究焦点已转向开发计算有效的设计。基于窗口的本地关注是最近作品采用的主要技术之一。这些方法以非常小的贴片尺寸和小的嵌入尺寸开始,然后执行冲击卷积(贴片合并),以减少特征图尺寸并增加嵌入尺寸,因此,形成像设计的金字塔卷积神经网络(CNN)。在这项工作中,我们通过呈现一种新的各向同性架构,调查变压器中的本地和全球信息建模,以便采用当地窗口和特殊令牌,称为超级令牌,以自我关注。具体地,将单个超级令牌分配给每个图像窗口,该窗口捕获该窗口的丰富本地细节。然后使用这些令牌用于跨窗口通信和全局代表学习。因此,大多数学习都独立于较高层次的图像补丁$(n)$,并且仅基于超级令牌$(n / m ^ 2)$何处,从中学习额外的嵌入量窗口大小。在ImageNet-1K上的标准图像分类中,所提出的基于超代币的变压器(STT-S25)实现了83.5 \%的精度,其等同于带有大约一半参数(49M)的Swin变压器(Swin-B)和推断的两倍时间吞吐量。建议的超级令牌变压器为可视识别任务提供轻量级和有前途的骨干。
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
在本文中,我们通过利用视觉数据中的空间稀疏性提出了一种新的模型加速方法。我们观察到,视觉变压器中的最终预测仅基于最有用的令牌的子集,这足以使图像识别。基于此观察,我们提出了一个动态的令牌稀疏框架,以根据加速视觉变压器的输入逐渐和动态地修剪冗余令牌。具体而言,我们设计了一个轻量级预测模块,以估计给定当前功能的每个令牌的重要性得分。该模块被添加到不同的层中以层次修剪冗余令牌。尽管该框架的启发是我们观察到视觉变压器中稀疏注意力的启发,但我们发现自适应和不对称计算的想法可能是加速各种体系结构的一般解决方案。我们将我们的方法扩展到包括CNN和分层视觉变压器在内的层次模型,以及更复杂的密集预测任务,这些任务需要通过制定更通用的动态空间稀疏框架,并具有渐进性的稀疏性和非对称性计算,用于不同空间位置。通过将轻质快速路径应用于少量的特征,并使用更具表现力的慢速路径到更重要的位置,我们可以维护特征地图的结构,同时大大减少整体计算。广泛的实验证明了我们框架对各种现代体系结构和不同视觉识别任务的有效性。我们的结果清楚地表明,动态空间稀疏为模型加速提供了一个新的,更有效的维度。代码可从https://github.com/raoyongming/dynamicvit获得
translated by 谷歌翻译
We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (i.e. shift, scale, and distortion invariance) while maintaining the merits of Transformers (i.e. dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (e.g. ImageNet-22k) and fine-tuned to downstream tasks. Pretrained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at https: //github.com/leoxiaobin/CvT.
translated by 谷歌翻译
Attention-based neural networks, such as Transformers, have become ubiquitous in numerous applications, including computer vision, natural language processing, and time-series analysis. In all kinds of attention networks, the attention maps are crucial as they encode semantic dependencies between input tokens. However, most existing attention networks perform modeling or reasoning based on representations, wherein the attention maps of different layers are learned separately without explicit interactions. In this paper, we propose a novel and generic evolving attention mechanism, which directly models the evolution of inter-token relationships through a chain of residual convolutional modules. The major motivations are twofold. On the one hand, the attention maps in different layers share transferable knowledge, thus adding a residual connection can facilitate the information flow of inter-token relationships across layers. On the other hand, there is naturally an evolutionary trend among attention maps at different abstraction levels, so it is beneficial to exploit a dedicated convolution-based module to capture this process. Equipped with the proposed mechanism, the convolution-enhanced evolving attention networks achieve superior performance in various applications, including time-series representation, natural language understanding, machine translation, and image classification. Especially on time-series representation tasks, Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly, achieving an average of 17% improvement compared to the best SOTA. To the best of our knowledge, this is the first work that explicitly models the layer-wise evolution of attention maps. Our implementation is available at https://github.com/pkuyym/EvolvingAttention
translated by 谷歌翻译
视觉变压器(VIT)用作强大的视觉模型。与卷积神经网络不同,在前几年主导视觉研究,视觉变压器享有捕获数据中的远程依赖性的能力。尽管如此,任何变压器架构的组成部分,自我关注机制都存在高延迟和低效的内存利用,使其不太适合高分辨率输入图像。为了缓解这些缺点,分层视觉模型在非交错的窗口上局部使用自我关注。这种放松会降低输入尺寸的复杂性;但是,它限制了横窗相互作用,损害了模型性能。在本文中,我们提出了一种新的班次不变的本地注意层,称为查询和参加(QNA),其以重叠的方式聚集在本地输入,非常类似于卷积。 QNA背后的关键想法是介绍学习的查询,这允许快速高效地实现。我们通过将其纳入分层视觉变压器模型来验证我们的层的有效性。我们展示了速度和内存复杂性的改进,同时实现了与最先进的模型的可比准确性。最后,我们的图层尺寸尤其良好,窗口大小,需要高于X10的内存,而不是比现有方法更快。
translated by 谷歌翻译
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers.As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. We release the code at https: //github.com/facebookresearch/LeViT.
translated by 谷歌翻译
在这项工作中,我们探索了用于视觉接地的整洁而有效的基于变压器的框架。先前的方法通常解决了视觉接地的核心问题,即具有手动设计的机制,即多模式融合和推理。这样的启发式设计不仅复杂化,而且使模型容易过度拟合特定的数据分布。为了避免这种情况,我们首先提出了TransVG,该TransVG通过变压器建立了多模式的对应关系,并通过直接回归框坐标来定位引用区域。我们从经验上表明,复杂的融合模块可以用具有更高性能的变压器编码层的简单堆栈代替。但是,TransVG中的核心融合变压器是针对Uni-Modal编码器的独立性,因此应在有限的视觉接地数据上从头开始训练,这使得很难优化并导致次优性能。为此,我们进一步介绍了TransVG ++以进行两倍的改进。一方面,我们通过利用Vision Transformer(VIT)进行视觉功能编码来将框架升级到一个纯粹的基于变压器的框架。对于另一个人来说,我们设计了语言有条件的视觉变压器,以去除外部融合模块,并重用Uni-Modal vit进行中间层的视觉融合。我们对五个普遍数据集进行了广泛的实验,并报告一系列最先进的记录。
translated by 谷歌翻译
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two interesting issues. First, Vision Transformers present a queryirrelevant behavior at deep layers, where the attention maps exhibit nearly consistent contexts in global scope, regardless of the query patch position (also head-irrelevant). Second, the attention maps are intrinsically sparse, few tokens dominate the attention weights; introducing the knowledge from ConvNets would largely smooth the attention and enhance the performance. Motivated by above observations, we generalize self-attention formulation to abstract a queryirrelevant global context directly and further integrate the global context into convolutions. The resulting model, a Fully Convolutional Vision Transformer (i.e., FCViT), purely consists of convolutional layers and firmly inherits the merits of both attention mechanism and convolutions, including dynamic property, weight sharing, and short- and long-range feature modeling, etc. Experimental results demonstrate the effectiveness of FCViT. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still perform better than previous state-of-the-art ConvNeXt with even fewer parameters. FCViT-based models also demonstrate promising transferability to downstream tasks, like object detection, instance segmentation, and semantic segmentation. Codes and models are made available at: https://github.com/ma-xu/FCViT.
translated by 谷歌翻译