Attention-based neural networks, such as Transformers, have become ubiquitous in numerous applications, including computer vision, natural language processing, and time-series analysis. In all kinds of attention networks, the attention maps are crucial as they encode semantic dependencies between input tokens. However, most existing attention networks perform modeling or reasoning based on representations, wherein the attention maps of different layers are learned separately without explicit interactions. In this paper, we propose a novel and generic evolving attention mechanism, which directly models the evolution of inter-token relationships through a chain of residual convolutional modules. The major motivations are twofold. On the one hand, the attention maps in different layers share transferable knowledge, thus adding a residual connection can facilitate the information flow of inter-token relationships across layers. On the other hand, there is naturally an evolutionary trend among attention maps at different abstraction levels, so it is beneficial to exploit a dedicated convolution-based module to capture this process. Equipped with the proposed mechanism, the convolution-enhanced evolving attention networks achieve superior performance in various applications, including time-series representation, natural language understanding, machine translation, and image classification. Especially on time-series representation tasks, Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly, achieving an average of 17% improvement compared to the best SOTA. To the best of our knowledge, this is the first work that explicitly models the layer-wise evolution of attention maps. Our implementation is available at https://github.com/pkuyym/EvolvingAttention
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
变压器是一种基于关注的编码器解码器架构,彻底改变了自然语言处理领域。灵感来自这一重大成就,最近在将变形式架构调整到计算机视觉(CV)领域的一些开创性作品,这已经证明了他们对各种简历任务的有效性。依靠竞争力的建模能力,与现代卷积神经网络相比在本文中,我们已经为三百不同的视觉变压器进行了全面的审查,用于三个基本的CV任务(分类,检测和分割),提出了根据其动机,结构和使用情况组织这些方法的分类。 。由于培训设置和面向任务的差异,我们还在不同的配置上进行了评估了这些方法,以便于易于和直观的比较而不是各种基准。此外,我们已经揭示了一系列必不可少的,但可能使变压器能够从众多架构中脱颖而出,例如松弛的高级语义嵌入,以弥合视觉和顺序变压器之间的差距。最后,提出了三个未来的未来研究方向进行进一步投资。
translated by 谷歌翻译
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. 1
translated by 谷歌翻译
视觉变压器正在成为解决计算机视觉问题的强大工具。最近的技术还证明了超出图像域之外的变压器来解决许多与视频相关的任务的功效。其中,由于其广泛的应用,人类的行动识别是从研究界受到特别关注。本文提供了对动作识别的视觉变压器技术的首次全面调查。我们朝着这个方向分析并总结了现有文献和新兴文献,同时突出了适应变形金刚以进行动作识别的流行趋势。由于其专业应用,我们将这些方法统称为``动作变压器''。我们的文献综述根据其架构,方式和预期目标为动作变压器提供了适当的分类法。在动作变压器的背景下,我们探讨了编码时空数据,降低维度降低,框架贴片和时空立方体构造以及各种表示方法的技术。我们还研究了变压器层中时空注意的优化,以处理更长的序列,通常通过减少单个注意操作中的令牌数量。此外,我们还研究了不同的网络学习策略,例如自我监督和零局学习,以及它们对基于变压器的行动识别的相关损失。这项调查还总结了在具有动作变压器重要基准的评估度量评分方面取得的进步。最后,它提供了有关该研究方向的挑战,前景和未来途径的讨论。
translated by 谷歌翻译
We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (i.e. shift, scale, and distortion invariance) while maintaining the merits of Transformers (i.e. dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (e.g. ImageNet-22k) and fine-tuned to downstream tasks. Pretrained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at https: //github.com/leoxiaobin/CvT.
translated by 谷歌翻译
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two interesting issues. First, Vision Transformers present a queryirrelevant behavior at deep layers, where the attention maps exhibit nearly consistent contexts in global scope, regardless of the query patch position (also head-irrelevant). Second, the attention maps are intrinsically sparse, few tokens dominate the attention weights; introducing the knowledge from ConvNets would largely smooth the attention and enhance the performance. Motivated by above observations, we generalize self-attention formulation to abstract a queryirrelevant global context directly and further integrate the global context into convolutions. The resulting model, a Fully Convolutional Vision Transformer (i.e., FCViT), purely consists of convolutional layers and firmly inherits the merits of both attention mechanism and convolutions, including dynamic property, weight sharing, and short- and long-range feature modeling, etc. Experimental results demonstrate the effectiveness of FCViT. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still perform better than previous state-of-the-art ConvNeXt with even fewer parameters. FCViT-based models also demonstrate promising transferability to downstream tasks, like object detection, instance segmentation, and semantic segmentation. Codes and models are made available at: https://github.com/ma-xu/FCViT.
translated by 谷歌翻译
视觉变压器(VIT)用作强大的视觉模型。与卷积神经网络不同,在前几年主导视觉研究,视觉变压器享有捕获数据中的远程依赖性的能力。尽管如此,任何变压器架构的组成部分,自我关注机制都存在高延迟和低效的内存利用,使其不太适合高分辨率输入图像。为了缓解这些缺点,分层视觉模型在非交错的窗口上局部使用自我关注。这种放松会降低输入尺寸的复杂性;但是,它限制了横窗相互作用,损害了模型性能。在本文中,我们提出了一种新的班次不变的本地注意层,称为查询和参加(QNA),其以重叠的方式聚集在本地输入,非常类似于卷积。 QNA背后的关键想法是介绍学习的查询,这允许快速高效地实现。我们通过将其纳入分层视觉变压器模型来验证我们的层的有效性。我们展示了速度和内存复杂性的改进,同时实现了与最先进的模型的可比准确性。最后,我们的图层尺寸尤其良好,窗口大小,需要高于X10的内存,而不是比现有方法更快。
translated by 谷歌翻译
变压器注意机制的二次计算和内存复杂性限制了对长序列建模的可扩展性。在本文中,我们提出了Luna,一种线性统一嵌套关注机制,使Softmax注意力具有两个嵌套线性关注功能,仅产生线性(与二次)的时间和空间复杂度相反。具体地,通过第一注意功能,LUNA将输入序列包装成固定长度的序列。然后,使用第二关注功能未包装包装序列。与更传统的关注机制相比,LUNA引入具有固定长度的附加序列作为输入和额外的相应输出,允许LUNA线性地进行关注操作,同时还存储足够的上下文信息。我们对三个序列建模任务的基准进行了广泛的评估:长上下文序列建模,神经机平移和大型预磨损的屏蔽语言建模。竞争甚至更好的实验结果表明了Luna的有效性和效率与各种各样相比
translated by 谷歌翻译
视觉变压器由于能够捕获图像中的长期依赖性的能力而成功地应用于图像识别任务。但是,变压器与现有卷积神经网络(CNN)之间的性能和计算成本仍然存在差距。在本文中,我们旨在解决此问题,并开发一个网络,该网络不仅可以超越规范变压器,而且可以超越高性能卷积模型。我们通过利用变压器来捕获长期依赖性和CNN来建模本地特征,从而提出了一个新的基于变压器的混合网络。此外,我们将其扩展为获得一个称为CMT的模型家族,比以前的基于卷积和基于变压器的模型获得了更好的准确性和效率。特别是,我们的CMT-S在ImageNet上获得了83.5%的TOP-1精度,而在拖鞋上的拖曳率分别比现有的DEIT和EficitiveNet小14倍和2倍。拟议的CMT-S还可以很好地概括CIFAR10(99.2%),CIFAR100(91.7%),花(98.7%)以及其他具有挑战性的视觉数据集,例如可可(44.3%地图),计算成本较小。
translated by 谷歌翻译
多尺度特征层次结构已在计算机视觉区域的成功中得到了见证。这进一步激发了研究人员设计自然语言处理的多尺度变压器,主要是基于自我发项机制。例如,限制跨头部的接收场或通过卷积提取局部细粒度特征。但是,大多数现有作品都直接建模了本地功能,但忽略了单词边界信息。这导致了缺乏解释性的多余和模棱两可的注意力分布。在这项工作中,我们在不同的语言单元中定义了这些量表,包括子字,单词和短语。我们通过基于单词边界信息和短语级别的先验知识之间建立量表之间的关系来构建多尺度变压器模型。提出的\ textbf {u} niversal \ textbf {m} ulti \ textbf {s} cale \ textbf {t} ransformer,即在两个序列生成任务上评估。值得注意的是,它在几个测试组上的强大基线上产生了一致的性能,而无需牺牲效率。
translated by 谷歌翻译
由于其二次复杂性,是变压器中的关注模块,其是变压器中的重要组件不能高效地扩展到长序列。许多工作侧重于近似于尺寸的圆点 - 指数的软MAX功能,导致分二次甚至线性复杂性变压器架构。但是,我们表明这些方法不能应用于超出点的指数样式的更强大的注意模块,例如,具有相对位置编码(RPE)的变压器。由于在许多最先进的模型中,相对位置编码被用作默认,设计可以包含RPE的高效变压器是吸引人的。在本文中,我们提出了一种新颖的方法来加速对RPE的转化仪的关注计算在核心化的关注之上。基于观察到相对位置编码形成Toeplitz矩阵,我们数在数学上表明,可以使用快速傅里叶变换(FFT)有效地计算具有RPE的核化注意。使用FFT,我们的方法实现$ \ mathcal {o}(n \ log n)$时间复杂性。有趣的是,我们进一步证明使用相对位置编码适当地可以减轻香草群关注的培训不稳定问题。在广泛的任务上,我们经验证明我们的模型可以从头开始培训,没有任何优化问题。学习模型比许多高效的变压器变体更好地执行,并且在长序列制度中比标准变压器更快。
translated by 谷歌翻译
在这项工作中,我们探索了用于视觉接地的整洁而有效的基于变压器的框架。先前的方法通常解决了视觉接地的核心问题,即具有手动设计的机制,即多模式融合和推理。这样的启发式设计不仅复杂化,而且使模型容易过度拟合特定的数据分布。为了避免这种情况,我们首先提出了TransVG,该TransVG通过变压器建立了多模式的对应关系,并通过直接回归框坐标来定位引用区域。我们从经验上表明,复杂的融合模块可以用具有更高性能的变压器编码层的简单堆栈代替。但是,TransVG中的核心融合变压器是针对Uni-Modal编码器的独立性,因此应在有限的视觉接地数据上从头开始训练,这使得很难优化并导致次优性能。为此,我们进一步介绍了TransVG ++以进行两倍的改进。一方面,我们通过利用Vision Transformer(VIT)进行视觉功能编码来将框架升级到一个纯粹的基于变压器的框架。对于另一个人来说,我们设计了语言有条件的视觉变压器,以去除外部融合模块,并重用Uni-Modal vit进行中间层的视觉融合。我们对五个普遍数据集进行了广泛的实验,并报告一系列最先进的记录。
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
变形金刚在自然语言处理方面取得了巨大的成功。由于变压器中自我发挥机制的强大能力,研究人员为各种计算机视觉任务(例如图像识别,对象检测,图像分割,姿势估计和3D重建)开发了视觉变压器。本文介绍了有关视觉变形金刚的不同建筑设计和培训技巧(包括自我监督的学习)文献的全面概述。我们的目标是为开放研究机会提供系统的审查。
translated by 谷歌翻译
Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16×16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4×4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost.
translated by 谷歌翻译
注意机制对研究界提出了重大兴趣,因为他们承诺改善神经网络架构的表现。但是,在任何特定的问题中,我们仍然缺乏主要的方法来选择导致保证改进的具体机制和超参数。最近,已经提出了自我关注并广泛用于变压器 - 类似的架构中,导致某些应用中的重大突破。在这项工作中,我们专注于两种形式的注意机制:注意模块和自我关注。注意模块用于重新重量每个层输入张量的特征。不同的模块具有不同的方法,可以在完全连接或卷积层中执行此重复。研究的注意力模型是完全模块化的,在这项工作中,它们将与流行的Reset架构一起使用。自我关注,最初在自然语言处理领域提出,可以将所有项目与输入序列中的所有项目相关联。自我关注在计算机视觉中越来越受欢迎,其中有时与卷积层相结合,尽管最近的一些架构与卷曲完全消失。在这项工作中,我们研究并执行了在特定计算机视觉任务中许多不同关注机制的客观的比较,在广泛使用的皮肤癌MNIST数据集中的样本分类。结果表明,关注模块有时会改善卷积神经网络架构的性能,也是这种改进虽然明显且统计学意义,但在不同的环境中并不一致。另一方面,通过自我关注机制获得的结果表明了一致和显着的改进,即使在具有减少数量的参数的架构中,也可以实现最佳结果。
translated by 谷歌翻译
人类自然有效地在复杂的场景中找到突出区域。通过这种观察的动机,引入了计算机视觉中的注意力机制,目的是模仿人类视觉系统的这一方面。这种注意机制可以基于输入图像的特征被视为动态权重调整过程。注意机制在许多视觉任务中取得了巨大的成功,包括图像分类,对象检测,语义分割,视频理解,图像生成,3D视觉,多模态任务和自我监督的学习。在本调查中,我们对计算机愿景中的各种关注机制进行了全面的审查,并根据渠道注意,空间关注,暂时关注和分支注意力进行分类。相关的存储库https://github.com/menghaoguo/awesome-vision-tions致力于收集相关的工作。我们还建议了未来的注意机制研究方向。
translated by 谷歌翻译
Scene text spotting is of great importance to the computer vision community due to its wide variety of applications. Recent methods attempt to introduce linguistic knowledge for challenging recognition rather than pure visual classification. However, how to effectively model the linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting. Firstly, the autonomous suggests enforcing explicitly language modeling by decoupling the recognizer into vision model and language model and blocking gradient flow between both models. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for the language model which can effectively alleviate the impact of noise input. Finally, to polish ABINet++ in long text recognition, we propose to aggregate horizontal features by embedding Transformer units inside a U-Net, and design a position and content attention module which integrates character order and content to attend to character features precisely. ABINet++ achieves state-of-the-art performance on both scene text recognition and scene text spotting benchmarks, which consistently demonstrates the superiority of our method in various environments especially on low-quality images. Besides, extensive experiments including in English and Chinese also prove that, a text spotter that incorporates our language modeling method can significantly improve its performance both in accuracy and speed compared with commonly used attention-based recognizers.
translated by 谷歌翻译
表面缺陷检测是确保工业产品质量的极其至关重要的步骤。如今,基于编码器架构的卷积神经网络(CNN)在各种缺陷检测任务中取得了巨大的成功。然而,由于卷积的内在局部性,它们通常在明确建模长距离相互作用时表现出限制,这对于复杂情况下的像素缺陷检测至关重要,例如杂乱的背景和难以辨认的伪缺陷。最近的变压器尤其擅长学习全球图像依赖性,但对于详细的缺陷位置所需的本地结构信息有限。为了克服上述局限性,我们提出了一个有效的混合变压器体系结构,称为缺陷变压器(faft),用于表面缺陷检测,该检测将CNN和Transferaler纳入统一模型,以协作捕获本地和非本地关系。具体而言,在编码器模块中,首先采用卷积茎块来保留更详细的空间信息。然后,贴片聚合块用于生成具有四个层次结构的多尺度表示形式,每个层次结构之后分别是一系列的feft块,该块分别包括用于本地位置编码的本地位置块,一个轻巧的多功能自我自我 - 注意与良好的计算效率建模多尺度的全球上下文关系,以及用于功能转换和进一步位置信息学习的卷积馈送网络。最后,提出了一个简单但有效的解码器模块,以从编码器中的跳过连接中逐渐恢复空间细节。与其他基于CNN的网络相比,三个数据集上的广泛实验证明了我们方法的优势和效率。
translated by 谷歌翻译