虽然视觉变压器(VT)体系结构在计算机视觉中越来越流行,但纯VT模型在微小的数据集上的性能较差。为了解决这个问题,本文提出了改善小型数据集VT性能的地方指南。我们首先分析,由于VTS中自我注意的机制的高灵活性和内在的全球性,因此很难用有限的数据来学习局部信息,这对于理解图像非常重要。为了促进本地信息,我们通过模仿已经训练有素的卷积神经网络(CNN)的特征来实现VT的当地指南,灵感来自CNN的内置本地到全球层次结构。在我们的双任务学习范式下,由低分辨率图像训练的轻型CNN提供的局部指导足以加速收敛并在很大程度上提高VT的性能。因此,我们的本地指导方法非常简单有效,可以作为小型数据集中VT的基本性能增强方法。广泛的实验表明,我们的方法在小型数据集中从头开始训练时可以显着改善VT,并且与不同种类的VT和数据集兼容。例如,我们提出的方法可以将各种VT在微型数据集上的性能提高(例如,DEIT 13.07%,T2T为8.98%,PVT为7.85%),并使更强大的基线PVTV2提高了1.86%至79.30%,显示出来小型数据集上的VT潜力。该代码可从https://github.com/lkhl/tiny-transformers获得。
translated by 谷歌翻译
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels. But the scarce data can not enable ViTs to learn strong enough representation for accurate recognition. To this end, we propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. On spatial aspect, we adopt a hybrid structure, in which convolution is integrated into patch embedding and multi-layer perceptron module, forcing the model to capture the token features as well as their neighboring features. On channel aspect, we introduce a dynamic feature aggregation module in MLP and a brand new "head token" design in multi-head self-attention module to help re-calibrate channel representation and make different channel group representation interacts with each other. The fusion of weak channel representation forms a strong enough representation for classification. With this design, we successfully eliminate the performance gap between CNNs and ViTs, and our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters. Code is available at https://github.com/ArieSeirack/DHVT.
translated by 谷歌翻译
Pure transformers have shown great potential for vision tasks recently. However, their accuracy in small or medium datasets is not satisfactory. Although some existing methods introduce a CNN as a teacher to guide the training process by distillation, the gap between teacher and student networks would lead to sub-optimal performance. In this work, we propose a new One-shot Vision transformer search framework with Online distillation, namely OVO. OVO samples sub-nets for both teacher and student networks for better distillation results. Benefiting from the online distillation, thousands of subnets in the supernet are well-trained without extra finetuning or retraining. In experiments, OVO-Ti achieves 73.32% top-1 accuracy on ImageNet and 75.2% on CIFAR-100, respectively.
translated by 谷歌翻译
随着变压器作为语言处理的标准及其在计算机视觉方面的进步,参数大小和培训数据的数量相应地增长。许多人开始相信,因此,变形金刚不适合少量数据。这种趋势引起了人们的关注,例如:某些科学领域中数据的可用性有限,并且排除了该领域研究资源有限的人。在本文中,我们旨在通过引入紧凑型变压器来提出一种小规模学习的方法。我们首次表明,具有正确的尺寸,卷积令牌化,变压器可以避免在小数据集上过度拟合和优于最先进的CNN。我们的模型在模型大小方面具有灵活性,并且在获得竞争成果的同时,参数可能仅为0.28亿。当在CIFAR-10上训练Cifar-10,只有370万参数训练时,我们的最佳模型可以达到98%的准确性,这是与以前的基于变形金刚的模型相比,数据效率的显着提高,比其他变压器小于10倍,并且是15%的大小。在实现类似性能的同时,重新NET50。 CCT还表现优于许多基于CNN的现代方法,甚至超过一些基于NAS的方法。此外,我们在Flowers-102上获得了新的SOTA,具有99.76%的TOP-1准确性,并改善了Imagenet上现有基线(82.71%精度,具有29%的VIT参数)以及NLP任务。我们针对变压器的简单而紧凑的设计使它们更可行,可以为那些计算资源和/或处理小型数据集的人学习,同时扩展了在数据高效变压器中的现有研究工作。我们的代码和预培训模型可在https://github.com/shi-labs/compact-transformers上公开获得。
translated by 谷歌翻译
视觉变压器(VITS)已成为各种视觉任务的流行结构和优于卷积神经网络(CNNS)。然而,这种强大的变形金机带来了巨大的计算负担。而这背后的基本障碍是排气的令牌到令牌比较。为了缓解这一点,我们深入研究Vit的模型属性,观察到VITS表现出稀疏关注,具有高令牌相似性。这直观地向我们介绍了可行的结构不可知的尺寸,令牌编号,以降低计算成本。基于这一探索,我们为香草vits提出了一种通用的自我切片学习方法,即坐下。具体而言,我们首先设计一种新颖的令牌减肥模块(TSM),可以通过动态令牌聚集来提高VIT的推理效率。不同于令牌硬滴,我们的TSM轻轻地集成了冗余令牌变成了更少的信息,可以在不切断图像中的鉴别性令牌关系的情况下动态缩放视觉注意。此外,我们介绍了一种简洁的密集知识蒸馏(DKD)框架,其密集地以柔性自动编码器方式传送无组织的令牌信息。由于教师和学生之间的结构类似,我们的框架可以有效地利用结构知识以获得更好的收敛性。最后,我们进行了广泛的实验来评估我们的坐姿。它展示了我们的方法可以通过1.7倍加速VITS,其精度下降可忽略不计,甚至在3.6倍上加速VITS,同时保持其性能的97%。令人惊讶的是,通过简单地武装LV-VIT与我们的坐线,我们在想象中实现了新的最先进的表现,超过了最近文学中的所有CNN和VITS。
translated by 谷歌翻译
在本文中,我们通过利用视觉数据中的空间稀疏性提出了一种新的模型加速方法。我们观察到,视觉变压器中的最终预测仅基于最有用的令牌的子集,这足以使图像识别。基于此观察,我们提出了一个动态的令牌稀疏框架,以根据加速视觉变压器的输入逐渐和动态地修剪冗余令牌。具体而言,我们设计了一个轻量级预测模块,以估计给定当前功能的每个令牌的重要性得分。该模块被添加到不同的层中以层次修剪冗余令牌。尽管该框架的启发是我们观察到视觉变压器中稀疏注意力的启发,但我们发现自适应和不对称计算的想法可能是加速各种体系结构的一般解决方案。我们将我们的方法扩展到包括CNN和分层视觉变压器在内的层次模型,以及更复杂的密集预测任务,这些任务需要通过制定更通用的动态空间稀疏框架,并具有渐进性的稀疏性和非对称性计算,用于不同空间位置。通过将轻质快速路径应用于少量的特征,并使用更具表现力的慢速路径到更重要的位置,我们可以维护特征地图的结构,同时大大减少整体计算。广泛的实验证明了我们框架对各种现代体系结构和不同视觉识别任务的有效性。我们的结果清楚地表明,动态空间稀疏为模型加速提供了一个新的,更有效的维度。代码可从https://github.com/raoyongming/dynamicvit获得
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
The Position Embedding (PE) is critical for Vision Transformers (VTs) due to the permutation-invariance of self-attention operation. By analyzing the input and output of each encoder layer in VTs using reparameterization and visualization, we find that the default PE joining method (simply adding the PE and patch embedding together) operates the same affine transformation to token embedding and PE, which limits the expressiveness of PE and hence constrains the performance of VTs. To overcome this limitation, we propose a simple, effective, and robust method. Specifically, we provide two independent layer normalizations for token embeddings and PE for each layer, and add them together as the input of each layer's Muti-Head Self-Attention module. Since the method allows the model to adaptively adjust the information of PE for different layers, we name it as Layer-adaptive Position Embedding, abbreviated as LaPE. Extensive experiments demonstrate that LaPE can improve various VTs with different types of PE and make VTs robust to PE types. For example, LaPE improves 0.94% accuracy for ViT-Lite on Cifar10, 0.98% for CCT on Cifar100, and 1.72% for DeiT on ImageNet-1K, which is remarkable considering the negligible extra parameters, memory and computational cost brought by LaPE. The code is publicly available at https://github.com/Ingrid725/LaPE.
translated by 谷歌翻译
视觉变换器(VTS)作为卷积网络(CNNS)的架构范式替代品。与CNN不同,VT可以捕获图像元素之间的全局关系,并且它们可能具有更大的表示容量。然而,缺乏典型的卷积电感偏差使这些模型比普通的CNN更饥饿。实际上,嵌入在CNN架构设计中的某些本地属性,在VTS中应该从样品中学习。在本文中,我们明确地分析了不同的VTS,比较了他们在小型训练制度中的鲁棒性,并且我们表明,尽管在想象中训练时具有可比的准确性,但它们在较小数据集上的性能可能很大程度上不同。此外,我们提出了一种自我监督的任务,可以从图像中提取其他信息,只有可忽略不计的计算开销。这项任务鼓励VTS学习图像内的空间关系,并使VT培训在训练数据稀缺时更加强劲。我们的任务与标准(监督)培训共同使用,它不依赖于特定的架构选择,因此它可以轻松插入现有的VTS。使用与不同的VTS和数据集进行广泛的评估,我们表明我们的方法可以改善(有时显着地)VTS的最终精度。我们的代码可用于:https://github.com/yhlleo/vts-droc。
translated by 谷歌翻译
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These highperforming vision transformers are pre-trained with hundreds of millions of images using a large infrastructure, thereby limiting their adoption.In this work, we produce competitive convolutionfree transformers trained on ImageNet only using a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on ImageNet with no external data.We also introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention, typically from a convnet teacher. The learned transformers are competitive (85.2% top-1 acc.) with the state of the art on ImageNet, and similarly when transferred to other tasks. We will share our code and models.
translated by 谷歌翻译
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These highperforming vision transformers are pre-trained with hundreds of millions of images using a large infrastructure, thereby limiting their adoption.In this work, we produce competitive convolution-free transformers by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on ImageNet with no external data.More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.
translated by 谷歌翻译
变压器是一种基于关注的编码器解码器架构,彻底改变了自然语言处理领域。灵感来自这一重大成就,最近在将变形式架构调整到计算机视觉(CV)领域的一些开创性作品,这已经证明了他们对各种简历任务的有效性。依靠竞争力的建模能力,与现代卷积神经网络相比在本文中,我们已经为三百不同的视觉变压器进行了全面的审查,用于三个基本的CV任务(分类,检测和分割),提出了根据其动机,结构和使用情况组织这些方法的分类。 。由于培训设置和面向任务的差异,我们还在不同的配置上进行了评估了这些方法,以便于易于和直观的比较而不是各种基准。此外,我们已经揭示了一系列必不可少的,但可能使变压器能够从众多架构中脱颖而出,例如松弛的高级语义嵌入,以弥合视觉和顺序变压器之间的差距。最后,提出了三个未来的未来研究方向进行进一步投资。
translated by 谷歌翻译
变形金刚在自然语言处理方面取得了巨大的成功。由于变压器中自我发挥机制的强大能力,研究人员为各种计算机视觉任务(例如图像识别,对象检测,图像分割,姿势估计和3D重建)开发了视觉变压器。本文介绍了有关视觉变形金刚的不同建筑设计和培训技巧(包括自我监督的学习)文献的全面概述。我们的目标是为开放研究机会提供系统的审查。
translated by 谷歌翻译
探讨了语言建模流行的变形金刚,用于近期解决视觉任务,例如,用于图像分类的视觉变压器(VIT)。 VIT模型将每个图像分成具有固定长度的令牌序列,然后应用多个变压器层以模拟它们的全局关系以进行分类。然而,当从像想象中的中型数据集上从头开始训练时,VIT对CNNS达到较差的性能。我们发现它是因为:1)输入图像的简单标记未能模拟相邻像素之间的重要局部结构,例如边缘和线路,导致训练采样效率低。 2)冗余注意骨干骨干设计对固定计算预算和有限的训练样本有限的具有限制性。为了克服这些限制,我们提出了一种新的令牌到令牌视觉变压器(T2T-VIT),它包含1)层 - 明智的代币(T2T)转换,通过递归聚合相邻来逐步地结构于令牌到令牌。代币进入一个令牌(令牌到令牌),这样可以建模由周围令牌所代表的本地结构,并且可以减少令牌长度; 2)一种高效的骨干,具有深度狭窄的结构,用于在实证研究后CNN建筑设计的激励变压器结构。值得注意的是,T2T-VIT将Vanilla Vit的参数计数和Mac减少了一半,同时从想象中从头开始训练时,改善了超过3.0 \%。它还优于Endnets并通过直接培训Imagenet训练来实现与MobileNets相当的性能。例如,T2T-VTO与Reset50(21.5M参数)的可比大小(21.5M参数)可以在图像分辨率384 $ \ Times 384上实现83.3 \%TOP1精度。 (代码:https://github.com/yitu-opensource/t2t-vit)
translated by 谷歌翻译
The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in transformer models for image classification. To this end, we propose a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity and these tokens are then fused purely by attention multiple times to complement each other. Furthermore, to reduce computation, we develop a simple yet effective token fusion module based on cross attention, which uses a single token for each branch as a query to exchange information with other branches. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise. Extensive experiments demonstrate that our approach performs better than or on par with several concurrent works on vision transformer, in addition to efficient CNN models. For example, on the ImageNet1K dataset, with some architectural changes, our approach outperforms the recent DeiT by a large margin of 2% with a small to moderate increase in FLOPs and model parameters. Our source codes and models are available at https://github.com/IBM/CrossViT.
translated by 谷歌翻译
Convolutional architectures have proven extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pretrained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias. We initialize the GPSA layers to mimic the locality of convolutional layers, then give each attention head the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information. The resulting convolutionallike ViT architecture, ConViT, outperforms the DeiT (Touvron et al., 2020) on ImageNet, while offering a much improved sample efficiency. We further investigate the role of locality in learning by first quantifying how it is encouraged in vanilla self-attention layers, then analyzing how it is escaped in GPSA layers. We conclude by presenting various ablations to better understand the success of the ConViT. Our code and models are released publicly at https://github.com/ facebookresearch/convit.
translated by 谷歌翻译
本文研究了从预先训练的模型,尤其是蒙面自动编码器中提取知识的潜力。我们的方法很简单:除了优化掩盖输入的像素重建损失外,我们还将教师模型的中间特征图与学生模型的中间特征图之间的距离最小化。此设计导致一个计算高效的知识蒸馏框架,给定1)仅使用一个少量可见的补丁子集,2)(笨拙的)教师模型仅需要部分执行,\ ie,\ ie,在前几个中,向前传播输入层,用于获得中间特征图。与直接蒸馏微型模型相比,提炼预训练的模型显着改善了下游性能。例如,通过将知识从MAE预先训练的VIT-L提炼为VIT-B,我们的方法可实现84.0%的Imagenet Top-1精度,表现优于直接将微型VIT-L蒸馏的基线,降低1.2%。更有趣的是,我们的方法即使具有极高的掩盖率也可以从教师模型中进行鲁棒性蒸馏:例如,在蒸馏过程中仅可见十个斑块,我们的VIT-B具有竞争力的前1个Imagenet精度为83.6%,在95%的掩盖率中,只有十个斑块。 ;令人惊讶的是,它仍然可以通过仅四个可见斑(98%的掩盖率)积极训练来确保82.4%的Top-1 Imagenet精度。代码和模型可在https://github.com/ucsc-vlaa/dmae上公开获得。
translated by 谷歌翻译
最近,将变压器结构应用于图像分类任务的视觉变压器(VIV)具有优于卷积神经网络的优势。然而,使用诸如JFT-300M的大型数据集的预先训练的VIT结果的高性能和其对大型数据集的依赖性被解释为由于低地位感应偏差。本文提出了移动的贴片标记(SPT)和地区自我关注(LSA),有效解决了缺乏地区归纳偏差,使其即使在小型数据集上也能从划痕中学习。此外,SPT和LSA是通用且有效的附加模块,可轻松适用于各种VITS。实验结果表明,当SPT和LSA都应用于VITS时,性能在微小的想象中平均提高2.96%,这是一个代表性的小型数据集。特别是,由于所提出的SPT和LSA,Swin Transformer达到了4.08%的压倒性的性能提高。
translated by 谷歌翻译
大型预训练的变压器是现代语义分割基准的顶部,但具有高计算成本和冗长的培训。为了提高这种约束,我们从综合知识蒸馏的角度来研究有效的语义分割,并考虑弥合多源知识提取和特定于变压器特定的斑块嵌入之间的差距。我们提出了基于变压器的知识蒸馏(TransKD)框架,该框架通过蒸馏出大型教师变压器的特征地图和补丁嵌入来学习紧凑的学生变形金刚,绕过长期的预训练过程并将FLOPS降低> 85.0%。具体而言,我们提出了两个基本和两个优化模块:(1)交叉选择性融合(CSF)可以通过通道注意和层次变压器内的特征图蒸馏之间的知识转移; (2)嵌入对齐(PEA)在斑块过程中执行尺寸转换,以促进贴片嵌入蒸馏; (3)全局本地上下文混合器(GL-MIXER)提取了代表性嵌入的全局和局部信息; (4)嵌入助手(EA)是一种嵌入方法,可以无缝地桥接老师和学生模型,并具有老师的渠道数量。关于CityScapes,ACDC和NYUV2数据集的实验表明,TransKD的表现优于最先进的蒸馏框架,并竞争了耗时的预训练方法。代码可在https://github.com/ruipingl/transkd上找到。
translated by 谷歌翻译
We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (i.e. shift, scale, and distortion invariance) while maintaining the merits of Transformers (i.e. dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (e.g. ImageNet-22k) and fine-tuned to downstream tasks. Pretrained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at https: //github.com/leoxiaobin/CvT.
translated by 谷歌翻译