We introduce submodel co-training, a regularization method related to co-training, self-distillation and stochastic depth. Given a neural network to be trained, for each sample we implicitly instantiate two altered networks, ``submodels'', with stochastic depth: we activate only a subset of the layers. Each network serves as a soft teacher to the other, by providing a loss that complements the regular loss provided by the one-hot label. Our approach, dubbed cosub, uses a single set of weights, and does not involve a pre-trained external model or temporal averaging. Experimentally, we show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation. Our approach is compatible with multiple architectures, including RegNet, ViT, PiT, XCiT, Swin and ConvNext. Our training strategy improves their results in comparable settings. For instance, a ViT-B pretrained with cosub on ImageNet-21k obtains 87.4% top-1 acc. @448 on ImageNet-val.
translated by 谷歌翻译
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These highperforming vision transformers are pre-trained with hundreds of millions of images using a large infrastructure, thereby limiting their adoption.In this work, we produce competitive convolution-free transformers by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on ImageNet with no external data.More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.
translated by 谷歌翻译
我们展示了如何通过基于关注的全球地图扩充任何卷积网络,以实现非本地推理。我们通过基于关注的聚合层替换为单个变压器块的最终平均池,重量贴片如何参与分类决策。我们使用2个参数(宽度和深度)使用简单的补丁卷积网络,使用简单的补丁的卷积网络插入学习的聚合层。与金字塔设计相比,该架构系列在所有层上维护输入补丁分辨率。它在准确性和复杂性之间产生了令人惊讶的竞争权衡,特别是在记忆消耗方面,如我们在各种计算机视觉任务所示:对象分类,图像分割和检测的实验所示。
translated by 谷歌翻译
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These highperforming vision transformers are pre-trained with hundreds of millions of images using a large infrastructure, thereby limiting their adoption.In this work, we produce competitive convolutionfree transformers trained on ImageNet only using a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on ImageNet with no external data.We also introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention, typically from a convnet teacher. The learned transformers are competitive (85.2% top-1 acc.) with the state of the art on ImageNet, and similarly when transferred to other tasks. We will share our code and models.
translated by 谷歌翻译
Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. In particular, we investigate the interplay of architecture and optimization of such dedicated transformers. We make two transformers architecture changes that significantly improve the accuracy of deep transformers. This leads us to produce models whose performance does not saturate early with more depth, for instance we obtain 86.5% top-1 accuracy on Imagenet when training with no external data, we thus attain the current SOTA with less FLOPs and parameters. Moreover, our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency, in the setting with no additional training data. We share our code and models 1 .
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
Vision Transformer(VIT)最近由于其出色的模型功能而引起了计算机视觉的极大关注。但是,大多数流行的VIT模型都有大量参数,从而限制了其在资源有限的设备上的适用性。为了减轻这个问题,我们提出了Tinyvit,这是一个新的小型,有效的小型视觉变压器,并通过我们提议的快速蒸馏框架在大型数据集上预处理。核心思想是将知识从大型模型转移到小型模型,同时使小型模型能够获得大量预处理数据的股息。更具体地说,我们在预训练期间应用蒸馏进行知识转移。大型教师模型的徽标被稀疏并提前存储在磁盘中,以节省内存成本和计算开销。微小的学生变形金刚自动从具有计算和参数约束的大型审计模型中缩小。全面的实验证明了TinyVit的功效。它仅具有21m参数的Imagenet-1k上的前1个精度为84.8%,与在Imagenet-21K上预读的SWIN-B相当,而使用较少的参数则使用了4.2倍。此外,增加图像分辨率,TinyVit可以达到86.5%的精度,仅使用11%参数,比SWIN-L略好。最后但并非最不重要的一点是,我们在各种下游任务上展示了TinyVit的良好转移能力。代码和型号可在https://github.com/microsoft/cream/tree/main/tinyvit上找到。
translated by 谷歌翻译
Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.
translated by 谷歌翻译
Vision transformer (ViT) models exhibit substandard optimizability. In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p×p convolution (p = 16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks. To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3×3 convolutions. While the vast majority of computation in the two ViT designs is identical, we find that this small change in early visual processing results in markedly different training behavior in terms of the sensitivity to optimization settings as well as the final model accuracy. Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance (by ∼1-2% top-1 accuracy on ImageNet-1k), while maintaining flops and runtime. The improvement can be observed across the wide spectrum of model complexities (from 1G to 36G flops) and dataset scales (from ImageNet-1k to ImageNet-21k). These findings lead us to recommend using a standard, lightweight convolutional stem for ViT models in this regime as a more robust architectural choice compared to the original ViT model design.
translated by 谷歌翻译
视觉变压器(VIT)已被证明可以在广泛的视觉应用中获得高度竞争性的性能,例如图像分类,对象检测和语义图像分割。与卷积神经网络相比,通常发现视觉变压器的较弱的电感偏差会在较小的培训数据集上培训时,会增加对模型正则化或数据增强的依赖(简称为“ AUGREG”)。我们进行了一项系统的实证研究,以便更好地了解培训数据,AUGREG,模型大小和计算预算之间的相互作用。作为这项研究的一个结果,我们发现增加的计算和AUGREG的组合可以产生与在数量级上训练的模型相同的训练数据的模型:我们在公共Imagenet-21K数据集中培训各种尺寸的VIT模型在较大的JFT-300M数据集上匹配或超越其对手的培训。
translated by 谷歌翻译
Large deep learning models have achieved remarkable success in many scenarios. However, training large models is usually challenging, e.g., due to the high computational cost, the unstable and painfully slow optimization procedure, and the vulnerability to overfitting. To alleviate these problems, this work studies a divide-and-conquer strategy, i.e., dividing a large model into smaller modules, training them independently, and reassembling the trained modules to obtain the target model. This approach is promising since it avoids directly training large models from scratch. Nevertheless, implementing this idea is non-trivial, as it is difficult to ensure the compatibility of the independently trained modules. In this paper, we present an elegant solution to address this issue, i.e., we introduce a global, shared meta model to implicitly link all the modules together. This enables us to train highly compatible modules that collaborate effectively when they are assembled together. We further propose a module incubation mechanism that enables the meta model to be designed as an extremely shallow network. As a result, the additional overhead introduced by the meta model is minimalized. Though conceptually simple, our method significantly outperforms end-to-end (E2E) training in terms of both final accuracy and training efficiency. For example, on top of ViT-Huge, it improves the accuracy by 2.7% compared to the E2E baseline on ImageNet-1K, while saving the training cost by 43% in the meantime. Code is available at https://github.com/LeapLabTHU/Model-Assembling.
translated by 谷歌翻译
变压器模型在处理各种视觉任务方面表现出了有希望的有效性。但是,与训练卷积神经网络(CNN)模型相比,训练视觉变压器(VIT)模型更加困难,并且依赖于大规模训练集。为了解释这一观察结果,我们做出了一个假设,即\ textit {vit模型在捕获图像的高频组件方面的有效性较小,而不是CNN模型},并通过频率分析对其进行验证。受这一发现的启发,我们首先研究了现有技术从新的频率角度改进VIT模型的影响,并发现某些技术(例如,randaugment)的成功可以归因于高频组件的更好使用。然后,为了补偿这种不足的VIT模型能力,我们提出了HAT,该HAT可以通过对抗训练直接增强图像的高频组成部分。我们表明,HAT可以始终如一地提高各种VIT模型的性能(例如VIT-B的 +1.2%,Swin-B的 +0.5%),尤其是提高了仅使用Imagenet-的高级模型Volo-D5至87.3% 1K数据,并且优势也可以维持在分发数据的数据上,并转移到下游任务。该代码可在以下网址获得:https://github.com/jiawangbai/hat。
translated by 谷歌翻译
视觉识别的“咆哮20S”开始引入视觉变压器(VITS),这将被取代的Cummnets作为最先进的图像分类模型。另一方面,vanilla vit,当应用于一般计算机视觉任务等对象检测和语义分割时面临困难。它是重新引入多个ConvNet Priors的等级变压器(例如,Swin变压器),使变压器实际上可作为通用视觉骨干网,并在各种视觉任务上展示了显着性能。然而,这种混合方法的有效性仍然在很大程度上归功于变压器的内在优越性,而不是卷积的固有感应偏差。在这项工作中,我们重新审视设计空间并测试纯粹的Convnet可以实现的限制。我们逐渐“现代化”标准Reset朝着视觉变压器的设计设计,并发现几个有助于沿途绩效差异的关键组件。此探索的结果是一个纯粹的ConvNet型号被称为ConvNext。完全由标准的Convnet模块构建,ConvNexts在准确性和可扩展性方面与变压器竞争,实现了87.8%的ImageNet Top-1精度和表现优于COCO检测和ADE20K分割的Swin变压器,同时保持了标准Convnet的简单性和效率。
translated by 谷歌翻译
我们在视觉变压器上呈现整洁但有效的递归操作,可以提高参数利用而不涉及额外参数。这是通过在变压器网络的深度分享权重来实现的。所提出的方法可以只使用NA \“IVE递归操作来获得大量增益(〜2%),不需要对设计网络原理的特殊或复杂的知识,并引入训练程序的最小计算开销。减少额外的计算通过递归操作,同时保持卓越的准确性,我们通过递归层的多个切片组自行引入近似方法,这可以通过最小的性能损失将成本消耗降低10〜30%。我们称我们的模型切片递归变压器(SRET) ,这与高效视觉变压器的广泛的其他设计兼容。我们最好的模型在含有较少参数的同时,在最先进的方法中对Imagenet建立了重大改进。建议的切片递归操作使我们能够建立一个变压器超过100甚至1000层,仍然仍然小尺寸(13〜15米),以避免困难当模型尺寸太大时,IES在优化中。灵活的可扩展性显示出缩放和构建极深和大维视觉变压器的巨大潜力。我们的代码和模型可在https://github.com/szq0214/sret中找到。
translated by 谷歌翻译
本文研究了从预先训练的模型,尤其是蒙面自动编码器中提取知识的潜力。我们的方法很简单:除了优化掩盖输入的像素重建损失外,我们还将教师模型的中间特征图与学生模型的中间特征图之间的距离最小化。此设计导致一个计算高效的知识蒸馏框架,给定1)仅使用一个少量可见的补丁子集,2)(笨拙的)教师模型仅需要部分执行,\ ie,\ ie,在前几个中,向前传播输入层,用于获得中间特征图。与直接蒸馏微型模型相比,提炼预训练的模型显着改善了下游性能。例如,通过将知识从MAE预先训练的VIT-L提炼为VIT-B,我们的方法可实现84.0%的Imagenet Top-1精度,表现优于直接将微型VIT-L蒸馏的基线,降低1.2%。更有趣的是,我们的方法即使具有极高的掩盖率也可以从教师模型中进行鲁棒性蒸馏:例如,在蒸馏过程中仅可见十个斑块,我们的VIT-B具有竞争力的前1个Imagenet精度为83.6%,在95%的掩盖率中,只有十个斑块。 ;令人惊讶的是,它仍然可以通过仅四个可见斑(98%的掩盖率)积极训练来确保82.4%的Top-1 Imagenet精度。代码和模型可在https://github.com/ucsc-vlaa/dmae上公开获得。
translated by 谷歌翻译
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers.As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. We release the code at https: //github.com/facebookresearch/LeViT.
translated by 谷歌翻译
本文研究了两种技术,用于开发有效的自我监督视觉变压器(ESVIT)进行视觉表示学习。首先,我们通过一项全面的实证研究表明,具有稀疏自我生产的多阶段体系结构可以显着降低建模的复杂性,但具有失去捕获图像区域之间细粒度对应关系的能力的成本。其次,我们提出了一项新的区域匹配训练任务,该任务使模型可以捕获细粒的区域依赖性,因此显着提高了学习视觉表示的质量。我们的结果表明,ESVIT在ImageNet线性探针评估上结合两种技术,在ImageNet线性探针评估中获得了81.3%的TOP-1,优于先前的艺术,其较高吞吐量的顺序幅度约为较高。当转移到下游线性分类任务时,ESVIT在18个数据集中的17个中优于其受监督的对方。代码和模型可公开可用:https://github.com/microsoft/esvit
translated by 谷歌翻译
在本文中,我们通过利用视觉数据中的空间稀疏性提出了一种新的模型加速方法。我们观察到,视觉变压器中的最终预测仅基于最有用的令牌的子集,这足以使图像识别。基于此观察,我们提出了一个动态的令牌稀疏框架,以根据加速视觉变压器的输入逐渐和动态地修剪冗余令牌。具体而言,我们设计了一个轻量级预测模块,以估计给定当前功能的每个令牌的重要性得分。该模块被添加到不同的层中以层次修剪冗余令牌。尽管该框架的启发是我们观察到视觉变压器中稀疏注意力的启发,但我们发现自适应和不对称计算的想法可能是加速各种体系结构的一般解决方案。我们将我们的方法扩展到包括CNN和分层视觉变压器在内的层次模型,以及更复杂的密集预测任务,这些任务需要通过制定更通用的动态空间稀疏框架,并具有渐进性的稀疏性和非对称性计算,用于不同空间位置。通过将轻质快速路径应用于少量的特征,并使用更具表现力的慢速路径到更重要的位置,我们可以维护特征地图的结构,同时大大减少整体计算。广泛的实验证明了我们框架对各种现代体系结构和不同视觉识别任务的有效性。我们的结果清楚地表明,动态空间稀疏为模型加速提供了一个新的,更有效的维度。代码可从https://github.com/raoyongming/dynamicvit获得
translated by 谷歌翻译
对象检测是用于测试预先训练的网络参数的中央下游任务是否达到益处,例如提高准确度或训练速度。当新架构(如视觉变压器(VIT)模型到达时,物体检测方法的复杂性可以使该基准是非微不足道的。这些困难(例如,架构不相容,慢训练,高记忆消耗,未知的培训公式等)已经阻止了最近通过标准VIT模型进行了基准测试转移学习的研究。在本文中,我们提出了克服这些挑战的培训技术,使得使用标准的VT模型作为面膜R-CNN的骨干。这些工具促进了我们研究的主要目标:我们比较五种Vit初始化,包括最近的最先进的自我监督的学习方法,监督初始化和强大的随机初始化基线。我们的研究结果表明,最近基于掩蔽的无监督学习方法可能是在COCO的令人信服的转移学习改进,将箱子AP增加到4%(绝对)的监督和先前自我监督的预训练方法。此外,基于掩蔽的初始化比例更好,随着模型尺寸的增加而增长的提高。
translated by 谷歌翻译