深度学习体系结构的令人印象深刻的性能与模型复杂性的大量增加有关。需要对数百万个参数进行调整,并相应地进行训练和推理时间扩展。但是需要进行大规模的微调吗?在本文中,专注于图像分类,我们考虑了一种简单的转移学习方法利用预卷积特征作为快速内核方法的输入。我们将这种方法称为最佳调整,因为只有内核分类器经过培训。通过执行2500多个培训过程,我们表明这种最佳调整方法提供了可比的精度W.R.T.进行微调,训练时间较小在一个和两个数量级之间。这些结果表明,顶级调整为中小型数据集中的微调提供了有用的替代方法,尤其是在训练效率至关重要的情况下。
translated by 谷歌翻译
视觉变压器(VIT)已被证明可以在广泛的视觉应用中获得高度竞争性的性能,例如图像分类,对象检测和语义图像分割。与卷积神经网络相比,通常发现视觉变压器的较弱的电感偏差会在较小的培训数据集上培训时,会增加对模型正则化或数据增强的依赖(简称为“ AUGREG”)。我们进行了一项系统的实证研究,以便更好地了解培训数据,AUGREG,模型大小和计算预算之间的相互作用。作为这项研究的一个结果,我们发现增加的计算和AUGREG的组合可以产生与在数量级上训练的模型相同的训练数据的模型:我们在公共Imagenet-21K数据集中培训各种尺寸的VIT模型在较大的JFT-300M数据集上匹配或超越其对手的培训。
translated by 谷歌翻译
Transfer learning is a cornerstone of computer vision, yet little work has been done to evaluate the relationship between architecture and transfer. An implicit hypothesis in modern computer vision research is that models that perform better on ImageNet necessarily perform better on other vision tasks. However, this hypothesis has never been systematically tested. Here, we compare the performance of 16 classification networks on 12 image classification datasets. We find that, when networks are used as fixed feature extractors or fine-tuned, there is a strong correlation between ImageNet accuracy and transfer accuracy (r = 0.99 and 0.96, respectively). In the former setting, we find that this relationship is very sensitive to the way in which networks are trained on ImageNet; many common forms of regularization slightly improve ImageNet accuracy but yield penultimate layer features that are much worse for transfer learning. Additionally, we find that, on two small fine-grained image classification datasets, pretraining on ImageNet provides minimal benefits, indicating the learned features from Ima-geNet do not transfer well to fine-grained tasks. Together, our results show that ImageNet architectures generalize well across datasets, but ImageNet features are less general than previously suggested.
translated by 谷歌翻译
监测原位浮游生物的种群对于保留水生生态系统至关重要。浮游生物微生物实际上易受较小的环境扰动的影响,可以反映出随之而来的形态学和动力学修饰。如今,高级自动或半自动采集系统的可用性已允许生产越来越多的浮游生物图像数据。由于大量获得的数据和浮游生物的数字,因此,采用机器学习算法来对此类数据进行分类。为了应对这些挑战,我们提出了有效的无监督学习管道,以提供浮游生物微生物的准确分类。我们构建一组图像描述符,利用两步过程。首先,对预先训练的神经网络提取的功能进行了跨自动编码器(VAE)的培训。然后,我们将学习的潜在空间用作聚类的图像描述符。我们将方法与最新的无监督方法进行了比较,其中一组预定义的手工特征用于浮游生物图像的聚类。所提出的管道优于我们分析中包含的所有浮游生物数据集的基准算法,提供了更好的图像嵌入属性。
translated by 谷歌翻译
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. 1
translated by 谷歌翻译
由于标记数据的稀缺性,使用在ImageNet上预先培训的模型是遥感场景分类中的事实上标准。虽然最近,几个较大的高分辨率遥感(HRRS)数据集具有建立新基准的目标,但在这些数据集上从头开始的培训模型的尝试是零星的。在本文中,我们显示来自划痕的训练模型在几个较新数据集中产生可比较的结果,可以进行微调在想象中预先培训的模型。此外,在HRRS数据集上学到的表示,更好地或至少与想象中学到的那些类似的人力出现场景分类任务转移到其他HRRS场景分类任务。最后,我们表明,在许多情况下,通过使用域名数据的第二轮预训练,即域 - 自适应预训练,获得最佳表示。源代码和预先训练的模型可用于\ url {https://github.com/risojevicv/rssc-transfer。}
translated by 谷歌翻译
Does the dominant approach to learn representations (as a side effect of optimizing an expected cost for a single training distribution) remain a good approach when we are dealing with multiple distributions. Our thesis is that such scenarios are better served by representations that are "richer" than those obtained with a single optimization episode. This is supported by a collection of empirical results obtained with an apparently na\"ive ensembling technique: concatenating the representations obtained with multiple training episodes using the same data, model, algorithm, and hyper-parameters, but different random seeds. These independently trained networks perform similarly. Yet, in a number of scenarios involving new distributions, the concatenated representation performs substantially better than an equivalently sized network trained from scratch. This proves that the representations constructed by multiple training episodes are in fact different. Although their concatenation carries little additional information about the training task under the training distribution, it becomes substantially more informative when tasks or distributions change. Meanwhile, a single training episode is unlikely to yield such a redundant representation because the optimization process has no reason to accumulate features that do not incrementally improve the training performance.
translated by 谷歌翻译
微调被广泛应用于图像分类任务中,作为转移学习方法。它重新使用源任务中的知识来学习和获得目标任务中的高性能。微调能够减轻培训数据不足和新数据昂贵标签的挑战。但是,标准微调在复杂的数据分布中的性能有限。为了解决这个问题,我们提出了适应性的多调整方法,该方法可适应地确定每个数据样本的微调策略。在此框架中,定义了多个微调设置和一个策略网络。适应性多调整中的策略网络可以动态地调整为最佳权重,以将不同的样本馈入使用不同的微调策略训练的模型。我们的方法的表现优于标准的微调方法1.69%,数据集FGVC-Aircraft和可描述的纹理优于2.79%,在Stanford Cars,CIFAR-10和时尚范围内产生可比的性能。
translated by 谷歌翻译
我们提出“ AITLAS:基准竞技场” - 一个开源基准测试框架,用于评估地球观察中图像分类的最新深度学习方法(EO)。为此,我们介绍了从九种不同的最先进的体系结构得出的400多个模型的全面比较分析,并将它们与来自22个具有不同尺寸的数据集的各种多级和多标签分类任务进行比较和属性。除了完全在这些数据集上训练的模型外,我们还基于在转移学习的背景下训练的模型,利用预训练的模型变体,因为通常在实践中执行。所有提出的方法都是一般的,可以轻松地扩展到本研究中未考虑的许多其他遥感图像分类任务。为了确保可重复性并促进更好的可用性和进一步的开发,所有实验资源在内的所有实验资源,包括训练的模型,模型配置和数据集的处理详细信息(以及用于培训和评估模型的相应拆分)都在存储库上公开可用:HTTPS ://github.com/biasvariancelabs/aitlas-arena。
translated by 谷歌翻译
我们展示了如何通过基于关注的全球地图扩充任何卷积网络,以实现非本地推理。我们通过基于关注的聚合层替换为单个变压器块的最终平均池,重量贴片如何参与分类决策。我们使用2个参数(宽度和深度)使用简单的补丁卷积网络,使用简单的补丁的卷积网络插入学习的聚合层。与金字塔设计相比,该架构系列在所有层上维护输入补丁分辨率。它在准确性和复杂性之间产生了令人惊讶的竞争权衡,特别是在记忆消耗方面,如我们在各种计算机视觉任务所示:对象分类,图像分割和检测的实验所示。
translated by 谷歌翻译
Image classification with small datasets has been an active research area in the recent past. However, as research in this scope is still in its infancy, two key ingredients are missing for ensuring reliable and truthful progress: a systematic and extensive overview of the state of the art, and a common benchmark to allow for objective comparisons between published methods. This article addresses both issues. First, we systematically organize and connect past studies to consolidate a community that is currently fragmented and scattered. Second, we propose a common benchmark that allows for an objective comparison of approaches. It consists of five datasets spanning various domains (e.g., natural images, medical imagery, satellite data) and data types (RGB, grayscale, multispectral). We use this benchmark to re-evaluate the standard cross-entropy baseline and ten existing methods published between 2017 and 2021 at renowned venues. Surprisingly, we find that thorough hyper-parameter tuning on held-out validation data results in a highly competitive baseline and highlights a stunted growth of performance over the years. Indeed, only a single specialized method dating back to 2019 clearly wins our benchmark and outperforms the baseline classifier.
translated by 谷歌翻译
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These highperforming vision transformers are pre-trained with hundreds of millions of images using a large infrastructure, thereby limiting their adoption.In this work, we produce competitive convolution-free transformers by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on ImageNet with no external data.More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.
translated by 谷歌翻译
Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers. 1
translated by 谷歌翻译
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These highperforming vision transformers are pre-trained with hundreds of millions of images using a large infrastructure, thereby limiting their adoption.In this work, we produce competitive convolutionfree transformers trained on ImageNet only using a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on ImageNet with no external data.We also introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention, typically from a convnet teacher. The learned transformers are competitive (85.2% top-1 acc.) with the state of the art on ImageNet, and similarly when transferred to other tasks. We will share our code and models.
translated by 谷歌翻译
We introduce submodel co-training, a regularization method related to co-training, self-distillation and stochastic depth. Given a neural network to be trained, for each sample we implicitly instantiate two altered networks, ``submodels'', with stochastic depth: we activate only a subset of the layers. Each network serves as a soft teacher to the other, by providing a loss that complements the regular loss provided by the one-hot label. Our approach, dubbed cosub, uses a single set of weights, and does not involve a pre-trained external model or temporal averaging. Experimentally, we show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation. Our approach is compatible with multiple architectures, including RegNet, ViT, PiT, XCiT, Swin and ConvNext. Our training strategy improves their results in comparable settings. For instance, a ViT-B pretrained with cosub on ImageNet-21k obtains 87.4% top-1 acc. @448 on ImageNet-val.
translated by 谷歌翻译
转移学习是一种经典范式,通过该范式,在大型“上游”数据集上佩戴的模型适于在“下游”专业数据集中产生良好的结果。通常,据了解,“上游”数据集上的更准确的模型将提供更好的转移精度“下游”。在这项工作中,我们在想象的神经网络(CNNS)的背景下对这种现象进行了深入的调查,这些现象已经在想象的数据集上训练的情况下被修剪 - 这是通过缩小它们的连接来压缩。具体地,我们考虑使用通过应用几种最先进的修剪方法而获得的非结构化修剪模型的转移,包括基于幅度的,二阶,重新增长和正规化方法,在12个标准转移任务的上下文中。简而言之,我们的研究表明,即使在高稀稀物质,稀疏的型号也可以匹配或甚至优于致密模型的转移性能,并且在此操作时,可以导致显着的推论甚至培训加速度。与此同时,我们观察和分析不同修剪方法行为的显着差异。
translated by 谷歌翻译
基于注意力的神经网络(例如Vision Transformer(VIT))最近在许多计算机视觉基准上获得了最新结果。尺度是获得出色结果的主要成分,因此,了解模型的缩放属性是有效设计子孙后代的关键。尽管已经研究了用于扩展变压器语言模型的法律,但视觉变压器如何扩展是未知的。为了解决这个问题,我们将VIT模型和数据扩展到上下,并表征错误率,数据和计算之间的关系。在此过程中,我们完善了VIT的体系结构和培训,减少了记忆消耗并提高了所得模型的准确性。结果,我们成功地训练了具有20亿个参数的VIT模型,该模型达到了90.45%TOP-1准确性的新最先进。该模型在几次转移中也表现良好,例如,ImageNet上的Top-1精度达到了84.86%,每个类别仅10个示例。
translated by 谷歌翻译
Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -from 1 example per class to 1 M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.
translated by 谷歌翻译
With the ever-growing model size and the limited availability of labeled training data, transfer learning has become an increasingly popular approach in many science and engineering domains. For classification problems, this work delves into the mystery of transfer learning through an intriguing phenomenon termed neural collapse (NC), where the last-layer features and classifiers of learned deep networks satisfy: (i) the within-class variability of the features collapses to zero, and (ii) the between-class feature means are maximally and equally separated. Through the lens of NC, our findings for transfer learning are the following: (i) when pre-training models, preventing intra-class variability collapse (to a certain extent) better preserves the intrinsic structures of the input data, so that it leads to better model transferability; (ii) when fine-tuning models on downstream tasks, obtaining features with more NC on downstream data results in better test accuracy on the given task. The above results not only demystify many widely used heuristics in model pre-training (e.g., data augmentation, projection head, self-supervised learning), but also leads to more efficient and principled fine-tuning method on downstream tasks that we demonstrate through extensive experimental results.
translated by 谷歌翻译
在许多图像分类任务中,诸如夹子之类的开放式摄影模型具有高精度。但是,在某些设置中,他们的零拍摄性能远非最佳。我们研究模型修补程序,目的是提高对特定任务的准确性,而不会在表现已经足够的任务上降低准确性。为了实现这一目标,我们引入了油漆,这是一种修补方法,该方法在微调之前使用模型的权重与要修补的任务进行微调后的权重。在零机夹的性能差的九个任务上,油漆可将精度提高15至60个百分点,同时将ImageNet上的精度保留在零拍模型的一个百分点之内。油漆还允许在多个任务上修补单个模型,并通过模型刻度进行改进。此外,我们确定了广泛转移的案例,即使任务不相交,对一个任务进行修补也会提高其他任务的准确性。最后,我们研究了超出常见基准的应用程序,例如计数或减少印刷攻击对剪辑的影响。我们的发现表明,可以扩展一组任务集,开放式摄影模型可实现高精度,而无需从头开始重新训练它们。
translated by 谷歌翻译