智能论文笔记

POViT: Vision Transformer for Multi-objective Design and Characterization of Nanophotonic Devices

Xinyu Chen , Renjie Li , Yueyao Yu , Yuanwen Shen , Wenye Li , Zhaoyu Zhang , Yin Zhang

分类：机器学习

2022-05-17

We solve a fundamental challenge in semiconductor IC design: the fast and accurate characterization of nanoscale photonic devices. Much like the fusion between AI and EDA, many efforts have been made to apply DNNs such as convolutional neural networks (CNN) to prototype and characterize next-gen optoelectronic devices commonly found in photonic integrated circuits (PIC) and LiDAR. These prior works generally strive to predict the quality factor (Q) and modal volume (V) of for instance, photonic crystals, with ultra-high accuracy and speed. However, state-of-the-art models are still far from being directly applicable in the real-world: e.g. the correlation coefficient of V ($V_{coeff}$ ) is only about 80%, which is much lower than what it takes to generate reliable and reproducible nanophotonic designs. Recently, attention-based transformer models have attracted extensive interests and been widely used in CV and NLP. In this work, we propose the first-ever Transformer model (POViT) to efficiently design and simulate semiconductor photonic devices with multiple objectives. Unlike the standard Vision Transformer (ViT), we supplied photonic crystals as data input and changed the activation layer from GELU to an absolute-value function (ABS). Our experiments show that POViT exceeds results reported by previous models significantly. The correlation coefficient $V_{coeff}$ increases by over 12% (i.e., to 92.0%) and the prediction errors of Q is reduced by an order of magnitude, among several other key metric improvements. Our work has the potential to drive the expansion of EDA to fully automated photonic design. The complete dataset and code will be released to aid researchers endeavoring in the interdisciplinary field of physics and computer science.

translated by 谷歌翻译

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly

分类：

2020-10-22

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. 1

translated by 谷歌翻译

A lightweight Transformer-based model for fish landmark detection

Alzayat Saleh , David Jones , Dean Jerry , Mostafa Rahimi Azghadi

分类：计算机视觉

2022-09-13

当有足够的训练数据时，在某些视力任务中，基于变压器的模型（例如Vision Transformer（VIT））可以超越跨趋化神经网络（CNN）。然而，（CNN）对视力任务（即翻译均衡和局部性）具有强大而有用的归纳偏见。在这项工作中，我们开发了一种新颖的模型架构，我们称之为移动鱼类地标检测网络（MFLD-NET）。我们已经使用基于VIT的卷积操作（即斑块嵌入，多层感知器）制作了该模型。 MFLD-NET可以在轻巧的同时获得竞争性或更好的结果，同时轻巧，因此适用于嵌入式和移动设备。此外，我们表明MFLD-NET可以在PAR上获得关键点（地标）估计精度，甚至比FISH图像数据集上的某些最先进的（CNN）更好。此外，与VIT不同，MFLD-NET不需要预训练的模型，并且在小型数据集中训练时可以很好地概括。我们提供定量和定性的结果，以证明该模型的概括能力。这项工作将为未来开发移动但高效的鱼类监测系统和设备的努力奠定基础。

translated by 谷歌翻译

A Data-scalable Transformer for Medical Image Segmentation: Architecture, Model Efficiency, and Benchmark

Yunhe Gao , Mu Zhou , Di Liu , Zhennan Yan , Shaoting Zhang , Dimitris N. Metaxas

分类：计算机视觉

2022-02-28

作为新一代神经体系结构的变形金刚在自然语言处理和计算机视觉方面表现出色。但是，现有的视觉变形金刚努力使用有限的医学数据学习，并且无法概括各种医学图像任务。为了应对这些挑战，我们将Medformer作为数据量表变压器呈现为可推广的医学图像分割。关键设计结合了理想的电感偏差，线性复杂性的层次建模以及以空间和语义全局方式以线性复杂性的关注以及多尺度特征融合。 Medformer可以在不预训练的情况下学习微小至大规模的数据。广泛的实验表明，Medformer作为一般分割主链的潜力，在三个具有多种模式（例如CT和MRI）和多样化的医学靶标（例如，健康器官，疾病，疾病组织和肿瘤）的三个公共数据集上优于CNN和视觉变压器。我们将模型和评估管道公开可用，为促进广泛的下游临床应用提供固体基线和无偏比较。

translated by 谷歌翻译

ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human Activity Recognition in Videos

James Wensel , Hayat Ullah , Arslan Munir , Erik Blasch

分类：计算机视觉

2022-08-16

人类活动识别是计算机视觉中的新出现和重要领域，旨在确定个体或个体正在执行的活动。该领域的应用包括从体育中生成重点视频到智能监视和手势识别。大多数活动识别系统依赖于卷积神经网络（CNN）的组合来从数据和复发性神经网络（RNN）中进行特征提取来确定数据的时间依赖性。本文提出并设计了两个用于人类活动识别的变压器神经网络：一个经常性变压器（RET），这是一个专门的神经网络，用于对数据序列进行预测，以及视觉变压器（VIT），一种用于提取显着的变压器的变压器（VIT）图像的特征，以提高活动识别的速度和可扩展性。我们在速度和准确性方面提供了对拟议的变压器神经网络与现代CNN和基于RNN的人类活动识别模型的广泛比较。

translated by 谷歌翻译

SwinCheX: Multi-label classification on chest X-ray images with transformers

Sina Taslimi , Soroush Taslimi , Nima Fathi , Mohammadreza Salehi , Mohammad Hossein Rohban

分类：计算机视觉

2022-06-09

根据诊断各种疾病的胸部X射线图像的可观增长，以及收集广泛的数据集，使用深神经网络进行了自动诊断程序，已经占据了专家的思想。计算机视觉中的大多数可用方法都使用CNN主链来获得分类问题的高精度。然而，最近的研究表明，在NLP中成为事实上方法的变压器也可以优于许多基于CNN的模型。本文提出了一个基于SWIN变压器的多标签分类深模型，作为实现最新诊断分类的骨干。它利用了头部体系结构来利用多层感知器（也称为MLP）。我们评估了我们的模型，该模型称为“ Chest X-Ray14”，最广泛，最大的X射线数据集之一，该数据集由30,000多名14例著名胸部疾病的患者组成100,000多个额叶/背景图像。我们的模型已经用几个数量的MLP层用于头部设置，每个模型都在所有类别上都达到了竞争性的AUC分数。胸部X射线14的全面实验表明，与以前的SOTA平均AUC为0.799相比，三层头的平均AUC得分为0.810，其平均AUC得分为0.810。我们建议对现有方法进行公平基准测试的实验设置，该设置可以用作未来研究的基础。最后，我们通过确认所提出的方法参与胸部的病理相关区域，从而跟进了结果。

translated by 谷歌翻译

Assessing the Impact of Attention and Self-Attention Mechanisms on the Classification of Skin Lesions

Rafael Pedro , Arlindo L. Oliveira

分类：计算机视觉 | 人工智能 | 机器学习

2021-12-23

注意机制对研究界提出了重大兴趣，因为他们承诺改善神经网络架构的表现。但是，在任何特定的问题中，我们仍然缺乏主要的方法来选择导致保证改进的具体机制和超参数。最近，已经提出了自我关注并广泛用于变压器 - 类似的架构中，导致某些应用中的重大突破。在这项工作中，我们专注于两种形式的注意机制：注意模块和自我关注。注意模块用于重新重量每个层输入张量的特征。不同的模块具有不同的方法，可以在完全连接或卷积层中执行此重复。研究的注意力模型是完全模块化的，在这项工作中，它们将与流行的Reset架构一起使用。自我关注，最初在自然语言处理领域提出，可以将所有项目与输入序列中的所有项目相关联。自我关注在计算机视觉中越来越受欢迎，其中有时与卷积层相结合，尽管最近的一些架构与卷曲完全消失。在这项工作中，我们研究并执行了在特定计算机视觉任务中许多不同关注机制的客观的比较，在广泛使用的皮肤癌MNIST数据集中的样本分类。结果表明，关注模块有时会改善卷积神经网络架构的性能，也是这种改进虽然明显且统计学意义，但在不同的环境中并不一致。另一方面，通过自我关注机制获得的结果表明了一致和显着的改进，即使在具有减少数量的参数的架构中，也可以实现最佳结果。

translated by 谷歌翻译

Learned Queries for Efficient Local Attention

Moab Arar , Ariel Shamir , Amit H. Bermano

分类：计算机视觉

2021-12-21

视觉变压器（VIT）用作强大的视觉模型。与卷积神经网络不同，在前几年主导视觉研究，视觉变压器享有捕获数据中的远程依赖性的能力。尽管如此，任何变压器架构的组成部分，自我关注机制都存在高延迟和低效的内存利用，使其不太适合高分辨率输入图像。为了缓解这些缺点，分层视觉模型在非交错的窗口上局部使用自我关注。这种放松会降低输入尺寸的复杂性;但是，它限制了横窗相互作用，损害了模型性能。在本文中，我们提出了一种新的班次不变的本地注意层，称为查询和参加（QNA），其以重叠的方式聚集在本地输入，非常类似于卷积。 QNA背后的关键想法是介绍学习的查询，这允许快速高效地实现。我们通过将其纳入分层视觉变压器模型来验证我们的层的有效性。我们展示了速度和内存复杂性的改进，同时实现了与最先进的模型的可比准确性。最后，我们的图层尺寸尤其良好，窗口大小，需要高于X10的内存，而不是比现有方法更快。

translated by 谷歌翻译

Transfer Learning and Vision Transformer based State-of-Health prediction of Lithium-Ion Batteries

Pengyu Fu , Liang Chu , Zhuoran Hou , Jincheng Hu , Yanjun Huang , Yuanjian Zhang

分类：计算机视觉 | 人工智能

2022-09-07

近年来，在运输电气化方面取得了重大进展。作为主要的储能设备，锂离子电池（LIB）已受到广泛关注。准确地预测健康状况（SOH）不仅可以缓解用户对电池寿命的焦虑，而且还可以为电池管理提供重要信息。本文提出了一种基于视觉变压器（VIT）模型的SOH的预测方法。首先，预定义电压范围的离散充电数据用作输入数据矩阵。然后，电池的循环特征是由VIT捕获的，可以获得可以获得全局特征，并且通过将循环特征与完整连接（FC）层相结合来获得SOH。同时，引入了转移学习（TL），并根据目标任务电池的早期周期数据进一步微调基于源任务电池训练的预测模型，以提供准确的预测。实验表明，与现有的深度学习方法相比，我们的方法可以获得更好的特征表达，从而可以实现更好的预测效果和传递效果。

translated by 谷歌翻译

A novel time-frequency Transformer based on self-attention mechanism and its application in fault diagnosis of rolling bearings

Yifei Ding , Minping Jia , Qiuhua Miao , Yudong Cao

分类：人工智能 | 机器学习

2021-04-19

通过深度学习（DL）大大扩展了数据驱动故障诊断模型的范围。然而，经典卷积和反复化结构具有计算效率和特征表示的缺陷，而基于注意机制的最新变压器架构尚未应用于该字段。为了解决这些问题，我们提出了一种新颖的时变电片（TFT）模型，其灵感来自序列加工的香草变压器大规模成功。特别是，我们设计了一个新的笨蛋和编码器模块，以从振动信号的时频表示（TFR）中提取有效抽象。在此基础上，本文提出了一种基于时变电片的新的端到端故障诊断框架。通过轴承实验数据集的案例研究，我们构建了最佳变压器结构并验证了其故障诊断性能。与基准模型和其他最先进的方法相比，证明了所提出的方法的优越性。

translated by 谷歌翻译

Towards Efficient Adversarial Training on Vision Transformers

Boxi Wu , Jindong Gu , Zhifeng Li , Deng Cai , Xiaofei He , Wei Liu

分类：计算机视觉

2022-07-21

视觉变压器（VIT）是卷积神经网络（CNN）的强大替代方案，引起了很多关注。最近的工作表明，VIT也容易受到CNN等对抗性例子的影响。为了建立强大的VIT，一种直观的方法是应用对抗训练，因为它已被证明是完成强大CNN的最有效方法之一。但是，对抗性培训的一个主要局限性是其沉重的计算成本。 VIT所采用的自我注意力的机制是计算强度的操作，其费用随输入贴片的数量四次增加，从而使VIT上的对抗性训练更加耗时。在这项工作中，我们首先全面研究了有关各种视觉变压器的快速对抗训练，并说明了效率和鲁棒性之间的关系。然后，为了加快对VIT的对抗训练，我们提出了一种有效的注意力引导的对抗训练机制。具体而言，依靠自我注意的专长，我们在对抗训练过程中以注意引导策略的掉落策略积极地嵌入了每一层的某些斑块嵌入。纤细的自我发场模块大大加速了对VIT的对抗训练。只有65％的快速对抗训练时间，我们与具有挑战性的成像网基准相匹配。

translated by 谷歌翻译

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

Zhiying Lu , Hongtao Xie , Chuanbin Liu , Yongdong Zhang

分类：计算机视觉 | 机器学习

2022-10-12

There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels. But the scarce data can not enable ViTs to learn strong enough representation for accurate recognition. To this end, we propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. On spatial aspect, we adopt a hybrid structure, in which convolution is integrated into patch embedding and multi-layer perceptron module, forcing the model to capture the token features as well as their neighboring features. On channel aspect, we introduce a dynamic feature aggregation module in MLP and a brand new "head token" design in multi-head self-attention module to help re-calibrate channel representation and make different channel group representation interacts with each other. The fusion of weak channel representation forms a strong enough representation for classification. With this design, we successfully eliminate the performance gap between CNNs and ViTs, and our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters. Code is available at https://github.com/ArieSeirack/DHVT.

translated by 谷歌翻译

Convolution-enhanced Evolving Attention Networks

Yujing Wang , Yaming Yang , Zhuo Li , Jiangang Bai , Mingliang Zhang , Xiangtai Li , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

分类：机器学习 | 自然语言处理 | 计算机视觉 | 神经与进化计算

2022-12-16

Attention-based neural networks, such as Transformers, have become ubiquitous in numerous applications, including computer vision, natural language processing, and time-series analysis. In all kinds of attention networks, the attention maps are crucial as they encode semantic dependencies between input tokens. However, most existing attention networks perform modeling or reasoning based on representations, wherein the attention maps of different layers are learned separately without explicit interactions. In this paper, we propose a novel and generic evolving attention mechanism, which directly models the evolution of inter-token relationships through a chain of residual convolutional modules. The major motivations are twofold. On the one hand, the attention maps in different layers share transferable knowledge, thus adding a residual connection can facilitate the information flow of inter-token relationships across layers. On the other hand, there is naturally an evolutionary trend among attention maps at different abstraction levels, so it is beneficial to exploit a dedicated convolution-based module to capture this process. Equipped with the proposed mechanism, the convolution-enhanced evolving attention networks achieve superior performance in various applications, including time-series representation, natural language understanding, machine translation, and image classification. Especially on time-series representation tasks, Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly, achieving an average of 17% improvement compared to the best SOTA. To the best of our knowledge, this is the first work that explicitly models the layer-wise evolution of attention maps. Our implementation is available at https://github.com/pkuyym/EvolvingAttention

translated by 谷歌翻译

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

Ben Graham , Alaaeldin El-Nouby , Hugo Touvron , Pierre Stock , Armand Joulin , Hervé Jégou , Matthijs Douze

分类：

2021-04-02

We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers.As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. We release the code at https: //github.com/facebookresearch/LeViT.

translated by 谷歌翻译

Sliced Recursive Transformer

Zhiqiang Shen , Zechun Liu , Eric Xing

分类：计算机视觉 | 人工智能 | 机器学习

2021-11-09

我们在视觉变压器上呈现整洁但有效的递归操作，可以提高参数利用而不涉及额外参数。这是通过在变压器网络的深度分享权重来实现的。所提出的方法可以只使用NA \“IVE递归操作来获得大量增益（〜2％），不需要对设计网络原理的特殊或复杂的知识，并引入训练程序的最小计算开销。减少额外的计算通过递归操作，同时保持卓越的准确性，我们通过递归层的多个切片组自行引入近似方法，这可以通过最小的性能损失将成本消耗降低10〜30％。我们称我们的模型切片递归变压器（SRET），这与高效视觉变压器的广泛的其他设计兼容。我们最好的模型在含有较少参数的同时，在最先进的方法中对Imagenet建立了重大改进。建议的切片递归操作使我们能够建立一个变压器超过100甚至1000层，仍然仍然小尺寸（13〜15米），以避免困难当模型尺寸太大时，IES在优化中。灵活的可扩展性显示出缩放和构建极深和大维视觉变压器的巨大潜力。我们的代码和模型可在https://github.com/szq0214/sret中找到。

translated by 谷歌翻译

Transformer-based Hand Gesture Recognition via High-Density EMG Signals: From Instantaneous Recognition to Fusion of Motor Unit Spike Trains

Mansooreh Montazerin , Elahe Rahimian , Farnoosh Naderkhani , S. Farokh Atashzar , Svetlana Yanushkevich , Arash Mohammadi

分类：机器学习

2022-11-29

Designing efficient and labor-saving prosthetic hands requires powerful hand gesture recognition algorithms that can achieve high accuracy with limited complexity and latency. In this context, the paper proposes a compact deep learning framework referred to as the CT-HGR, which employs a vision transformer network to conduct hand gesture recognition using highdensity sEMG (HD-sEMG) signals. The attention mechanism in the proposed model identifies similarities among different data segments with a greater capacity for parallel computations and addresses the memory limitation problems while dealing with inputs of large sequence lengths. CT-HGR can be trained from scratch without any need for transfer learning and can simultaneously extract both temporal and spatial features of HD-sEMG data. Additionally, the CT-HGR framework can perform instantaneous recognition using sEMG image spatially composed from HD-sEMG signals. A variant of the CT-HGR is also designed to incorporate microscopic neural drive information in the form of Motor Unit Spike Trains (MUSTs) extracted from HD-sEMG signals using Blind Source Separation (BSS). This variant is combined with its baseline version via a hybrid architecture to evaluate potentials of fusing macroscopic and microscopic neural drive information. The utilized HD-sEMG dataset involves 128 electrodes that collect the signals related to 65 isometric hand gestures of 20 subjects. The proposed CT-HGR framework is applied to 31.25, 62.5, 125, 250 ms window sizes of the above-mentioned dataset utilizing 32, 64, 128 electrode channels. The average accuracy over all the participants using 32 electrodes and a window size of 31.25 ms is 86.23%, which gradually increases till reaching 91.98% for 128 electrodes and a window size of 250 ms. The CT-HGR achieves accuracy of 89.13% for instantaneous recognition based on a single frame of HD-sEMG image.

translated by 谷歌翻译

Vision Transformers: State of the Art and Research Challenges

Bo-Kai Ruan , Hong-Han Shuai , Wen-Huang Cheng

分类：计算机视觉

2022-07-07

变形金刚在自然语言处理方面取得了巨大的成功。由于变压器中自我发挥机制的强大能力，研究人员为各种计算机视觉任务（例如图像识别，对象检测，图像分割，姿势估计和3D重建）开发了视觉变压器。本文介绍了有关视觉变形金刚的不同建筑设计和培训技巧（包括自我监督的学习）文献的全面概述。我们的目标是为开放研究机会提供系统的审查。

translated by 谷歌翻译

Augmenting Convolutional networks with attention-based aggregation

Hugo Touvron , Matthieu Cord , Alaaeldin El-Nouby , Piotr Bojanowski , Armand Joulin , Gabriel Synnaeve , Hervé Jégou

分类：计算机视觉

2021-12-27

我们展示了如何通过基于关注的全球地图扩充任何卷积网络，以实现非本地推理。我们通过基于关注的聚合层替换为单个变压器块的最终平均池，重量贴片如何参与分类决策。我们使用2个参数（宽度和深度）使用简单的补丁卷积网络，使用简单的补丁的卷积网络插入学习的聚合层。与金字塔设计相比，该架构系列在所有层上维护输入补丁分辨率。它在准确性和复杂性之间产生了令人惊讶的竞争权衡，特别是在记忆消耗方面，如我们在各种计算机视觉任务所示：对象分类，图像分割和检测的实验所示。

translated by 谷歌翻译

Efficient deep learning models for land cover image classification

Ioannis Papoutsis , Nikolaos-Ioannis Bountos , Angelos Zavras , Dimitrios Michail , Christos Tryfonopoulos

分类：计算机视觉

2021-11-18

哥内克人Sentinel Imagery的纯粹卷的可用性为使用深度学习的大尺度创造了新的土地利用陆地覆盖（Lulc）映射的机会。虽然在这种大型数据集上培训是一个非琐碎的任务。在这项工作中，我们试验Lulc Image分类和基准不同最先进模型的Bigearthnet数据集，包括卷积神经网络，多层感知，视觉变压器，高效导通和宽残余网络（WRN）架构。我们的目标是利用分类准确性，培训时间和推理率。我们提出了一种基于用于网络深度，宽度和输入数据分辨率的WRNS复合缩放的高效导通的框架，以有效地训练和测试不同的模型设置。我们设计一种新颖的缩放WRN架构，增强了有效的通道注意力机制。我们提出的轻量级模型具有较小的培训参数，实现所有19个LULC类的平均F分类准确度达到4.5％，并且验证了我们使用的resnet50最先进的模型速度快两倍作为基线。我们提供超过50种培训的型号，以及我们在多个GPU节点上分布式培训的代码。

translated by 谷歌翻译

Escaping the Big Data Paradigm with Compact Transformers

Ali Hassani , Steven Walton , Nikhil Shah , Abulikemu Abuduweili , Jiachen Li , Humphrey Shi

分类：计算机视觉 | 机器学习

2021-04-12

随着变压器作为语言处理的标准及其在计算机视觉方面的进步，参数大小和培训数据的数量相应地增长。许多人开始相信，因此，变形金刚不适合少量数据。这种趋势引起了人们的关注，例如：某些科学领域中数据的可用性有限，并且排除了该领域研究资源有限的人。在本文中，我们旨在通过引入紧凑型变压器来提出一种小规模学习的方法。我们首次表明，具有正确的尺寸，卷积令牌化，变压器可以避免在小数据集上过度拟合和优于最先进的CNN。我们的模型在模型大小方面具有灵活性，并且在获得竞争成果的同时，参数可能仅为0.28亿。当在CIFAR-10上训练Cifar-10，只有370万参数训练时，我们的最佳模型可以达到98％的准确性，这是与以前的基于变形金刚的模型相比，数据效率的显着提高，比其他变压器小于10倍，并且是15％的大小。在实现类似性能的同时，重新NET50。 CCT还表现优于许多基于CNN的现代方法，甚至超过一些基于NAS的方法。此外，我们在Flowers-102上获得了新的SOTA，具有99.76％的TOP-1准确性，并改善了Imagenet上现有基线（82.71％精度，具有29％的VIT参数）以及NLP任务。我们针对变压器的简单而紧凑的设计使它们更可行，可以为那些计算资源和/或处理小型数据集的人学习，同时扩展了在数据高效变压器中的现有研究工作。我们的代码和预培训模型可在https://github.com/shi-labs/compact-transformers上公开获得。

translated by 谷歌翻译