Many state-of-the-art deep learning models for computer vision tasks are based on the transformer architecture. Such models can be computationally expensive and are typically statically set to meet the deployment scenario. However, in real-time applications, the resources available for every inference can vary considerably and be smaller than what state-of-the-art models use. We can use dynamic models to adapt the model execution to meet real-time application resource constraints. While prior dynamic work has primarily minimized resource utilization for less complex input images while maintaining accuracy and focused on CNNs and early transformer models such as BERT, we adapt vision transformers to meet system dynamic resource constraints, independent of the input image. We find that unlike early transformer models, recent state-of-the-art vision transformers heavily rely on convolution layers. We show that pretrained models are fairly resilient to skipping computation in the convolution and self-attention layers, enabling us to create a low-overhead system for dynamic real-time inference without additional training. Finally, we create a optimized accelerator for these dynamic vision transformers in a 5nm technology. The PE array occupies 2.26mm$^2$ and is 17 times faster than a NVIDIA TITAN V GPU for state-of-the-art transformer-based models for semantic segmentation.
translated by 谷歌翻译
过去一年目睹了将变压器模块应用于视力问题的快速发展。虽然一些研究人员已经证明,基于变压器的模型享有有利的拟合数据能力,但仍然越来越多的证据,表明这些模型尤其在训练数据受到限制时遭受过度拟合。本文通过执行逐步操作来提供实证研究,逐步运输基于变压器的模型到基于卷积的模型。我们在过渡过程中获得的结果为改善视觉识别提供了有用的消息。基于这些观察,我们提出了一个名为VIRFormer的新架构,该体系结构从“视觉友好的变压器”中缩写。具有相同的计算复杂度,在想象集分类精度方面,VISFormer占据了基于变压器的基于卷积的模型,并且当模型复杂性较低或训练集较小时,优势变得更加重要。代码可在https://github.com/danczs/visformer中找到。
translated by 谷歌翻译
Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and NLP Transformers: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the enforced denser/sparser workloads and encoder/decoder engines for boosted hardware utilization. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x, 86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively.
translated by 谷歌翻译
基于自我注意力的模型,例如视觉变压器(VIT),已经成为计算机视觉中卷积神经网络(CNN)的一种非常有竞争力的建筑。尽管越来越高的变体具有更高的识别精度,但由于自我注意力的二次复杂性,现有的VIT通常在计算和模型大小中要求。尽管已重新引入了最近的CNN的几种成功设计选择(例如,卷积和分层多阶段结构)已重新引入最近的VIT,但它们仍然不足以满足移动设备的有限资源要求。这激发了最近根据最先进的Mobilenet-V2开发光线的尝试,但仍然留下了性能差距。在这项工作中,在这个研究不足的方向上进一步推动了Edgevits,这是一个新的轻巧vits家族,这首先使基于注意力的视觉模型能够与最佳轻巧的CNN竞争,这准确性和设备效率。这是通过基于自我注意力和卷积的最佳整合而引入高度成本效益的本地 - 全球局(LGL)信息交换瓶颈来实现的。对于设备青年的评估,我们不再依赖诸如拖船或参数的不准确代理,而是采用一种实用的方法来直接专注于设备延迟,以及首次首次提供能源效率。具体而言,我们表明,当考虑准确性的延迟和准确性 - 能量折衷时,我们的模型是帕累托最佳的,在几乎所有情况下都严格占据了其他VIT并与最有效的CNN竞争的严格优势。代码可从https://github.com/saic-fi/edgevit获得。
translated by 谷歌翻译
在本文中,我们通过利用视觉数据中的空间稀疏性提出了一种新的模型加速方法。我们观察到,视觉变压器中的最终预测仅基于最有用的令牌的子集,这足以使图像识别。基于此观察,我们提出了一个动态的令牌稀疏框架,以根据加速视觉变压器的输入逐渐和动态地修剪冗余令牌。具体而言,我们设计了一个轻量级预测模块,以估计给定当前功能的每个令牌的重要性得分。该模块被添加到不同的层中以层次修剪冗余令牌。尽管该框架的启发是我们观察到视觉变压器中稀疏注意力的启发,但我们发现自适应和不对称计算的想法可能是加速各种体系结构的一般解决方案。我们将我们的方法扩展到包括CNN和分层视觉变压器在内的层次模型,以及更复杂的密集预测任务,这些任务需要通过制定更通用的动态空间稀疏框架,并具有渐进性的稀疏性和非对称性计算,用于不同空间位置。通过将轻质快速路径应用于少量的特征,并使用更具表现力的慢速路径到更重要的位置,我们可以维护特征地图的结构,同时大大减少整体计算。广泛的实验证明了我们框架对各种现代体系结构和不同视觉识别任务的有效性。我们的结果清楚地表明,动态空间稀疏为模型加速提供了一个新的,更有效的维度。代码可从https://github.com/raoyongming/dynamicvit获得
translated by 谷歌翻译
诸如对象检测和分割等密集的计算机视觉任务需要有效的多尺度特征表示,用于检测或分类具有不同大小的对象或区域。虽然卷积神经网络(CNNS)是这种任务的主导架构,但最近引入了视觉变压器(VITS)的目标是将它们替换为骨干。类似于CNN,VITS构建一个简单的多级结构(即,细致粗略),用于使用单尺度补丁进行多尺度表示。在这项工作中,通过从现有变压器的不同角度来看,我们探索了多尺度补丁嵌入和多路径结构,构建了多路径视觉变压器(MPVIT)。 MPVIT通过使用重叠的卷积贴片嵌入,将相同尺寸〜(即,序列长度,序列长度,序列长度的序列长度)嵌入不同尺度的斑块。然后,通过多个路径独立地将不同尺度的令牌独立地馈送到变压器编码器,并且可以聚合产生的特征,使得能够在相同特征级别的精细和粗糙的特征表示。由于多样化,多尺寸特征表示,我们的MPVits从微小〜(5m)缩放到基础〜(73米)一直在想象成分,对象检测,实例分段上的最先进的视觉变压器来实现卓越的性能,和语义细分。这些广泛的结果表明,MPVIT可以作为各种视觉任务的多功能骨干网。代码将在\ url {https://git.io/mpvit}上公开可用。
translated by 谷歌翻译
在图像变压器网络的编码器部分中的FineTuning佩带的骨干网一直是语义分段任务的传统方法。然而,这种方法揭示了图像在编码阶段提供的语义上下文。本文认为将图像的语义信息纳入预磨料的基于分层变换器的骨干,而FineTuning可显着提高性能。为实现这一目标,我们提出了一个简单且有效的框架,在语义关注操作的帮助下将语义信息包含在编码器中。此外,我们在训练期间使用轻量级语义解码器,为每个阶段提供监督对中间语义的先前地图。我们的实验表明,结合语义前导者增强了所建立的分层编码器的性能,随着絮凝物的数量略有增加。我们通过将Sromask集成到Swin-Cransformer的每个变体中提供了经验证明,因为我们的编码器与不同的解码器配对。我们的框架在CudeScapes数据集上实现了ADE20K数据集的新型58.22%的MIOU,并在Miou指标中提高了超过3%的内容。代码和检查点在https://github.com/picsart-ai-research/semask-egation上公开使用。
translated by 谷歌翻译
我们介绍了移动前的Mobilenet和Transformer的平行设计,在两侧桥。该结构利用MobileNet在全局互动下在局部加工和变压器处的优点。而且桥梁可以实现本地和全局特征的双向融合。不同于近期Vision变形金机的作品,移动设备中的变压器包含很少的令牌(例如6或更少的令牌),这些代币被随机初始化以学习全球前沿,导致计算成本低。结合所提出的轻量度跨关注模型桥梁,移动前不仅是计算高效的,而且还有更多的表示力量。它在从25米到500米到500米拖鞋的低浮圈制度以25米到500米的潮流表现出MobileNetv3。例如,移动前者在294米的拖鞋处获得77.9 \%的前1个精度,获得1.3 \%的MobileNetv3,但节省了17 \%的计算。当传输到对象检测时,移动式以前从RetinAnet框架中占MobileNetv3到8.6 AP。此外,我们通过用移动设备替换DETR中的骨干,编码器和解码器来构建高效的端到端探测器,该骨干,其优于12个AP,但节省了52 \%的计算成本和36 \%的参数。
translated by 谷歌翻译
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
translated by 谷歌翻译
We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perceptron (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5× smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code will be released at: github.com/NVlabs/SegFormer.Preprint. Under review.
translated by 谷歌翻译
ous vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones. (3) We validate PVT through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope that PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future research.
translated by 谷歌翻译
最近,Vision Transformer通过推动各种视觉任务的最新技术取得了巨大的成功。视觉变压器中最具挑战性的问题之一是,图像令牌的较大序列长度会导致高计算成本(二次复杂性)。解决此问题的一个流行解决方案是使用单个合并操作来减少序列长度。本文考虑如何改善现有的视觉变压器,在这种变压器中,单个合并操作提取的合并功能似乎不太强大。为此,我们注意到,由于其在上下文抽象中的强大能力,金字塔池在各种视觉任务中已被证明是有效的。但是,在骨干网络设计中尚未探索金字塔池。为了弥合这一差距,我们建议在视觉变压器中将金字塔池汇总到多头自我注意力(MHSA)中,同时降低了序列长度并捕获强大的上下文特征。我们插入了基于池的MHSA,我们构建了一个通用视觉变压器主链,称为金字塔池变压器(P2T)。广泛的实验表明,与先前的基于CNN-和基于变压器的网络相比,当将P2T用作骨干网络时,它在各种视觉任务中显示出很大的优势。该代码将在https://github.com/yuhuan-wu/p2t上发布。
translated by 谷歌翻译
变形金刚正在改变计算机视觉的景观,特别是对于识别任务。检测变压器是对象检测的第一个完全结束的学习系统,而视觉变压器是用于图像分类的第一个完全变压器的架构。在本文中,我们集成了视觉和检测变压器(Vidt)以构建有效和高效的物体探测器。 VIDT引入了重新配置的注意模块,将最近的Swin变压器扩展为独立对象检测器,然后是计算高效的变压器解码器,该解码器利用多尺度特征和辅助技术来提高检测性能,而无需多大增加计算负载。 Microsoft Coco基准数据集上的广泛评估结果表明,VIDT在现有的基于变压器的对象检测器中获得了最佳的AP和延迟折衷,并且由于大型型号的高可扩展性而实现了49.2AP。我们将在https://github.com/naver-ai/vidt发布代码和培训的型号
translated by 谷歌翻译
视觉变压器(VIT)的最新进展在视觉识别任务中取得了出色的表现。卷积神经网络(CNNS)利用空间电感偏见来学习视觉表示,但是这些网络在空间上是局部的。 VIT可以通过其自我注意力机制学习全球表示形式,但它们通常是重量重量,不适合移动设备。在本文中,我们提出了交叉功能关注(XFA),以降低变压器的计算成本,并结合有效的移动CNN,形成一种新型有效的轻质CNN-CNN-VIT混合模型Xformer,可以用作通用的骨干链。学习全球和本地代表。实验结果表明,Xformer在不同的任务和数据集上的表现优于大量CNN和基于VIT的模型。在ImagEnet1k数据集上,XFormer以550万参数的优先级达到78.5%的TOP-1精度,比EdgitionNet-B0(基于CNN)(基于CNN)和DEIT(基于VIT)(基于VIT)的参数高2.2%和6.3%。当转移到对象检测和语义分割任务时,我们的模型也表现良好。在MS Coco数据集上,Xformer在Yolov3框架中仅超过10.5 AP(22.7-> 33.2 AP),只有630万参数和3.8克Flops。在CityScapes数据集上,只有一个简单的全MLP解码器,Xformer可实现78.5的MIOU,而FPS为15.3,超过了最先进的轻量级分割网络。
translated by 谷歌翻译
视觉表示学习是解决各种视力问题的关键。依靠开创性的网格结构先验,卷积神经网络(CNN)已成为大多数深视觉模型的事实上的标准架构。例如,经典的语义分割方法通常采用带有编码器编码器体系结构的完全横向卷积网络(FCN)。编码器逐渐减少了空间分辨率,并通过更大的接受场来学习更多抽象的视觉概念。由于上下文建模对于分割至关重要,因此最新的努力一直集中在通过扩张(即极度)卷积或插入注意力模块来增加接受场。但是,基于FCN的体系结构保持不变。在本文中,我们旨在通过将视觉表示学习作为序列到序列预测任务来提供替代观点。具体而言,我们部署纯变压器以将图像编码为一系列贴片,而无需局部卷积和分辨率减少。通过在变压器的每一层中建立的全球环境,可以学习更强大的视觉表示形式,以更好地解决视力任务。特别是,我们的细分模型(称为分割变压器(SETR))在ADE20K上擅长(50.28%MIOU,这是提交当天测试排行榜中的第一个位置),Pascal环境(55.83%MIOU),并在CityScapes上达到竞争成果。此外,我们制定了一个分层局部全球(HLG)变压器的家族,其特征是窗户内的本地关注和跨窗户的全球性专注于层次结构和金字塔架构。广泛的实验表明,我们的方法在各种视觉识别任务(例如,图像分类,对象检测和实例分割和语义分割)上实现了吸引力的性能。
translated by 谷歌翻译
视觉变压器的最新进展在基于点产生自我注意的新空间建模机制驱动的各种任务中取得了巨大成功。在本文中,我们表明,视觉变压器背后的关键要素,即输入自适应,远程和高阶空间相互作用,也可以通过基于卷积的框架有效地实现。我们介绍了递归封闭式卷积($ \ textit {g}^\ textit {n} $ conv),该卷积{n} $ conv)与封闭的卷积和递归设计执行高阶空间交互。新操作是高度灵活和可定制的,它与卷积的各种变体兼容,并将自我注意的两阶相互作用扩展到任意订单,而无需引入大量额外的计算。 $ \ textit {g}^\ textit {n} $ conv可以用作插件模块,以改善各种视觉变压器和基于卷积的模型。根据该操作,我们构建了一个名为Hornet的新型通用视觉骨干家族。关于ImageNet分类,可可对象检测和ADE20K语义分割的广泛实验表明,大黄蜂的表现优于Swin变形金刚,并具有相似的整体体系结构和训练配置的明显边距。大黄蜂还显示出对更多训练数据和更大模型大小的有利可伸缩性。除了在视觉编码器中的有效性外,我们还可以将$ \ textit {g}^\ textit {n} $ conv应用于特定于任务的解码器,并始终通过较少的计算来提高密集的预测性能。我们的结果表明,$ \ textIt {g}^\ textit {n} $ conv可以成为视觉建模的新基本模块,可有效结合视觉变形金刚和CNN的优点。代码可从https://github.com/raoyongming/hornet获得
translated by 谷歌翻译
Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully devised yet simple spatial attention mechanism performs favorably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks including image-level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our Code is available at: https://git.io/Twins.
translated by 谷歌翻译
Image segmentation is often ambiguous at the level of individual image patches and requires contextual information to reach label consensus. In this paper we introduce Segmenter, a transformer model for semantic segmentation. In contrast to convolution-based methods, our approach allows to model global context already at the first layer and throughout the network. We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation. To do so, we rely on the output embeddings corresponding to image patches and obtain class labels from these embeddings with a point-wise linear decoder or a mask transformer decoder. We leverage models pre-trained for image classification and show that we can fine-tune them on moderate sized datasets available for semantic segmentation. The linear decoder allows to obtain excellent results already, but the performance can be further improved by a mask transformer generating class masks. We conduct an extensive ablation study to show the impact of the different parameters, in particular the performance is better for large models and small patch sizes. Segmenter attains excellent results for semantic segmentation. It outperforms the state of the art on both ADE20K and Pascal Context datasets and is competitive on Cityscapes.
translated by 谷歌翻译
我们从实际应用的角度重新审视了现有的出色变压器。他们中的大多数甚至不如基本的重新连接系列效率那么高,并且偏离了现实的部署方案。这可能是由于当前的标准测量计算效率,例如FLOPS或参数是单方面的,次优的和对硬件的不敏感的。因此,本文直接将特定硬件的紧张延迟视为效率指标,该指标提供了涉及计算能力,内存成本和带宽的更全面的反馈。基于一系列受控实验,这项工作为面向浓度和部署的网络设计提供了四个实用指南,例如,在阶段级别,早期的变压器和晚期CNN,在Block Level的早期CNN和Late Transformer。因此,提出了一个面向Tensortrt的变压器家族,缩写为TRT-VIT。广泛的实验表明,在不同的视觉任务(例如,图像分类,对象检测和语义细分)方面,TRT-VIT显着优于现有的Convnet和视觉变压器。例如,在82.7%的Imagenet-1k Top-1精度下,TRT-VIT比CSWIN快2.7 $ \ times $,比双胞胎快2.0 $ \ times $。在MS-COCO对象检测任务上,TRT-VIT与双胞胎达到可比的性能,而推理速度则增加了2.8 $ \ times $。
translated by 谷歌翻译
人们普遍认为,对于准确的语义细分,必须使用昂贵的操作(例如,非常卷积)结合使用昂贵的操作(例如非常卷积),从而导致缓慢的速度和大量的内存使用。在本文中,我们质疑这种信念,并证明既不需要高度的内部决议也不是必需的卷积。我们的直觉是,尽管分割是一个每像素的密集预测任务,但每个像素的语义通常都取决于附近的邻居和遥远的环境。因此,更强大的多尺度功能融合网络起着至关重要的作用。在此直觉之后,我们重新访问常规的多尺度特征空间(通常限制为P5),并将其扩展到更丰富的空间,最小的P9,其中最小的功能仅为输入大小的1/512,因此具有很大的功能接受场。为了处理如此丰富的功能空间,我们利用最近的BIFPN融合了多尺度功能。基于这些见解,我们开发了一个简化的分割模型,称为ESEG,该模型既没有内部分辨率高,也没有昂贵的严重卷积。也许令人惊讶的是,与多个数据集相比,我们的简单方法可以以比以前的艺术更快地实现更高的准确性。在实时设置中,ESEG-Lite-S在189 fps的CityScapes [12]上达到76.0%MIOU,表现优于更快的[9](73.1%MIOU时为170 fps)。我们的ESEG-LITE-L以79 fps的速度运行,达到80.1%MIOU,在很大程度上缩小了实时和高性能分割模型之间的差距。
translated by 谷歌翻译