智能论文笔记

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond

Yue Cao , Jiarui Xu , Stephen Lin , Fangyun Wei , Han Hu

分类：

2019-04-25

The Non-Local Network (NLNet) presents a pioneering approach for capturing long-range dependencies, via aggregating query-specific global context to each query position. However, through a rigorous empirical analysis, we have found that the global contexts modeled by non-local network are almost the same for different query positions within an image. In this paper, we take advantage of this finding to create a simplified network based on a queryindependent formulation, which maintains the accuracy of NLNet but with significantly less computation. We further observe that this simplified design shares similar structure with Squeeze-Excitation Network (SENet). Hence we unify them into a three-step general framework for global context modeling. Within the general framework, we design a better instantiation, called the global context (GC) block, which is lightweight and can effectively model the global context. The lightweight property allows us to apply it for multiple layers in a backbone network to construct a global context network (GCNet), which generally outperforms both simplified NLNet and SENet on major benchmarks for various recognition tasks. The code and configurations are released at https://github.com/xvjiarui/GCNet.

translated by 谷歌翻译

Non-local neural networks

分类：

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method [4] in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our nonlocal models can compete or outperform current competition winners on both Kinetics and Charades datasets.In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code will be made available.

translated by 谷歌翻译

SDA-$x$Net: Selective Depth Attention Networks for Adaptive Multi-scale Feature Representation

Qingbei Guo , Xiao-Jun Wu , Zhiquan Feng , Tianyang Xu , Cong Hu

分类：计算机视觉

2022-09-21

现有的多尺度解决方案会导致仅增加接受场大小的风险，同时忽略小型接受场。因此，有效构建自适应神经网络以识别各种空间尺度对象是一个具有挑战性的问题。为了解决这个问题，我们首先引入一个新的注意力维度，即除了现有的注意力维度（例如渠道，空间和分支）之外，并提出了一个新颖的选择性深度注意网络，以对称地处理各种视觉中的多尺度对象任务。具体而言，在给定神经网络的每个阶段内的块，即重新连接，输出层次功能映射共享相同的分辨率但具有不同的接收场大小。基于此结构属性，我们设计了一个舞台建筑模块，即SDA，其中包括树干分支和类似SE的注意力分支。躯干分支的块输出融合在一起，以通过注意力分支指导其深度注意力分配。根据提出的注意机制，我们可以动态选择不同的深度特征，这有助于自适应调整可变大小输入对象的接收场大小。这样，跨块信息相互作用会导致沿深度方向的远距离依赖关系。与其他多尺度方法相比，我们的SDA方法结合了从以前的块到舞台输出的多个接受场，从而提供了更广泛，更丰富的有效接收场。此外，我们的方法可以用作其他多尺度网络以及注意力网络的可插入模块，并创造为SDA- $ x $ net。它们的组合进一步扩展了有效的接受场的范围，可以实现可解释的神经网络。我们的源代码可在\ url {https://github.com/qingbeiguo/sda-xnet.git}中获得。

translated by 谷歌翻译

GTA: Global Temporal Attention for Video Action Understanding

Bo He , Xitong Yang , Zuxuan Wu , Hao Chen , Ser-Nam Lim , Abhinav Shrivastava

分类：计算机视觉

2020-12-15

自我关注学习成对相互作用以模型远程依赖性，从而产生了对视频动作识别的巨大改进。在本文中，我们寻求更深入地了解视频中的时间建模的自我关注。我们首先表明通过扁平所有像素通过扁平化的时空信息的缠结建模是次优的，未明确捕获帧之间的时间关系。为此，我们介绍了全球暂时关注（GTA），以脱钩的方式在空间关注之上进行全球时间关注。我们在像素和语义类似地区上应用GTA，以捕获不同水平的空间粒度的时间关系。与计算特定于实例的注意矩阵的传统自我关注不同，GTA直接学习全局注意矩阵，该矩阵旨在编码遍布不同样本的时间结构。我们进一步增强了GTA的跨通道多头方式，以利用通道交互以获得更好的时间建模。对2D和3D网络的广泛实验表明，我们的方法一致地增强了时间建模，并在三个视频动作识别数据集中提供最先进的性能。

translated by 谷歌翻译

Semiconductor Defect Pattern Classification by Self-Proliferation-and-Attention Neural Network

YuanFu Yang , Min Sun

分类：计算机视觉

2022-12-01

Semiconductor manufacturing is on the cusp of a revolution: the Internet of Things (IoT). With IoT we can connect all the equipment and feed information back to the factory so that quality issues can be detected. In this situation, more and more edge devices are used in wafer inspection equipment. This edge device must have the ability to quickly detect defects. Therefore, how to develop a high-efficiency architecture for automatic defect classification to be suitable for edge devices is the primary task. In this paper, we present a novel architecture that can perform defect classification in a more efficient way. The first function is self-proliferation, using a series of linear transformations to generate more feature maps at a cheaper cost. The second function is self-attention, capturing the long-range dependencies of feature map by the channel-wise and spatial-wise attention mechanism. We named this method as self-proliferation-and-attention neural network. This method has been successfully applied to various defect pattern classification tasks. Compared with other latest methods, SP&A-Net has higher accuracy and lower computation cost in many defect inspection tasks.

translated by 谷歌翻译

A Close Look at Spatial Modeling: From Attention to Convolution

Xu Ma , Huan Wang , Can Qin , Kunpeng Li , Xingchen Zhao , Jie Fu , Yun Fu

分类：计算机视觉 | 人工智能 | 机器学习

2022-12-23

Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two interesting issues. First, Vision Transformers present a queryirrelevant behavior at deep layers, where the attention maps exhibit nearly consistent contexts in global scope, regardless of the query patch position (also head-irrelevant). Second, the attention maps are intrinsically sparse, few tokens dominate the attention weights; introducing the knowledge from ConvNets would largely smooth the attention and enhance the performance. Motivated by above observations, we generalize self-attention formulation to abstract a queryirrelevant global context directly and further integrate the global context into convolutions. The resulting model, a Fully Convolutional Vision Transformer (i.e., FCViT), purely consists of convolutional layers and firmly inherits the merits of both attention mechanism and convolutions, including dynamic property, weight sharing, and short- and long-range feature modeling, etc. Experimental results demonstrate the effectiveness of FCViT. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still perform better than previous state-of-the-art ConvNeXt with even fewer parameters. FCViT-based models also demonstrate promising transferability to downstream tasks, like object detection, instance segmentation, and semantic segmentation. Codes and models are made available at: https://github.com/ma-xu/FCViT.

translated by 谷歌翻译

HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

Yongming Rao , Wenliang Zhao , Yansong Tang , Jie Zhou , Ser-Nam Lim , Jiwen Lu

分类：计算机视觉

2022-07-28

视觉变压器的最新进展在基于点产生自我注意的新空间建模机制驱动的各种任务中取得了巨大成功。在本文中，我们表明，视觉变压器背后的关键要素，即输入自适应，远程和高阶空间相互作用，也可以通过基于卷积的框架有效地实现。我们介绍了递归封闭式卷积（$ \ textit {g}^\ textit {n} $ conv），该卷积{n} $ conv）与封闭的卷积和递归设计执行高阶空间交互。新操作是高度灵活和可定制的，它与卷积的各种变体兼容，并将自我注意的两阶相互作用扩展到任意订单，而无需引入大量额外的计算。 $ \ textit {g}^\ textit {n} $ conv可以用作插件模块，以改善各种视觉变压器和基于卷积的模型。根据该操作，我们构建了一个名为Hornet的新型通用视觉骨干家族。关于ImageNet分类，可可对象检测和ADE20K语义分割的广泛实验表明，大黄蜂的表现优于Swin变形金刚，并具有相似的整体体系结构和训练配置的明显边距。大黄蜂还显示出对更多训练数据和更大模型大小的有利可伸缩性。除了在视觉编码器中的有效性外，我们还可以将$ \ textit {g}^\ textit {n} $ conv应用于特定于任务的解码器，并始终通过较少的计算来提高密集的预测性能。我们的结果表明，$ \ textIt {g}^\ textit {n} $ conv可以成为视觉建模的新基本模块，可有效结合视觉变形金刚和CNN的优点。代码可从https://github.com/raoyongming/hornet获得

translated by 谷歌翻译

Locally Enhanced Self-Attention: Combining Self-Attention and Convolution as Local and Context Terms

Chenglin Yang , Siyuan Qiao , Adam Kortylewski , Alan Yuille

分类：计算机视觉

2021-07-12

在计算机视觉模型中自我关注已经普遍存在。灵感来自完全连接的条件随机字段（CRF），我们将自我关注分解为本地和上下文条款。它们对应于CRF中的一元和二进制术语，并通过带投影矩阵的注意机制来实现。我们观察到，即机构只能对产出作出小贡献，而且同时依赖于机智术语的标准CNNS在各种任务上实现了良好的表现。因此，我们提出了局部增强的自我关注（LESA），通过将其与卷曲掺入卷积来增强联合术语，并利用融合模块动态地耦合偶联和二进制操作。在我们的实验中，我们用Lesa取代自我关注模块。 Imagenet和Coco的结果显示了Lesa在卷积和自我关注基线的优越性，用于图像识别，对象检测和实例分割的任务。代码公开可用。

translated by 谷歌翻译

Dynamic Graph Message Passing Networks for Visual Recognition

Li Zhang , Mohan Chen , Anurag Arnab , Xiangyang Xue , Philip H. S. Torr

分类：计算机视觉 | 机器学习

2022-09-20

建模长期依赖关系对于理解计算机视觉中的任务至关重要。尽管卷积神经网络（CNN）在许多视觉任务中都表现出色，但由于它们通常由当地核层组成，因此它们仍然限制捕获长期结构化关系。但是，完全连接的图（例如变形金刚中的自我发项操作）对这种建模是有益的，但是，其计算开销非常有用。在本文中，我们提出了一个动态图形消息传递网络，与建模完全连接的图形相比，该网络大大降低了计算复杂性。这是通过在图表中自适应采样节点（以输入为条件）来实现的，以传递消息传递。基于采样节点，我们动态预测节点依赖性滤波器权重和亲和力矩阵，以在它们之间传播信息。这种公式使我们能够设计一个自我发挥的模块，更重要的是，我们将基于变压器的新骨干网络用于图像分类预处理，并用于解决各种下游任务（对象检测，实例和语义细分）。使用此模型，我们在四个不同任务上的强，最先进的基线方面显示出显着改进。我们的方法还优于完全连接的图形，同时使用较少的浮点操作和参数。代码和型号将在https://github.com/fudan-zvg/dgmn2上公开提供。

translated by 谷歌翻译

Efficient Multi-order Gated Aggregation Network

Siyuan Li , Zedong Wang , Zicheng Liu , Cheng Tan , Haitao Lin , Di Wu , Zhiyuan Chen , Jiangbin Zheng , Stan Z. Li

分类：计算机视觉 | 人工智能

2022-11-07

Since the recent success of Vision Transformers (ViTs), explorations toward transformer-style architectures have triggered the resurgence of modern ConvNets. In this work, we explore the representation ability of DNNs through the lens of interaction complexities. We empirically show that interaction complexity is an overlooked but essential indicator for visual recognition. Accordingly, a new family of efficient ConvNets, named MogaNet, is presented to pursue informative context mining in pure ConvNet-based models, with preferable complexity-performance trade-offs. In MogaNet, interactions across multiple complexities are facilitated and contextualized by leveraging two specially designed aggregation blocks in both spatial and channel interaction spaces. Extensive studies are conducted on ImageNet classification, COCO object detection, and ADE20K semantic segmentation tasks. The results demonstrate that our MogaNet establishes new state-of-the-art over other popular methods in mainstream scenarios and all model scales. Typically, the lightweight MogaNet-T achieves 80.0\% top-1 accuracy with only 1.44G FLOPs using a refined training setup on ImageNet-1K, surpassing ParC-Net-S by 1.4\% accuracy but saving 59\% (2.04G) FLOPs.

translated by 谷歌翻译

Scaling Local Self-Attention for Parameter Efficient Visual Backbones

Ashish Vaswani , Prajit Ramachandran , Aravind Srinivas , Niki Parmar , Blake Hechtman , Jonathon Shlens

分类：

2021-03-23

Self-attention has the promise of improving computer vision systems due to parameter-independent scaling of receptive fields and content-dependent interactions, in contrast to parameter-dependent scaling and content-independent interactions of convolutions. Self-attention models have recently been shown to have encouraging improvements on accuracy-parameter trade-offs compared to baseline convolutional models such as ResNet-50. In this work, we aim to develop self-attention models that can outperform not just the canonical baseline models, but even the high-performing convolutional models. We propose two extensions to selfattention that, in conjunction with a more efficient implementation of self-attention, improve the speed, memory usage, and accuracy of these models. We leverage these improvements to develop a new self-attention model family, HaloNets, which reach state-of-the-art accuracies on the parameterlimited setting of the ImageNet classification benchmark. In preliminary transfer learning experiments, we find that HaloNet models outperform much larger models and have better inference performance. On harder tasks such as object detection and instance segmentation, our simple local self-attention and convolutional hybrids show improvements over very strong baselines. These results mark another step in demonstrating the efficacy of self-attention models on settings traditionally dominated by convolutional models.

translated by 谷歌翻译

Vision Transformers with Hierarchical Attention

Yun Liu , Yu-Huan Wu , Guolei Sun , Le Zhang , Ajad Chhatkuli , Luc Van Gool

分类：计算机视觉

2021-06-06

本文解决了由多头自我注意力（MHSA）中高计算/空间复杂性引起的视觉变压器的低效率缺陷。为此，我们提出了层次MHSA（H-MHSA），其表示以层次方式计算。具体而言，我们首先将输入图像分为通常完成的补丁，每个补丁都被视为令牌。然后，拟议的H-MHSA学习本地贴片中的令牌关系，作为局部关系建模。然后，将小贴片合并为较大的贴片，H-MHSA对少量合并令牌的全局依赖性建模。最后，汇总了本地和全球专注的功能，以获得具有强大表示能力的功能。由于我们仅在每个步骤中计算有限数量的令牌的注意力，因此大大减少了计算负载。因此，H-MHSA可以在不牺牲细粒度信息的情况下有效地模拟令牌之间的全局关系。使用H-MHSA模块合并，我们建立了一个基于层次的变压器网络的家族，即HAT-NET。为了证明在场景理解中HAT-NET的优越性，我们就基本视觉任务进行了广泛的实验，包括图像分类，语义分割，对象检测和实例细分。因此，HAT-NET为视觉变压器提供了新的视角。可以在https://github.com/yun-liu/hat-net上获得代码和预估计的模型。

translated by 谷歌翻译

Improved Multiscale Vision Transformers for Classification and Detection

Yanghao Li , Chao-Yuan Wu , Haoqi Fan , Karttikeya Mangalam , Bo Xiong , Jitendra Malik , Christoph Feichtenhofer

分类：计算机视觉

2021-12-02

在本文中，我们将多尺度视觉变压器（MVIT）作为图像和视频分类的统一架构，以及对象检测。我们提出了一种改进的MVIT版本，它包含分解的相对位置嵌入和残余汇集连接。我们以五种尺寸实例化此架构，并评估Imagenet分类，COCO检测和动力学视频识别，在此优先效果。我们进一步比较了MVITS的汇集注意力来窗口注意力机制，其中它在准确性/计算中优于后者。如果没有钟声，MVIT在3个域中具有最先进的性能：ImageNet分类的准确性为88.8％，Coco对象检测的56.1盒AP和动力学-400视频分类的86.1％。代码和模型将公开可用。

translated by 谷歌翻译

Vision Transformer with Deformable Attention

Zhuofan Xia , Xuran Pan , Shiji Song , Li Erran Li , Gao Huang

分类：计算机视觉

2022-01-03

变压器最近在各种视觉任务上表现出卓越的性能。大型有时甚至全球，接收领域赋予变换器模型，并通过其CNN对应物具有更高的表示功率。然而，简单地扩大接收领域也产生了几个问题。一方面，使用致密的注意，例如，在VIT中，导致过度的记忆和计算成本，并且特征可以受到超出兴趣区域的无关紧要的影响。另一方面，PVT或SWIN变压器采用的稀疏注意是数据不可知论，可能会限制模拟长距离关系的能力。为了缓解这些问题，我们提出了一种新型可变形的自我关注模块，其中以数据相关的方式选择密钥和值对中的密钥和值对的位置。这种灵活的方案使自我关注模块能够专注于相关区域并捕获更多的信息性功能。在此基础上，我们呈现可变形的关注变压器，一般骨干模型，具有可变形关注的图像分类和密集预测任务。广泛的实验表明，我们的模型在综合基准上实现了一致的改善结果。代码可在https://github.com/leaplabthu/dat上获得。

translated by 谷歌翻译

Bottleneck Transformers for Visual Recognition

Aravind Srinivas , Tsung-Yi Lin , Niki Parmar , Jonathon Shlens , Pieter Abbeel , Ashish Vaswani

分类：

2021-01-27

We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and no other changes, our approach improves upon the baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency. Through the design of BoTNet, we also point out how ResNet bottleneck blocks with self-attention can be viewed as Transformer blocks. Without any bells and whistles, BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark using the Mask R-CNN framework; surpassing the previous best published single model and single scale results of ResNeSt [67] evaluated on the COCO validation set. Finally, we present a simple adaptation of the BoTNet design for image classification, resulting in models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark while being up to 1.64x faster in "compute" 1 time than the popular EfficientNet models on TPU-v3 hardware. We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.

translated by 谷歌翻译

ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

Qilong Wang , Banggu Wu , Pengfei Zhu , Peihua Li , Wangmeng Zuo , Qinghua Hu

分类：

2019-10-08

Recently, channel attention mechanism has demonstrated to offer great potential in improving the performance of deep convolutional neural networks (CNNs). However, most existing methods dedicate to developing more sophisticated attention modules for achieving better performance, which inevitably increase model complexity.To overcome the paradox of performance and complexity trade-off, this paper proposes an Efficient Channel Attention (ECA) module, which only involves a handful of parameters while bringing clear performance gain. By dissecting the channel attention module in SENet, we empirically show avoiding dimensionality reduction is important for learning channel attention, and appropriate cross-channel interaction can preserve performance while significantly decreasing model complexity. Therefore, we propose a local crosschannel interaction strategy without dimensionality reduction, which can be efficiently implemented via 1D convolution. Furthermore, we develop a method to adaptively select kernel size of 1D convolution, determining coverage of local cross-channel interaction. The proposed ECA module is efficient yet effective, e.g., the parameters and computations of our modules against backbone of ResNet50 are 80 vs. 24.37M and 4.7e-4 GFLOPs vs. 3.86 GFLOPs, respectively, and the performance boost is more than 2% in terms of Top-1 accuracy. We extensively evaluate our ECA module on image classification, object detection and instance segmentation with backbones of ResNets and MobileNetV2. The experimental results show our module is more efficient while performing favorably against its counterparts.

translated by 谷歌翻译

A Discriminative Channel Diversification Network for Image Classification

Krushi Patel , Guanghui Wang

分类：计算机视觉

2021-12-10

已证明卷积神经网络中的渠道注意机制在各种计算机视觉任务中有效。但是，性能改进具有额外的模型复杂性和计算成本。在本文中，我们提出了一种被称为信道分流块的轻量级和有效的注意模块，以通过在全球层面建立信道关系来增强全局背景。与其他通道注意机制不同，所提出的模块通过在考虑信道激活时更加关注空间可区分的渠道，专注于最辨别的特征。与其他介绍模块不同的其他中间层之间的其他关注模型不同，所提出的模块嵌入在骨干网络的末尾，使其易于实现。在CiFar-10，SVHN和微型想象中心数据集上进行了广泛的实验表明，所提出的模块平均提高了基线网络的性能3％的余量。

translated by 谷歌翻译

CMT: Convolutional Neural Networks Meet Vision Transformers

Jianyuan Guo , Kai Han , Han Wu , Yehui Tang , Xinghao Chen , Yunhe Wang , Chang Xu

分类：计算机视觉

2021-07-13

视觉变压器由于能够捕获图像中的长期依赖性的能力而成功地应用于图像识别任务。但是，变压器与现有卷积神经网络（CNN）之间的性能和计算成本仍然存在差距。在本文中，我们旨在解决此问题，并开发一个网络，该网络不仅可以超越规范变压器，而且可以超越高性能卷积模型。我们通过利用变压器来捕获长期依赖性和CNN来建模本地特征，从而提出了一个新的基于变压器的混合网络。此外，我们将其扩展为获得一个称为CMT的模型家族，比以前的基于卷积和基于变压器的模型获得了更好的准确性和效率。特别是，我们的CMT-S在ImageNet上获得了83.5％的TOP-1精度，而在拖鞋上的拖曳率分别比现有的DEIT和EficitiveNet小14倍和2倍。拟议的CMT-S还可以很好地概括CIFAR10（99.2％），CIFAR100（91.7％），花（98.7％）以及其他具有挑战性的视觉数据集，例如可可（44.3％地图），计算成本较小。

translated by 谷歌翻译

Focal Modulation Networks

Jianwei Yang , Chunyuan Li , Xiyang Dai , Lu Yuan , Jianfeng Gao

分类：计算机视觉 | 人工智能 | 机器学习

2022-03-22

We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K in 224 resolution, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224 and 384, respectively. When transferred to downstream tasks, FocalNets exhibit clear superiority. For object detection with Mask R-CNN, FocalNet base trained with 1\times outperforms the Swin counterpart by 2.1 points and already surpasses Swin trained with 3\times schedule (49.0 v.s. 48.5). For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.3 and 64.4 mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of much larger attention-based models like Swinv2-G and BEIT-3. Code and checkpoints are available at https://github.com/microsoft/FocalNet.

translated by 谷歌翻译

Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective

Li Zhang , Sixiao Zheng , Jiachen Lu , Xinxuan Zhao , Xiatian Zhu , Yanwei Fu , Tao Xiang , Jianfeng Feng

分类：计算机视觉

2022-07-19

视觉表示学习是解决各种视力问题的关键。依靠开创性的网格结构先验，卷积神经网络（CNN）已成为大多数深视觉模型的事实上的标准架构。例如，经典的语义分割方法通常采用带有编码器编码器体系结构的完全横向卷积网络（FCN）。编码器逐渐减少了空间分辨率，并通过更大的接受场来学习更多抽象的视觉概念。由于上下文建模对于分割至关重要，因此最新的努力一直集中在通过扩张（即极度）卷积或插入注意力模块来增加接受场。但是，基于FCN的体系结构保持不变。在本文中，我们旨在通过将视觉表示学习作为序列到序列预测任务来提供替代观点。具体而言，我们部署纯变压器以将图像编码为一系列贴片，而无需局部卷积和分辨率减少。通过在变压器的每一层中建立的全球环境，可以学习更强大的视觉表示形式，以更好地解决视力任务。特别是，我们的细分模型（称为分割变压器（SETR））在ADE20K上擅长（50.28％MIOU，这是提交当天测试排行榜中的第一个位置），Pascal环境（55.83％MIOU），并在CityScapes上达到竞争成果。此外，我们制定了一个分层局部全球（HLG）变压器的家族，其特征是窗户内的本地关注和跨窗户的全球性专注于层次结构和金字塔架构。广泛的实验表明，我们的方法在各种视觉识别任务（例如，图像分类，对象检测和实例分割和语义分割）上实现了吸引力的性能。

translated by 谷歌翻译