智能论文笔记

Global Context Vision Transformers

Ali Hatamizadeh , Hongxu Yin , Jan Kautz , Pavlo Molchanov

分类：计算机视觉 | 人工智能 | 机器学习

2022-06-20

我们提出了全球环境视觉变压器（GC VIT），这是一种新的结构，可增强参数和计算利用率。我们的方法利用了与本地自我注意的联合的全球自我发项模块，以有效但有效地建模长和短距离的空间相互作用，而无需昂贵的操作，例如计算注意力面罩或移动本地窗户。此外，我们通过建议在我们的体系结构中使用修改后的融合倒置残差块来解决VIT中缺乏归纳偏差的问题。我们提出的GC VIT在图像分类，对象检测和语义分割任务中实现了最新的结果。在用于分类的ImagEnet-1k数据集上，基本，小而微小的GC VIT，$ 28 $ M，$ 51 $ M和$ 90 $ M参数实现$ \ textbf {83.2 \％} $，$ \ textbf {83.9 \％} $和$ \ textbf {84.4 \％} $ top-1的精度，超过了相当大的先前艺术，例如基于CNN的Convnext和基于VIT的Swin Transformer，其优势大大。在对象检测，实例分割和使用MS Coco和ADE20K数据集的下游任务中，预训练的GC VIT主机在对象检测，实例分割和语义分割的任务中始终如一地超过事务，有时是通过大余量。可在https://github.com/nvlabs/gcvit上获得代码。

translated by 谷歌翻译

Focal Modulation Networks

Jianwei Yang , Chunyuan Li , Xiyang Dai , Lu Yuan , Jianfeng Gao

分类：计算机视觉 | 人工智能 | 机器学习

2022-03-22

We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K in 224 resolution, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224 and 384, respectively. When transferred to downstream tasks, FocalNets exhibit clear superiority. For object detection with Mask R-CNN, FocalNet base trained with 1\times outperforms the Swin counterpart by 2.1 points and already surpasses Swin trained with 3\times schedule (49.0 v.s. 48.5). For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.3 and 64.4 mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of much larger attention-based models like Swinv2-G and BEIT-3. Code and checkpoints are available at https://github.com/microsoft/FocalNet.

translated by 谷歌翻译

Token Transformer: Can class token help window-based transformer build better long-range interactions?

Jiawei Mao , Yuanqi Chang , Xuesong Yin

分类：计算机视觉

2022-11-11

Compared with the vanilla transformer, the window-based transformer offers a better trade-off between accuracy and efficiency. Although the window-based transformer has made great progress, its long-range modeling capabilities are limited due to the size of the local window and the window connection scheme. To address this problem, we propose a novel Token Transformer (TT). The core mechanism of TT is the addition of a Class (CLS) token for summarizing window information in each local window. We refer to this type of token interaction as CLS Attention. These CLS tokens will interact spatially with the tokens in each window to enable long-range modeling. In order to preserve the hierarchical design of the window-based transformer, we designed Feature Inheritance Module (FIM) in each phase of TT to deliver the local window information from the previous phase to the CLS token in the next phase. In addition, we have designed a Spatial-Channel Feedforward Network (SCFFN) in TT, which can mix CLS tokens and embedded tokens on the spatial domain and channel domain without additional parameters. Extensive experiments have shown that our TT achieves competitive results with low parameters in image classification and downstream tasks.

translated by 谷歌翻译

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu , Yutong Lin , Yue Cao , Han Hu , Yixuan Wei , Zheng Zhang , Stephen Lin , Baining Guo

分类：

2021-03-25

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-theart by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github. com/microsoft/Swin-Transformer.

translated by 谷歌翻译

CMT: Convolutional Neural Networks Meet Vision Transformers

Jianyuan Guo , Kai Han , Han Wu , Yehui Tang , Xinghao Chen , Yunhe Wang , Chang Xu

分类：计算机视觉

2021-07-13

视觉变压器由于能够捕获图像中的长期依赖性的能力而成功地应用于图像识别任务。但是，变压器与现有卷积神经网络（CNN）之间的性能和计算成本仍然存在差距。在本文中，我们旨在解决此问题，并开发一个网络，该网络不仅可以超越规范变压器，而且可以超越高性能卷积模型。我们通过利用变压器来捕获长期依赖性和CNN来建模本地特征，从而提出了一个新的基于变压器的混合网络。此外，我们将其扩展为获得一个称为CMT的模型家族，比以前的基于卷积和基于变压器的模型获得了更好的准确性和效率。特别是，我们的CMT-S在ImageNet上获得了83.5％的TOP-1精度，而在拖鞋上的拖曳率分别比现有的DEIT和EficitiveNet小14倍和2倍。拟议的CMT-S还可以很好地概括CIFAR10（99.2％），CIFAR100（91.7％），花（98.7％）以及其他具有挑战性的视觉数据集，例如可可（44.3％地图），计算成本较小。

translated by 谷歌翻译

Vision Transformers with Hierarchical Attention

Yun Liu , Yu-Huan Wu , Guolei Sun , Le Zhang , Ajad Chhatkuli , Luc Van Gool

分类：计算机视觉

2021-06-06

本文解决了由多头自我注意力（MHSA）中高计算/空间复杂性引起的视觉变压器的低效率缺陷。为此，我们提出了层次MHSA（H-MHSA），其表示以层次方式计算。具体而言，我们首先将输入图像分为通常完成的补丁，每个补丁都被视为令牌。然后，拟议的H-MHSA学习本地贴片中的令牌关系，作为局部关系建模。然后，将小贴片合并为较大的贴片，H-MHSA对少量合并令牌的全局依赖性建模。最后，汇总了本地和全球专注的功能，以获得具有强大表示能力的功能。由于我们仅在每个步骤中计算有限数量的令牌的注意力，因此大大减少了计算负载。因此，H-MHSA可以在不牺牲细粒度信息的情况下有效地模拟令牌之间的全局关系。使用H-MHSA模块合并，我们建立了一个基于层次的变压器网络的家族，即HAT-NET。为了证明在场景理解中HAT-NET的优越性，我们就基本视觉任务进行了广泛的实验，包括图像分类，语义分割，对象检测和实例细分。因此，HAT-NET为视觉变压器提供了新的视角。可以在https://github.com/yun-liu/hat-net上获得代码和预估计的模型。

translated by 谷歌翻译

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Xiaoyi Dong , Jianmin Bao , Dongdong Chen , Weiming Zhang , Nenghai Yu , Lu Yuan , Dong Chen , Baining Guo

分类：计算机视觉 | 机器学习

2021-07-01

我们介绍克斯内变压器，一种高效且有效的变压器的骨干，用于通用视觉任务。变压器设计的具有挑战性的问题是，全球自我关注来计算成本昂贵，而局部自我关注经常限制每个令牌的相互作用。为了解决这个问题，我们开发了以平行的横向和垂直条纹在水平和垂直条纹中计算自我关注的交叉形窗口自我关注机制，通过将输入特征分成相等的条纹而获得的每个条纹宽度。我们提供了条纹宽度效果的数学分析，并改变变压器网络的不同层的条纹宽度，这在限制计算成本时实现了强大的建模能力。我们还介绍了本地增强的位置编码（LEPE），比现有的编码方案更好地处理本地位置信息。 LEPE自然支持任意输入分辨率，因此对下游任务特别有效和友好。 CSWIN变压器并入其具有这些设计和分层结构，展示了普通愿景任务的竞争性能。具体来说，它在ImageNet-1K上实现了85.4 \％Top-1精度，而无需任何额外的培训数据或标签，53.9盒AP和46.4掩模AP，ADE20K语义分割任务上的52.2 Miou，超过以前的状态 - 在类似的拖鞋设置下，艺术品+1.2，+2.0，+1.4和+2.0分别为+1.2，+2.0，+1.4和+2.0。通过在较大的数据集Imagenet-21k上进行前预先预订，我们在Ave20K上实现了87.5％的成像-1K和高分性能，55.7 miou。代码和模型可在https://github.com/microsoft/cswin-transformer中找到。

translated by 谷歌翻译

Hire-MLP: Vision MLP via Hierarchical Rearrangement

Jianyuan Guo , Yehui Tang , Kai Han , Xinghao Chen , Han Wu , Chao Xu , Chang Xu , Yunhe Wang

分类：计算机视觉

2021-08-30

先前的视觉MLP，如MLP-MILER和RESMLP接受线性扁平的图像贴片作为输入，使其对不同的输入大小和难以捕获空间信息。这种方法隐瞒了MLP与基于变压器的对应物相比，并防止它们成为计算机视觉的一般骨干。本文介绍了Hire-MLP，通过\ TextBF {Hi} reachical \ TextBF {Re}排列，这是一个简单而竞争的愿景MLP架构，其中包含两个重排级别。具体地，提出内部区域重新排列以捕获空间区域内的局部信息，并且提出横区域重新排列以使不同区域之间的信息通信能够通过沿空间方向循环地转换所有令牌来实现不同区域之间的信息通信。广泛的实验证明了Hire-MLP作为各种视觉任务的多功能骨干的有效性。特别是，Hire-MLP在图像分类，对象检测和语义分割任务上实现竞争结果，例如，在Imagenet上的83.8％的前1个精度，51.7％盒AP和Coco Val2017上的44.8％掩模AP和Ade20k上的49.9％Miou ，超越以前的基于变压器和基于MLP的型号，具有更好的折衷以获得准确性和吞吐量。代码可在https://github.com/ggjy/hire-wave-mlp.pytorch获得。

translated by 谷歌翻译

Dilated Neighborhood Attention Transformer

Ali Hassani , Humphrey Shi

分类：计算机视觉 | 人工智能 | 机器学习

2022-09-29

变形金刚迅速成为跨模式，域和任务的最深入学习架构之一。在视觉上，除了对普通变压器的持续努力外，层次变压器还引起了人们的重大关注，这要归功于它们的性能和轻松整合到现有框架中。这些模型通常采用局部注意机制，例如滑动窗口社区的注意力（NA）或Swin Transformer转移的窗户自我关注。尽管有效地降低了自我注意力的二次复杂性，但局部注意力却削弱了自我注意力最理想的两个特性：远距离相互依赖性建模和全球接受场。在本文中，我们引入了扩张的邻里注意力（DINA），这是NA的天然，灵活和有效的扩展，可以捕获更多的全球环境，并以无需额外的成本呈指数级扩展接受场。 NA的本地关注和Dina的稀疏全球关注相互补充，因此我们引入了扩张的邻里注意力变压器（Dinat），这是一种新的分层视觉变压器。 Dinat变体对基于注意的基线（例如NAT和SWIN）以及现代卷积基线Convnext都具有重大改进。我们的大型模型在可可对象检测中以1.5％的盒子AP领先于其在COCO物体检测中，1.3％的掩码AP在可可实例分段中，而ADE20K语义分段中的1.1％MIOU和更快的吞吐量。我们认为，NA和Dina的组合有可能增强本文提出的各种任务的能力。为了支持和鼓励朝着这个方向，远见和超越方向进行研究，我们在以下网址开放我们的项目：https：//github.com/shi-labs/neighborhood-cithention-transformer。

translated by 谷歌翻译

ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer

Rui Yang , Hailong Ma , Jie Wu , Yansong Tang , Xuefeng Xiao , Min Zheng , Xiu Li

分类：计算机视觉 | 人工智能

2022-03-21

香草自我注意的机制固有地依赖于预定和坚定的计算维度。这种僵化的性限制了它具有面向上下文的概括，可以带来更多的上下文提示和全球表示。为了减轻此问题，我们提出了一种可扩展的自我注意（SSA）机制，该机制利用两个缩放因素来释放查询，键和价值矩阵的维度，同时使它们不符合输入。这种可伸缩性可获得面向上下文的概括并增强对象灵敏度，从而将整个网络推向准确性和成本之间的更有效的权衡状态。此外，我们提出了一个基于窗口的自我注意事项（IWSA），该自我注意力（IWSA）通过重新合并独立的值代币并从相邻窗口中汇总空间信息来建立非重叠区域之间的相互作用。通过交替堆叠SSA和IWSA，可扩展的视觉变压器（可伸缩率）在通用视觉任务中实现最先进的性能。例如，在Imagenet-1K分类中，可伸缩率S的表现优于双胞胎-SVT-S，而Swin-T则比1.4％。

translated by 谷歌翻译

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Xiangxiang Chu , Zhi Tian , Yuqing Wang , Bo Zhang , Haibing Ren , Xiaolin Wei , Huaxia Xia , Chunhua Shen

分类：

2021-04-28

Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully devised yet simple spatial attention mechanism performs favorably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks including image-level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our Code is available at: https://git.io/Twins.

translated by 谷歌翻译

MaxViT: Multi-Axis Vision Transformer

Zhengzhong Tu , Hossein Talebi , Han Zhang , Feng Yang , Peyman Milanfar , Alan Bovik , Yinxiao Li

分类：计算机视觉 | 人工智能 | 机器学习

2022-04-04

变形金刚最近在计算机视觉社区中引起了极大的关注。然而，缺乏关于图像大小的自我注意力机制的可扩展性限制了它们在最先进的视觉骨架中的广泛采用。在本文中，我们介绍了一种高效且可扩展的注意模型，我们称之为多轴注意，该模型由两个方面组成：阻止局部和扩张的全球关注。这些设计选择允许仅具有线性复杂性的任意输入分辨率上进行全局本地空间相互作用。我们还通过有效地将我们提出的注意模型与卷积混合在一起，提出了一个新的建筑元素，因此，通过简单地在多个阶段重复基本的构建块，提出了一个简单的层次视觉主链，称为Maxvit。值得注意的是，即使在早期的高分辨率阶段，Maxvit也能够在整个网络中“看到”。我们证明了模型在广泛的视觉任务上的有效性。根据图像分类，Maxvit在各种设置下实现最先进的性能：没有额外的数据，Maxvit获得了86.5％的Imagenet-1K Top-1精度；使用Imagenet-21K预训练，我们的模型可实现88.7％的TOP-1精度。对于下游任务，麦克斯维特（Maxvit）作为骨架可在对象检测以及视觉美学评估方面提供有利的性能。我们还表明，我们提出的模型表达了ImageNet上强大的生成建模能力，这表明了Maxvit块作为通用视觉模块的优势潜力。源代码和训练有素的模型将在https://github.com/google-research/maxvit上找到。

translated by 谷歌翻译

MPViT: Multi-Path Vision Transformer for Dense Prediction

Youngwan Lee , Jonghee Kim , Jeff Willette , Sung Ju Hwang

分类：计算机视觉

2021-12-21

诸如对象检测和分割等密集的计算机视觉任务需要有效的多尺度特征表示，用于检测或分类具有不同大小的对象或区域。虽然卷积神经网络（CNNS）是这种任务的主导架构，但最近引入了视觉变压器（VITS）的目标是将它们替换为骨干。类似于CNN，VITS构建一个简单的多级结构（即，细致粗略），用于使用单尺度补丁进行多尺度表示。在这项工作中，通过从现有变压器的不同角度来看，我们探索了多尺度补丁嵌入和多路径结构，构建了多路径视觉变压器（MPVIT）。 MPVIT通过使用重叠的卷积贴片嵌入，将相同尺寸〜（即，序列长度，序列长度，序列长度的序列长度）嵌入不同尺度的斑块。然后，通过多个路径独立地将不同尺度的令牌独立地馈送到变压器编码器，并且可以聚合产生的特征，使得能够在相同特征级别的精细和粗糙的特征表示。由于多样化，多尺寸特征表示，我们的MPVits从微小〜（5m）缩放到基础〜（73米）一直在想象成分，对象检测，实例分段上的最先进的视觉变压器来实现卓越的性能，和语义细分。这些广泛的结果表明，MPVIT可以作为各种视觉任务的多功能骨干网。代码将在\ url {https://git.io/mpvit}上公开可用。

translated by 谷歌翻译

Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective

Li Zhang , Sixiao Zheng , Jiachen Lu , Xinxuan Zhao , Xiatian Zhu , Yanwei Fu , Tao Xiang , Jianfeng Feng

分类：计算机视觉

2022-07-19

视觉表示学习是解决各种视力问题的关键。依靠开创性的网格结构先验，卷积神经网络（CNN）已成为大多数深视觉模型的事实上的标准架构。例如，经典的语义分割方法通常采用带有编码器编码器体系结构的完全横向卷积网络（FCN）。编码器逐渐减少了空间分辨率，并通过更大的接受场来学习更多抽象的视觉概念。由于上下文建模对于分割至关重要，因此最新的努力一直集中在通过扩张（即极度）卷积或插入注意力模块来增加接受场。但是，基于FCN的体系结构保持不变。在本文中，我们旨在通过将视觉表示学习作为序列到序列预测任务来提供替代观点。具体而言，我们部署纯变压器以将图像编码为一系列贴片，而无需局部卷积和分辨率减少。通过在变压器的每一层中建立的全球环境，可以学习更强大的视觉表示形式，以更好地解决视力任务。特别是，我们的细分模型（称为分割变压器（SETR））在ADE20K上擅长（50.28％MIOU，这是提交当天测试排行榜中的第一个位置），Pascal环境（55.83％MIOU），并在CityScapes上达到竞争成果。此外，我们制定了一个分层局部全球（HLG）变压器的家族，其特征是窗户内的本地关注和跨窗户的全球性专注于层次结构和金字塔架构。广泛的实验表明，我们的方法在各种视觉识别任务（例如，图像分类，对象检测和实例分割和语义分割）上实现了吸引力的性能。

translated by 谷歌翻译

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Jiemin Fang , Lingxi Xie , Xinggang Wang , Xiaopeng Zhang , Wenyu Liu , Qi Tian

分类：计算机视觉 | 机器学习

2021-05-31

变压器提供了一种设计神经网络以进行视觉识别的新方法。与卷积网络相比，变压器享有在每个阶段引用全局特征的能力，但注意模块带来了更高的计算开销，阻碍了变压器的应用来处理高分辨率的视觉数据。本文旨在减轻效率和灵活性之间的冲突，为此，我们为每个地区提出了专门的令牌，作为使者（MSG）。因此，通过操纵这些MSG令牌，可以在跨区域灵活地交换视觉信息，并且减少计算复杂性。然后，我们将MSG令牌集成到一个名为MSG-Transformer的多尺度体系结构中。在标准图像分类和对象检测中，MSG变压器实现了竞争性能，加速了GPU和CPU的推断。代码可在https://github.com/hustvl/msg-transformer中找到。

translated by 谷歌翻译

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Jiashi Li , Xin Xia , Wei Li , Huixia Li , Xing Wang , Xuefeng Xiao , Rui Wang , Min Zheng , Xin Pan

分类：计算机视觉

2022-07-12

由于复杂的注意机制和模型设计，大多数现有的视觉变压器（VIT）无法在现实的工业部署方案中的卷积神经网络（CNN）高效，例如张力和coreml。这提出了一个独特的挑战：可以设计视觉神经网络以与CNN一样快地推断并表现强大吗？最近的作品试图设计CNN-Transformer混合体系结构来解决这个问题，但是这些作品的整体性能远非令人满意。为了结束这些结束，我们提出了下一代视觉变压器，以在现实的工业场景中有效部署，即下一步，从延迟/准确性权衡的角度来看，它在CNN和VIT上占主导地位。在这项工作中，下一个卷积块（NCB）和下一个变压器块（NTB）分别开发出用于使用部署友好机制捕获本地和全球信息。然后，下一个混合策略（NHS）旨在将NCB和NTB堆叠在有效的混合范式中，从而提高了各种下游任务中的性能。广泛的实验表明，在各种视觉任务方面的延迟/准确性权衡方面，下一个VIT明显优于现有的CNN，VIT和CNN转换混合体系结构。在Tensorrt上，在可可检测上，Next-Vit超过5.4 MAP（从40.4到45.8），在类似延迟下，ADE20K细分的8.2％MIOU（从38.8％到47.0％）。同时，它可以与CSWIN达到可比的性能，而推理速度则以3.6倍的速度加速。在COREML上，在类似的延迟下，在COCO检测上，下一步超过了可可检测的4.6 MAP（从42.6到47.2），ADE20K分割的3.5％MIOU（从45.2％到48.7％）。代码将最近发布。

translated by 谷歌翻译

Efficient Multi-order Gated Aggregation Network

Siyuan Li , Zedong Wang , Zicheng Liu , Cheng Tan , Haitao Lin , Di Wu , Zhiyuan Chen , Jiangbin Zheng , Stan Z. Li

分类：计算机视觉 | 人工智能

2022-11-07

Since the recent success of Vision Transformers (ViTs), explorations toward transformer-style architectures have triggered the resurgence of modern ConvNets. In this work, we explore the representation ability of DNNs through the lens of interaction complexities. We empirically show that interaction complexity is an overlooked but essential indicator for visual recognition. Accordingly, a new family of efficient ConvNets, named MogaNet, is presented to pursue informative context mining in pure ConvNet-based models, with preferable complexity-performance trade-offs. In MogaNet, interactions across multiple complexities are facilitated and contextualized by leveraging two specially designed aggregation blocks in both spatial and channel interaction spaces. Extensive studies are conducted on ImageNet classification, COCO object detection, and ADE20K semantic segmentation tasks. The results demonstrate that our MogaNet establishes new state-of-the-art over other popular methods in mainstream scenarios and all model scales. Typically, the lightweight MogaNet-T achieves 80.0\% top-1 accuracy with only 1.44G FLOPs using a refined training setup on ImageNet-1K, surpassing ParC-Net-S by 1.4\% accuracy but saving 59\% (2.04G) FLOPs.

translated by 谷歌翻译

NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition

Hao Liu , Xinghua Jiang , Xin Li , Zhimin Bao , Deqiang Jiang , Bo Ren

分类：计算机视觉

2021-11-25

最近，视觉变压器（VIT），具有自我关注（SA）作为事实上的成分，在计算机视觉社区中表现出很大的潜力。为了在效率和性能之间进行权衡，一组作品仅仅在本地补丁中执行SA操作，而全局上下文信息被放弃，这对于可视识别任务是不可或缺的。为了解决这个问题，随后的全球本地VITS在模型中以并行或替代方式将本地SA与全球范围内纳入本地SA。然而，令人遗憾地组合的局部和全局上下文可能存在各种视觉数据的冗余，并且每个层内的接收场是固定的。或者，更优雅的方式是全局和本地上下文可以自适应地贡献本身以适应不同的视觉数据。为实现这一目标，我们本文提出了一种新的Vit架构，称为NOMMER，可以动态提名视觉变压器中的协同全球本地背景。通过调查我们提出的NOMMER的工作模式，我们进一步探讨了哪些上下文信息。有益于这种“动态提名”机制，没有钟声和吹口哨，不仅可以在Imagenet上达到84.5％的前1个分类准确性，只有73米的参数，也显示了对致密预测任务的有希望的性能，即对象检测和语义分割。代码和模型将在〜\ url {https://github.com/nommer1125/nommer中公开可用。

translated by 谷歌翻译

A Close Look at Spatial Modeling: From Attention to Convolution

Xu Ma , Huan Wang , Can Qin , Kunpeng Li , Xingchen Zhao , Jie Fu , Yun Fu

分类：计算机视觉 | 人工智能 | 机器学习

2022-12-23

Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two interesting issues. First, Vision Transformers present a queryirrelevant behavior at deep layers, where the attention maps exhibit nearly consistent contexts in global scope, regardless of the query patch position (also head-irrelevant). Second, the attention maps are intrinsically sparse, few tokens dominate the attention weights; introducing the knowledge from ConvNets would largely smooth the attention and enhance the performance. Motivated by above observations, we generalize self-attention formulation to abstract a queryirrelevant global context directly and further integrate the global context into convolutions. The resulting model, a Fully Convolutional Vision Transformer (i.e., FCViT), purely consists of convolutional layers and firmly inherits the merits of both attention mechanism and convolutions, including dynamic property, weight sharing, and short- and long-range feature modeling, etc. Experimental results demonstrate the effectiveness of FCViT. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still perform better than previous state-of-the-art ConvNeXt with even fewer parameters. FCViT-based models also demonstrate promising transferability to downstream tasks, like object detection, instance segmentation, and semantic segmentation. Codes and models are made available at: https://github.com/ma-xu/FCViT.

translated by 谷歌翻译

HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

Yongming Rao , Wenliang Zhao , Yansong Tang , Jie Zhou , Ser-Nam Lim , Jiwen Lu

分类：计算机视觉

2022-07-28

视觉变压器的最新进展在基于点产生自我注意的新空间建模机制驱动的各种任务中取得了巨大成功。在本文中，我们表明，视觉变压器背后的关键要素，即输入自适应，远程和高阶空间相互作用，也可以通过基于卷积的框架有效地实现。我们介绍了递归封闭式卷积（$ \ textit {g}^\ textit {n} $ conv），该卷积{n} $ conv）与封闭的卷积和递归设计执行高阶空间交互。新操作是高度灵活和可定制的，它与卷积的各种变体兼容，并将自我注意的两阶相互作用扩展到任意订单，而无需引入大量额外的计算。 $ \ textit {g}^\ textit {n} $ conv可以用作插件模块，以改善各种视觉变压器和基于卷积的模型。根据该操作，我们构建了一个名为Hornet的新型通用视觉骨干家族。关于ImageNet分类，可可对象检测和ADE20K语义分割的广泛实验表明，大黄蜂的表现优于Swin变形金刚，并具有相似的整体体系结构和训练配置的明显边距。大黄蜂还显示出对更多训练数据和更大模型大小的有利可伸缩性。除了在视觉编码器中的有效性外，我们还可以将$ \ textit {g}^\ textit {n} $ conv应用于特定于任务的解码器，并始终通过较少的计算来提高密集的预测性能。我们的结果表明，$ \ textIt {g}^\ textit {n} $ conv可以成为视觉建模的新基本模块，可有效结合视觉变形金刚和CNN的优点。代码可从https://github.com/raoyongming/hornet获得

translated by 谷歌翻译