智能论文笔记

Conv-Adapter: Exploring Parameter Efficient Transfer Learning for ConvNets

Hao Chen , Ran Tao , Han Zhang , Yidong Wang , Wei Ye , Jindong Wang , Guosheng Hu , Marios Savvides

分类：计算机视觉 | 人工智能

2022-08-15

尽管参数有效调整（PET）方法在自然语言处理（NLP）任务上显示出巨大的潜力，但其有效性仍然对计算机视觉（CV）任务的大规模转向进行了研究。本文提出了Conv-Adapter，这是一种专为CONCNET设计的PET模块。 Conv-Adapter具有轻巧的，可转让的域和架构，不合时宜，并且在不同的任务上具有广义性能。当转移下游任务时，Conv-Adapter将特定于任务的特征调制到主链的中间表示，同时保持预先训练的参数冻结。通过仅引入少量可学习的参数，例如，仅3.5％的RESNET50的完整微调参数，Conv-Adapter优于先前的宠物基线方法，并实现可比性或超过23个分类任务的全面调查的性能。它还在几乎没有分类的情况下表现出卓越的性能，平均利润率为3.39％。除分类外，Conv-Adapter可以推广到检测和细分任务，其参数降低了50％以上，但性能与传统的完整微调相当。

translated by 谷歌翻译

Visual Prompt Tuning

Menglin Jia , Luming Tang , Bor-Chun Chen , Claire Cardie , Serge Belongie , Bharath Hariharan , Ser-Nam Lim

分类：计算机视觉

2022-03-23

当前的Modus Operandi在改编预训练的模型中涉及更新所有骨干参数，即，完整的微调。本文介绍了视觉及时调整（VPT），作为视觉中大规模变压器模型的全面微调的有效替代方案。VPT从最近有效地调整大型语言模型的最新进展中汲取灵感，在输入空间中仅引入了少量的可训练参数（少于模型参数），同时保持模型骨架冻结。通过对各种下游识别任务的广泛实验，我们表明VPT与其他参数有效调整协议相比获得了显着的性能增长。最重要的是，在许多情况下，VPT甚至在模型能力和培训数据量表的许多情况下都胜过全面的微调，同时降低了每任务的存储成本。

translated by 谷歌翻译

Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning

Dongze Lian , Daquan Zhou , Jiashi Feng , Xinchao Wang

分类：计算机视觉

2022-10-17

Existing fine-tuning methods either tune all parameters of the pre-trained model (full fine-tuning), which is not efficient, or only tune the last linear layer (linear probing), which suffers a significant accuracy drop compared to the full fine-tuning. In this paper, we propose a new parameter-efficient fine-tuning method termed as SSF, representing that researchers only need to Scale and Shift the deep Features extracted by a pre-trained model to catch up with the performance of full fine-tuning. In this way, SSF also surprisingly outperforms other parameter-efficient fine-tuning approaches even with a smaller number of tunable parameters. Furthermore, different from some existing parameter-efficient fine-tuning methods (e.g., Adapter or VPT) that introduce the extra parameters and computational cost in the training and inference stages, SSF only adds learnable parameters during the training stage, and these additional parameters can be merged into the original pre-trained model weights via re-parameterization in the inference phase. With the proposed SSF, our model obtains 2.46% (90.72% vs. 88.54%) and 11.48% (73.10% vs. 65.57%) performance improvement on FGVC and VTAB-1k in terms of Top-1 accuracy compared to the full fine-tuning but only fine-tuning about 0.3M parameters. We also conduct amounts of experiments in various model families (CNNs, Transformers, and MLPs) and datasets. Results on 26 image classification datasets in total and 3 robustness & out-of-distribution datasets show the effectiveness of SSF. Code is available at https://github.com/dongzelian/SSF.

translated by 谷歌翻译

Convolutional Bypasses Are Better Vision Transformer Adapters

Shibo Jie , Zhi-Hong Deng

分类：计算机视觉

2022-07-14

在计算机视觉中广泛采用了预处理 - 最终的范式。但是，随着视觉变压器（VIT）的尺寸呈指数增长，鉴于较重的存储空间的头顶，完整的燃料变得过于望而却步。最近的研究是由参数效率转移学习（PETL）的动机，最近的研究试图插入轻巧的适应模块（例如，适配器层或及时令牌）以预处理VIT，并且仅释放这些模块，而预处理的权重则是冷冻的。但是，这些模块最初是为了芬太尼语言模型而提出的。尽管对VIT的口号很好，但他们的设计缺乏视觉任务的先验知识。在本文中，我们建议在VIT中构建卷积旁路（Convass）作为适应模块，仅引入了可训练参数的少量（少于模型参数的0.5％）以适应大型VIT。与其他PETL方法不同，卷积层的硬编码电感偏置的互惠受益，因此更适合视觉任务，尤其是在低数据表格中。 VTAB-1K基准和少量学习数据集的实验结果表明，Convass的表现优于当前面向语言的适应模块，这证明了对视觉模型量身定制面向视觉的适应模块的必要性。

translated by 谷歌翻译

FacT: Factor-Tuning for Lightweight Adaptation on Vision Transformer

Shibo Jie , Zhi-Hong Deng

分类：计算机视觉

2022-12-06

Recent work has explored the potential to adapt a pre-trained vision transformer (ViT) by updating only a few parameters so as to improve storage efficiency, called parameter-efficient transfer learning (PETL). Current PETL methods have shown that by tuning only 0.5% of the parameters, ViT can be adapted to downstream tasks with even better performance than full fine-tuning. In this paper, we aim to further promote the efficiency of PETL to meet the extreme storage constraint in real-world applications. To this end, we propose a tensorization-decomposition framework to store the weight increments, in which the weights of each ViT are tensorized into a single 3D tensor, and their increments are then decomposed into lightweight factors. In the fine-tuning process, only the factors need to be updated and stored, termed Factor-Tuning (FacT). On VTAB-1K benchmark, our method performs on par with NOAH, the state-of-the-art PETL method, while being 5x more parameter-efficient. We also present a tiny version that only uses 8K (0.01% of ViT's parameters) trainable parameters but outperforms full fine-tuning and many other PETL methods such as VPT and BitFit. In few-shot settings, FacT also beats all PETL baselines using the fewest parameters, demonstrating its strong capability in the low-data regime.

translated by 谷歌翻译

HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

Yongming Rao , Wenliang Zhao , Yansong Tang , Jie Zhou , Ser-Nam Lim , Jiwen Lu

分类：计算机视觉

2022-07-28

视觉变压器的最新进展在基于点产生自我注意的新空间建模机制驱动的各种任务中取得了巨大成功。在本文中，我们表明，视觉变压器背后的关键要素，即输入自适应，远程和高阶空间相互作用，也可以通过基于卷积的框架有效地实现。我们介绍了递归封闭式卷积（$ \ textit {g}^\ textit {n} $ conv），该卷积{n} $ conv）与封闭的卷积和递归设计执行高阶空间交互。新操作是高度灵活和可定制的，它与卷积的各种变体兼容，并将自我注意的两阶相互作用扩展到任意订单，而无需引入大量额外的计算。 $ \ textit {g}^\ textit {n} $ conv可以用作插件模块，以改善各种视觉变压器和基于卷积的模型。根据该操作，我们构建了一个名为Hornet的新型通用视觉骨干家族。关于ImageNet分类，可可对象检测和ADE20K语义分割的广泛实验表明，大黄蜂的表现优于Swin变形金刚，并具有相似的整体体系结构和训练配置的明显边距。大黄蜂还显示出对更多训练数据和更大模型大小的有利可伸缩性。除了在视觉编码器中的有效性外，我们还可以将$ \ textit {g}^\ textit {n} $ conv应用于特定于任务的解码器，并始终通过较少的计算来提高密集的预测性能。我们的结果表明，$ \ textIt {g}^\ textit {n} $ conv可以成为视觉建模的新基本模块，可有效结合视觉变形金刚和CNN的优点。代码可从https://github.com/raoyongming/hornet获得

translated by 谷歌翻译

A ConvNet for the 2020s

Zhuang Liu , Hanzi Mao , Chao-Yuan Wu , Christoph Feichtenhofer , Trevor Darrell , Saining Xie

分类：计算机视觉

2022-01-10

视觉识别的“咆哮20S”开始引入视觉变压器（VITS），这将被取代的Cummnets作为最先进的图像分类模型。另一方面，vanilla vit，当应用于一般计算机视觉任务等对象检测和语义分割时面临困难。它是重新引入多个ConvNet Priors的等级变压器（例如，Swin变压器），使变压器实际上可作为通用视觉骨干网，并在各种视觉任务上展示了显着性能。然而，这种混合方法的有效性仍然在很大程度上归功于变压器的内在优越性，而不是卷积的固有感应偏差。在这项工作中，我们重新审视设计空间并测试纯粹的Convnet可以实现的限制。我们逐渐“现代化”标准Reset朝着视觉变压器的设计设计，并发现几个有助于沿途绩效差异的关键组件。此探索的结果是一个纯粹的ConvNet型号被称为ConvNext。完全由标准的Convnet模块构建，ConvNexts在准确性和可扩展性方面与变压器竞争，实现了87.8％的ImageNet Top-1精度和表现优于COCO检测和ADE20K分割的Swin变压器，同时保持了标准Convnet的简单性和效率。

translated by 谷歌翻译

Parameter-Efficient Image-to-Video Transfer Learning

Junting Pan , Ziyi Lin , Xiatian Zhu , Jing Shao , Hongsheng Li

分类：计算机视觉

2022-06-27

最近出现了有希望的表现，利用大型预训练的模型来实现各种感兴趣的下游任务。由于模型的规模不断增长，因此，在模型培训和存储方面，基于标准的完整任务适应策略的成本高昂。这导致了参数有效传输学习的新研究方向。但是，现有的尝试通常集中在预训练模型的相同模式（例如图像理解）的下游任务上。这会产生限制，因为在某些特定的方式（例如，视频理解）中，具有足够知识的强大预训练模型较少或不可用。在这项工作中，我们研究了这样一种新型的跨模式转移学习设置，即参数有效的图像到视频传输学习。为了解决此问题，我们为每个视频任务提出了一个新的时空适配器（ST-ADAPTER），以进行参数有效调整。凭借紧凑设计中的内置时空推理能力，ST-ADAPTER可以实现预训练的图像模型，而无需时间知识，以小（〜8％）的每任务参数成本来理解动态视频内容，以大约需要与以前的工作相比，更新参数少20倍。在视频动作识别任务上进行的广泛实验表明，我们的ST-ADAPTER可以匹配甚至优于强大的完整微调策略和最先进的视频模型，同时享受参数效率的优势。

translated by 谷歌翻译

Parameter-efficient Model Adaptation for Vision Transformers

Xuehai He , Chunyuan Li , Pengchuan Zhang , Jianwei Yang , Xin Eric Wang

分类：计算机视觉 | 人工智能

2022-03-29

In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., vision transformers) to downstream tasks. Common approaches for model adaptation either update all model parameters or leverage linear probes. In this paper, we aim to study parameter-efficient model adaptation strategies for vision transformers on the image classification task. We formulate efficient model adaptation as a subspace training problem and perform a comprehensive benchmarking over different efficient adaptation methods. We conduct an empirical study on each efficient model adaptation method focusing on its performance alongside parameter cost. Furthermore, we propose a parameter-efficient model adaptation framework, which first selects submodules by measuring local intrinsic dimensions and then projects them into subspace for further decomposition via a novel Kronecker Adaptation (KAdaptation) method. We analyze and compare our method with a diverse set of baseline model adaptation methods (including state-of-the-art methods for pretrained language models). Our method performs the best in terms of the tradeoff between accuracy and parameter efficiency across 20 image classification datasets under the few-shot setting and 7 image classification datasets under the full-shot setting.

translated by 谷歌翻译

GhostNets on Heterogeneous Devices via Cheap Operations

Kai Han , Yunhe Wang , Chang Xu , Jianyuan Guo , Chunjing Xu , Enhua Wu , Qi Tian

分类：计算机视觉

2022-01-10

由于存储器和计算资源有限，部署在移动设备上的卷积神经网络（CNNS）是困难的。我们的目标是通过利用特征图中的冗余来设计包括CPU和GPU的异构设备的高效神经网络，这很少在神经结构设计中进行了研究。对于类似CPU的设备，我们提出了一种新颖的CPU高效的Ghost（C-Ghost）模块，以生成从廉价操作的更多特征映射。基于一组内在的特征映射，我们使用廉价的成本应用一系列线性变换，以生成许多幽灵特征图，可以完全揭示内在特征的信息。所提出的C-Ghost模块可以作为即插即用组件，以升级现有的卷积神经网络。 C-Ghost瓶颈旨在堆叠C-Ghost模块，然后可以轻松建立轻量级的C-Ghostnet。我们进一步考虑GPU设备的有效网络。在建筑阶段的情况下，不涉及太多的GPU效率（例如，深度明智的卷积），我们建议利用阶段明智的特征冗余来制定GPU高效的幽灵（G-GHOST）阶段结构。舞台中的特征被分成两个部分，其中使用具有较少输出通道的原始块处理第一部分，用于生成内在特征，另一个通过利用阶段明智的冗余来生成廉价的操作。在基准测试上进行的实验证明了所提出的C-Ghost模块和G-Ghost阶段的有效性。 C-Ghostnet和G-Ghostnet分别可以分别实现CPU和GPU的准确性和延迟的最佳权衡。代码可在https://github.com/huawei-noah/cv-backbones获得。

translated by 谷歌翻译

Reversible Column Networks

Yuxuan Cai , Yizhuang Zhou , Qi Han , Jianjian Sun , Xiangwen Kong , Jun Li , Xiangyu Zhang

分类：计算机视觉

2022-12-22

We propose a new neural network design paradigm Reversible Column Network (RevCol). The main body of RevCol is composed of multiple copies of subnetworks, named columns respectively, between which multi-level reversible connections are employed. Such architectural scheme attributes RevCol very different behavior from conventional networks: during forward propagation, features in RevCol are learned to be gradually disentangled when passing through each column, whose total information is maintained rather than compressed or discarded as other network does. Our experiments suggest that CNN-style RevCol models can achieve very competitive performances on multiple computer vision tasks such as image classification, object detection and semantic segmentation, especially with large parameter budget and large dataset. For example, after ImageNet-22K pre-training, RevCol-XL obtains 88.2% ImageNet-1K accuracy. Given more pre-training data, our largest model RevCol-H reaches 90.0% on ImageNet-1K, 63.8% APbox on COCO detection minival set, 61.0% mIoU on ADE20k segmentation. To our knowledge, it is the best COCO detection and ADE20k segmentation result among pure (static) CNN models. Moreover, as a general macro architecture fashion, RevCol can also be introduced into transformers or other neural networks, which is demonstrated to improve the performances in both computer vision and NLP tasks. We release code and models at https://github.com/megvii-research/RevCol

translated by 谷歌翻译

Pro-tuning: Unified Prompt Tuning for Vision Tasks

Xing Nie , Bolin Ni , Jianlong Chang , Gaomeng Meng , Chunlei Huo , Zhaoxiang Zhang , Shiming Xiang , Qi Tian , Chunhong Pan

分类：计算机视觉

2022-07-28

在计算机视觉中，微调是利用预训练的视觉模型来执行下游任务的事实上的方法。但是，由于采用参数效率低下的全局更新并严重依赖于高质量的下游数据，因此在实践中部署它是非常具有挑战性的。最近，基于及时的学习添加了与任务相关的提示，以使下游任务适应预训练的模型，从而极大地提高了许多自然语言下游任务的性能。在这项工作中，我们扩展了这种显着的转移能力，从迅速的愿景模型中受益，以替代微调。为此，我们提出了参数有效的及时调整（亲调整），以使冷冻视觉模型适应各种下游视觉任务。实行调整的关键是基于及时的调整，即学习特定于任务的视觉提示，以使用预先训练的模型冷冻的下游输入图像。通过仅培训一些其他参数，它可以在基于CNN和基于变压器的各种架构上工作。广泛的实验证据表明，在广泛的视觉任务和场景中，主张表现优于微调，包括图像分类（通用对象，类失衡，图像腐败，对抗性稳定性和分布范围内的概括）和密集的预测任务例如对象检测和语义分割。

translated by 谷歌翻译

Effective Adaptation in Multi-Task Co-Training for Unified Autonomous Driving

Xiwen Liang , Yangxin Wu , Jianhua Han , Hang Xu , Chunjing Xu , Xiaodan Liang

分类：计算机视觉

2022-09-19

为了同时朝着对多个下游任务的整体理解，需要提取具有更好可传递性的功能。尽管许多最新的自我监管的预训练方法在普遍的预处理前范式下在各种视觉任务上取得了令人印象深刻的表现，但它们对多任务学习方案的概括能力尚待探索。在本文中，我们在三个下游任务上进行了广泛研究各种类型的自我监督方法的转移性能，例如Moco和Simclr，包括语义细分，可驱动的区域细分和交通对象检测，在大规模驾驶数据集中BDD100K。我们出人意料地发现，他们的表现是最佳的甚至落后于单任务基线的滞后，这可能是由于训练目标和建筑设计的区别在于预处理范式。为了克服这一难题，并避免重新设计资源密集的预培训阶段，我们提出了一种简单而有效的预处理 - 适应性 - 赛范围，用于一般的多任务培训，可以有效地适应现行预审预周态的模型没有增加培训开销。在自适应阶段，我们利用可学习的多尺度适配器来动态调整由多任务目标监督的预验证的模型权重，同时使经过预告片的知识未经触及。此外，我们将视觉语言预训练模型剪辑视为对预处理 - 适应 - 最终范式的强烈补充，并提出了一个名为LV-Adapter的新型适配器，该适配器通过任务特定的提示将语言先验纳入了多任务的模型中和视觉和文本特征之间的对齐。

translated by 谷歌翻译

KronA: Parameter Efficient Tuning with Kronecker Adapter

Ali Edalati , Marzieh Tahaei , Ivan Kobyzev , Vahid Partovi Nia , James J. Clark , Mehdi Rezagholizadeh

分类：自然语言处理

2022-12-20

Fine-tuning a Pre-trained Language Model (PLM) on a specific downstream task has been a well-known paradigm in Natural Language Processing. However, with the ever-growing size of PLMs, training the entire model on several downstream tasks becomes very expensive and resource-hungry. Recently, different Parameter Efficient Tuning (PET) techniques are proposed to improve the efficiency of fine-tuning PLMs. One popular category of PET methods is the low-rank adaptation methods which insert learnable truncated SVD modules into the original model either sequentially or in parallel. However, low-rank decomposition suffers from limited representation power. In this work, we address this problem using the Kronecker product instead of the low-rank representation. We introduce KronA, a Kronecker product-based adapter module for efficient fine-tuning of Transformer-based PLMs. We apply the proposed methods for fine-tuning T5 on the GLUE benchmark to show that incorporating the Kronecker-based modules can outperform state-of-the-art PET methods.

translated by 谷歌翻译

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Yongming Rao , Wenliang Zhao , Guangyi Chen , Yansong Tang , Zheng Zhu , Guan Huang , Jie Zhou , Jiwen Lu

分类：计算机视觉 | 人工智能 | 机器学习

2021-12-02

最近的进展表明，使用对比图像文本对的大规模预训练可以是从自然语言监督的高质量视觉表演学习的有前途的替代方案。从更广泛的监督来源受益，这种新的范例展示了对下游分类任务和数据集的令人印象深刻的可转移性。然而，从图像文本对中学习的知识转移到更复杂的密集预测任务的问题几乎没有访问过。在这项工作中，我们通过隐式和明确地利用来自剪辑的预先训练的知识来提出了一种新的密集预测框架。具体地，我们将剪辑中的原始图像文本匹配问题转换为像素文本匹配问题，并使用像素文本分数图来指导致密预测模型的学习。通过进一步使用图像中的上下文信息来提示语言模型，我们能够促进我们的模型来更好地利用预先接受训练的知识。我们的方法是模型 - 不可行的，它可以应用于任意密集的预测系统和各种预先训练的视觉底座，包括夹模型和想象成预先训练的模型。广泛的实验证明了我们对语义分割，对象检测和实例分段任务的方法的卓越性能。代码可在https://github.com/raoyongming/denseclip获得

translated by 谷歌翻译

Prompt-Matched Semantic Segmentation

Lingbo Liu , Bruce X. B. Yu , Jianlong Chang , Qi Tian , Chang-Wen Chen

分类：计算机视觉

2022-08-22

这项工作的目的是探索如何有效有效地将预训练的基础模型适应图像语义分割的各种下游任务。常规方法通常为每个特定数据集微调整个网络，并且存储这些网络的大量参数是繁重的。最近的一些作品试图将一些可训练的参数插入冷冻网络中，以学习有效调整的视觉提示。但是，这些作品显着修改了标准模块的原始结构，使其在许多现有的高速推理设备上无法使用，其中标准模块及其参数已嵌入。为了促进基于及时的语义细分，我们提出了一个新颖的阶段间及时匹配的框架，该框架保持基础模型的原始结构，同时自适应地生成视觉提示，以适应以任务为导向的调整。具体而言，首先将预训练的模型分为多个阶段，其参数被冷冻并共享所有语义分割任务。然后将称为语义意识的提示匹配器的轻巧模块在两个阶段之间介绍给层次上的插值，以在临时语义图的指导下学习每个特定任务的合理提示。这样，我们可以更好地刺激对冷冻模型的预训练的知识，以有效地学习下游数据集的语义概念。在五个基准上进行的广泛实验表明，所提出的方法可以实现参数效率和性能效率之间的有希望的权衡。

translated by 谷歌翻译

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Wenhai Wang , Jifeng Dai , Zhe Chen , Zhenhang Huang , Zhiqi Li , Xizhou Zhu , Xiaowei Hu , Tong Lu , Lewei Lu , Hongsheng Li

分类：计算机视觉

2022-11-10

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved the new record 65.4 mAP on COCO test-dev. The code will be released at https://github.com/OpenGVLab/InternImage.

translated by 谷歌翻译

Towards a Unified View of Parameter-Efficient Transfer Learning

Junxian He , Chunting Zhou , Xuezhe Ma , Taylor Berg-Kirkpatrick , Graham Neubig

分类：自然语言处理 | 机器学习

2021-10-08

微调下游任务的大型预训练语言模型已成为NLP中的事实上学习范式。然而，常规方法微调预先训练模型的所有参数，这变得越来越稳定，因为模型尺寸和增长的任务数量。最近的工作提出了各种参数有效的转移学习方法，只需微调少数（额外）参数以获得强大的性能。虽然有效，但各种方法中的成功和联系的关键成分尚不清楚。在本文中，我们分解了最先进的参数有效的传输学习方法的设计，并提出了一个在它们之间建立连接的统一框架。具体而言，我们将它们重新框架作为预先训练的模型对特定隐藏状态的修改，并定义了一组设计尺寸，不同的方法变化，例如计算修改的功能和应用修改的位置。通过跨机翻译的全面实证研究，文本摘要，语言理解和文本分类基准，我们利用统一的视图来确定以前的方法中的重要设计选择。此外，我们的统一框架使得能够在不同的方法中传输设计元素，因此我们能够实例化新的参数高效的微调方法，该方法比以前的方法更加有效，而是更有效，实现可比的结果在所有四个任务上调整所有参数。

translated by 谷歌翻译

Representation Separation for Semantic Segmentation with Vision Transformers

Yuanduo Hong , Huihui Pan , Weichao Sun , Xinghu Yu , Huijun Gao

分类：计算机视觉 | 人工智能

2022-12-28

Vision transformers (ViTs) encoding an image as a sequence of patches bring new paradigms for semantic segmentation.We present an efficient framework of representation separation in local-patch level and global-region level for semantic segmentation with ViTs. It is targeted for the peculiar over-smoothness of ViTs in semantic segmentation, and therefore differs from current popular paradigms of context modeling and most existing related methods reinforcing the advantage of attention. We first deliver the decoupled two-pathway network in which another pathway enhances and passes down local-patch discrepancy complementary to global representations of transformers. We then propose the spatially adaptive separation module to obtain more separate deep representations and the discriminative cross-attention which yields more discriminative region representations through novel auxiliary supervisions. The proposed methods achieve some impressive results: 1) incorporated with large-scale plain ViTs, our methods achieve new state-of-the-art performances on five widely used benchmarks; 2) using masked pre-trained plain ViTs, we achieve 68.9% mIoU on Pascal Context, setting a new record; 3) pyramid ViTs integrated with the decoupled two-pathway network even surpass the well-designed high-resolution ViTs on Cityscapes; 4) the improved representations by our framework have favorable transferability in images with natural corruptions. The codes will be released publicly.

translated by 谷歌翻译

Vision Transformers with Hierarchical Attention

Yun Liu , Yu-Huan Wu , Guolei Sun , Le Zhang , Ajad Chhatkuli , Luc Van Gool

分类：计算机视觉

2021-06-06

本文解决了由多头自我注意力（MHSA）中高计算/空间复杂性引起的视觉变压器的低效率缺陷。为此，我们提出了层次MHSA（H-MHSA），其表示以层次方式计算。具体而言，我们首先将输入图像分为通常完成的补丁，每个补丁都被视为令牌。然后，拟议的H-MHSA学习本地贴片中的令牌关系，作为局部关系建模。然后，将小贴片合并为较大的贴片，H-MHSA对少量合并令牌的全局依赖性建模。最后，汇总了本地和全球专注的功能，以获得具有强大表示能力的功能。由于我们仅在每个步骤中计算有限数量的令牌的注意力，因此大大减少了计算负载。因此，H-MHSA可以在不牺牲细粒度信息的情况下有效地模拟令牌之间的全局关系。使用H-MHSA模块合并，我们建立了一个基于层次的变压器网络的家族，即HAT-NET。为了证明在场景理解中HAT-NET的优越性，我们就基本视觉任务进行了广泛的实验，包括图像分类，语义分割，对象检测和实例细分。因此，HAT-NET为视觉变压器提供了新的视角。可以在https://github.com/yun-liu/hat-net上获得代码和预估计的模型。

translated by 谷歌翻译