智能论文笔记

CAT: Learning to Collaborate Channel and Spatial Attention from Multi-Information Fusion

Zizhang Wu , Man Wang , Weiwei Sun , Yuchen Li , Tianhao Xu , Fan Wang , Keke Huang

分类：计算机视觉

2022-12-13

Channel and spatial attention mechanism has proven to provide an evident performance boost of deep convolution neural networks (CNNs). Most existing methods focus on one or run them parallel (series), neglecting the collaboration between the two attentions. In order to better establish the feature interaction between the two types of attention, we propose a plug-and-play attention module, which we term "CAT"-activating the Collaboration between spatial and channel Attentions based on learned Traits. Specifically, we represent traits as trainable coefficients (i.e., colla-factors) to adaptively combine contributions of different attention modules to fit different image hierarchies and tasks better. Moreover, we propose the global entropy pooling (GEP) apart from global average pooling (GAP) and global maximum pooling (GMP) operators, an effective component in suppressing noise signals by measuring the information disorder of feature maps. We introduce a three-way pooling operation into attention modules and apply the adaptive mechanism to fuse their outcomes. Extensive experiments on MS COCO, Pascal-VOC, Cifar-100, and ImageNet show that our CAT outperforms existing state-of-the-art attention mechanisms in object detection, instance segmentation, and image classification. The model and code will be released soon.

translated by 谷歌翻译

CBAM: Convolutional Block Attention Module

Sanghyun Woo , Jongchan Park , Joon-Young Lee , In So Kweon

分类：

2018-07-17

We propose Convolutional Block Attention Module (CBAM), a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, our module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. Because CBAM is a lightweight and general module, it can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs. We validate our CBAM through extensive experiments on ImageNet-1K, MS COCO detection, and VOC 2007 detection datasets. Our experiments show consistent improvements in classification and detection performances with various models, demonstrating the wide applicability of CBAM. The code and models will be publicly available.

translated by 谷歌翻译

SDA-$x$Net: Selective Depth Attention Networks for Adaptive Multi-scale Feature Representation

Qingbei Guo , Xiao-Jun Wu , Zhiquan Feng , Tianyang Xu , Cong Hu

分类：计算机视觉

2022-09-21

现有的多尺度解决方案会导致仅增加接受场大小的风险，同时忽略小型接受场。因此，有效构建自适应神经网络以识别各种空间尺度对象是一个具有挑战性的问题。为了解决这个问题，我们首先引入一个新的注意力维度，即除了现有的注意力维度（例如渠道，空间和分支）之外，并提出了一个新颖的选择性深度注意网络，以对称地处理各种视觉中的多尺度对象任务。具体而言，在给定神经网络的每个阶段内的块，即重新连接，输出层次功能映射共享相同的分辨率但具有不同的接收场大小。基于此结构属性，我们设计了一个舞台建筑模块，即SDA，其中包括树干分支和类似SE的注意力分支。躯干分支的块输出融合在一起，以通过注意力分支指导其深度注意力分配。根据提出的注意机制，我们可以动态选择不同的深度特征，这有助于自适应调整可变大小输入对象的接收场大小。这样，跨块信息相互作用会导致沿深度方向的远距离依赖关系。与其他多尺度方法相比，我们的SDA方法结合了从以前的块到舞台输出的多个接受场，从而提供了更广泛，更丰富的有效接收场。此外，我们的方法可以用作其他多尺度网络以及注意力网络的可插入模块，并创造为SDA- $ x $ net。它们的组合进一步扩展了有效的接受场的范围，可以实现可解释的神经网络。我们的源代码可在\ url {https://github.com/qingbeiguo/sda-xnet.git}中获得。

translated by 谷歌翻译

An advanced YOLOv3 method for small object detection

Baokai Liu , Fengjie He , Shiqiang Du , Jiacheng Li , Wenjie Liu

分类：计算机视觉

2022-12-06

In recent years, object detection has achieved a very large performance improvement, but the detection result of small objects is still not very satisfactory. This work proposes a strategy based on feature fusion and dilated convolution that employs dilated convolution to broaden the receptive field of feature maps at various scales in order to address this issue. On the one hand, it can improve the detection accuracy of larger objects. On the other hand, it provides more contextual information for small objects, which is beneficial to improving the detection accuracy of small objects. The shallow semantic information of small objects is obtained by filtering out the noise in the feature map, and the feature information of more small objects is preserved by using multi-scale fusion feature module and attention mechanism. The fusion of these shallow feature information and deep semantic information can generate richer feature maps for small object detection. Experiments show that this method can have higher accuracy than the traditional YOLOv3 network in the detection of small objects and occluded objects. In addition, we achieve 32.8\% Mean Average Precision on the detection of small objects on MS COCO2017 test set. For 640*640 input, this method has 88.76\% mAP on the PASCAL VOC2012 dataset.

translated by 谷歌翻译

Dual Complementary Dynamic Convolution for Image Recognition

Longbin Yan , Yunxiao Qin , Shumin Liu , Jie Chen

分类：计算机视觉

2022-11-11

As a powerful engine, vanilla convolution has promoted huge breakthroughs in various computer tasks. However, it often suffers from sample and content agnostic problems, which limits the representation capacities of the convolutional neural networks (CNNs). In this paper, we for the first time model the scene features as a combination of the local spatial-adaptive parts owned by the individual and the global shift-invariant parts shared to all individuals, and then propose a novel two-branch dual complementary dynamic convolution (DCDC) operator to flexibly deal with these two types of features. The DCDC operator overcomes the limitations of vanilla convolution and most existing dynamic convolutions who capture only spatial-adaptive features, and thus markedly boosts the representation capacities of CNNs. Experiments show that the DCDC operator based ResNets (DCDC-ResNets) significantly outperform vanilla ResNets and most state-of-the-art dynamic convolutional networks on image classification, as well as downstream tasks including object detection, instance and panoptic segmentation tasks, while with lower FLOPs and parameters.

translated by 谷歌翻译

Res2Net: A New Multi-scale Backbone Architecture

Shang-Hua Gao , Ming-Ming Cheng , Kai Zhao , Xin-Yu Zhang , Ming-Hsuan Yang , Philip Torr

分类：

2019-04-02

Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layerwise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods. The source code and trained models are available on https://mmcheng.net/res2net/.

translated by 谷歌翻译

ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

Qilong Wang , Banggu Wu , Pengfei Zhu , Peihua Li , Wangmeng Zuo , Qinghua Hu

分类：

2019-10-08

Recently, channel attention mechanism has demonstrated to offer great potential in improving the performance of deep convolutional neural networks (CNNs). However, most existing methods dedicate to developing more sophisticated attention modules for achieving better performance, which inevitably increase model complexity.To overcome the paradox of performance and complexity trade-off, this paper proposes an Efficient Channel Attention (ECA) module, which only involves a handful of parameters while bringing clear performance gain. By dissecting the channel attention module in SENet, we empirically show avoiding dimensionality reduction is important for learning channel attention, and appropriate cross-channel interaction can preserve performance while significantly decreasing model complexity. Therefore, we propose a local crosschannel interaction strategy without dimensionality reduction, which can be efficiently implemented via 1D convolution. Furthermore, we develop a method to adaptively select kernel size of 1D convolution, determining coverage of local cross-channel interaction. The proposed ECA module is efficient yet effective, e.g., the parameters and computations of our modules against backbone of ResNet50 are 80 vs. 24.37M and 4.7e-4 GFLOPs vs. 3.86 GFLOPs, respectively, and the performance boost is more than 2% in terms of Top-1 accuracy. We extensively evaluate our ECA module on image classification, object detection and instance segmentation with backbones of ResNets and MobileNetV2. The experimental results show our module is more efficient while performing favorably against its counterparts.

translated by 谷歌翻译

Dual Attention Network for Scene Segmentation

Jun Fu , Jing Liu , Haijie Tian , Yong Li , Yongjun Bao , Zhiwei Fang , Hanqing Lu

分类：

2018-09-09

In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on the self-attention mechanism. Unlike previous works that capture contexts by multi-scale feature fusion, we propose a Dual Attention Network (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the feature at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results. We achieve new state-of-theart segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset. In particular, a Mean IoU score of 81.5% on Cityscapes test set is achieved without using coarse data. 1 .

translated by 谷歌翻译

A Close Look at Spatial Modeling: From Attention to Convolution

Xu Ma , Huan Wang , Can Qin , Kunpeng Li , Xingchen Zhao , Jie Fu , Yun Fu

分类：计算机视觉 | 人工智能 | 机器学习

2022-12-23

Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two interesting issues. First, Vision Transformers present a queryirrelevant behavior at deep layers, where the attention maps exhibit nearly consistent contexts in global scope, regardless of the query patch position (also head-irrelevant). Second, the attention maps are intrinsically sparse, few tokens dominate the attention weights; introducing the knowledge from ConvNets would largely smooth the attention and enhance the performance. Motivated by above observations, we generalize self-attention formulation to abstract a queryirrelevant global context directly and further integrate the global context into convolutions. The resulting model, a Fully Convolutional Vision Transformer (i.e., FCViT), purely consists of convolutional layers and firmly inherits the merits of both attention mechanism and convolutions, including dynamic property, weight sharing, and short- and long-range feature modeling, etc. Experimental results demonstrate the effectiveness of FCViT. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still perform better than previous state-of-the-art ConvNeXt with even fewer parameters. FCViT-based models also demonstrate promising transferability to downstream tasks, like object detection, instance segmentation, and semantic segmentation. Codes and models are made available at: https://github.com/ma-xu/FCViT.

translated by 谷歌翻译

DPNet: Dual-Path Network for Real-time Object Detection with Lightweight Attention

Quan Zhou , Huimin Shi , Weikang Xiang , Bin Kang , Xiaofu Wu , Longin Jan Latecki

分类：计算机视觉

2022-09-28

压缩高准确性卷积神经网络（CNN）的最新进展已经见证了实时对象检测的显着进步。为了加速检测速度，轻质检测器总是使用单路主链几乎没有卷积层。但是，单路径架构涉及连续的合并和下采样操作，始终导致粗糙和不准确的特征图，这些图形不利，无法找到对象。另一方面，由于网络容量有限，最近的轻质网络在表示大规模的视觉数据方面通常很弱。为了解决这些问题，本文提出了一个名为DPNET的双路径网络，并采用了实时对象检测的轻巧注意方案。双路径体系结构使我们能够与提取物相对于高级语义特征和低级对象详细信息。尽管DPNET相对于单路检测器几乎具有重复的形状，但计算成本和模型大小并未显着增加。为了增强表示能力，轻巧的自相关模块（LSCM）旨在捕获全局交互，只有很少的计算开销和网络参数。在颈部，LSCM扩展到轻质互相关模块（LCCM），从而捕获相邻尺度特征之间的相互依赖性。我们已经对Coco和Pascal VOC 2007数据集进行了详尽的实验。实验结果表明，DPNET在检测准确性和实施效率之间实现了最新的权衡。具体而言，DPNET在MS COCO Test-DEV上可实现30.5％的AP，Pascal VOC 2007测试集上的81.5％地图，MWITH近250万型号，1.04 GFLOPS，1.04 GFLOPS和164 fps和196 fps和196 fps，320 x 320输入图像的320 x 320输入图像。

translated by 谷歌翻译

Visual Parser: Representing Part-whole Hierarchies with Transformers

Shuyang Sun , Xiaoyu Yue , Song Bai , Philip Torr

分类：计算机视觉

2021-07-13

人类视力能够从整个场景中捕获部分整个分层信息。本文介绍了Visual解析器（VIP），它明确地构造了与变压器的等层次结构。 VIP将视觉表示分为两个级别，零件级别和整个级别。每个部分的信息代表整个内部的几个独立向量的组合。为了模拟两个级别的表示，我们首先通过注意机制将整体信息从整体编码为部分向量，然后将零件向量内的全局信息解码回到整个表示中。通过使用所提出的编码器 - 解码器交互迭代地解析两个级别，模型可以逐渐改进两个级别上的特征。实验结果表明，VIP可以在三个主要任务中实现非常竞争的性能。分类，检测和实例分割。特别是，它可以通过对象检测的大边缘超越先前的最先进的CNN主干。 VIP系列的小型型号为7.2美元，参数为$ 7.2 \ times $ 10.9 \ times $更少的拖鞋可以与最大的resnext-101-64 $ \ times $ 4d的resne（x）t家族相对表现。可视化结果还表明，学习部分对预测类具有高度信息，使VIP比以前的基本架构更可说明。代码可在https://github.com/kevin-ssy/vip上获得。

translated by 谷歌翻译

Switchable Self-attention Module

Shanshan Zhong , Wushao Wen , Jinghui Qin

分类：计算机视觉

2022-09-13

注意机制在视力识别方面取得了巨大成功。许多作品致力于提高注意力机制的有效性，该机制精心设计了注意操作员的结构。这些作品需要大量实验才能在场景变化时挑选最佳设置，这会消耗大量时间和计算资源。此外，神经网络通常包含许多网络层，并且大多数研究通常使用相同的注意模块来增强不同的网络层，从而阻碍了自我发挥机制的性能的进一步改善。为了解决上述问题，我们提出了一个自我发挥的模块SEM。基于注意模块和替代注意操作员的输入信息，SEM可以自动决定选择和集成注意操作员以计算注意力图。 SEM的有效性通过广泛使用的基准数据集和流行的自我发挥网络的广泛实验来证明。

translated by 谷歌翻译

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond

Yue Cao , Jiarui Xu , Stephen Lin , Fangyun Wei , Han Hu

分类：

2019-04-25

The Non-Local Network (NLNet) presents a pioneering approach for capturing long-range dependencies, via aggregating query-specific global context to each query position. However, through a rigorous empirical analysis, we have found that the global contexts modeled by non-local network are almost the same for different query positions within an image. In this paper, we take advantage of this finding to create a simplified network based on a queryindependent formulation, which maintains the accuracy of NLNet but with significantly less computation. We further observe that this simplified design shares similar structure with Squeeze-Excitation Network (SENet). Hence we unify them into a three-step general framework for global context modeling. Within the general framework, we design a better instantiation, called the global context (GC) block, which is lightweight and can effectively model the global context. The lightweight property allows us to apply it for multiple layers in a backbone network to construct a global context network (GCNet), which generally outperforms both simplified NLNet and SENet on major benchmarks for various recognition tasks. The code and configurations are released at https://github.com/xvjiarui/GCNet.

translated by 谷歌翻译

Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective

Li Zhang , Sixiao Zheng , Jiachen Lu , Xinxuan Zhao , Xiatian Zhu , Yanwei Fu , Tao Xiang , Jianfeng Feng

分类：计算机视觉

2022-07-19

视觉表示学习是解决各种视力问题的关键。依靠开创性的网格结构先验，卷积神经网络（CNN）已成为大多数深视觉模型的事实上的标准架构。例如，经典的语义分割方法通常采用带有编码器编码器体系结构的完全横向卷积网络（FCN）。编码器逐渐减少了空间分辨率，并通过更大的接受场来学习更多抽象的视觉概念。由于上下文建模对于分割至关重要，因此最新的努力一直集中在通过扩张（即极度）卷积或插入注意力模块来增加接受场。但是，基于FCN的体系结构保持不变。在本文中，我们旨在通过将视觉表示学习作为序列到序列预测任务来提供替代观点。具体而言，我们部署纯变压器以将图像编码为一系列贴片，而无需局部卷积和分辨率减少。通过在变压器的每一层中建立的全球环境，可以学习更强大的视觉表示形式，以更好地解决视力任务。特别是，我们的细分模型（称为分割变压器（SETR））在ADE20K上擅长（50.28％MIOU，这是提交当天测试排行榜中的第一个位置），Pascal环境（55.83％MIOU），并在CityScapes上达到竞争成果。此外，我们制定了一个分层局部全球（HLG）变压器的家族，其特征是窗户内的本地关注和跨窗户的全球性专注于层次结构和金字塔架构。广泛的实验表明，我们的方法在各种视觉识别任务（例如，图像分类，对象检测和实例分割和语义分割）上实现了吸引力的性能。

translated by 谷歌翻译

Attention Mechanisms in Computer Vision: A Survey

Meng-Hao Guo , Tian-Xing Xu , Jiang-Jiang Liu , Zheng-Ning Liu , Peng-Tao Jiang , Tai-Jiang Mu , Song-Hai Zhang , Ralph R. Martin , Ming-Ming Cheng , Shi-Min Hu

分类：计算机视觉

2021-11-15

人类自然有效地在复杂的场景中找到突出区域。通过这种观察的动机，引入了计算机视觉中的注意力机制，目的是模仿人类视觉系统的这一方面。这种注意机制可以基于输入图像的特征被视为动态权重调整过程。注意机制在许多视觉任务中取得了巨大的成功，包括图像分类，对象检测，语义分割，视频理解，图像生成，3D视觉，多模态任务和自我监督的学习。在本调查中，我们对计算机愿景中的各种关注机制进行了全面的审查，并根据渠道注意，空间关注，暂时关注和分支注意力进行分类。相关的存储库https：//github.com/menghaoguo/awesome-vision-tions致力于收集相关的工作。我们还建议了未来的注意机制研究方向。

translated by 谷歌翻译

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Wenhai Wang , Enze Xie , Xiang Li , Deng-Ping Fan , Kaitao Song , Ding Liang , Tong Lu , Ping Luo , Ling Shao

分类：

2021-02-24

ous vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones. (3) We validate PVT through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope that PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future research.

translated by 谷歌翻译

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Jiashi Li , Xin Xia , Wei Li , Huixia Li , Xing Wang , Xuefeng Xiao , Rui Wang , Min Zheng , Xin Pan

分类：计算机视觉

2022-07-12

由于复杂的注意机制和模型设计，大多数现有的视觉变压器（VIT）无法在现实的工业部署方案中的卷积神经网络（CNN）高效，例如张力和coreml。这提出了一个独特的挑战：可以设计视觉神经网络以与CNN一样快地推断并表现强大吗？最近的作品试图设计CNN-Transformer混合体系结构来解决这个问题，但是这些作品的整体性能远非令人满意。为了结束这些结束，我们提出了下一代视觉变压器，以在现实的工业场景中有效部署，即下一步，从延迟/准确性权衡的角度来看，它在CNN和VIT上占主导地位。在这项工作中，下一个卷积块（NCB）和下一个变压器块（NTB）分别开发出用于使用部署友好机制捕获本地和全球信息。然后，下一个混合策略（NHS）旨在将NCB和NTB堆叠在有效的混合范式中，从而提高了各种下游任务中的性能。广泛的实验表明，在各种视觉任务方面的延迟/准确性权衡方面，下一个VIT明显优于现有的CNN，VIT和CNN转换混合体系结构。在Tensorrt上，在可可检测上，Next-Vit超过5.4 MAP（从40.4到45.8），在类似延迟下，ADE20K细分的8.2％MIOU（从38.8％到47.0％）。同时，它可以与CSWIN达到可比的性能，而推理速度则以3.6倍的速度加速。在COREML上，在类似的延迟下，在COCO检测上，下一步超过了可可检测的4.6 MAP（从42.6到47.2），ADE20K分割的3.5％MIOU（从45.2％到48.7％）。代码将最近发布。

translated by 谷歌翻译

Attention in Attention Network for Image Super-Resolution

Haoyu Chen , Jinjin Gu , Zhi Zhang

分类：计算机视觉

2021-04-19

卷积神经网络在过去十年中允许在单个图像超分辨率（SISR）中的显着进展。在SISR最近的进展中，关注机制对于高性能SR模型至关重要。但是，注意机制仍然不清楚为什么它在SISR中的工作原理。在这项工作中，我们试图量化和可视化SISR中的注意力机制，并表明并非所有关注模块都同样有益。然后，我们提出了关注网络（A $ ^ 2 $ n）的注意力，以获得更高效和准确的SISR。具体来说，$ ^ 2 $ n包括非关注分支和耦合注意力分支。提出了一种动态注意力模块，为这两个分支产生权重，以动态地抑制不需要的注意力调整，其中权重根据输入特征自适应地改变。这允许注意模块专门从事惩罚的有益实例，从而大大提高了注意力网络的能力，即几个参数开销。实验结果表明，我们的最终模型A $ ^ 2 $ n可以实现与类似尺寸的最先进网络相比的卓越的权衡性能。代码可以在https://github.com/haoyuc/a2n获得。

translated by 谷歌翻译

Dense Prediction with Attentive Feature Aggregation

Yung-Hsu Yang , Thomas E. Huang , Samuel Rota Bulò , Peter Kontschieder , Fisher Yu

分类：计算机视觉

2021-11-01

跨不同层的特征的聚合信息是密集预测模型的基本操作。尽管表现力有限，但功能级联占主导地位聚合运营的选择。在本文中，我们引入了细分特征聚合（AFA），以融合不同的网络层，具有更具表现力的非线性操作。 AFA利用空间和渠道注意，以计算层激活的加权平均值。灵感来自神经体积渲染，我们将AFA扩展到规模空间渲染（SSR），以执行多尺度预测的后期融合。 AFA适用于各种现有网络设计。我们的实验表明了对挑战性的语义细分基准，包括城市景观，BDD100K和Mapillary Vistas的一致而显着的改进，可忽略不计的计算和参数开销。特别是，AFA改善了深层聚集（DLA）模型在城市景观上的近6％Miou的性能。我们的实验分析表明，AFA学会逐步改进分割地图并改善边界细节，导致新的最先进结果对BSDS500和NYUDV2上的边界检测基准。在http://vis.xyz/pub/dla-afa上提供代码和视频资源。

translated by 谷歌翻译

Focal Modulation Networks

Jianwei Yang , Chunyuan Li , Xiyang Dai , Lu Yuan , Jianfeng Gao

分类：计算机视觉 | 人工智能 | 机器学习

2022-03-22

We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K in 224 resolution, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224 and 384, respectively. When transferred to downstream tasks, FocalNets exhibit clear superiority. For object detection with Mask R-CNN, FocalNet base trained with 1\times outperforms the Swin counterpart by 2.1 points and already surpasses Swin trained with 3\times schedule (49.0 v.s. 48.5). For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.3 and 64.4 mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of much larger attention-based models like Swinv2-G and BEIT-3. Code and checkpoints are available at https://github.com/microsoft/FocalNet.

translated by 谷歌翻译