智能论文笔记

Efficient Joint Detection and Multiple Object Tracking with Spatially Aware Transformer

Siddharth Sagar Nijhawan , Leo Hoshikawa , Atsushi Irie , Masakazu Yoshimura , Junji Otsuka , Takeshi Ohashi

分类：计算机视觉

2022-11-09

We propose a light-weight and highly efficient Joint Detection and Tracking pipeline for the task of Multi-Object Tracking using a fully-transformer architecture. It is a modified version of TransTrack, which overcomes the computational bottleneck associated with its design, and at the same time, achieves state-of-the-art MOTA score of 73.20%. The model design is driven by a transformer based backbone instead of CNN, which is highly scalable with the input resolution. We also propose a drop-in replacement for Feed Forward Network of transformer encoder layer, by using Butterfly Transform Operation to perform channel fusion and depth-wise convolution to learn spatial context within the feature maps, otherwise missing within the attention maps of the transformer. As a result of our modifications, we reduce the overall model size of TransTrack by 58.73% and the complexity by 78.72%. Therefore, we expect our design to provide novel perspectives for architecture optimization in future research related to multi-object tracking.

translated by 谷歌翻译

Transformers in Vision: A Survey

Salman Khan , Muzammal Naseer , Munawar Hayat , Syed Waqas Zamir , Fahad Shahbaz Khan , Mubarak Shah

分类：

2021-01-04

Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.

translated by 谷歌翻译

TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers

Qianyu Zhou , Xiangtai Li , Lu He , Yibo Yang , Guangliang Cheng , Yunhai Tong , Lizhuang Ma , Dacheng Tao

分类：计算机视觉

2022-01-13

Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (2 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 % mAP while running at around 30 FPS on a single V100 GPU device. Code and models will be available for further research.

translated by 谷歌翻译

ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Hwanjun Song , Deqing Sun , Sanghyuk Chun , Varun Jampani , Dongyoon Han , Byeongho Heo , Wonjae Kim , Ming-Hsuan Yang

分类：计算机视觉 | 机器学习

2021-10-08

变形金刚正在改变计算机视觉的景观，特别是对于识别任务。检测变压器是对象检测的第一个完全结束的学习系统，而视觉变压器是用于图像分类的第一个完全变压器的架构。在本文中，我们集成了视觉和检测变压器（Vidt）以构建有效和高效的物体探测器。 VIDT引入了重新配置的注意模块，将最近的Swin变压器扩展为独立对象检测器，然后是计算高效的变压器解码器，该解码器利用多尺度特征和辅助技术来提高检测性能，而无需多大增加计算负载。 Microsoft Coco基准数据集上的广泛评估结果表明，VIDT在现有的基于变压器的对象检测器中获得了最佳的AP和延迟折衷，并且由于大型型号的高可扩展性而实现了49.2AP。我们将在https://github.com/naver-ai/vidt发布代码和培训的型号

translated by 谷歌翻译

PatchTrack: Multiple Object Tracking Using Frame Patches

Xiaotong Chen , Seyed Mehdi Iranmanesh , Kuo-Chin Lien

分类：计算机视觉

2022-01-01

对象运动和对象外观是多个对象跟踪（MOT）应用中的常用信息，用于将帧跨越帧的检测相关联，或用于联合检测和跟踪方法的直接跟踪预测。然而，不仅是这两种类型的信息通常是单独考虑的，而且它们也没有帮助直接从当前感兴趣帧中使用视觉信息的用法。在本文中，我们提出了PatchTrack，一种基于变压器的联合检测和跟踪系统，其使用当前感兴趣的帧帧的曲线预测曲目。我们使用卡尔曼滤波器从前一帧预测当前帧中的现有轨道的位置。从预测边界框裁剪的补丁被发送到变压器解码器以推断新曲目。通过利用在补丁中编码的对象运动和对象外观信息，所提出的方法将更多地关注新曲目更有可能发生的位置。我们展示了近期MOT基准的Patchtrack的有效性，包括MOT16（MOTA 73.71％，IDF1 65.77％）和MOT17（MOTA 73.59％，IDF1 65.23％）。结果在https://motchallenge.net/method/mot=4725&chl=10上发布。

translated by 谷歌翻译

End-to-End Video Instance Segmentation with Transformers

Yuqing Wang , Zhaoliang Xu , Xinlong Wang , Chunhua Shen , Baoshan Cheng , Hao Shen , Huaxia Xia

分类：

2020-11-30

speed among all existing VIS models, and achieves the best result among methods using single model on the YouTube-VIS dataset. For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy. We hope that VisTR can motivate future research for more video understanding tasks.

translated by 谷歌翻译

Unified Transformer Tracker for Object Tracking

Fan Ma , Mike Zheng Shou , Linchao Zhu , Haoqi Fan , Yilei Xu , Yi Yang , Zhicheng Yan

分类：计算机视觉

2022-03-29

作为计算机视觉的重要领域，对象跟踪形成了两个独立的社区，分别研究单个对象跟踪（SOT）和多个对象跟踪（MOT）。但是，由于两个任务的不同训练数据集和跟踪对象，因此在一个跟踪方案中的当前方法不容易适应另一种方法。尽管unitrack \ cite {wang2021Diverent}表明，具有多个头部的共享外观模型可用于处理单个跟踪任务，但它无法利用大规模跟踪数据集进行训练，并且在单个对象跟踪上执行良好的训练。在这项工作中，我们提出了统一的变压器跟踪器（UTT），以通过一个范式在不同方案中解决跟踪问题。在我们的UTT中开发了轨道变压器，以跟踪SOT和MOT中的目标。利用目标和跟踪框架功能之间的相关性以定位目标。我们证明SOT和MOT任务都可以在此框架内解决。该模型可以同时通过在单个任务数据集中优化SOT和MOT目标，同时端到端训练。广泛的实验是在几个基准测试基准上进行的，该基准具有在SOT和MOT数据集上训练的统一模型。代码将在https://github.com/flowerfan/trackron上找到。

translated by 谷歌翻译

3D Vision with Transformers: A Survey

Jean Lahoud , Jiale Cao , Fahad Shahbaz Khan , Hisham Cholakkal , Rao Muhammad Anwer , Salman Khan , Ming-Hsuan Yang

分类：计算机视觉

2022-08-08

变压器在自然语言处理中的成功最近引起了计算机视觉领域的关注。由于能够学习长期依赖性，变压器已被用作广泛使用的卷积运算符的替代品。事实证明，这种替代者在许多任务中都取得了成功，其中几种最先进的方法依靠变压器来更好地学习。在计算机视觉中，3D字段还见证了使用变压器来增加3D卷积神经网络和多层感知器网络的增加。尽管许多调查都集中在视力中的变压器上，但由于与2D视觉相比，由于数据表示和处理的差异，3D视觉需要特别注意。在这项工作中，我们介绍了针对不同3D视觉任务的100多种变压器方法的系统和彻底审查，包括分类，细分，检测，完成，姿势估计等。我们在3D Vision中讨论了变形金刚的设计，该设计使其可以使用各种3D表示形式处理数据。对于每个应用程序，我们强调了基于变压器的方法的关键属性和贡献。为了评估这些方法的竞争力，我们将它们的性能与12个3D基准测试的常见非转化方法进行了比较。我们通过讨论3D视觉中变压器的不同开放方向和挑战来结束调查。除了提出的论文外，我们的目标是频繁更新最新的相关论文及其相应的实现：https：//github.com/lahoud/3d-vision-transformers。

translated by 谷歌翻译

Efficient Decoder-free Object Detection with Transformers

Peixian Chen , Mengdan Zhang , Yunhang Shen , Kekai Sheng , Yuting Gao , Xing Sun , Ke Li , Chunhua Shen

分类：计算机视觉

2022-06-14

视觉变压器（VIT）正在改变对象检测方法的景观。 VIT的自然使用方法是用基于变压器的骨干替换基于CNN的骨干，该主链很简单有效，其价格为推理带来了可观的计算负担。更微妙的用法是DEDR家族，它消除了对物体检测中许多手工设计的组件的需求，但引入了一个解码器，要求超长时间进行融合。结果，基于变压器的对象检测不能在大规模应用中占上风。为了克服这些问题，我们提出了一种新型的无解码器基于完全变压器（DFFT）对象检测器，这是第一次在训练和推理阶段达到高效率。我们通过居中两个切入点来简化反对检测到仅编码单级锚点的密集预测问题：1）消除训练感知的解码器，并利用两个强的编码器来保留单层特征映射预测的准确性； 2）探索具有有限的计算资源的检测任务的低级语义特征。特别是，我们设计了一种新型的轻巧的面向检测的变压器主链，该主链有效地捕获了基于良好的消融研究的丰富语义的低级特征。 MS Coco基准测试的广泛实验表明，DFFT_SMALL的表现优于2.5％AP，计算成本降低28％，$ 10 \ $ 10 \乘以$ 10 \乘以$较少的培训时期。与尖端的基于锚的探测器视网膜相比，DFFT_SMALL获得了超过5.5％的AP增益，同时降低了70％的计算成本。

translated by 谷歌翻译

TransFiner: A Full-Scale Refinement Approach for Multiple Object Tracking

Bin Sun , Jiale Cao

分类：计算机视觉

2022-07-26

多个对象跟踪（MOT）是包含检测和关联的任务。大量追踪器已经取得了竞争性能。不幸的是，由于缺乏这些子任务的信息交流，它们通常会偏向两者之一，并且在复杂的情况下，例如预期的虚假负面因素和彼此通过时的目标轨迹错误。在本文中，我们提出了Transfiner，这是一种基于变压器的MOT进行后填充方法。这是一个通用的附件框架，从原始跟踪器作为输入来利用图像和跟踪结果（位置和类预测）作为输入，然后将其用于强大地启动转机精矿。此外，推高器取决于查询对，这些查询对通过融合解码器产生了一对检测和运动，并实现了全面的跟踪改进。我们还通过根据不同的细化水平标记查询对来提供有针对性的改进。实验表明，在MOT17基准测试上，我们的设计是有效的，我们将CenterTrack从67.8％的MOTA和64.7％的IDF1提升到71.5％MOTA和66.8％IDF1。

translated by 谷歌翻译

SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

Junfeng Wu , Yi Jiang , Wenqing Zhang , Xiang Bai , Song Bai

分类：计算机视觉

2021-12-15

在这项工作中，我们呈现SEQFormer，这是一个令人沮丧的视频实例分段模型。 SEQFormer遵循Vision变换器的原理，该方法模型视频帧之间的实例关系。然而，我们观察到一个独立的实例查询足以捕获视频中的时间序列，但应该独立地使用每个帧进行注意力机制。为此，SEQFormer在每个帧中定位一个实例，并聚合时间信息以学习视频级实例的强大表示，其用于动态地预测每个帧上的掩模序列。实例跟踪自然地实现而不进行跟踪分支或后处理。在YouTube-VIS数据集上，SEQFormer使用Reset-50个骨干和49.0 AP实现47.4个AP，其中Reset-101骨干，没有响铃和吹口哨。此类成果分别显着超过了以前的最先进的性能4.6和4.4。此外，与最近提出的Swin变压器集成，SEQFormer可以实现59.3的高得多。我们希望SEQFormer可能是一个强大的基线，促进了视频实例分段中的未来研究，同时使用更强大，准确，整洁的模型来实现该字段。代码和预先训练的型号在https://github.com/wjf5203/seqformer上公开使用。

translated by 谷歌翻译

SwinTrack: A Simple and Strong Baseline for Transformer Tracking

Liting Lin , Heng Fan , Yong Xu , Haibin Ling

分类：计算机视觉

2021-12-02

变压器最近展示了改进视觉跟踪算法的明显潜力。尽管如此，基于变压器的跟踪器主要使用变压器熔断并增强由卷积神经网络（CNNS）产生的功能。相比之下，在本文中，我们提出了一个完全基于注意力的变压器跟踪算法，Swin-Cranstormer Tracker（SwintRack）。 SwintRack使用变压器进行特征提取和特征融合，允许目标对象和搜索区域之间的完全交互进行跟踪。为了进一步提高性能，我们调查了全面的不同策略，用于特征融合，位置编码和培训损失。所有这些努力都使SwintRack成为一个简单但坚实的基线。在我们的彻底实验中，SwintRack在leasot上设置了一个新的记录，在4.6 \％的情况下超过4.6 \％，同时仍然以45 fps运行。此外，它达到了最先进的表演，0.483 Suc，0.832 Suc和0.694 Ao，其他具有挑战性的leasot _ {ext} $，trackingnet和got-10k。我们的实施和培训型号可在HTTPS://github.com/litinglin/swintrack获得。

translated by 谷歌翻译

A Survey of Visual Transformers

Yang Liu , Yao Zhang , Yixin Wang , Feng Hou , Jin Yuan , Jiang Tian , Yang Zhang , Zhongchao Shi , Jianping Fan , Zhiqiang He

分类：计算机视觉

2021-11-11

变压器是一种基于关注的编码器解码器架构，彻底改变了自然语言处理领域。灵感来自这一重大成就，最近在将变形式架构调整到计算机视觉（CV）领域的一些开创性作品，这已经证明了他们对各种简历任务的有效性。依靠竞争力的建模能力，与现代卷积神经网络相比在本文中，我们已经为三百不同的视觉变压器进行了全面的审查，用于三个基本的CV任务（分类，检测和分割），提出了根据其动机，结构和使用情况组织这些方法的分类。。由于培训设置和面向任务的差异，我们还在不同的配置上进行了评估了这些方法，以便于易于和直观的比较而不是各种基准。此外，我们已经揭示了一系列必不可少的，但可能使变压器能够从众多架构中脱颖而出，例如松弛的高级语义嵌入，以弥合视觉和顺序变压器之间的差距。最后，提出了三个未来的未来研究方向进行进一步投资。

translated by 谷歌翻译

Tracking Objects as Pixel-wise Distributions

Zelin Zhao , Ze Wu , Yueqing Zhuang , Boxun Li , Jiaya Jia

分类：计算机视觉 | 人工智能 | 机器学习

2022-07-12

多对象跟踪（MOT）需要通过帧检测和关联对象。与通过检测到的边界框或将对象作为点跟踪不同，我们建议跟踪对象作为像素分布。我们将此想法实例化，以基于变压器的体系结构P3Aformer，并具有像素的传播，预测和关联。P3Aformer通过流量信息引导的Pixel-Pixel特征，以传递帧之间的消息。此外，P3Aformer采用元结构结构来生成多尺度对象特征图。在推断期间，提出了一个像素关联过程，以基于像素的预测来通过帧恢复对象连接。P3Aformer在MOT17基准上的MOTA中产生81.2 \％，这是所有变压器网络中第一个达到文献中80 \％MOTA。P3AFORMER在MOT20和Kitti基准测试上也优于最先进的。

translated by 谷歌翻译

IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation

Lihua Fu , Haoyue Tian , Xiangping Bryce Zhai , Pan Gao , Xiaojiang Peng

分类：计算机视觉

2022-12-06

Semantic segmentation usually benefits from global contexts, fine localisation information, multi-scale features, etc. To advance Transformer-based segmenters with these aspects, we present a simple yet powerful semantic segmentation architecture, termed as IncepFormer. IncepFormer has two critical contributions as following. First, it introduces a novel pyramid structured Transformer encoder which harvests global context and fine localisation features simultaneously. These features are concatenated and fed into a convolution layer for final per-pixel prediction. Second, IncepFormer integrates an Inception-like architecture with depth-wise convolutions, and a light-weight feed-forward module in each self-attention layer, efficiently obtaining rich local multi-scale object features. Extensive experiments on five benchmarks show that our IncepFormer is superior to state-of-the-art methods in both accuracy and speed, e.g., 1) our IncepFormer-S achieves 47.7% mIoU on ADE20K which outperforms the existing best method by 1% while only costs half parameters and fewer FLOPs. 2) Our IncepFormer-B finally achieves 82.0% mIoU on Cityscapes dataset with 39.6M parameters. Code is available:github.com/shendu0321/IncepFormer.

translated by 谷歌翻译

MOTR: End-to-End Multiple-Object Tracking with Transformer

Fangao Zeng , Bin Dong , Yuang Zhang , Tiancai Wang , Xiangyu Zhang , Yichen Wei

分类：计算机视觉

2021-05-07

对象的时间建模是多个对象跟踪（MOT）的关键挑战。现有方法通过通过基于运动和基于外观的相似性启发式方法关联检测来跟踪。关联的后处理性质阻止了视频序列中时间变化的端到端。在本文中，我们提出了MOTR，它扩展了DETR并介绍了轨道查询，以模拟整个视频中的跟踪实例。轨道查询被转移并逐帧更新，以随着时间的推移执行迭代预测。我们提出了曲目感知的标签分配，以训练轨道查询和新生儿对象查询。我们进一步提出了时间聚集网络和集体平均损失，以增强时间关系建模。 Dancetrack上的实验结果表明，MOTR在HOTA度量方面的表现明显优于最先进的方法，字节范围为6.5％。在MOT17上，MOTR在关联性能方面优于我们的并发作品，跟踪器和Transtrack。 MOTR可以作为对时间建模和基于变压器的跟踪器的未来研究的更强基线。代码可在https://github.com/megvii-research/motr上找到。

translated by 谷歌翻译

DeVIS: Making Deformable Transformers Work for Video Instance Segmentation

Adrià Caelles , Tim Meinhardt , Guillem Brasó , Laura Leal-Taixé

分类：计算机视觉 | 机器学习 | 机器人

2022-07-22

视频实例分割（VIS）在视频序列中共同处理多对象检测，跟踪和分割。过去，VIS方法反映了这些子任务在其建筑设计中的碎片化，因此在关节溶液上错过了这些子任务。变形金刚最近允许将整个VIS任务作为单个设定预测问题进行。然而，现有基于变压器的方法的二次复杂性需要较长的训练时间，高内存需求和处理低音尺度特征地图的处理。可变形的注意力提供了更有效的替代方案，但尚未探索其对时间域或分段任务的应用。在这项工作中，我们提出了可变形的Vis（Devis），这是一种利用可变形变压器的效率和性能的VIS方法。为了在多个框架上共同考虑所有VIS子任务，我们使用实例感知对象查询表示时间尺度可变形。我们进一步介绍了带有多尺度功能的新图像和视频实例蒙版头，并通过多提示剪辑跟踪执行近乎对方的视频处理。 Devis减少了内存和训练时间要求，并在YouTube-Vis 2021以及具有挑战性的OVIS数据集上实现了最先进的结果。代码可在https://github.com/acaelles97/devis上找到。

translated by 谷歌翻译

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

Fengyuan Shi , Ruopeng Gao , Weilin Huang , Limin Wang

分类：计算机视觉

2022-09-28

多模式变压器表现出高容量和灵活性，可将图像和文本对齐以进行视觉接地。然而，由于自我发挥操作的二次时间复杂性，仅编码的接地框架（例如，transvg）遭受了沉重的计算。为了解决这个问题，我们通过将整个接地过程解散为编码和解码阶段，提出了一种新的多模式变压器体系结构，以动态MDETR形成。关键观察是，图像中存在很高的空间冗余。因此，我们通过在加快视觉接地过程之前利用这种稀疏性来设计一种新的动态多模式变压器解码器。具体而言，我们的动态解码器由2D自适应采样模块和文本引导的解码模块组成。采样模块旨在通过预测参考点的偏移来选择这些信息补丁，而解码模块则可以通过在图像功能和文本功能之间执行交叉注意来提取接地对象信息。这两个模块也被堆叠起来，以逐渐弥合模态间隙，并迭代地完善接地对象的参考点，最终实现了视觉接地的目的。对五个基准测试的广泛实验表明，我们提出的动态MDETR实现了计算和准确性之间的竞争权衡。值得注意的是，在解码器中仅使用9％的特征点，我们可以降低〜44％的多模式变压器的GLOP，但仍然比仅编码器的对应物更高的精度。此外，为了验证其概括能力并扩展我们的动态MDETR，我们构建了第一个单级剪辑授权的视觉接地框架，并在这些基准测试中实现最先进的性能。

translated by 谷歌翻译

QAHOI: Query-Based Anchors for Human-Object Interaction Detection

Junwen Chen , Keiji Yanai

分类：计算机视觉

2021-12-16

人对象交互（HOI）检测作为对象检测任务的下游需要本地化人和对象，并从图像中提取人类和对象之间的语义关系。最近，由于其高效率，一步方法已成为这项任务的新趋势。然而，这些方法侧重于检测可能的交互点或过滤人对象对，忽略空间尺度处的不同物体的位置和大小的可变性。为了解决这个问题，我们提出了一种基于变压器的方法，Qahoi（用于人对象交互检测的查询锚点），它利用了多尺度架构来提取来自不同空间尺度的特征，并使用基于查询的锚来预测全部Hoi实例的元素。我们进一步调查了强大的骨干，显着提高了QAHOI的准确性，QAHOI与基于变压器的骨干优于最近的最近最先进的方法，通过HICO-DEC基准。源代码以$ \ href {https://github.com/cjw2021/qhoii} {\ text {this https url}} $。

translated by 谷歌翻译

CvT: Introducing Convolutions to Vision Transformers

Haiping Wu , Bin Xiao , Noel Codella , Mengchen Liu , Xiyang Dai , Lu Yuan , Lei Zhang

分类：

2021-03-29

We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (i.e. shift, scale, and distortion invariance) while maintaining the merits of Transformers (i.e. dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (e.g. ImageNet-22k) and fine-tuned to downstream tasks. Pretrained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at https: //github.com/leoxiaobin/CvT.

translated by 谷歌翻译