智能论文笔记

Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

Yasheng Sun , Hang Zhou , Kaisiyuan Wang , Qianyi Wu , Zhibin Hong , Jingtuo Liu , Errui Ding , Jingdong Wang , Ziwei Liu , Hideki Koike

分类：计算机视觉 | 人工智能

2022-12-09

Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions. However, most of them deform or generate the whole facial area, leading to non-realistic results. In this work, we delve into the formulation of altering only the mouth shapes of the target person. This requires masking a large percentage of the original image and seamlessly inpainting it with the aid of audio and reference frames. To this end, we propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality by predicting the masked mouth shapes. Our key insight is to exploit desired contextual information provided in audio and visual modalities thoroughly with delicately designed Transformers. Specifically, we propose a convolution-Transformer hybrid backbone and design an attention-based fusion strategy for filling the masked parts. It uniformly attends to the textural information on the unmasked regions and the reference frame. Then the semantic audio information is involved in enhancing the self-attention computation. Additionally, a refinement network with audio injection improves both image and lip-sync quality. Extensive experiments validate that our model can generate high-fidelity lip-synced results for arbitrary subjects.

translated by 谷歌翻译

EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model

Xinya Ji , Hang Zhou , Kaisiyuan Wang , Qianyi Wu , Wayne Wu , Feng Xu , Xun Cao

分类：计算机视觉

2022-05-30

尽管已经对音频驱动的说话的面部生成取得了重大进展，但现有方法要么忽略面部情绪，要么不能应用于任意主题。在本文中，我们提出了情感感知的运动模型（EAMM），以通过涉及情感源视频来产生一次性的情感谈话面孔。具体而言，我们首先提出了一个Audio2Facial-Dynamics模块，该模块从音频驱动的无监督零和一阶密钥点运动中进行说话。然后，通过探索运动模型的属性，我们进一步提出了一个隐性的情绪位移学习者，以表示与情绪相关的面部动力学作为对先前获得的运动表示形式的线性添加位移。全面的实验表明，通过纳入两个模块的结果，我们的方法可以在具有现实情感模式的任意主题上产生令人满意的说话面部结果。

translated by 谷歌翻译

StableFace: Analyzing and Improving Motion Stability for Talking Face Generation

Jun Ling , Xu Tan , Liyang Chen , Runnan Li , Yuchao Zhang , Sheng Zhao , Li Song

分类：计算机视觉

2022-08-29

虽然先前以语音为导向的说话面部生成方法在改善合成视频的视觉质量和唇部同步质量方面取得了重大进展，但它们对唇部运动的关注较少，从而极大地破坏了说话面部视频的真实性。是什么导致运动烦恼，以及如何减轻问题？在本文中，我们基于最先进的管道对运动抖动问题进行系统分析，该管道使用3D面表示桥接输入音频和输出视频，并通过一系列有效的设计来改善运动稳定性。我们发现，几个问题可能会导致综合说话的面部视频中的烦恼：1）输入3D脸部表示的烦恼； 2）训练推导不匹配； 3）视频帧之间缺乏依赖建模。因此，我们提出了三种有效的解决方案来解决此问题：1）我们提出了一个基于高斯的自适应平滑模块，以使3D面部表征平滑以消除输入中的抖动； 2）我们在训练中对神经渲染器的输入数据增加了增强的侵蚀，以模拟推理中的变形以减少不匹配； 3）我们开发了一个音频融合的变压器生成器，以模拟视频帧之间的依赖性。此外，考虑到没有现成的指标来测量说话面部视频中的运动抖动，我们设计了一个客观的度量标准（运动稳定性指数，MSI），可以通过计算方差加速度的倒数来量化运动抖动。广泛的实验结果表明，我们方法对运动稳定的面部视频生成的优越性，其质量比以前的系统更好。

translated by 谷歌翻译

HTML版本

One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning

Suzhen Wang , Lincheng Li , Yu Ding , Xin Yu

分类：计算机视觉

2021-12-06

音频驱动的单次谈话脸生成方法通常培训各种人的视频资源。然而，他们创建的视频经常遭受不自然的口腔形状和异步嘴唇，因为这些方法努力学习来自不同扬声器的一致语音风格。我们观察到从特定扬声器学习一致的语音风格会更容易，这导致正宗的嘴巴运动。因此，我们通过从特定扬声器探讨音频和视觉运动之间的一致相关性，然后将音频驱动的运动场转移到参考图像来提出一种新颖的单次谈论的谈话脸。具体地，我们开发了一种视听相关变压器（AVCT），其旨在从输入音频推断由基于KeyPoint基的密集运动场表示的谈话运动。特别是，考虑到音频可能来自部署中的不同身份，我们将音素合并以表示音频信号。以这种方式，我们的AVCT可以本质地推广其他身份的音频。此外，由于面部键点用于表示扬声器，AVCT对训练扬声器的外观不可知，因此允许我们容易地操纵不同标识的面部图像。考虑到不同的面形状导致不同的运动，利用运动场传输模块来减少训练标识和一次性参考之间的音频驱动的密集运动场间隙。一旦我们获得了参考图像的密集运动场，我们就会使用图像渲染器从音频剪辑生成其谈话脸视频。由于我们学识到的一致口语风格，我们的方法会产生真正的口腔形状和生动的运动。广泛的实验表明，在视觉质量和唇部同步方面，我们的合成视频优于现有技术。

translated by 谷歌翻译

A Survey of Deep Face Restoration: Denoise, Super-Resolution, Deblur, Artifact Removal

Tao Wang , Kaihao Zhang , Xuanxi Chen , Wenhan Luo , Jiankang Deng , Tong Lu , Xiaochun Cao , Wei Liu , Hongdong Li , Stefanos Zafeiriou

分类：计算机视觉

2022-11-05

Face Restoration (FR) aims to restore High-Quality (HQ) faces from Low-Quality (LQ) input images, which is a domain-specific image restoration problem in the low-level computer vision area. The early face restoration methods mainly use statistic priors and degradation models, which are difficult to meet the requirements of real-world applications in practice. In recent years, face restoration has witnessed great progress after stepping into the deep learning era. However, there are few works to study deep learning-based face restoration methods systematically. Thus, this paper comprehensively surveys recent advances in deep learning techniques for face restoration. Specifically, we first summarize different problem formulations and analyze the characteristic of the face image. Second, we discuss the challenges of face restoration. Concerning these challenges, we present a comprehensive review of existing FR methods, including prior based methods and deep learning-based methods. Then, we explore developed techniques in the task of FR covering network architectures, loss functions, and benchmark datasets. We also conduct a systematic benchmark evaluation on representative methods. Finally, we discuss future directions, including network designs, metrics, benchmark datasets, applications,etc. We also provide an open-source repository for all the discussed methods, which is available at https://github.com/TaoWangzj/Awesome-Face-Restoration.

translated by 谷歌翻译

DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering

Shunyu Yao , RuiZhe Zhong , Yichao Yan , Guangtao Zhai , Xiaokang Yang

分类：计算机视觉

2022-01-03

虽然深度神经网络的最近进步使得可以呈现高质量的图像，产生照片 - 现实和个性化的谈话头部仍然具有挑战性。通过给定音频，解决此任务的关键是同步唇部运动，同时生成头部移动和眼睛闪烁等个性化属性。在这项工作中，我们观察到输入音频与唇部运动高度相关，而与其他个性化属性的较少相关（例如，头部运动）。受此启发，我们提出了一种基于神经辐射场的新颖框架，以追求高保真和个性化的谈话。具体地，神经辐射场将唇部运动特征和个性化属性作为两个解除态条件采用，其中从音频输入直接预测唇部移动以实现唇部同步的生成。同时，从概率模型采样个性化属性，我们设计了从高斯过程中采样的基于变压器的变差自动码器，以学习合理的和自然的头部姿势和眼睛闪烁。在几个基准上的实验表明，我们的方法比最先进的方法达到了更好的结果。

translated by 谷歌翻译

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

Dongchan Min , Minyoung Song , Sung Ju Hwang

分类：计算机视觉 | 机器学习

2022-08-23

我们提出了Styletalker，这是一种新颖的音频驱动的会说话的头部生成模型，可以从单个参考图像中综合一个会说话的人的视频，并具有准确的音频同步的唇形，逼真的头姿势和眼睛眨眼。具体而言，通过利用预验证的图像生成器和图像编码器，我们估计了会说话的头视频的潜在代码，这些代码忠实地反映了给定的音频。通过几个新设计的组件使这成为可能：1）一种用于准确唇部同步的对比性唇部同步鉴别剂，2）一种条件顺序的连续变异自动编码器，该差异自动编码器了解从唇部运动中解散的潜在运动空间，以便我们可以独立地操纵运动运动的运动。和唇部运动，同时保留身份。 3）自动回归事先增强，并通过标准化流量来学习复杂的音频到运动多模式潜在空间。配备了这些组件，Styletalker不仅可以在给出另一个运动源视频时以动作控制的方式生成说话的头视频，而且还可以通过从输入音频中推断出现实的动作，以完全由音频驱动的方式生成。通过广泛的实验和用户研究，我们表明我们的模型能够以令人印象深刻的感知质量合成会说话的头部视频，这些视频与输入音频相符，可以准确地唇部同步，这在很大程度上要优于先进的基线。

translated by 谷歌翻译

Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

Zhentao Yu , Zixin Yin , Deyu Zhou , Duomin Wang , Finn Wong , Baoyuan Wang

分类：计算机视觉

2022-12-07

In this paper, we introduce a simple and novel framework for one-shot audio-driven talking head generation. Unlike prior works that require additional driving sources for controlled synthesis in a deterministic manner, we instead probabilistically sample all the holistic lip-irrelevant facial motions (i.e. pose, expression, blink, gaze, etc.) to semantically match the input audio while still maintaining both the photo-realism of audio-lip synchronization and the overall naturalness. This is achieved by our newly proposed audio-to-visual diffusion prior trained on top of the mapping between audio and disentangled non-lip facial representations. Thanks to the probabilistic nature of the diffusion prior, one big advantage of our framework is it can synthesize diverse facial motion sequences given the same audio clip, which is quite user-friendly for many real applications. Through comprehensive evaluations on public benchmarks, we conclude that (1) our diffusion prior outperforms auto-regressive prior significantly on almost all the concerned metrics; (2) our overall system is competitive with prior works in terms of audio-lip synchronization but can effectively sample rich and natural-looking lip-irrelevant facial motions while still semantically harmonized with the audio input.

translated by 谷歌翻译

Look\&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Junwen Xiong , Yu Zhou , Peng Zhang , Lei Xie , Wei Huang , Yufei Zha

分类：人工智能

2022-03-04

主动演讲者的检测和语音增强已成为视听场景中越来越有吸引力的主题。根据它们各自的特征，独立设计的体系结构方案已被广泛用于与每个任务的对应。这可能导致模型特定于任务所学的表示形式，并且不可避免地会导致基于多模式建模的功能缺乏概括能力。最近的研究表明，建立听觉和视觉流之间的跨模式关系是针对视听多任务学习挑战的有前途的解决方案。因此，作为弥合视听任务中多模式关联的动机，提出了一个统一的框架，以通过在本研究中通过联合学习视听模型来实现目标扬声器的检测和语音增强。

translated by 谷歌翻译

StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles

Yifeng Ma , Suzhen Wang , Zhipeng Hu , Changjie Fan , Tangjie Lv , Yu Ding , Zhidong Deng , Xin Yu

分类：计算机视觉

2023-01-03

Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.

translated by 谷歌翻译

Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors

Sindhu B Hegde , Rudrabha Mukhopadhyay , Vinay P Namboodiri , C. V. Jawahar

分类：计算机视觉

2022-08-17

在本文中，我们探讨了一个有趣的问题，即从$ 8 \ times8 $ Pixel视频序列中获得什么。令人惊讶的是，事实证明很多。我们表明，当我们处理此$ 8 \ times8 $视频带有正确的音频和图像先验时，我们可以获得全长的256 \ times256 $视频。我们使用新颖的视听UPPRAPLING网络实现了极低分辨率输入的$ 32 \ times $缩放。音频先验有助于恢复元素面部细节和精确的唇形，而单个高分辨率目标身份图像先验为我们提供了丰富的外观细节。我们的方法是端到端的多阶段框架。第一阶段会产生一个粗糙的中间输出视频，然后可用于动画单个目标身份图像并生成逼真，准确和高质量的输出。我们的方法很简单，并且与以前的超分辨率方法相比，表现非常好（$ 8 \ times $改善了FID得分）。我们还将模型扩展到了谈话视频压缩，并表明我们在以前的最新时间上获得了$ 3.5 \ times $的改进。通过广泛的消融实验（在论文和补充材料中）对我们网络的结果进行了彻底的分析。我们还在我们的网站上提供了演示视频以及代码和模型：\ url {http://cvit.iiit.ac.in/research/project/projects/cvit-projects/talking-face-vace-video-upsmpling}。

translated by 谷歌翻译

Parallel and High-Fidelity Text-to-Lip Generation

Jinglin Liu , Zhiying Zhu , Yi Ren , Wencan Huang , Baoxing Huai , Nicholas Yuan , Zhou Zhao

分类：计算机视觉

2021-07-14

作为谈论脸生成的关键组成部分，唇部运动产生决定了所产生的谈话脸视频的自然度和相干性。前文学主要侧重于语音到唇部生成，而文本到唇（T2L）生成缺乏缺乏。 T2L是一个具有挑战性的任务，现有的端到端工作取决于注意机制和自回归（AR）解码方式。然而，AR解码方式产生在先前生成的帧上的当前唇框，其固有地阻碍推广速度，并且对由于误差传播引起的产生唇框的质量有不利影响。这鼓励了并行T2L代的研究。在这项工作中，我们提出了一种用于快速和高保真文本到唇部生成（Paralip）的平行解码模型。具体地，我们预测编码语言特征的持续时间和模型在编码的语言特征上调节的目标唇框，其持续时间以非自动增加方式。此外，我们纳入了结构相似性指数损失和对抗性学习，以提高产生的唇框的感知质量，并减轻模糊预测问题。在网格和TCD-TIMIT数据集上进行的广泛实验证明了所提出的方法的优越性。视频样本可通过\ URL {https://paralip.github.io/}获得。

translated by 谷歌翻译

Talking Head from Speech Audio using a Pre-trained Image Generator

Mohammed M. Alghamdi , He Wang , Andrew J. Bulpitt , David C. Hogg

分类：计算机视觉

2022-09-09

我们提出了一种新颖的方法，用于生成语音音频和单个“身份”图像的高分辨率视频。我们的方法基于卷积神经网络模型，该模型结合了预训练的样式Gener。我们将每个帧建模为Stylegan潜在空间中的一个点，以便视频对应于潜在空间的轨迹。培训网络分为两个阶段。第一阶段是根据语音话语调节潜在空间中的轨迹。为此，我们使用现有的编码器倒转发电机，将每个视频框架映射到潜在空间中。我们训练一个经常性的神经网络，以从语音话语绘制到图像发生器潜在空间中的位移。这些位移是相对于从训练数据集中所描绘的个体选择的身份图像的潜在空间的反向预测的。在第二阶段，我们通过在单个图像或任何选择的身份的简短视频上调整图像生成器来提高生成视频的视觉质量。我们对标准度量（PSNR，SSIM，FID和LMD）的模型进行评估，并表明它在两个常用数据集之一上的最新方法明显优于最新的最新方法，另一方面给出了可比的性能。最后，我们报告了验证模型组成部分的消融实验。可以在https://mohammedalghamdi.github.io/talking-heads-acm-mm上找到实验的代码和视频

translated by 谷歌翻译

Generalised Image Outpainting with U-Transformer

Penglei Gao , Xi Yang , Rui Zhang , Kaizhu Huang , John Y. Goulermas , Yujie Geng , Yuyao Yan

分类：计算机视觉

2022-01-27

虽然大多数当前的图像支出都进行了水平外推，但我们研究了广义图像支出问题，这些问题将视觉上下文推断出给定图像周围的全面。为此，我们开发了一个新型的基于变压器的生成对抗网络，称为U-Transformer，能够扩展具有合理结构和细节的图像边界，即使是复杂的风景图像。具体而言，我们将生成器设计为嵌入流行的Swin Transformer块的编码器到二次结构。因此，我们的新型框架可以更好地应对图像远程依赖性，这对于广义图像支出至关重要。我们另外提出了U形结构和多视图时间空间预测网络，以增强图像自我重建以及未知的零件预测。我们在实验上证明，我们提出的方法可以为针对最新图像支出方法提供广义图像支出产生可吸引人的结果。

translated by 谷歌翻译

MAT: Mask-Aware Transformer for Large Hole Image Inpainting

Wenbo Li , Zhe Lin , Kun Zhou , Lu Qi , Yi Wang , Jiaya Jia

分类：计算机视觉

2022-03-29

最近的研究表明，在介绍问题中建模长期相互作用的重要性。为了实现这一目标，现有方法利用独立的注意技术或变压器，但考虑到计算成本，通常在低分辨率下。在本文中，我们提出了一个基于变压器的新型模型，用于大孔介入，该模型统一了变压器和卷积的优点，以有效地处理高分辨率图像。我们仔细设计框架的每个组件，以确保恢复图像的高保真度和多样性。具体而言，我们自定义了一个面向内部的变压器块，其中注意模块仅从部分有效令牌中汇总非本地信息，该信息由动态掩码表示。广泛的实验证明了在多个基准数据集上新模型的最新性能。代码在https://github.com/fenglinglwb/mat上发布。

translated by 谷歌翻译

Imitator: Personalized Speech-driven 3D Facial Animation

Balamurugan Thambiraja , Ikhsanul Habibie , Sadegh Aliakbarian , Darren Cosker , Christian Theobalt , Justus Thies

分类：计算机视觉

2022-12-30

Speech-driven 3D facial animation has been widely explored, with applications in gaming, character animation, virtual reality, and telepresence systems. State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor, thus, resulting in unrealistic and inaccurate lip movements. To address this, we present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video and produces novel facial expressions matching the identity-specific speaking style and facial idiosyncrasies of the target actor. Specifically, we train a style-agnostic transformer on a large facial expression dataset which we use as a prior for audio-driven facial expressions. Based on this prior, we optimize for identity-specific speaking style based on a short reference video. To train the prior, we introduce a novel loss function based on detected bilabial consonants to ensure plausible lip closures and consequently improve the realism of the generated expressions. Through detailed experiments and a user study, we show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.

translated by 谷歌翻译

StyleSwap: Style-Based Generator Empowers Robust Face Swapping

Zhiliang Xu , Hang Zhou , Zhibin Hong , Ziwei Liu , Jiaming Liu , Zhizhi Guo , Junyu Han , Jingtuo Liu , Errui Ding , Jingdong Wang

分类：计算机视觉

2022-09-27

鉴于其广泛的应用，已经对人面部交换的任务进行了许多尝试。尽管现有的方法主要依赖于乏味的网络和损失设计，但它们仍然在源和目标面之间的信息平衡中挣扎，并倾向于产生可见的人工制品。在这项工作中，我们引入了一个名为StylesWap的简洁有效的框架。我们的核心想法是利用基于样式的生成器来增强高保真性和稳健的面部交换，因此可以采用发电机的优势来优化身份相似性。我们仅通过最小的修改来确定，StyleGAN2体系结构可以成功地处理来自源和目标的所需信息。此外，受到TORGB层的启发，进一步设计了交换驱动的面具分支以改善信息的融合。此外，可以采用stylegan倒置的优势。特别是，提出了交换引导的ID反转策略来优化身份相似性。广泛的实验验证了我们的框架会产生高质量的面部交换结果，从而超过了最先进的方法，既有定性和定量。

translated by 谷歌翻译

Multimodal Image Synthesis and Editing: A Survey

Fangneng Zhan , Yingchen Yu , Rongliang Wu , Jiahui Zhang , Shijian Lu

分类：计算机视觉

2021-12-27

随着信息中的各种方式存在于现实世界中的各种方式，多式联信息之间的有效互动和融合在计算机视觉和深度学习研究中的多模式数据的创造和感知中起着关键作用。通过卓越的功率，在多式联运信息中建模互动，多式联运图像合成和编辑近年来已成为一个热门研究主题。与传统的视觉指导不同，提供明确的线索，多式联路指南在图像合成和编辑方面提供直观和灵活的手段。另一方面，该领域也面临着具有固有的模态差距的特征的几个挑战，高分辨率图像的合成，忠实的评估度量等。在本调查中，我们全面地阐述了最近多式联运图像综合的进展根据数据模型和模型架构编辑和制定分类。我们从图像合成和编辑中的不同类型的引导方式开始介绍。然后，我们描述了多模式图像综合和编辑方法，其具有详细的框架，包括生成的对抗网络（GAN），GaN反转，变压器和其他方法，例如NERF和扩散模型。其次是在多模式图像合成和编辑中广泛采用的基准数据集和相应的评估度量的综合描述，以及分析各个优点和限制的不同合成方法的详细比较。最后，我们为目前的研究挑战和未来的研究方向提供了深入了解。与本调查相关的项目可在HTTPS://github.com/fnzhan/mise上获得

translated by 谷歌翻译

Synthesizing Photorealistic Virtual Humans Through Cross-modal Disentanglement

Siddarth Ravichandran , Ondřej Texler , Dimitar Dinev , Hyun Jae Kang

分类：计算机视觉

2022-09-03

在过去的几十年中，虚拟领域的许多方面都得到了增强，从亚马逊的Alexa和Apple的Siri等数字助手到出现到重新品牌的Meta的最新元元努力。这些趋势强调了产生对人类的影像性视觉描述的重要性。近年来，这导致了所谓的深层和说话的头部生成方法的快速增长。尽管它们令人印象深刻和受欢迎程度，但它们通常缺乏某些定性方面，例如纹理质量，嘴唇同步或解决方案以及实时运行的实用方面。为了允许虚拟人类化身在实际场景中使用，我们提出了一个端到端框架，用于合成能够语音的高质量虚拟人脸，并特别强调性能。我们介绍了一个新的网络，利用Visemes作为中间音频表示，并采用层次图像综合方法的新型数据增强策略，该方法允许解散用于控制全球头部运动的不同模态。我们的方法是实时运行的，与当前的最新技术相比，我们能够提供卓越的结果。

translated by 谷歌翻译

Transformers in Vision: A Survey

Salman Khan , Muzammal Naseer , Munawar Hayat , Syed Waqas Zamir , Fahad Shahbaz Khan , Mubarak Shah

分类：

2021-01-04

Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.

translated by 谷歌翻译