Recently, massive architectures based on Convolutional Neural Network (CNN) and self-attention mechanisms have become necessary for audio classification. While these techniques are state-of-the-art, these works' effectiveness can only be guaranteed with huge computational costs and parameters, large amounts of data augmentation, transfer from large datasets and some other tricks. By utilizing the lightweight nature of audio, we propose an efficient network structure called Paired Inverse Pyramid Structure (PIP) and a network called Paired Inverse Pyramid Structure MLP Network (PIPMN). The PIPMN reaches 96\% of Environmental Sound Classification (ESC) accuracy on the UrbanSound8K dataset and 93.2\% of Music Genre Classification (MGC) on the GTAZN dataset, with only 1 million parameters. Both of the results are achieved without data augmentation or model transfer. Public code is available at: https://github.com/JNAIC/PIPMN
translated by 谷歌翻译
自我关注已成为最近网络架构的一个组成部分,例如,统治主要图像和视频基准的变压器。这是因为自我关注可以灵活地模拟远程信息。出于同样的原因,研究人员最近使尝试恢复多层Perceptron(MLP)并提出一些类似MLP的架构,显示出极大的潜力。然而,当前的MLP样架构不擅长捕获本地细节并缺乏对图像和/或视频中的核心细节的逐步了解。为了克服这个问题,我们提出了一种新颖的Morphmlp架构,该架构专注于在低级层处捕获本地细节,同时逐渐改变,以专注于高级层的长期建模。具体地,我们设计一个完全连接的层,称为Morphfc,两个可变过滤器,其沿着高度和宽度尺寸逐渐地发展其接收领域。更有趣的是,我们建议灵活地调整视频域中的Morphfc层。为了我们最好的知识,我们是第一个创建类似MLP骨干的用于学习视频表示的骨干。最后,我们对图像分类,语义分割和视频分类进行了广泛的实验。我们的Morphmlp,如此自我关注的自由骨干,可以与基于自我关注的型号一样强大。
translated by 谷歌翻译
We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (i.e. shift, scale, and distortion invariance) while maintaining the merits of Transformers (i.e. dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (e.g. ImageNet-22k) and fine-tuned to downstream tasks. Pretrained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at https: //github.com/leoxiaobin/CvT.
translated by 谷歌翻译
视觉变压器由于能够捕获图像中的长期依赖性的能力而成功地应用于图像识别任务。但是,变压器与现有卷积神经网络(CNN)之间的性能和计算成本仍然存在差距。在本文中,我们旨在解决此问题,并开发一个网络,该网络不仅可以超越规范变压器,而且可以超越高性能卷积模型。我们通过利用变压器来捕获长期依赖性和CNN来建模本地特征,从而提出了一个新的基于变压器的混合网络。此外,我们将其扩展为获得一个称为CMT的模型家族,比以前的基于卷积和基于变压器的模型获得了更好的准确性和效率。特别是,我们的CMT-S在ImageNet上获得了83.5%的TOP-1精度,而在拖鞋上的拖曳率分别比现有的DEIT和EficitiveNet小14倍和2倍。拟议的CMT-S还可以很好地概括CIFAR10(99.2%),CIFAR100(91.7%),花(98.7%)以及其他具有挑战性的视觉数据集,例如可可(44.3%地图),计算成本较小。
translated by 谷歌翻译
表面缺陷检测是确保工业产品质量的极其至关重要的步骤。如今,基于编码器架构的卷积神经网络(CNN)在各种缺陷检测任务中取得了巨大的成功。然而,由于卷积的内在局部性,它们通常在明确建模长距离相互作用时表现出限制,这对于复杂情况下的像素缺陷检测至关重要,例如杂乱的背景和难以辨认的伪缺陷。最近的变压器尤其擅长学习全球图像依赖性,但对于详细的缺陷位置所需的本地结构信息有限。为了克服上述局限性,我们提出了一个有效的混合变压器体系结构,称为缺陷变压器(faft),用于表面缺陷检测,该检测将CNN和Transferaler纳入统一模型,以协作捕获本地和非本地关系。具体而言,在编码器模块中,首先采用卷积茎块来保留更详细的空间信息。然后,贴片聚合块用于生成具有四个层次结构的多尺度表示形式,每个层次结构之后分别是一系列的feft块,该块分别包括用于本地位置编码的本地位置块,一个轻巧的多功能自我自我 - 注意与良好的计算效率建模多尺度的全球上下文关系,以及用于功能转换和进一步位置信息学习的卷积馈送网络。最后,提出了一个简单但有效的解码器模块,以从编码器中的跳过连接中逐渐恢复空间细节。与其他基于CNN的网络相比,三个数据集上的广泛实验证明了我们方法的优势和效率。
translated by 谷歌翻译
变压器已经看到了自然语言处理和计算机视觉任务的前所未有的上升。但是,在音频任务中,由于音频波形的极大序列长度或在培训基于傅立叶特征时,它们是不可行的。在这项工作中,我们介绍了一个架构,Audiomer,在那里我们将1D残差网络与表演者的注意力结合起来,以实现使用原始音频波形的关键字在关键字中实现最先进的性能,优先于以前的所有方法,同时计算更便宜和参数效率。此外,我们的模型具有语音处理的实际优点,例如由于缺乏位置编码而在任意长的音频剪辑上推断。代码可在https://github.com/the-learning-machines/dautiomer获得
translated by 谷歌翻译
在过去的十年中,CNN在电脑愿景世界中统治了至高无上,但最近,变压器一直在崛起。然而,自我关注的二次计算成本已成为实践应用中的严重问题。在没有CNN的情况下,在这种情况下已经有很多研究了,并且在这种情况下自我关注。特别地,MLP混合器是使用MLP设计的简单架构,并击中与视觉变压器相当的精度。然而,这种体系结构中唯一的归纳偏见是嵌入令牌。这叶打开了将非卷积(或非本地)电感偏差结合到架构中的可能性,因此我们使用了两个简单的想法,以便利用其捕获全局相关能力的同时将电感偏差纳入MLP混合器。一种方法是将令牌混合块垂直和水平分割。另一种方法是在一些令牌混合通道中进行空间相关性密集。通过这种方法,我们能够提高MLP混合器的准确性,同时降低其参数和计算复杂性。 RAFTMLP-S的小型模型与每个计算的参数和效率方面的基于最先进的全球MLP的模型相当。此外,我们通过利用双向插值来解决基于MLP的模型的固定输入图像分辨率的问题。我们证明这些模型可以应用于诸如物体检测的下游任务的架构的骨干。但是,它没有显着的性能,并提到了对基于全球MLP的模型的下游任务的特定MLP特定架构的需求。 pytorch版本中的源代码可用于\ url {https:/github.com/okojoalg/raft-mlp}。
translated by 谷歌翻译
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10× more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https: //github.com/facebookresearch/SlowFast.
translated by 谷歌翻译
在本文中,我们呈现了UFFORER,一种用于图像恢复的有效和高效的变换器架构,其中我们使用变压器块构建分层编码器解码器网络。在UFFAR中,有两个核心设计。首先,我们介绍了一个新颖的本地增强型窗口(Lewin)变压器块,其执行基于窗口的自我关注而不是全局自我关注。它显着降低了高分辨率特征映射的计算复杂性,同时捕获本地上下文。其次,我们提出了一种以多尺度空间偏置的形式提出了一种学习的多尺度恢复调制器,以调整UFFORER解码器的多个层中的特征。我们的调制器展示了卓越的能力,用于恢复各种图像恢复任务的详细信息,同时引入边缘额外参数和计算成本。通过这两个设计提供支持,UFFORER享有高能力,可以捕获本地和全局依赖性的图像恢复。为了评估我们的方法,在几种图像恢复任务中进行了广泛的实验,包括图像去噪,运动脱棕,散焦和污染物。没有钟声和口哨,与最先进的算法相比,我们的UFormer实现了卓越的性能或相当的性能。代码和模型可在https://github.com/zhendongwang6/uformer中找到。
translated by 谷歌翻译
卷积神经网络(CNN)是用于计算机视觉的主要的深神经网络(DNN)架构。最近,变压器和多层的Perceptron(MLP)的基础型号,如视觉变压器和MLP-MILER,开始引领新的趋势,因为它们在想象成分类任务中显示出了有希望的结果。在本文中,我们对这些DNN结构进行了实证研究,并试图了解他们各自的利弊。为了确保公平的比较,我们首先开发一个名为SPACH的统一框架,可以采用单独的空间和通道处理模块。我们在SPACH框架下的实验表明,所有结构都可以以适度的规模实现竞争性能。但是,当网络大小缩放时,它们展示了独特的行为。根据我们的调查结果,我们建议使用卷积和变压器模块的混合模型。由此产生的Hybrid-MS-S +模型实现了83.9%的前1个精度,63米参数和12.3g拖薄。它已与具有复杂设计的SOTA模型相提并论。代码和模型在https://github.com/microsoft/spach上公开使用。
translated by 谷歌翻译
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. To this regard, we propose the convolution-augmented transformer for speech recognition, named Conformer. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother. We also observe competitive performance of 2.7%/6.3% with a small model of only 10M parameters.
translated by 谷歌翻译
过去一年目睹了将变压器模块应用于视力问题的快速发展。虽然一些研究人员已经证明,基于变压器的模型享有有利的拟合数据能力,但仍然越来越多的证据,表明这些模型尤其在训练数据受到限制时遭受过度拟合。本文通过执行逐步操作来提供实证研究,逐步运输基于变压器的模型到基于卷积的模型。我们在过渡过程中获得的结果为改善视觉识别提供了有用的消息。基于这些观察,我们提出了一个名为VIRFormer的新架构,该体系结构从“视觉友好的变压器”中缩写。具有相同的计算复杂度,在想象集分类精度方面,VISFormer占据了基于变压器的基于卷积的模型,并且当模型复杂性较低或训练集较小时,优势变得更加重要。代码可在https://github.com/danczs/visformer中找到。
translated by 谷歌翻译
与卷积神经网络(CNN)相比,视觉变压器(VIT)正在变得越来越流行和主导技术。作为计算机视觉中苛刻的技术,VIT已成功解决了各种视觉问题,同时着眼于远程关系。在本文中,我们首先介绍自我注意机制的基本概念和背景。接下来,我们提供了最新表现最好的VIT方法的全面概述,该方法在强度和弱点,计算成本以及培训和测试数据集方面描述。我们彻底比较了流行基准数据集上各种VIT算法和大多数代表性CNN方法的性能。最后,我们通过有见地的观察来探索一些局限性,并提供进一步的研究方向。项目页面以及论文集可通过https://github.com/khawar512/vit-survey获得
translated by 谷歌翻译
The increasing number of surveillance cameras and security concerns have made automatic violent activity detection from surveillance footage an active area for research. Modern deep learning methods have achieved good accuracy in violence detection and proved to be successful because of their applicability in intelligent surveillance systems. However, the models are computationally expensive and large in size because of their inefficient methods for feature extraction. This work presents a novel architecture for violence detection called Two-stream Multi-dimensional Convolutional Network (2s-MDCN), which uses RGB frames and optical flow to detect violence. Our proposed method extracts temporal and spatial information independently by 1D, 2D, and 3D convolutions. Despite combining multi-dimensional convolutional networks, our models are lightweight and efficient due to reduced channel capacity, yet they learn to extract meaningful spatial and temporal information. Additionally, combining RGB frames and optical flow yields 2.2% more accuracy than a single RGB stream. Regardless of having less complexity, our models obtained state-of-the-art accuracy of 89.7% on the largest violence detection benchmark dataset.
translated by 谷歌翻译
皮肤病变的准确和公正检查对于早期诊断和治疗皮肤疾病至关重要。皮肤病变的视觉特征明显差异,因为图像是通过使用不同的成像设备从具有不同病变颜色和形态的患者中收集的。最近的研究报告说,结合卷积神经网络(CNN)是实用的,可以对图像进行分类以早期诊断皮肤疾病。但是,这些连接的CNN的实际使用受到限制,因为这些网络是重量级的,并且不足以处理上下文信息。尽管开发了轻量级网络(例如MobileNetV3和ExcilityNet),以减少参数来实现移动设备上的深神经网络,但功能表示深度不足会限制性能。为了解决现有的局限性,我们开发了一个新的精简神经网络,即Hierattn。 Hierattn采用了一种新颖的深度监督策略,通过使用只有一种训练损失的多阶段和多分支注意力机制来学习本地和全球特征。通过使用皮肤镜图像数据集ISIC2019和智能手机照片数据集PAD-FIFES-20(PAD2020)评估Hierattn的功效。实验结果表明,Hierattn在最先进的轻量级网络中达到了曲线(AUC)下最佳的精度和面积。该代码可从https://github.com/anthonyweidai/hierattn获得。
translated by 谷歌翻译
网络架构在基于深度学习的计算机视觉系统中起关键作用。广泛使用的卷积神经网络和变压器将图像视为网格或序列结构,该网格或序列结构并非灵活以捕获不规则和复杂的对象。在本文中,我们建议将图像表示为图形结构,并引入新的视觉GNN(VIG)体系结构,以提取视觉任务的图形级特征。我们首先将图像拆分为许多被视为节点的补丁,然后通过连接最近的邻居来构造图形。根据图像的图表表示,我们构建了VIG模型以在所有节点之间转换和交换信息。 VIG由两个基本模块组成:用于汇总和更新图形信息的图形卷积的图形模块,以及带有两个线性层的FFN模块用于节点特征转换。 VIG的各向同性和金字塔体系结构均具有不同的型号。关于图像识别和对象检测任务的广泛实验证明了我们的VIG架构的优势。我们希望GNN关于一般视觉任务的开创性研究将为未来的研究提供有用的灵感和经验。 pytorch代码可在https://github.com/huawei-noah/effficity-ai-backbones上获得,Mindspore代码可在https://gitee.com/mindspore/models上获得。
translated by 谷歌翻译
Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domainspecific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver -a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio. The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.
translated by 谷歌翻译
大多数用于音频任务的机器学习模型都在处理手工制作的功能,即频谱图。但是,仍然未知是否可以用基于深度学习的功能代替频谱图。在本文中,我们通过将不同的可学习神经网络与成功的频谱图模型进行比较,并提出了基于双U-NET(GAFX-U)的一般音频提取器(GAFX)(GAFX-R(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R)(GAFX-R))和注意力(GAFX-A)模块。我们设计实验以评估GTZAN数据集上的音乐流派分类任务,并遵循音频频谱变压器(AST)分类器Achie Achie Achie aCHIE竞争性能,对我们框架的不同配置和模型GAFX-U进行了详细的消融研究。
translated by 谷歌翻译
最近,Vision Transformer通过推动各种视觉任务的最新技术取得了巨大的成功。视觉变压器中最具挑战性的问题之一是,图像令牌的较大序列长度会导致高计算成本(二次复杂性)。解决此问题的一个流行解决方案是使用单个合并操作来减少序列长度。本文考虑如何改善现有的视觉变压器,在这种变压器中,单个合并操作提取的合并功能似乎不太强大。为此,我们注意到,由于其在上下文抽象中的强大能力,金字塔池在各种视觉任务中已被证明是有效的。但是,在骨干网络设计中尚未探索金字塔池。为了弥合这一差距,我们建议在视觉变压器中将金字塔池汇总到多头自我注意力(MHSA)中,同时降低了序列长度并捕获强大的上下文特征。我们插入了基于池的MHSA,我们构建了一个通用视觉变压器主链,称为金字塔池变压器(P2T)。广泛的实验表明,与先前的基于CNN-和基于变压器的网络相比,当将P2T用作骨干网络时,它在各种视觉任务中显示出很大的优势。该代码将在https://github.com/yuhuan-wu/p2t上发布。
translated by 谷歌翻译
Semantic segmentation usually benefits from global contexts, fine localisation information, multi-scale features, etc. To advance Transformer-based segmenters with these aspects, we present a simple yet powerful semantic segmentation architecture, termed as IncepFormer. IncepFormer has two critical contributions as following. First, it introduces a novel pyramid structured Transformer encoder which harvests global context and fine localisation features simultaneously. These features are concatenated and fed into a convolution layer for final per-pixel prediction. Second, IncepFormer integrates an Inception-like architecture with depth-wise convolutions, and a light-weight feed-forward module in each self-attention layer, efficiently obtaining rich local multi-scale object features. Extensive experiments on five benchmarks show that our IncepFormer is superior to state-of-the-art methods in both accuracy and speed, e.g., 1) our IncepFormer-S achieves 47.7% mIoU on ADE20K which outperforms the existing best method by 1% while only costs half parameters and fewer FLOPs. 2) Our IncepFormer-B finally achieves 82.0% mIoU on Cityscapes dataset with 39.6M parameters. Code is available:github.com/shendu0321/IncepFormer.
translated by 谷歌翻译