We apply the vision transformer, a deep machine learning model build around the attention mechanism, on mel-spectrogram representations of raw audio recordings. When adding mel-based data augmentation techniques and sample-weighting, we achieve comparable performance on both (PRS and CCS challenge) tasks of ComParE21, outperforming most single model baselines. We further introduce overlapping vertical patching and evaluate the influence of parameter configurations. Index Terms: audio classification, attention, mel-spectrogram, unbalanced data-sets, computational paralinguistics
translated by 谷歌翻译
将深度学习模型部署在具有有限计算资源的时间关键性应用程序中,例如在边缘计算系统和IoT网络中,是一项具有挑战性的任务,通常依赖于动态推理方法(例如早期退出)。在本文中,我们介绍了一种基于视觉变压器体系结构的新型架构,用于早期退出,以及一种微调策略,该策略与传统方法相比,在引入较少的开销的同时,显着提高了早期出口分支的准确性。通过有关图像和音频分类以及视听人群计数的广泛实验,我们表明我们的方法在分类和回归问题以及单模式设置中都适用于分类和回归问题。此外,我们引入了一种新颖的方法,用于在视听数据分析的早期出口中整合音频和视觉方式,这可能导致更细粒度的动态推断。
translated by 谷歌翻译
Future surveys such as the Legacy Survey of Space and Time (LSST) of the Vera C. Rubin Observatory will observe an order of magnitude more astrophysical transient events than any previous survey before. With this deluge of photometric data, it will be impossible for all such events to be classified by humans alone. Recent efforts have sought to leverage machine learning methods to tackle the challenge of astronomical transient classification, with ever improving success. Transformers are a recently developed deep learning architecture, first proposed for natural language processing, that have shown a great deal of recent success. In this work we develop a new transformer architecture, which uses multi-head self attention at its core, for general multi-variate time-series data. Furthermore, the proposed time-series transformer architecture supports the inclusion of an arbitrary number of additional features, while also offering interpretability. We apply the time-series transformer to the task of photometric classification, minimising the reliance of expert domain knowledge for feature selection, while achieving results comparable to state-of-the-art photometric classification methods. We achieve a logarithmic-loss of 0.507 on imbalanced data in a representative setting using data from the Photometric LSST Astronomical Time-Series Classification Challenge (PLAsTiCC). Moreover, we achieve a micro-averaged receiver operating characteristic area under curve of 0.98 and micro-averaged precision-recall area under curve of 0.87.
translated by 谷歌翻译
当前的关键字发现系统通常通过大量预定义的关键字进行培训。在开放式摄影设置中识别关键字对于个性化智能设备互动至关重要。为了实现这一目标,我们提出了一个基于MLPMixer的纯粹基于MLP的神经网络,该网络是MLPMIXER - 一种MLP模型体系结构,可有效取代视觉变压器中的注意机制。我们研究了将mlpmixer体系结构适应QBYE开放式录音录一下关键字点斑点任务的不同方法。与最先进的RNN和CNN模型的比较表明,我们的方法在挑战性情况(10DB和6DB环境)上都在公开可用的HEY-SNIPS数据集和具有400个扬声器的更大规模的内部数据集上取得了更好的性能。与基线模型相比,我们提出的模型还具有较少数量的参数和MAC。
translated by 谷歌翻译
In recent years, Speech Emotion Recognition (SER) has been investigated mainly transforming the speech signal into spectrograms that are then classified using Convolutional Neural Networks pretrained on generic images and fine tuned with spectrograms. In this paper, we start from the general idea above and develop a new learning solution for SER, which is based on Compact Convolutional Transformers (CCTs) combined with a speaker embedding. With CCTs, the learning power of Vision Transformers (ViT) is combined with a diminished need for large volume of data as made possible by the convolution. This is important in SER, where large corpora of data are usually not available. The speaker embedding allows the network to extract an identity representation of the speaker, which is then integrated by means of a self-attention mechanism with the features that the CCT extracts from the spectrogram. Overall, the solution is capable of operating in real-time showing promising results in a cross-corpus scenario, where training and test datasets are kept separate. Experiments have been performed on several benchmarks in a cross-corpus setting as rarely used in the literature, with results that are comparable or superior to those obtained with state-of-the-art network architectures. Our code is available at https://github.com/JabuMlDev/Speaker-VGG-CCT.
translated by 谷歌翻译
本文研究了基于图像的蒙版自动编码器(MAE)的简单扩展,以从音频谱图中学习自我监督的表示。在MAE中的变压器编码器编码器设计之后,我们的Audio-MAE首先编码具有较高遮罩比的音频谱图斑块,仅通过编码器层馈入非掩盖令牌。然后,解码器重新订购并解码编码的上下文,并用掩码令牌填充,以重建输入频谱图。我们发现将局部窗户注意力纳入解码器是有益的,因为音频谱图在当地时间和频带中高度相关。然后,我们在目标数据集上以较低的掩模比微调编码器。从经验上讲,音频MAE在六个音频和语音分类任务上设定了新的最先进的性能,超过了使用外部监督预训练的其他最新模型。代码和模型将在https://github.com/facebookresearch/audiomae上。
translated by 谷歌翻译
根据诊断各种疾病的胸部X射线图像的可观增长,以及收集广泛的数据集,使用深神经网络进行了自动诊断程序,已经占据了专家的思想。计算机视觉中的大多数可用方法都使用CNN主链来获得分类问题的高精度。然而,最近的研究表明,在NLP中成为事实上方法的变压器也可以优于许多基于CNN的模型。本文提出了一个基于SWIN变压器的多标签分类深模型,作为实现最新诊断分类的骨干。它利用了头部体系结构来利用多层感知器(也称为MLP)。我们评估了我们的模型,该模型称为“ Chest X-Ray14”,最广泛,最大的X射线数据集之一,该数据集由30,000多名14例著名胸部疾病的患者组成100,000多个额叶/背景图像。我们的模型已经用几个数量的MLP层用于头部设置,每个模型都在所有类别上都达到了竞争性的AUC分数。胸部X射线14的全面实验表明,与以前的SOTA平均AUC为0.799相比,三层头的平均AUC得分为0.810,其平均AUC得分为0.810。我们建议对现有方法进行公平基准测试的实验设置,该设置可以用作未来研究的基础。最后,我们通过确认所提出的方法参与胸部的病理相关区域,从而跟进了结果。
translated by 谷歌翻译
随着变压器作为语言处理的标准及其在计算机视觉方面的进步,参数大小和培训数据的数量相应地增长。许多人开始相信,因此,变形金刚不适合少量数据。这种趋势引起了人们的关注,例如:某些科学领域中数据的可用性有限,并且排除了该领域研究资源有限的人。在本文中,我们旨在通过引入紧凑型变压器来提出一种小规模学习的方法。我们首次表明,具有正确的尺寸,卷积令牌化,变压器可以避免在小数据集上过度拟合和优于最先进的CNN。我们的模型在模型大小方面具有灵活性,并且在获得竞争成果的同时,参数可能仅为0.28亿。当在CIFAR-10上训练Cifar-10,只有370万参数训练时,我们的最佳模型可以达到98%的准确性,这是与以前的基于变形金刚的模型相比,数据效率的显着提高,比其他变压器小于10倍,并且是15%的大小。在实现类似性能的同时,重新NET50。 CCT还表现优于许多基于CNN的现代方法,甚至超过一些基于NAS的方法。此外,我们在Flowers-102上获得了新的SOTA,具有99.76%的TOP-1准确性,并改善了Imagenet上现有基线(82.71%精度,具有29%的VIT参数)以及NLP任务。我们针对变压器的简单而紧凑的设计使它们更可行,可以为那些计算资源和/或处理小型数据集的人学习,同时扩展了在数据高效变压器中的现有研究工作。我们的代码和预培训模型可在https://github.com/shi-labs/compact-transformers上公开获得。
translated by 谷歌翻译
我们展示了如何通过基于关注的全球地图扩充任何卷积网络,以实现非本地推理。我们通过基于关注的聚合层替换为单个变压器块的最终平均池,重量贴片如何参与分类决策。我们使用2个参数(宽度和深度)使用简单的补丁卷积网络,使用简单的补丁的卷积网络插入学习的聚合层。与金字塔设计相比,该架构系列在所有层上维护输入补丁分辨率。它在准确性和复杂性之间产生了令人惊讶的竞争权衡,特别是在记忆消耗方面,如我们在各种计算机视觉任务所示:对象分类,图像分割和检测的实验所示。
translated by 谷歌翻译
呼吸声分类中的问题已在去年的临床科学家和医学研究员团体中获得了良好的关注,以诊断Covid-19疾病。迄今为止,各种模型的人工智能(AI)进入了现实世界,从人类生成的声音等人生成的声音中检测了Covid-19疾病,例如语音/言语,咳嗽和呼吸。实现卷积神经网络(CNN)模型,用于解决基于人工智能(AI)的机器上的许多真实世界问题。在这种情况下,建议并实施一个维度(1D)CNN,以诊断Covid-19的呼吸系统疾病,例如语音,咳嗽和呼吸。应用基于增强的机制来改善Covid-19声音数据集的预处理性能,并使用1D卷积网络自动化Covid-19疾病诊断。此外,使用DDAE(数据去噪自动编码器)技术来产生诸如输入功能的深声特征,而不是采用MFCC(MEL频率跳跃系数)的标准输入,并且它更好地执行比以前的型号的准确性和性能。
translated by 谷歌翻译
To ensure proper knowledge representation of the kitchen environment, it is vital for kitchen robots to recognize the states of the food items that are being cooked. Although the domain of object detection and recognition has been extensively studied, the task of object state classification has remained relatively unexplored. The high intra-class similarity of ingredients during different states of cooking makes the task even more challenging. Researchers have proposed adopting Deep Learning based strategies in recent times, however, they are yet to achieve high performance. In this study, we utilized the self-attention mechanism of the Vision Transformer (ViT) architecture for the Cooking State Recognition task. The proposed approach encapsulates the globally salient features from images, while also exploiting the weights learned from a larger dataset. This global attention allows the model to withstand the similarities between samples of different cooking objects, while the employment of transfer learning helps to overcome the lack of inductive bias by utilizing pretrained weights. To improve recognition accuracy, several augmentation techniques have been employed as well. Evaluation of our proposed framework on the `Cooking State Recognition Challenge Dataset' has achieved an accuracy of 94.3%, which significantly outperforms the state-of-the-art.
translated by 谷歌翻译
随着自我关注机制的发展,变压器模型已经在计算机视觉域中展示了其出色的性能。然而,从完全关注机制带来的大规模计算成为内存消耗的沉重负担。顺序地,记忆的限制降低了改善变压器模型的可能性。为了解决这个问题,我们提出了一种名为耦合器的新的记忆经济性注意力机制,它将注意力映射与两个子矩阵分成并从空间信息中生成对准分数。应用了一系列不同的尺度图像分类任务来评估模型的有效性。实验结果表明,在ImageNet-1K分类任务上,与常规变压器相比,耦合器可以显着降低28%的存储器消耗,同时访问足够的精度要求,并且在占用相同的内存占用时表达了0.92%。结果,耦合器可以用作视觉任务中的有效骨干,并提供关于研究人员注意机制的新颖视角。
translated by 谷歌翻译
本文提供了一种在没有传统艺术神经结构状态的情况下做大规模音频理解的方式。自从在过去十年中引入理解音频信号的深度学习以来,卷积架构已经能够实现最先进的艺术状态,这些结果超越了传统的手工制作功能。在最近的过去,远离传统的卷积和经常性神经网络相似的转变,朝纯端到端的变压器架构。在这项工作中,我们探索了一种基于袋式模型的方法。我们的方法没有任何卷积,复发,关注,变压器或其他方法如伯特。我们利用Micro和Macro Level群集Vanilla Embeddings,并使用MLP头进行分类。我们仅使用前馈编码器解码器模型来获取光谱包围,光谱贴片和切片以及多分辨率光谱的瓶颈。类似于SIMCLR中的方法的分类头(前馈层)在学习的表示上培训。使用简单的代码在潜在的陈述中了解到,我们展示了我们如何超越传统的卷积神经网络架构,并引人注目地接近优势强大的变压器架构。这项工作希望能够在没有大规模的端到端神经架构的情况下铺平道具的兴奋进步。
translated by 谷歌翻译
目的。手写是日常生活中最常见的模式之一,由于它具有挑战性的应用,例如手写识别(HWR),作家识别和签名验证。与仅使用空间信息(即图像)的离线HWR相反,在线HWR(ONHWR)使用更丰富的时空信息(即轨迹数据或惯性数据)。尽管存在许多离线HWR数据集,但只有很少的数据可用于开发纸质上的ONHWR方法,因为它需要硬件集成的笔。方法。本文为实时序列到序列(SEQ2SEQ)学习和基于单个字符的识别提供了数据和基准模型。我们的数据由传感器增强的圆珠笔记录,从三轴加速度计,陀螺仪,磁力计和力传感器100 \,\ textit {hz}产生传感器数据流。我们建议各种数据集,包括与作者依赖和作者无关的任务的方程式和单词。我们的数据集允许在平板电脑上的经典ONHWR与传感器增强笔之间进行比较。我们使用经常性和时间卷积网络和变压器与连接派时间分类(CTC)损失(CTC)损失(CE)损失,为SEQ2SEQ和基于单个字符的HWR提供了评估基准。结果。我们的卷积网络与Bilstms相结合,优于基于变压器的架构,与基于序列的分类任务的启动时间相提并论,并且与28种最先进的技术相比,结果更好。时间序列扩展方法改善了基于序列的任务,我们表明CE变体可以改善单个分类任务。
translated by 谷歌翻译
哥内克人Sentinel Imagery的纯粹卷的可用性为使用深度学习的大尺度创造了新的土地利用陆地覆盖(Lulc)映射的机会。虽然在这种大型数据集上培训是一个非琐碎的任务。在这项工作中,我们试验Lulc Image分类和基准不同最先进模型的Bigearthnet数据集,包括卷积神经网络,多层感知,视觉变压器,高效导通和宽残余网络(WRN)架构。我们的目标是利用分类准确性,培训时间和推理率。我们提出了一种基于用于网络深度,宽度和输入数据分辨率的WRNS复合缩放的高效导通的框架,以有效地训练和测试不同的模型设置。我们设计一种新颖的缩放WRN架构,增强了有效的通道注意力机制。我们提出的轻量级模型具有较小的培训参数,实现所有19个LULC类的平均F分类准确度达到4.5%,并且验证了我们使用的resnet50最先进的模型速度快两倍作为基线。我们提供超过50种培训的型号,以及我们在多个GPU节点上分布式培训的代码。
translated by 谷歌翻译
Time Series Classification (TSC) is an important and challenging problem in data mining. With the increase of time series data availability, hundreds of TSC algorithms have been proposed. Among these methods, only a few have considered Deep Neural Networks (DNNs) to perform this task. This is surprising as deep learning has seen very successful applications in the last years. DNNs have indeed revolutionized the field of computer vision especially with the advent of novel deeper architectures such as Residual and Convolutional Neural Networks. Apart from images, sequential data such as text and audio can also be processed with DNNs to reach state-of-the-art performance for document classification and speech recognition. In this article, we study the current state-ofthe-art performance of deep learning algorithms for TSC by presenting an empirical study of the most recent DNN architectures for TSC. We give an overview of the most successful deep learning applications in various time series domains under a unified taxonomy of DNNs for TSC. We also provide an open source deep learning framework to the TSC community where we implemented each of the compared approaches and evaluated them on a univariate TSC benchmark (the UCR/UEA archive) and 12 multivariate time series datasets. By training 8,730 deep learning models on 97 time series datasets, we propose the most exhaustive study of DNNs for TSC to date.
translated by 谷歌翻译
Attention-based neural networks, such as Transformers, have become ubiquitous in numerous applications, including computer vision, natural language processing, and time-series analysis. In all kinds of attention networks, the attention maps are crucial as they encode semantic dependencies between input tokens. However, most existing attention networks perform modeling or reasoning based on representations, wherein the attention maps of different layers are learned separately without explicit interactions. In this paper, we propose a novel and generic evolving attention mechanism, which directly models the evolution of inter-token relationships through a chain of residual convolutional modules. The major motivations are twofold. On the one hand, the attention maps in different layers share transferable knowledge, thus adding a residual connection can facilitate the information flow of inter-token relationships across layers. On the other hand, there is naturally an evolutionary trend among attention maps at different abstraction levels, so it is beneficial to exploit a dedicated convolution-based module to capture this process. Equipped with the proposed mechanism, the convolution-enhanced evolving attention networks achieve superior performance in various applications, including time-series representation, natural language understanding, machine translation, and image classification. Especially on time-series representation tasks, Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly, achieving an average of 17% improvement compared to the best SOTA. To the best of our knowledge, this is the first work that explicitly models the layer-wise evolution of attention maps. Our implementation is available at https://github.com/pkuyym/EvolvingAttention
translated by 谷歌翻译
通过深度学习(DL)大大扩展了数据驱动故障诊断模型的范围。然而,经典卷积和反复化结构具有计算效率和特征表示的缺陷,而基于注意机制的最新变压器架构尚未应用于该字段。为了解决这些问题,我们提出了一种新颖的时变电片(TFT)模型,其灵感来自序列加工的香草变压器大规模成功。特别是,我们设计了一个新的笨蛋和编码器模块,以从振动信号的时频表示(TFR)中提取有效抽象。在此基础上,本文提出了一种基于时变电片的新的端到端故障诊断框架。通过轴承实验数据集的案例研究,我们构建了最佳变压器结构并验证了其故障诊断性能。与基准模型和其他最先进的方法相比,证明了所提出的方法的优越性。
translated by 谷歌翻译
由于2017年介绍了变压器架构,因此许多尝试将自我关注范例带入计算机愿景领域。在本文中,我们提出了一种新颖的自我关注模块,可以很容易地集成在几乎每个卷积神经网络中,专门为计算机视觉设计,LHC:本地(多)头通道(自我关注)。 LHC是基于两个主要思想:首先,我们认为在电脑视觉中利用自我关注范式的最佳方式是渠道明智的应用而不是更探索的空间关注,并且卷积不会被引起的注意力替换经常性网络在NLP中;其次,局部方法有可能更好地克服卷积的局限性而不是全球关注。通过LHC-Net,我们设法在着名的FER2013数据集中实现了新的艺术状态,与先前的SOTA相比,在计算成本方面的复杂性和对“宿主”架构的复杂性显着和影响。
translated by 谷歌翻译
我们首次建议使用基于多个实例学习的无卷积变压器模型,称为多个实例神经图像变压器(Minit),以分类T1Weighted(T1W)MRIS。我们首先介绍了为神经图像采用的几种变压器模型。这些模型从输入体积提取非重叠的3D块,并对其线性投影进行多头自我注意。另一方面,Minit将输入MRI的每个非重叠的3D块视为其自己的实例,将其进一步分为非重叠的3D贴片,并在其上计算了多头自我注意力。作为概念验证,我们通过训练模型来评估模型的功效,以确定两个公共数据集的T1W-MRIS:青少年脑认知发展(ABCD)和青少年酒精和神经发展联盟(NCANDA)(NCANDA) 。博学的注意力图突出了有助于识别脑形态计量学性别差异的体素。该代码可在https://github.com/singlaayush/minit上找到。
translated by 谷歌翻译