注意层是现代端到端自动语音识别系统不可或缺的一部分,例如作为变压器或构象体体系结构的一部分。注意通常是多头的,每个头部都有一组独立的学习参数,并在相同的输入特征序列上运行。多头注意的输出是单个头部输出的融合。我们经验分析了不同注意力头部产生的表示之间的多样性,并证明在训练过程中头部高度相关。我们研究了一些增加注意力头多样性的方法,包括为每个头部使用不同的注意力机制和辅助训练损失功能来促进头部多样性。我们表明,在训练过程中引入多样性辅助损失功能是一种更有效的方法,并且在Librispeech语料库上获得了多达6%的相对相对的改善。最后,我们在注意力头的多样性与头部参数梯度的相似性之间建立了联系。
translated by 谷歌翻译
最近,基于注意的编码器 - 解码器(AED)模型对多个任务的端到端自动语音识别(ASR)显示了高性能。在此类模型中解决了过度控制,本文介绍了轻松关注的概念,这是一种简单地逐渐注入对训练期间对编码器 - 解码器注意重量的统一分配,其易于用两行代码实现。我们调查轻松关注跨不同AED模型架构和两个突出的ASR任务,华尔街日志(WSJ)和LibRisPeech的影响。我们发现,在用外部语言模型解码时,随着宽松的注意力训练的变压器始终如一地始终如一地遵循标准基线模型。在WSJ中,我们为基于变压器的端到端语音识别设置了一个新的基准,以3.65%的单词错误率,最优于13.1%的相对状态,同时仅引入单个HyperParameter。
translated by 谷歌翻译
基于全注意力的变压器体系结构的强大建模能力通常会导致过度拟合,并且 - 对于自然语言处理任务,导致自动回归变压器解码器中隐式学习的内部语言模型,使外部语言模型的集成变得复杂。在本文中,我们探索了放松的注意力,对注意力的重量进行了简单易于实现的平滑平滑,从编码器。其次,我们表明它自然支持外部语言模型的整合,因为它通过放松解码器中的交叉注意来抑制隐式学习的内部语言模型。我们证明了在几项任务中放松注意力的好处,并与最近的基准方法相结合,并明显改善。具体而言,我们超过了最大的最大公共唇部阅读LRS3基准的26.90%单词错误率的先前最新性能,单词错误率为26.31%,并且我们达到了最佳表现的BLEU分数37.67在IWSLT14(de $ \ rightarrow $ en)的机器翻译任务没有外部语言模型,几乎没有其他模型参数。代码和模型将公开可用。
translated by 谷歌翻译
多头注意力是最先进的变压器背后的推动力,它在各种自然语言处理(NLP)和计算机视觉任务中实现了出色的性能。已经观察到,对于许多应用,这些注意力头会学习冗余嵌入,并且大多数可以在不降低模型性能的情况下去除。受到这一观察的启发,我们提出了变压器的混合物(变压器-MGK)的混合物,这是一种新型的变压器架构,用每个头部的钥匙混合了变压器中的冗余头部。这些键的混合物遵循高斯混合模型,并使每个注意力头有效地集中在输入序列的不同部分上。与传统的变压器对应物相比,变压器-MGK会加速训练和推理,具有较少的参数,并且需要更少的拖船来计算,同时实现跨任务的可比性或更高的准确性。 Transformer-MGK也可以轻松扩展到线性注意力。我们从经验上证明了在一系列实用应用中变形金属MGK的优势,包括语言建模和涉及非常长序列的任务。在Wikitext-103和远程竞技场基准中,具有4个头部的变压器MGK具有与基线变压器具有8个头的可比性或更好的性能。
translated by 谷歌翻译
由于使用特征提取过程中的每个框架,基于变压器的语音识别模型取得了巨大的成功。尤其是,下层中的SA头通过查询键点产品捕获了各种语音特性,该产品旨在计算帧之间的成对关系。在本文中,我们提出了一种SA的变体来提取更多代表性的语音特征。提出的语音自我注意力(PHSA)由两种不同类型的语音注意组成。一个是基于相似性的,另一个是基于内容的。简而言之,基于相似性的注意力捕获了帧之间的相关性,而基于内容的注意力仅考虑每个帧而不会受到其他帧影响。我们确定原始点产品方程的哪些部分与两种不同的注意力模式有关,并通过简单的修改改善每个部分。我们关于音素分类和语音识别的实验表明,用PHSA代替下层SA可改善识别性能,而无需增加延迟和参数大小。
translated by 谷歌翻译
Transformers are among the state of the art for many tasks in speech, vision, and natural language processing, among others. Self-attentions, which are crucial contributors to this performance have quadratic computational complexity, which makes training on longer input sequences challenging. Prior work has produced state-of-the-art transformer variants with linear attention, however, current models sacrifice performance to achieve efficient implementations. In this work, we develop a novel linear transformer by examining the properties of the key-query product within self-attentions. Our model outperforms state of the art approaches on speech recognition and speech summarization, resulting in 1 % absolute WER improvement on the Librispeech-100 speech recognition benchmark and a new INTERVIEW speech recognition benchmark, and 5 points on ROUGE for summarization with How2.
translated by 谷歌翻译
事实证明,构象异构体在许多语音处理任务中都是有效的。它结合了使用卷积和使用自我注意的全球依赖性提取本地依赖的好处。受此启发,我们提出了一个更灵活,可解释和可自定义的编码器替代方案,分支机构,并在端到端语音处理中对各种远程依赖关系进行建模。在每个编码器层中,一个分支都采用自我注意事项或其变体来捕获远程依赖性,而另一个分支则利用带有卷积门控(CGMLP)的MLP模块来提取局部关系。我们对几种语音识别和口语理解基准进行实验。结果表明,我们的模型优于变压器和CGMLP。它还与构象异构体获得的最先进结果相匹配。此外,由于两分支结构,我们展示了减少计算的各种策略,包括在单个训练有素的模型中具有可变的推理复杂性的能力。合并分支的权重表明如何在不同层中使用本地和全球依赖性,从而使模型设计受益。
translated by 谷歌翻译
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. * Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.† Work performed while at Google Brain.‡ Work performed while at Google Research.
translated by 谷歌翻译
知识蒸馏(KD),最称为模型压缩的有效方法,旨在将更大的网络(教师)的知识转移到更小的网络(学生)。传统的KD方法通常采用以监督方式培训的教师模型,其中输出标签仅作为目标处理。我们进一步扩展了这一受监督方案,我们为KD,即Oracle老师推出了一种新型的教师模型,它利用源输入和输出标签的嵌入来提取更准确的知识来转移到学生。所提出的模型遵循变压器网络的编码器解码器注意结构,这允许模型从输出标签上参加相关信息。在三种不同的序列学习任务中进行了广泛的实验:语音识别,场景文本识别和机器翻译。从实验结果来看,我们经验证明,拟议的模型在这些任务中改善了学生,同时在教师模型的培训时间内实现了相当大的速度。
translated by 谷歌翻译
Attention-based autoregressive models have achieved state-of-the-art performance in various sequence-to-sequence tasks, including Text-To-Speech (TTS) and Neural Machine Translation (NMT), but can be difficult to train. The standard training approach, teacher forcing, guides a model with the reference back-history. During inference, the generated back-history must be used. This mismatch limits the evaluation performance. Attention forcing has been introduced to address the mismatch, guiding the model with the generated back-history and reference attention. While successful in tasks with continuous outputs like TTS, attention forcing faces additional challenges in tasks with discrete outputs like NMT. This paper introduces the two extensions of attention forcing to tackle these challenges. (1) Scheduled attention forcing automatically turns attention forcing on and off, which is essential for tasks with discrete outputs. (2) Parallel attention forcing makes training parallel, and is applicable to Transformer-based models. The experiments show that the proposed approaches improve the performance of models based on RNNs and Transformers.
translated by 谷歌翻译
通过利用变形金刚捕获基于内容的全球互动和卷积神经网络对本地特征的利用,Condormer在自动语音识别(ASR)方面取得了令人印象深刻的结果。在构象异构体中,两个具有一半剩余连接的马卡龙状进料层将多头的自我注意和卷积模块夹在一起,然后是后层的归一化。我们在两个方向上提高了构象异构器的长序列能力,\ emph {sparser}和\ emph {更深层次}。我们使用$ \ Mathcal {o}(l \ text {log} l)$在时间复杂性和内存使用情况下调整稀疏的自我发挥机制。在执行剩余连接时,将使用深层的归一化策略,以确保我们对一百级构象体块的培训。在日本CSJ-500H数据集上,这种深稀疏的构象异构体分别达到5.52 \%,4.03 \%和4.50 \%在三个评估集上和4.16 \%,2.84 \%\%和3.20 \%时,当结合五个深度稀疏的稀疏配置符号时从12到16、17、50,最后100个编码器层的变体。
translated by 谷歌翻译
最先进的编码器模型(例如,用于机器翻译(MT)或语音识别(ASR))作为原子单元构造并端到端训练。没有其他模型的任何组件都无法(重新)使用。我们描述了Legonn,这是一种使用解码器模块构建编码器架构的过程,可以在各种MT和ASR任务中重复使用,而无需进行任何微调。为了实现可重复性,每个编码器和解码器模块之间的界面都基于模型设计器预先定义的离散词汇,将其接地到边缘分布序列。我们提出了两种摄入这些边缘的方法。一个是可区分的,可以使整个网络的梯度流动,另一个是梯度分离的。为了使MT任务之间的解码器模块的可移植性用于不同的源语言和其他任务(例如ASR),我们引入了一种模态不可思议的编码器,该模态编码器由长度控制机制组成,以动态调整编码器的输出长度,以匹配预期的输入长度范围的范围预训练的解码器。我们提出了几项实验来证明Legonn模型的有效性:可以重复使用德国英语(DE-EN)MT任务的训练有素的语言解码器模块,而没有对Europarl English ASR和ROMANIAN-ENGLISH进行微调(RO)(RO)(RO)(RO) -en)MT任务以匹配或击败相应的基线模型。当针对数千个更新的目标任务进行微调时,我们的Legonn模型将RO-EN MT任务提高了1.5个BLEU点,并为Europarl ASR任务降低了12.5%的相对减少。此外,为了显示其可扩展性,我们从三个模块中构成了一个legonn ASR模型 - 每个模块都在三个不同数据集的不同端到端训练的模型中学习 - 将降低的减少降低到19.5%。
translated by 谷歌翻译
我们提出了基于流的端到端自动语音识别(ASR)体系结构,该体系结构通过计算成本摊销来实现有效的神经推断。我们的体系结构在推理时间动态创建稀疏的计算途径,从而选择性地使用计算资源在整个解码过程中,从而使计算中的大幅降低,对准确性的影响最小。完全可区分的体系结构是端到端训练的,随附的轻巧仲裁器机制在帧级别运行,以在每个输入上做出动态决策,同时使用可调损耗函数来正规化针对预测性能的整体计算水平。我们使用在LiblisPeech数据上进行的计算摊销变压器变形器(T-T)模型报告了实验的经验结果。我们的最佳模型可以实现60%的计算成本降低,而相对单词错误率仅3%(WER)增加。
translated by 谷歌翻译
Attention is a powerful and ubiquitous mechanism for allowing neural models to focus on particular salient pieces of information by taking their weighted average when making predictions. In particular, multi-headed attention is a driving force behind many recent state-of-the-art natural language processing (NLP) models such as Transformer-based MT models and BERT. These models apply multiple attention mechanisms in parallel, with each attention "head" potentially focusing on different parts of the input, which makes it possible to express sophisticated functions beyond the simple weighted average. In this paper we make the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. In fact, some layers can even be reduced to a single head. We further examine greedy algorithms for pruning down models, and the potential speed, memory efficiency, and accuracy improvements obtainable therefrom. Finally, we analyze the results with respect to which parts of the model are more reliant on having multiple heads, and provide precursory evidence that training dynamics play a role in the gains provided by multi-head attention 1 .1 Code to replicate our experiments is provided at https://github.com/pmichel31415/ are-16-heads-really-better-than-1
translated by 谷歌翻译
机器翻译历史上的重要突破之一是变压器模型的发展。不仅对于各种翻译任务,而且对于大多数其他NLP任务都是革命性的。在本文中,我们针对一个基于变压器的系统,该系统能够将德语用源句子转换为其英语的对应目标句子。我们对WMT'13数据集的新闻评论德语 - 英语并行句子进行实验。此外,我们研究了来自IWSLT'16数据集的培训中包含其他通用域数据以改善变压器模型性能的效果。我们发现,在培训中包括IWSLT'16数据集,有助于在WMT'13数据集的测试集中获得2个BLEU得分点。引入定性分析以分析通用域数据的使用如何有助于提高产生的翻译句子的质量。
translated by 谷歌翻译
Multi-head self-attention is a key component of the Transformer, a state-of-the-art architecture for neural machine translation. In this work we evaluate the contribution made by individual attention heads in the encoder to the overall performance of the model and analyze the roles played by them. We find that the most important and confident heads play consistent and often linguistically-interpretable roles. When pruning heads using a method based on stochastic gates and a differentiable relaxation of the L 0 penalty, we observe that specialized heads are last to be pruned. Our novel pruning method removes the vast majority of heads without seriously affecting performance. For example, on the English-Russian WMT dataset, pruning 38 out of 48 encoder heads results in a drop of only 0.15 BLEU. 1
translated by 谷歌翻译
Recent trends of incorporating attention mechanisms in vision have led researchers to reconsider the supremacy of convolutional layers as a primary building block. Beyond helping CNNs to handle long-range dependencies, Ramachandran et al. (2019) showed that attention can completely replace convolution and achieve state-of-the-art performance on vision tasks. This raises the question: do learned attention layers operate similarly to convolutional layers? This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis. Our code is publicly available 1 .
translated by 谷歌翻译
我用Hunglish2语料库训练神经电脑翻译任务的模型。这项工作的主要贡献在培训NMT模型期间评估不同的数据增强方法。我提出了5种不同的增强方法,这些方法是结构感知的,这意味着而不是随机选择用于消隐或替换的单词,句子的依赖树用作增强的基础。我首先关于神经网络的详细文献综述,顺序建模,神经机翻译,依赖解析和数据增强。经过详细的探索性数据分析和Hunglish2语料库的预处理之后,我使用所提出的数据增强技术进行实验。匈牙利语的最佳型号达到了33.9的BLEU得分,而英国匈牙利最好的模型达到了28.6的BLEU得分。
translated by 谷歌翻译
变形金刚在参加长语境时奋斗,因为计算量随着上下文长度而增长,因此它们不能有效地模拟长期存储器。已经提出了几种变体来缓解这个问题,但它们都有有限的内存容量,被迫降低旧信息。在本文中,我们提出了$ \ infty $ -former,它将Vanilla变压器与无限的长期记忆延伸。通过利用连续空间注意机制来参加长期内存,$ \ idty $ -former的注意力复杂性与上下文长度无关。因此,它能够在保持固定计算预算的同时进行任意长的上下文并维持“粘性存储器”。合成排序任务的实验展示了$ \ idty $ -former将信息从长序列中保留信息的能力。我们还通过培训从头开始培训模型以及微调预先培训的语言模型来执行语言建模实验,这表明了无限性的长期记忆的好处。
translated by 谷歌翻译
End-to-End automatic speech recognition (ASR) models aim to learn a generalised speech representation to perform recognition. In this domain there is little research to analyse internal representation dependencies and their relationship to modelling approaches. This paper investigates cross-domain language model dependencies within transformer architectures using SVCCA and uses these insights to exploit modelling approaches. It was found that specific neural representations within the transformer layers exhibit correlated behaviour which impacts recognition performance. Altogether, this work provides analysis of the modelling approaches affecting contextual dependencies and ASR performance, and can be used to create or adapt better performing End-to-End ASR models and also for downstream tasks.
translated by 谷歌翻译