师生(T / S)学习已被证明对领域适应和模型压缩等各种问题都有效。 T / S学习的一个缺点是,教师模型并不总是完美的,偶尔会以后验概率的形式产生错误的指导,从而导致学生模型朝向次优表现。为了克服这个问题,我们提出了一种条件性T / S学习方案,其中“智能”学生模型选择性地选择从教师模型或地面真实标签中学习,条件是教师是否能够正确地预测基本事实。与两个知识源的幼稚线性组合不同,条件学习在教师模型的预测正确时专门与教师模型进行交互,否则将退回到背景真相。因此,学生模型能够有效地从教师那里学习,甚至可能超越教师。我们研究了两个任务的建议学习方案:CHiM​​E-3数据集上的域适应和Microsoft短消息听写数据集上的Speakeradaptation。所提出的方法分别实现9.8%和12.8%的相对字错误率降低,用于环境适应的overT / S学习和与说话者无关的模型forspeaker自适应。
translated by 谷歌翻译
The teacher-student (T/S) learning has been shown effective in unsupervised domain adaptation [1]. It is a form of transfer learning, not in terms of the transfer of recognition decisions, but the knowledge of posteriori probabilities in the source domain as evaluated by the teacher model. It learns to handle the speaker and environment variability inherent in and restricted to the speech signal in the target domain without proactively addressing the robustness to other likely conditions. Performance degradation may thus ensue. In this work, we advance T/S learning by proposing adversarial T/S learning to explicitly achieve condition-robust unsupervised domain adaptation. In this method, a student acoustic model and a condition classifier are jointly optimized to minimize the Kullback-Leibler divergence between the output distributions of the teacher and student models, and simultaneously, to min-maximize the condition classification loss. A condition-invariant deep feature is learned in the adapted student model through this procedure. We further propose multi-factorial adversarial T/S learning which suppresses condition variabilities caused by multiple factors simultaneously. Evaluated with the noisy CHiME-3 test set, the proposed methods achieve relative word error rate improvements of 44.60% and 5.38%, respectively , over a clean source model and a strong T/S learning baseline model.
translated by 谷歌翻译
这是我们从100万小时无标签语音建立声学模型的经验教训的报告,而标记语音限制在7,000小时。我们对未标记数据进行学生/教师培训,与基于置信度模型的方法相比,帮助扩展目标生成,这需要解码器和置信度模型。为了优化存储并使目标生成并行化,我们存储了来自教师模型的高价值logits。介绍了预定学习的概念,我们在未标记和标记数据上交错学习。为了在大量GPU上扩展分布式训练,我们使用具有64个GPU的BMUF,同时仅使用16个GPU对具有梯度阈值压缩SGD的标记数据执行序列训练。 Ourexperiments表明,极其庞大的数据确实很有用;通过小参数调整,我们可以在10%到20%的范围内获得相对的WER改善,在噪声较大的条件下获得更高的增益。
translated by 谷歌翻译
End-to-end acoustic models, such as connectionist temporal classification (CTC) and the attention model, have been studied, and their speech recognition accuracies come close to those of conventional deep neural network (DNN)-hidden Markov models. However, most high-performance end-to-end models are not suitable for real-time (streaming) speech recognition because they are based on bidirectional recurrent neural networks (RNNs). In this study, to improve the performance of unidirectional RNN-based CTC, which is suitable for real-time processing, we investigate the knowledge distillation (KD)-based model compression method for training a CTC acoustic model. we evaluate a frame-level KD method and a sequence-level KD method for CTC model. The speech recognition experiments on Wall Street Journal tasks demonstrate that, the frame-level KD worsens the WERs of unidirec-tional CTC model, whereas sequence-level KD can improve the WERs of the model.
translated by 谷歌翻译
Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic speech recognition (ASR). Permutation invariant training (PIT) is a state of the art model-based approach, which applies a single neural network to solve this single-input, multiple-output modeling problem. We propose to advance the current state of the art by imposing a modular structure on the neural network, applying a progressive pretraining regimen, and improving the objective function with transfer learning and a discriminative training criterion. The modular structure splits the problem into three sub-tasks: frame-wise interpreting, utterance-level speaker tracing, and speech recognition. The pretraining regimen uses these modules to solve progressively harder tasks. Transfer learning leverages parallel clean speech to improve the training targets for the network. Our discriminative training formulation is a modification of standard formulations that also penalizes competing outputs of the system. Experiments are conducted on the artificial overlapped Switchboard and hub5e-swb dataset. The proposed framework achieves over 30% relative improvement of WER over both a strong jointly trained system, PIT for ASR, and a separately optimized system, PIT for speech separation with clean speech ASR model. The improvement comes from better model generalization, training efficiency and the sequence level linguistic knowledge integration.
translated by 谷歌翻译
本文探讨了在训练语音识别系统中使用对抗性实例来提高深度神经网络声学模型的鲁棒性。在训练期间,快速梯度符号方法用于生成增加原始训练数据的对抗性示例。与基于数据变换的常规数据增强不同,这些示例基于当前声学模型参数动态地生成。我们评估了在Aurora-4和CHIME-4单通道任务的实验中对抗性数据增加的影响,显示出改善的抗噪声和信道变化的稳健性。当将adversarialexamples与教师/学生培训相结合时,可以获得进一步的改进,从而导致Aurora-4的相对单词误差降低23%。
translated by 谷歌翻译
This paper addresses the robust speech recognition problem as a domain adaptation task. Specifically, we introduce an unsupervised deep domain adaptation (DDA) approach to acoustic modeling in order to eliminate the training-testing mismatch that is common in real-world use of speech recognition. Under a multi-task learning framework, the approach jointly learns two discriminative classifiers using one deep neural network (DNN). As the main task, a label predictor predicts phoneme labels and is used during training and at test time. As the second task, a domain classifier discriminates between the source and the target domains during training. The network is optimized by minimizing the loss of the label classifier and to maximize the loss of the domain classifier at the same time. The proposed approach is easy to implement by modifying a common feed-forward network. Moreover, this unsupervised approach only needs labeled training data from the source domain and some unlabeled raw data of the new domain. Speech recognition experiments on noise/channel distortion and domain shift confirm the effectiveness of the proposed approach. For instance, on the Aurora-4 corpus, compared with the acoustic model trained only using clean data, the DDA approach achieves relative 37.8% word error rate (WER) reduction.
translated by 谷歌翻译
使用端到端语音识别器实现高精度需要在训练之前进行仔细的参数初始化。否则,网络可能无法获得良好的局部最优。对于在线网络尤其如此,例如单向LSTM。目前,训练此类系统的最佳策略是从绑定三音素系统引导训练。然而,这是耗费时间的,更重要的是,对于没有高质量发音词典的语言来说,这是不可能的。在这项工作中,我们提出了一个初始策略,它使用师生学习将知识从大型,训练有素的离线端到端语音识别模型转移到在线端到端模型,从而无需使用词典或任何其他语言资源。我们还探索课程学习和标签平滑,并展示如何将他们与建议的师生学习结合起来进行进一步的改进。我们在Microsoft Cortana个人助理任务中评估我们的方法,并表明与随机初始化的基线系统相比,所提出的方法导致单词错误率相对提高19%。
translated by 谷歌翻译
我们研究序列级知识蒸馏序列级到序列(Seq2Seq)模型的大词汇量连续语音识别(LVSCR)的可行性。我们首先使用预先训练的较大的教师模型,通过波束搜索生成每个话语的多个假设。使用相同的输入,我们然后使用从教师生成的这些假设来训练学生模型,作为伪标签代替原始地面真实标签。我们使用华尔街日报(WSJ)语料库评估我们提出的方法。降低了9.8美元以上的参数减少量,精度损失高达7.0%,字错误率(WER)增加
translated by 谷歌翻译
诸如Alexa自动语音识别(ASR)系统的大规模机器学习(ML)系统随着手动转录的训练数据量的增加而不断改进。我们利用半监督学习(SSL)从大量未转录的音频数据中学习声学模型(AM),而不是将手动转录缩放到不实际的水平。从100万小时的音频中学习AM会带来独特的ML和系统设计挑战。我们为AM提供高度可扩展且资源效率高的SSL系统的设计和评估。使用学生/教师学习范例,我们专注于学生学习子系统:一个可扩展且强大的数据管道,可以从原始音频生成特征和目标,以及一个有效的模型管道,包括构建学生模型的分布式培训师。 Ourevaluations表明,即使没有广泛的超参数调整,我们也可以在10到20美元的范围内获得相对准确度的提升,同时具有更高的收益条件。此SSL系统的端到端处理时间为12天,此系统中的多个组件可以通过更多计算资源进行线性扩展。
translated by 谷歌翻译
Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In previous work, we have shown that such architectures are comparable to state-of-the-art ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We also introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12, 500 hour voice search task, we find that the proposed changes improve the WER from 9.2% to 5.6%, while the best conventional system achieves 6.7%; on a dictation task our model achieves a WER of 4.1% compared to 5% for the conventional system.
translated by 谷歌翻译
This work presents a broad study on the adaptation of neural network acoustic models by means of learning hidden unit contributions (LHUC)-a method that linearly re-combines hidden units in a speaker-or environment-dependent manner using small amounts of unsupervised adaptation data. We also extend LHUC to a speaker adaptive training (SAT) framework that leads to a more adaptable DNN acoustic model, working both in a speaker-dependent and a speaker-independent manner, without the requirements to maintain auxiliary speaker-dependent feature extractors or to introduce significant speaker-dependent changes to the DNN structure. Through a series of experiments on four different speech recognition benchmarks (TED talks, Switchboard, AMI meetings, and Aurora4) comprising 270 test speakers, we show that LHUC in both its test-only and SAT variants results in consistent word error rate reductions ranging from 5% to 23% relative depending on the task and the degree of mismatch between training and test data. In addition, we have investigated the effect of the amount of adaptation data per speaker, the quality of unsupervised adaptation targets, the complementarity to other adaptation techniques, one-shot adaptation, and an extension to adapting DNNs trained in a sequence discriminative manner.
translated by 谷歌翻译
为低资源语言开发实用的语音识别器是具有挑战性的,不仅因为语言的(可能是未知的)属性,而且因为测试数据可能不是与可用训练数据来自相同的域。在本文中,我们关注使用基于序列的标准训练的系统的后一挑战,即域失配。我们证明了使用预训练的英语识别器的有效性,这种识别器对于这种不匹配的条件是健壮的,作为低资源语言的域规范化特征提取器。在我们的示例中,我们使用TurkishConversational语音和广播新闻数据。这使得语音识别器能够快速开发,可以轻松适应任何领域。在各种跨域方案中,我们实现了对音素错误率的25%的相对改进,对于某些域,改进率约为50%。
translated by 谷歌翻译
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.
translated by 谷歌翻译
We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used 'na¨ıvelyna¨ıvely' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furthermore, simple, feature-level fusion based extensions to the framework are proposed to improve the integration with the ASR back-end. These yield a best result of 13.76 % average word error rate, which is, to our knowledge, the best score to date.
translated by 谷歌翻译
The environmental robustness of DNN-based acoustic models can be significantly improved by using multi-condition training data. However, as data collection is a costly proposition, simulation of the desired conditions is a frequently adopted strategy. In this paper we detail a data augmentation approach for far-field ASR. We examine the impact of using simulated room impulse responses (RIRs), as real RIRs can be difficult to acquire, and also the effect of adding point-source noises. We find that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added. Further we show that the trained acoustic models not only perform well in the distant-talking scenario but also provide better results in the close-talking scenario. We evaluate our approach on several LVCSR tasks which can adequately represent both scenarios.
translated by 谷歌翻译
在本文中,我们研究了对抗性学习用于未经监督的适应于看不见的录音条件,更具体地说是单麦克风远场语音的用途。我们将基于神经网络的声学模型(使用近距干净语音训练)应用于使用未转录的适应数据的新记录条件。我们对ItalianSPEECON数据集的实验结果表明,与未适应的模型相比,我们提出的方法实现了19.8%的相对蠕动率(WER)降低。此外,这种适应方法即使在对来自另一种语言(即法语)的数据执行时也是有益的,从而给出12.6%的相对WER减少。
translated by 谷歌翻译
回归神经网络(RNN)由于其优于传统的基于N-gram模型的性能而主导了语言建模。在许多应用中,使用大型递归神经网络语言模型(RNNLM)或几个RNNLM的集合。这些模型具有大的内存占用并且需要大量计算。在本文中,我们研究了应用知识蒸馏在减少RNNLMs模型大小方面的作用。此外,我们提出了一种信任正则化方法来改进RNNLMs的知识蒸馏训练。使用知识蒸馏和信任正则化,将参数大小减小到先前发布的最佳模型的三分之一,同时在Penn Treebankdata上保持最先进的困惑结果。在语音识别N-bestrescoring任务中,我们将RNNLM模型大小减少到基线系统的18.5%,而华尔街日报数据集的单词误差率(WER)性能没有下降。
translated by 谷歌翻译
目前最先进的自动语音识别系统在特定的“域”中进行训练,基于应用,采样率和编解码器等因素进行定义。当这种识别器用于与训练域不匹配的条件时,性能显着下降。这项工作探索了通过组合来自多个应用程序域的大规模训练数据,为各种用例构建单个域不变模型的理论。我们的终端系统使用162,000小时的语音进行训练。此外,在训练期间,每个设备都会被人为地扭曲,以模拟背景噪声,编解码器失真和采样率等效果。我们的结果表明,即使在这样的规模下,这样训练的模型几乎与精细调整到特定子集的模型一样好:单个模型对多个应用域可以是健壮的,并且可以对编解码器和噪声进行变化。更重要的是,这些模型更好地概括为看不见的条件并允许快速适应 - 我们通过使用来自新域的少至10小时的数据表明,适应的域不变模型可以匹配从头开始训练的adomain特定模型的性能数据量的70倍。我们还强调了在未来工作中需要解决的此类模型和领域的一些局限性。
translated by 谷歌翻译
从帧级对齐训练的常规自动语音识别(ASR)系统可以轻松利用后路融合来提高ASR准确度,并通过知识蒸馏建立更好的单一模型。使用连接主义时间分类(CTC)损失训练的端到端ASR系统不会需要帧级对齐,从而简化模型训练。然而,来自CTC模型的稀疏和随意的后峰值定时在多个模型的后路融合和CTC模型之间的知识蒸馏中提出了一系列新的挑战。我们提出了一种训练CTC模型的方法,以便引导其尖峰定时与预训练的引导CTC模型的尖峰定时一致。因此,所有共享相同引导模型的模型都具有对齐的尖峰时序。我们展示了我们的方法在各种情况下的优势,包括CTC模型的后验融合和具有不同结构的CTC模型之间的知识蒸馏。使用300小时的Switchboardtraining数据,从多个模型中提取的单字CTC模型将Hub5 2000Switchboard / CallHome测试集上的单词错误率从14.9%/ 24.1%提高到13.7%/ 23.1%,而不使用任何数据扩充,语言模型或复杂解码器。
translated by 谷歌翻译