这是我们从100万小时无标签语音建立声学模型的经验教训的报告,而标记语音限制在7,000小时。我们对未标记数据进行学生/教师培训,与基于置信度模型的方法相比,帮助扩展目标生成,这需要解码器和置信度模型。为了优化存储并使目标生成并行化,我们存储了来自教师模型的高价值logits。介绍了预定学习的概念,我们在未标记和标记数据上交错学习。为了在大量GPU上扩展分布式训练,我们使用具有64个GPU的BMUF,同时仅使用16个GPU对具有梯度阈值压缩SGD的标记数据执行序列训练。 Ourexperiments表明,极其庞大的数据确实很有用;通过小参数调整,我们可以在10%到20%的范围内获得相对的WER改善,在噪声较大的条件下获得更高的增益。
translated by 谷歌翻译
本文探讨了在训练语音识别系统中使用对抗性实例来提高深度神经网络声学模型的鲁棒性。在训练期间,快速梯度符号方法用于生成增加原始训练数据的对抗性示例。与基于数据变换的常规数据增强不同,这些示例基于当前声学模型参数动态地生成。我们评估了在Aurora-4和CHIME-4单通道任务的实验中对抗性数据增加的影响,显示出改善的抗噪声和信道变化的稳健性。当将adversarialexamples与教师/学生培训相结合时,可以获得进一步的改进,从而导致Aurora-4的相对单词误差降低23%。
translated by 谷歌翻译
Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic speech recognition (ASR). Permutation invariant training (PIT) is a state of the art model-based approach, which applies a single neural network to solve this single-input, multiple-output modeling problem. We propose to advance the current state of the art by imposing a modular structure on the neural network, applying a progressive pretraining regimen, and improving the objective function with transfer learning and a discriminative training criterion. The modular structure splits the problem into three sub-tasks: frame-wise interpreting, utterance-level speaker tracing, and speech recognition. The pretraining regimen uses these modules to solve progressively harder tasks. Transfer learning leverages parallel clean speech to improve the training targets for the network. Our discriminative training formulation is a modification of standard formulations that also penalizes competing outputs of the system. Experiments are conducted on the artificial overlapped Switchboard and hub5e-swb dataset. The proposed framework achieves over 30% relative improvement of WER over both a strong jointly trained system, PIT for ASR, and a separately optimized system, PIT for speech separation with clean speech ASR model. The improvement comes from better model generalization, training efficiency and the sequence level linguistic knowledge integration.
translated by 谷歌翻译
This paper addresses the robust speech recognition problem as a domain adaptation task. Specifically, we introduce an unsupervised deep domain adaptation (DDA) approach to acoustic modeling in order to eliminate the training-testing mismatch that is common in real-world use of speech recognition. Under a multi-task learning framework, the approach jointly learns two discriminative classifiers using one deep neural network (DNN). As the main task, a label predictor predicts phoneme labels and is used during training and at test time. As the second task, a domain classifier discriminates between the source and the target domains during training. The network is optimized by minimizing the loss of the label classifier and to maximize the loss of the domain classifier at the same time. The proposed approach is easy to implement by modifying a common feed-forward network. Moreover, this unsupervised approach only needs labeled training data from the source domain and some unlabeled raw data of the new domain. Speech recognition experiments on noise/channel distortion and domain shift confirm the effectiveness of the proposed approach. For instance, on the Aurora-4 corpus, compared with the acoustic model trained only using clean data, the DDA approach achieves relative 37.8% word error rate (WER) reduction.
translated by 谷歌翻译
我们研究序列级知识蒸馏序列级到序列(Seq2Seq)模型的大词汇量连续语音识别(LVSCR)的可行性。我们首先使用预先训练的较大的教师模型,通过波束搜索生成每个话语的多个假设。使用相同的输入,我们然后使用从教师生成的这些假设来训练学生模型,作为伪标签代替原始地面真实标签。我们使用华尔街日报(WSJ)语料库评估我们提出的方法。降低了9.8美元以上的参数减少量,精度损失高达7.0%,字错误率(WER)增加
translated by 谷歌翻译
使用端到端语音识别器实现高精度需要在训练之前进行仔细的参数初始化。否则,网络可能无法获得良好的局部最优。对于在线网络尤其如此,例如单向LSTM。目前,训练此类系统的最佳策略是从绑定三音素系统引导训练。然而,这是耗费时间的,更重要的是,对于没有高质量发音词典的语言来说,这是不可能的。在这项工作中,我们提出了一个初始策略,它使用师生学习将知识从大型,训练有素的离线端到端语音识别模型转移到在线端到端模型,从而无需使用词典或任何其他语言资源。我们还探索课程学习和标签平滑,并展示如何将他们与建议的师生学习结合起来进行进一步的改进。我们在Microsoft Cortana个人助理任务中评估我们的方法,并表明与随机初始化的基线系统相比,所提出的方法导致单词错误率相对提高19%。
translated by 谷歌翻译
End-to-end acoustic models, such as connectionist temporal classification (CTC) and the attention model, have been studied, and their speech recognition accuracies come close to those of conventional deep neural network (DNN)-hidden Markov models. However, most high-performance end-to-end models are not suitable for real-time (streaming) speech recognition because they are based on bidirectional recurrent neural networks (RNNs). In this study, to improve the performance of unidirectional RNN-based CTC, which is suitable for real-time processing, we investigate the knowledge distillation (KD)-based model compression method for training a CTC acoustic model. we evaluate a frame-level KD method and a sequence-level KD method for CTC model. The speech recognition experiments on Wall Street Journal tasks demonstrate that, the frame-level KD worsens the WERs of unidirec-tional CTC model, whereas sequence-level KD can improve the WERs of the model.
translated by 谷歌翻译
Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In previous work, we have shown that such architectures are comparable to state-of-the-art ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We also introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12, 500 hour voice search task, we find that the proposed changes improve the WER from 9.2% to 5.6%, while the best conventional system achieves 6.7%; on a dictation task our model achieves a WER of 4.1% compared to 5% for the conventional system.
translated by 谷歌翻译
为低资源语言开发实用的语音识别器是具有挑战性的,不仅因为语言的(可能是未知的)属性,而且因为测试数据可能不是与可用训练数据来自相同的域。在本文中,我们关注使用基于序列的标准训练的系统的后一挑战,即域失配。我们证明了使用预训练的英语识别器的有效性,这种识别器对于这种不匹配的条件是健壮的,作为低资源语言的域规范化特征提取器。在我们的示例中,我们使用TurkishConversational语音和广播新闻数据。这使得语音识别器能够快速开发,可以轻松适应任何领域。在各种跨域方案中,我们实现了对音素错误率的25%的相对改进,对于某些域,改进率约为50%。
translated by 谷歌翻译
This work presents a broad study on the adaptation of neural network acoustic models by means of learning hidden unit contributions (LHUC)-a method that linearly re-combines hidden units in a speaker-or environment-dependent manner using small amounts of unsupervised adaptation data. We also extend LHUC to a speaker adaptive training (SAT) framework that leads to a more adaptable DNN acoustic model, working both in a speaker-dependent and a speaker-independent manner, without the requirements to maintain auxiliary speaker-dependent feature extractors or to introduce significant speaker-dependent changes to the DNN structure. Through a series of experiments on four different speech recognition benchmarks (TED talks, Switchboard, AMI meetings, and Aurora4) comprising 270 test speakers, we show that LHUC in both its test-only and SAT variants results in consistent word error rate reductions ranging from 5% to 23% relative depending on the task and the degree of mismatch between training and test data. In addition, we have investigated the effect of the amount of adaptation data per speaker, the quality of unsupervised adaptation targets, the complementarity to other adaptation techniques, one-shot adaptation, and an extension to adapting DNNs trained in a sequence discriminative manner.
translated by 谷歌翻译
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.
translated by 谷歌翻译
目前最先进的自动语音识别系统在特定的“域”中进行训练,基于应用,采样率和编解码器等因素进行定义。当这种识别器用于与训练域不匹配的条件时,性能显着下降。这项工作探索了通过组合来自多个应用程序域的大规模训练数据,为各种用例构建单个域不变模型的理论。我们的终端系统使用162,000小时的语音进行训练。此外,在训练期间,每个设备都会被人为地扭曲,以模拟背景噪声,编解码器失真和采样率等效果。我们的结果表明,即使在这样的规模下,这样训练的模型几乎与精细调整到特定子集的模型一样好:单个模型对多个应用域可以是健壮的,并且可以对编解码器和噪声进行变化。更重要的是,这些模型更好地概括为看不见的条件并允许快速适应 - 我们通过使用来自新域的少至10小时的数据表明,适应的域不变模型可以匹配从头开始训练的adomain特定模型的性能数据量的70倍。我们还强调了在未来工作中需要解决的此类模型和领域的一些局限性。
translated by 谷歌翻译
回归神经网络(RNN)由于其优于传统的基于N-gram模型的性能而主导了语言建模。在许多应用中,使用大型递归神经网络语言模型(RNNLM)或几个RNNLM的集合。这些模型具有大的内存占用并且需要大量计算。在本文中,我们研究了应用知识蒸馏在减少RNNLMs模型大小方面的作用。此外,我们提出了一种信任正则化方法来改进RNNLMs的知识蒸馏训练。使用知识蒸馏和信任正则化,将参数大小减小到先前发布的最佳模型的三分之一,同时在Penn Treebankdata上保持最先进的困惑结果。在语音识别N-bestrescoring任务中,我们将RNNLM模型大小减少到基线系统的18.5%,而华尔街日报数据集的单词误差率(WER)性能没有下降。
translated by 谷歌翻译
We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used 'na¨ıvelyna¨ıvely' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furthermore, simple, feature-level fusion based extensions to the framework are proposed to improve the integration with the ASR back-end. These yield a best result of 13.76 % average word error rate, which is, to our knowledge, the best score to date.
translated by 谷歌翻译
本文介绍了一种使用不成对数据训练端到端自动语音识别(ASR)模型的方法。尽管端到端方法消除了对诸如发音词典之类的专业知识的需求以构建ASR系统,但它仍然需要大量的配对数据,即语音话语和它们的转录。最近提出循环一致性损失作为缓解有限配对数据问题的一种方法。这些方法构成了一种具有给定变换的逆向操作,例如,具有ASR的文本到语音(TTS),以构建仅需要的损失。无监督数据,本例中的语音。将循环一致性应用于ASR模型是不重要的,因为在中间文本瓶颈中丢失了诸如说话者特征之类的基本信息。为了解决这个问题,这项工作提出了一个基于语音编码器状态序列而不是原始语音信号的方法。这是通过训练文本到编码器模型并基于编码器重建误差定义高压来实现的。在LibriSpeech语料库上的实验结果表明,所提出的循环一致性训练使用100小时配对数据训练的初始模型将单词错误率降低了14.7%,使用额外的360小时音频数据而没有转录。我们还研究了纯文本数据的使用,主要用于语言建模,进一步提高了不成对数据训练场景中的性能。
translated by 谷歌翻译
在本文中,我们研究了对抗性学习用于未经监督的适应于看不见的录音条件,更具体地说是单麦克风远场语音的用途。我们将基于神经网络的声学模型(使用近距干净语音训练)应用于使用未转录的适应数据的新记录条件。我们对ItalianSPEECON数据集的实验结果表明,与未适应的模型相比,我们提出的方法实现了19.8%的相对蠕动率(WER)降低。此外,这种适应方法即使在对来自另一种语言(即法语)的数据执行时也是有益的,从而给出12.6%的相对WER减少。
translated by 谷歌翻译
Recent studies have shown that deep neural networks (DNNs) perform significantly better than shallow networks and Gaussian mixture models (GMMs) on large vocabulary speech recognition tasks. In this paper, we argue that the improved accuracy achieved by the DNNs is the result of their ability to extract dis-criminative internal representations that are robust to the many sources of variability in speech signals. We show that these representations become increasingly insensitive to small perturbations in the input with increasing network depth, which leads to better speech recognition performance with deeper networks. We also show that DNNs cannot extrapolate to test samples that are substantially different from the training examples. If the training data are sufficiently representative, however, internal features learned by the DNN are relatively stable with respect to speaker differences, bandwidth differences, and environment distortion. This enables DNN-based recognizers to perform as well or better than state-of-the-art systems based on GMMs or shallow networks without the need for explicit model adaptation or feature normalization.
translated by 谷歌翻译
The environmental robustness of DNN-based acoustic models can be significantly improved by using multi-condition training data. However, as data collection is a costly proposition, simulation of the desired conditions is a frequently adopted strategy. In this paper we detail a data augmentation approach for far-field ASR. We examine the impact of using simulated room impulse responses (RIRs), as real RIRs can be difficult to acquire, and also the effect of adding point-source noises. We find that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added. Further we show that the trained acoustic models not only perform well in the distant-talking scenario but also provide better results in the close-talking scenario. We evaluate our approach on several LVCSR tasks which can adequately represent both scenarios.
translated by 谷歌翻译
Domain mismatch between training and testing can lead to significant degradation in performance in many machine learning scenarios. Unfortunately, this is not a rare situation for automatic speech recognition deployments in real-world applications. Research on robust speech recognition can be regarded as trying to overcome this domain mismatch issue. In this paper, we address the unsupervised domain adaptation problem for robust speech recognition, where both source and target domain speech are available, but word transcripts are only available for the source domain speech. We present novel augmentation-based methods that transform speech in a way that does not change the transcripts. Specifically, we first train a vari-ational autoencoder on both source and target domain data (without supervision) to learn a latent representation of speech. We then transform nuisance attributes of speech that are irrelevant to recognition by modifying the latent representations, in order to augment labeled training data with additional data whose distribution is more similar to the target domain. The proposed method is evaluated on the CHiME-4 dataset and reduces the absolute word error rate (WER) by as much as 35% compared to the non-adapted baseline.
translated by 谷歌翻译