本文探讨了在训练语音识别系统中使用对抗性实例来提高深度神经网络声学模型的鲁棒性。在训练期间,快速梯度符号方法用于生成增加原始训练数据的对抗性示例。与基于数据变换的常规数据增强不同,这些示例基于当前声学模型参数动态地生成。我们评估了在Aurora-4和CHIME-4单通道任务的实验中对抗性数据增加的影响,显示出改善的抗噪声和信道变化的稳健性。当将adversarialexamples与教师/学生培训相结合时,可以获得进一步的改进,从而导致Aurora-4的相对单词误差降低23%。
translated by 谷歌翻译
Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic speech recognition (ASR). Permutation invariant training (PIT) is a state of the art model-based approach, which applies a single neural network to solve this single-input, multiple-output modeling problem. We propose to advance the current state of the art by imposing a modular structure on the neural network, applying a progressive pretraining regimen, and improving the objective function with transfer learning and a discriminative training criterion. The modular structure splits the problem into three sub-tasks: frame-wise interpreting, utterance-level speaker tracing, and speech recognition. The pretraining regimen uses these modules to solve progressively harder tasks. Transfer learning leverages parallel clean speech to improve the training targets for the network. Our discriminative training formulation is a modification of standard formulations that also penalizes competing outputs of the system. Experiments are conducted on the artificial overlapped Switchboard and hub5e-swb dataset. The proposed framework achieves over 30% relative improvement of WER over both a strong jointly trained system, PIT for ASR, and a separately optimized system, PIT for speech separation with clean speech ASR model. The improvement comes from better model generalization, training efficiency and the sequence level linguistic knowledge integration.
translated by 谷歌翻译
This paper addresses the robust speech recognition problem as a domain adaptation task. Specifically, we introduce an unsupervised deep domain adaptation (DDA) approach to acoustic modeling in order to eliminate the training-testing mismatch that is common in real-world use of speech recognition. Under a multi-task learning framework, the approach jointly learns two discriminative classifiers using one deep neural network (DNN). As the main task, a label predictor predicts phoneme labels and is used during training and at test time. As the second task, a domain classifier discriminates between the source and the target domains during training. The network is optimized by minimizing the loss of the label classifier and to maximize the loss of the domain classifier at the same time. The proposed approach is easy to implement by modifying a common feed-forward network. Moreover, this unsupervised approach only needs labeled training data from the source domain and some unlabeled raw data of the new domain. Speech recognition experiments on noise/channel distortion and domain shift confirm the effectiveness of the proposed approach. For instance, on the Aurora-4 corpus, compared with the acoustic model trained only using clean data, the DDA approach achieves relative 37.8% word error rate (WER) reduction.
translated by 谷歌翻译
使用端到端语音识别器实现高精度需要在训练之前进行仔细的参数初始化。否则,网络可能无法获得良好的局部最优。对于在线网络尤其如此,例如单向LSTM。目前,训练此类系统的最佳策略是从绑定三音素系统引导训练。然而,这是耗费时间的,更重要的是,对于没有高质量发音词典的语言来说,这是不可能的。在这项工作中,我们提出了一个初始策略,它使用师生学习将知识从大型,训练有素的离线端到端语音识别模型转移到在线端到端模型,从而无需使用词典或任何其他语言资源。我们还探索课程学习和标签平滑,并展示如何将他们与建议的师生学习结合起来进行进一步的改进。我们在Microsoft Cortana个人助理任务中评估我们的方法,并表明与随机初始化的基线系统相比,所提出的方法导致单词错误率相对提高19%。
translated by 谷歌翻译
The environmental robustness of DNN-based acoustic models can be significantly improved by using multi-condition training data. However, as data collection is a costly proposition, simulation of the desired conditions is a frequently adopted strategy. In this paper we detail a data augmentation approach for far-field ASR. We examine the impact of using simulated room impulse responses (RIRs), as real RIRs can be difficult to acquire, and also the effect of adding point-source noises. We find that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added. Further we show that the trained acoustic models not only perform well in the distant-talking scenario but also provide better results in the close-talking scenario. We evaluate our approach on several LVCSR tasks which can adequately represent both scenarios.
translated by 谷歌翻译
Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In previous work, we have shown that such architectures are comparable to state-of-the-art ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We also introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12, 500 hour voice search task, we find that the proposed changes improve the WER from 9.2% to 5.6%, while the best conventional system achieves 6.7%; on a dictation task our model achieves a WER of 4.1% compared to 5% for the conventional system.
translated by 谷歌翻译
End-to-end acoustic models, such as connectionist temporal classification (CTC) and the attention model, have been studied, and their speech recognition accuracies come close to those of conventional deep neural network (DNN)-hidden Markov models. However, most high-performance end-to-end models are not suitable for real-time (streaming) speech recognition because they are based on bidirectional recurrent neural networks (RNNs). In this study, to improve the performance of unidirectional RNN-based CTC, which is suitable for real-time processing, we investigate the knowledge distillation (KD)-based model compression method for training a CTC acoustic model. we evaluate a frame-level KD method and a sequence-level KD method for CTC model. The speech recognition experiments on Wall Street Journal tasks demonstrate that, the frame-level KD worsens the WERs of unidirec-tional CTC model, whereas sequence-level KD can improve the WERs of the model.
translated by 谷歌翻译
目前最先进的自动语音识别系统在特定的“域”中进行训练,基于应用,采样率和编解码器等因素进行定义。当这种识别器用于与训练域不匹配的条件时,性能显着下降。这项工作探索了通过组合来自多个应用程序域的大规模训练数据,为各种用例构建单个域不变模型的理论。我们的终端系统使用162,000小时的语音进行训练。此外,在训练期间,每个设备都会被人为地扭曲,以模拟背景噪声,编解码器失真和采样率等效果。我们的结果表明,即使在这样的规模下,这样训练的模型几乎与精细调整到特定子集的模型一样好:单个模型对多个应用域可以是健壮的,并且可以对编解码器和噪声进行变化。更重要的是,这些模型更好地概括为看不见的条件并允许快速适应 - 我们通过使用来自新域的少至10小时的数据表明,适应的域不变模型可以匹配从头开始训练的adomain特定模型的性能数据量的70倍。我们还强调了在未来工作中需要解决的此类模型和领域的一些局限性。
translated by 谷歌翻译
我们研究序列级知识蒸馏序列级到序列(Seq2Seq)模型的大词汇量连续语音识别(LVSCR)的可行性。我们首先使用预先训练的较大的教师模型,通过波束搜索生成每个话语的多个假设。使用相同的输入,我们然后使用从教师生成的这些假设来训练学生模型,作为伪标签代替原始地面真实标签。我们使用华尔街日报(WSJ)语料库评估我们提出的方法。降低了9.8美元以上的参数减少量,精度损失高达7.0%,字错误率(WER)增加
translated by 谷歌翻译
This work presents a broad study on the adaptation of neural network acoustic models by means of learning hidden unit contributions (LHUC)-a method that linearly re-combines hidden units in a speaker-or environment-dependent manner using small amounts of unsupervised adaptation data. We also extend LHUC to a speaker adaptive training (SAT) framework that leads to a more adaptable DNN acoustic model, working both in a speaker-dependent and a speaker-independent manner, without the requirements to maintain auxiliary speaker-dependent feature extractors or to introduce significant speaker-dependent changes to the DNN structure. Through a series of experiments on four different speech recognition benchmarks (TED talks, Switchboard, AMI meetings, and Aurora4) comprising 270 test speakers, we show that LHUC in both its test-only and SAT variants results in consistent word error rate reductions ranging from 5% to 23% relative depending on the task and the degree of mismatch between training and test data. In addition, we have investigated the effect of the amount of adaptation data per speaker, the quality of unsupervised adaptation targets, the complementarity to other adaptation techniques, one-shot adaptation, and an extension to adapting DNNs trained in a sequence discriminative manner.
translated by 谷歌翻译
具有低错误接受和错误拒绝率的准确的设备上关键字定位(KWS)对于客户对对话代理的远场语音控制的体验至关重要。在现实世界条件下保持较低的错误拒绝率尤其具有挑战性,其中存在(a)环境噪声,例如电视,家用电器或其他未在设备上引导的语音(b)从设备不完全取消音频回放在由AcousticEcho Cancellation(AEC)系统处理后产生残余回声。在本文中,我们提出了一种数据增强策略,以在这些具有挑战性的条件下提高关键字定位性能。通过混合音乐和电视/电影音频,以不同的信号干扰比,人工破坏训练集音频。我们的结果表明,在这些设备的音频回放下,我们可以在误报率范围内相对减少30-45%的错误拒绝率。
translated by 谷歌翻译
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.
translated by 谷歌翻译
为低资源语言开发实用的语音识别器是具有挑战性的,不仅因为语言的(可能是未知的)属性,而且因为测试数据可能不是与可用训练数据来自相同的域。在本文中,我们关注使用基于序列的标准训练的系统的后一挑战,即域失配。我们证明了使用预训练的英语识别器的有效性,这种识别器对于这种不匹配的条件是健壮的,作为低资源语言的域规范化特征提取器。在我们的示例中,我们使用TurkishConversational语音和广播新闻数据。这使得语音识别器能够快速开发,可以轻松适应任何领域。在各种跨域方案中,我们实现了对音素错误率的25%的相对改进,对于某些域,改进率约为50%。
translated by 谷歌翻译
本文介绍了一种使用不成对数据训练端到端自动语音识别(ASR)模型的方法。尽管端到端方法消除了对诸如发音词典之类的专业知识的需求以构建ASR系统,但它仍然需要大量的配对数据,即语音话语和它们的转录。最近提出循环一致性损失作为缓解有限配对数据问题的一种方法。这些方法构成了一种具有给定变换的逆向操作,例如,具有ASR的文本到语音(TTS),以构建仅需要的损失。无监督数据,本例中的语音。将循环一致性应用于ASR模型是不重要的,因为在中间文本瓶颈中丢失了诸如说话者特征之类的基本信息。为了解决这个问题,这项工作提出了一个基于语音编码器状态序列而不是原始语音信号的方法。这是通过训练文本到编码器模型并基于编码器重建误差定义高压来实现的。在LibriSpeech语料库上的实验结果表明,所提出的循环一致性训练使用100小时配对数据训练的初始模型将单词错误率降低了14.7%,使用额外的360小时音频数据而没有转录。我们还研究了纯文本数据的使用,主要用于语言建模,进一步提高了不成对数据训练场景中的性能。
translated by 谷歌翻译
在本文中,我们研究了对抗性学习用于未经监督的适应于看不见的录音条件,更具体地说是单麦克风远场语音的用途。我们将基于神经网络的声学模型(使用近距干净语音训练)应用于使用未转录的适应数据的新记录条件。我们对ItalianSPEECON数据集的实验结果表明,与未适应的模型相比,我们提出的方法实现了19.8%的相对蠕动率(WER)降低。此外,这种适应方法即使在对来自另一种语言(即法语)的数据执行时也是有益的,从而给出12.6%的相对WER减少。
translated by 谷歌翻译
New efficient measures for estimating uncertainty of deep neural network (DNN) classifiers are proposed and successfully applied to multistream-based unsupervised adaptation of ASR systems to address uncertainty derived from noise. The proposed measure is the error from associative memory models trained on outputs of a DNN. In the present study, an attempt is made to use autoencoders for remembering the property of data. Another measure proposed is an extension of the M-measure, which computes the divergences of probability estimates spaced at specific time intervals. The extended measure results in an improved reliability by considering the latent information of phoneme duration. Experimental comparisons carried out in a multistream-based ASR paradigm demonstrates that the proposed measures yielded improvements over the multistyle trained system and system selected based on existing measures. Fusion of the proposed measures achieved almost the same performance as the oracle system selection.
translated by 谷歌翻译
Recent studies have shown that deep neural networks (DNNs) perform significantly better than shallow networks and Gaussian mixture models (GMMs) on large vocabulary speech recognition tasks. In this paper, we argue that the improved accuracy achieved by the DNNs is the result of their ability to extract dis-criminative internal representations that are robust to the many sources of variability in speech signals. We show that these representations become increasingly insensitive to small perturbations in the input with increasing network depth, which leads to better speech recognition performance with deeper networks. We also show that DNNs cannot extrapolate to test samples that are substantially different from the training examples. If the training data are sufficiently representative, however, internal features learned by the DNN are relatively stable with respect to speaker differences, bandwidth differences, and environment distortion. This enables DNN-based recognizers to perform as well or better than state-of-the-art systems based on GMMs or shallow networks without the need for explicit model adaptation or feature normalization.
translated by 谷歌翻译
Although great progress has been made in automatic speech recognition (ASR), significant performance degradation still exists in noisy environments. Our previous work has demonstrated the superior noise robustness of very deep convolutional neural networks (VDCNN). Based on our work on very deep CNNs, this paper proposes a more advanced model referred to as the very deep convolutional residual network (VDCRN). This new model incorporates batch normalization and residual learning, showing more robustness than previous VDCNNs. Then, to alleviate the mismatch between the training and testing conditions, model adaptation and adaptive training are developed and compared for the new VDCRN. This work focuses on factor aware training (FAT) and cluster adaptive training (CAT). For factor aware training, a unified framework is explored. For cluster adaptive training, two schemes are first explored to construct the bases in the canonical model; furthermore a factorized version of CAT is designed to address multiple non-speech variabilities in one model. Finally, a complete multi-pass system is proposed to achieve the best system performance in the noisy scenarios. The proposed new approaches are evaluated on three different tasks: Aurora4 (simulated data with additive noise and channel distortion), CHiME4 (both simulated and real data with additive noise and reverberation) and the AMI meeting transcription task (real data with significant reverberation). The evaluation includes not only different noisy conditions, but also covers both simulated and real noisy data. The experiments show that the new VDCRN is more robust, and the adaptation on this model can further significantly reduce the word error rate. The proposed best architecture obtains consistent and very large improvements on all tasks compared to the baseline VDCNN or LSTM. Particularly, on Aurora4 a new milestone 5.67% WER is achieved by only improving acoustic modeling.
translated by 谷歌翻译
Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge , which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.
translated by 谷歌翻译
We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used 'na¨ıvelyna¨ıvely' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furthermore, simple, feature-level fusion based extensions to the framework are proposed to improve the integration with the ASR back-end. These yield a best result of 13.76 % average word error rate, which is, to our knowledge, the best score to date.
translated by 谷歌翻译