Existing training criteria in automatic speech recognition(ASR) permit the model to freely explore more than one time alignments between the feature and label sequences. In this paper, we use entropy to measure a model's uncertainty, i.e. how it chooses to distribute the probability mass over the set of allowed alignments. Furthermore, we evaluate the effect of entropy regularization in encouraging the model to distribute the probability mass only on a smaller subset of allowed alignments. Experiments show that entropy regularization enables a much simpler decoding method without sacrificing word error rate, and provides better time alignment quality.
translated by 谷歌翻译
Sequence transducers, such as the RNN-T and the Conformer-T, are one of the most promising models of end-to-end speech recognition, especially in streaming scenarios where both latency and accuracy are important. Although various methods, such as alignment-restricted training and FastEmit, have been studied to reduce the latency, latency reduction is often accompanied with a significant degradation in accuracy. We argue that this suboptimal performance might be caused because none of the prior methods explicitly model and reduce the latency. In this paper, we propose a new training method to explicitly model and reduce the latency of sequence transducer models. First, we define the expected latency at each diagonal line on the lattice, and show that its gradient can be computed efficiently within the forward-backward algorithm. Then we augment the transducer loss with this expected latency, so that an optimal trade-off between latency and accuracy is achieved. Experimental results on the WSJ dataset show that the proposed minimum latency training reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER degradation of 0.7%, and outperforms conventional alignment-restricted training (110 ms) and FastEmit (67 ms) methods.
translated by 谷歌翻译
Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.
translated by 谷歌翻译
在自动语音识别(ASR)研究中,歧视性标准在DNN-HMM系统中取得了出色的性能。鉴于这一成功,采用判别标准是有望提高端到端(E2E)ASR系统的性能。有了这一动机,以前的作品将最小贝叶斯风险(MBR,歧视性标准之一)引入了E2E ASR系统中。但是,基于MBR的方法的有效性和效率受到损害:MBR标准仅用于系统培训,这在训练和解码之间造成了不匹配;基于MBR的方法中的直接解码过程导致需要预先训练的模型和缓慢的训练速度。为此,在这项工作中提出了新的算法,以整合另一种广泛使用的判别标准,无晶格的最大互信息(LF-MMI),不仅在训练阶段,而且在解码过程中。提出的LF-MI训练和解码方法显示了它们对两个广泛使用的E2E框架的有效性:基于注意力的编码器解码器(AEDS)和神经传感器(NTS)。与基于MBR的方法相比,提出的LF-MMI方法:保持训练和解码之间的一致性;避开直立的解码过程;来自具有卓越训练效率的随机初始化模型的火车。实验表明,LF-MI方法的表现优于其MBR对应物,并始终导致各种框架和数据集从30小时到14.3k小时上的统计学意义改进。所提出的方法在Aishell-1(CER 4.10%)和Aishell-2(CER 5.02%)数据集上实现了最先进的结果(SOTA)。代码已发布。
translated by 谷歌翻译
作为语音识别的最流行的序列建模方法之一,RNN-Transducer通过越来越复杂的神经网络模型,以增长的规模和增加训练时代的增长,实现了不断发展的性能。尽管强大的计算资源似乎是培训卓越模型的先决条件,但我们试图通过仔细设计更有效的培训管道来克服它。在这项工作中,我们提出了一条高效的三阶段渐进式训练管道,以在合理的短时间内从头开始建立具有非常有限的计算资源的高效神经传感器模型。每个阶段的有效性在LibrisPeech和Convebobly Corpora上都经过实验验证。拟议的管道能够在短短2-3周内以单个GPU接近最先进的性能来训练换能器模型。我们最好的构型传感器在Librispeech测试中获得4.1%的速度,仅使用35个训练时代。
translated by 谷歌翻译
This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo (\url{https://github.com/NVIDIA/NeMo}) toolkit.
translated by 谷歌翻译
Recognizing a word shortly after it is spoken is an important requirement for automatic speech recognition (ASR) systems in real-world scenarios. As a result, a large body of work on streaming audio-only ASR models has been presented in the literature. However, streaming audio-visual automatic speech recognition (AV-ASR) has received little attention in earlier works. In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture. The audio and the visual encoder neural networks are both based on the conformer architecture, which is made streamable using chunk-wise self-attention (CSA) and causal convolution. Streaming recognition with a decoder neural network is realized by using the triggered attention technique, which performs time-synchronous decoding with joint CTC/attention scoring. For frame-level ASR criteria, such as CTC, a synchronized response from the audio and visual encoders is critical for a joint AV decision making process. In this work, we propose a novel alignment regularization technique that promotes synchronization of the audio and visual encoder, which in turn results in better word error rates (WERs) at all SNR levels for streaming and offline AV-ASR models. The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 (LRS3) dataset in an offline and online setup, respectively, which both present state-of-the-art results when no external training data are used.
translated by 谷歌翻译
统一的流和非流式的双通(U2)用于语音识别的端到端模型在流传输能力,准确性,实时因素(RTF)和延迟方面表现出很大的性能。在本文中,我们呈现U2 ++,U2的增强版本,进一步提高了准确性。 U2 ++的核心思想是在训练中同时使用标签序列的前向和向后信息来学习更丰富的信息,并在解码时结合前向和后向预测以提供更准确的识别结果。我们还提出了一种名为SPECSUB的新数据增强方法,以帮助U2 ++模型更准确和强大。我们的实验表明,与U2相比,U2 ++在训练中显示了更快的收敛,更好地鲁棒性对解码方法,以及U2上的一致5 \%-8 \%字错误率降低增益。在Aishell-1的实验中,我们通过u2 ++实现了一个4.63 \%的字符错误率(cer),其中没有流媒体设置和5.05 \%,具有320ms延迟的流设置。据我们所知,5.05 \%是Aishell-1测试集上的最佳发布的流媒体结果。
translated by 谷歌翻译
语言模型(LMS)显着提高端到端模型(E2E)模型在训练过程中很少见的单词的识别准确性,当时在浅融合或重新恢复设置中。在这项工作中,我们介绍了LMS在判别培训框架中学习混合自动回旋传感器(HAT)模型的研究,以减轻有关使用LMS的训练与推理差距。对于浅融合设置,我们在假设生成和损失计算过程中都使用LMS,而LM感知的MWER训练模型可实现10 \%的相对改进,比用标准MWER在语音搜索测试集中培训的模型相对改进,其中包含稀有单词。对于重新设置,我们学会了一个小型神经模块,以数据依赖性方式产生串联的融合权重。该模型与常规MWER训练的模型相同,但无需清除融合重量。
translated by 谷歌翻译
最近,基于注意的编码器 - 解码器(AED)模型对多个任务的端到端自动语音识别(ASR)显示了高性能。在此类模型中解决了过度控制,本文介绍了轻松关注的概念,这是一种简单地逐渐注入对训练期间对编码器 - 解码器注意重量的统一分配,其易于用两行代码实现。我们调查轻松关注跨不同AED模型架构和两个突出的ASR任务,华尔街日志(WSJ)和LibRisPeech的影响。我们发现,在用外部语言模型解码时,随着宽松的注意力训练的变压器始终如一地始终如一地遵循标准基线模型。在WSJ中,我们为基于变压器的端到端语音识别设置了一个新的基准,以3.65%的单词错误率,最优于13.1%的相对状态,同时仅引入单个HyperParameter。
translated by 谷歌翻译
在本文中,我们提出了一种新的双通方法来统一一个模型中的流和非流媒体端到端(E2E)语音识别。我们的型号采用混合CTC /注意架构,其中编码器中的构装层被修改。我们提出了一种基于动态的块的注意力策略,以允许任意右上下文长度。在推理时间,CTC解码器以流式方式生成n最佳假设。只有更改块大小,可以轻松控制推理延迟。然后,CTC假设被注意力解码器重新筛选以获得最终结果。这种有效的备用过程导致句子级延迟非常小。我们在开放的170小时Aishell-1数据集上的实验表明,所提出的方法可以简单有效地统一流和非流化模型。在Aishell-1测试集上,与标准的非流式变压器相比,我们的统一模型在非流式ASR中实现了5.60%的相对字符错误率(CER)减少。同一模型在流式ASR系统中实现了5.42%的CER,640ms延迟。
translated by 谷歌翻译
在长时间到数小时的长时间话语中,提高端到端ASR模型的性能是语音识别的持续挑战。一个常见的解决方案是使用单独的语音活动检测器(VAD)事先将音频分割,该声音活动检测器(VAD)纯粹基于声音/非语音信息来决定段边界位置。但是,VAD细分器可能是现实世界语音的最佳选择,例如,一个完整的句子应该整体上可能包含犹豫(“设置... 5点钟的警报”) 。我们建议用端到端的ASR模型替换VAD,能够以流方式预测段边界,从而使细分决定不仅在更好的声学特征上,而且还可以在解码文本的语义特征上进行,并具有可忽略的额外功能计算。在现实世界长音频(YouTube)的实验中,长度长达30分钟,我们证明了相对改善的8.5%,并且与VAD段基线相比,中位段延迟潜伏期的中位数延迟延迟减少了250毫秒。 - ART构象体RNN-T模型。
translated by 谷歌翻译
经常性的神经网络传感器(RNN-T)目标在建立当今最好的自动语音识别(ASR)系统中发挥着重要作用。与连接员时间分类(CTC)目标类似,RNN-T损失使用特定规则来定义生成一组对准以形成用于全汇训练的格子。但是,如果这些规则是最佳的,则在很大程度上未知,并且会导致最佳ASR结果。在这项工作中,我们介绍了一种新的传感器目标函数,它概括了RNN-T丢失来接受标签的图形表示,从而提供灵活和有效的框架来操纵训练格子,例如用于限制对齐或研究不同的转换规则。我们证明,与标准RNN-T相比,具有CTC样格子的基于传感器的ASR实现了更好的结果,同时确保了严格的单调对齐,这将允许更好地优化解码过程。例如,所提出的CTC样换能器系统对于测试 - LibrisPeech的其他条件,实现了5.9%的字误差率,相对于基于等效的RNN-T系统的提高,对应于4.8%。
translated by 谷歌翻译
最近,端到端(E2E)框架在各种自动语音识别(ASR)任务上取得了显着的结果。但是,无格的最大互信息(LF-MMI),作为在混合ASR系统中显示出卓越性能的鉴别性培训标准之一,很少在E2E ASR框架中采用。在这项工作中,我们提出了一种新的方法,将LF-MMI标准集成到培训和解码阶段的E2E ASR框架中。该方法显示了其在两个最广泛使用的E2E框架上的有效性,包括基于注意的编码器解码器(AED)和神经传感器(NTS)。实验表明,LF-MMI标准的引入始终如一地导致各种数据集和不同E2E ASR框架的显着性能改进。我们最好的模型在Aishell-1开发/测试集上实现了4.1 \%/ 4.4 \%的竞争力;我们还在强大的基线上实现了对Aishell-2和Librispeech数据集的显着误差。
translated by 谷歌翻译
End-to-end speech recognition models trained using joint Connectionist Temporal Classification (CTC)-Attention loss have gained popularity recently. In these models, a non-autoregressive CTC decoder is often used at inference time due to its speed and simplicity. However, such models are hard to personalize because of their conditional independence assumption that prevents output tokens from previous time steps to influence future predictions. To tackle this, we propose a novel two-way approach that first biases the encoder with attention over a predefined list of rare long-tail and out-of-vocabulary (OOV) words and then uses dynamic boosting and phone alignment network during decoding to further bias the subword predictions. We evaluate our approach on open-source VoxPopuli and in-house medical datasets to showcase a 60% improvement in F1 score on domain-specific rare words over a strong CTC baseline.
translated by 谷歌翻译
Recently, RNN-Transducers have achieved remarkable results on various automatic speech recognition tasks. However, lattice-free sequence discriminative training methods, which obtain superior performance in hybrid modes, are rarely investigated in RNN-Transducers. In this work, we propose three lattice-free training objectives, namely lattice-free maximum mutual information, lattice-free segment-level minimum Bayes risk, and lattice-free minimum Bayes risk, which are used for the final posterior output of the phoneme-based neural transducer with a limited context dependency. Compared to criteria using N-best lists, lattice-free methods eliminate the decoding step for hypotheses generation during training, which leads to more efficient training. Experimental results show that lattice-free methods gain up to 6.5% relative improvement in word error rate compared to a sequence-level cross-entropy trained model. Compared to the N-best-list based minimum Bayes risk objectives, lattice-free methods gain 40% - 70% relative training time speedup with a small degradation in performance.
translated by 谷歌翻译
梁搜索是端到端模型的主要ASR解码算法,生成树结构化假设。但是,最近的研究表明,通过假设合并进行解码可以通过可比或更好的性能实现更有效的搜索。但是,复发网络中的完整上下文与假设合并不兼容。我们建议在RNN传感器的预测网络中使用矢量定量的长期记忆单元(VQ-LSTM)。通过与ASR网络共同培训离散表示形式,可以积极合并假设以生成晶格。我们在总机语料库上进行的实验表明,提出的VQ RNN传感器改善了具有常规预测网络的换能器的ASR性能,同时还产生了具有相同光束尺寸的Oracle Word错误率(WER)的密集晶格。其他语言模型撤退实验还证明了拟议的晶格生成方案的有效性。
translated by 谷歌翻译
RNN-T模型由于其在线流媒体模式下运营的竞争力和能力,因此在文献和商业系统中广受欢迎。在这项工作中,我们进行了一项广泛的研究,比较了单调和原始RNN-T模型的几种预测网络体系结构。我们根据普通的最新构象编码器比较4种类型的预测网络,并在LibrisPeech和内部医学对话数据集上获得报告结果。我们的研究涵盖了离线批处理模式和在线流媒体方案。与以前的一些作品相反,我们的结果表明,当用作预测网络以及构象异构体编码器时,变压器并不总是胜过LSTM。受分数启发的启发,我们提出了一个新的简单预测网络体系结构N-CONCAT,它在我们在线流式传输基准测试中的表现优于其他。变压器和N-Gram降低的体系结构的表现非常相似,但在先前的上下文方面具有一些重要的不同行为。总体而言,与LSTM基线相比,我们获得了多达4.1%的相对相对改善,同时将预测网络参数降低了几乎数量级(8.4倍)。
translated by 谷歌翻译
在启用语音的应用程序中,一个预定的热词在同时用来激活设备以便进行查询。 toavoid重复一个热词,我们提出了一个端到端的流(E2E)打算查询检测器,该查询检测器识别向设备指向的发音,并滤除针对设备的其他发出内容。提出的方法将预期的查询检测器置于E2E模型中,该模型将语音识别的不同组件折叠成一个神经网络。E2E对台面解码和预期的查询检测进行建模,也使我们可以基于早期的部分偏置检测结果, ,这对于减少潜伏期和使系统响应很重要。我们证明,与独立的预期检测器相比,检测准确性和600个MSLATENCE的相对相对改善的相对提高一级误差率(EER)的相对提高了22%。在我们的实验中,提出的模型检测用户正在用用户开始讲话后,用8.7%的Eerwithin与设备进行对话。
translated by 谷歌翻译
最近,语音界正在看到从基于深神经网络的混合模型移动到自动语音识别(ASR)的端到端(E2E)建模的显着趋势。虽然E2E模型在大多数基准测试中实现最先进的,但在ASR精度方面,混合模型仍然在当前的大部分商业ASR系统中使用。有很多实际的因素会影响生产模型部署决定。传统的混合模型,用于数十年的生产优化,通常擅长这些因素。在不为所有这些因素提供优异的解决方案,E2E模型很难被广泛商业化。在本文中,我们将概述最近的E2E模型的进步,专注于解决行业视角的挑战技术。
translated by 谷歌翻译