Error correction in automatic speech recognition (ASR) aims to correct those incorrect words in sentences generated by ASR models. Since recent ASR models usually have low word error rate (WER), to avoid affecting originally correct tokens, error correction models should only modify incorrect words, and therefore detecting incorrect words is important for error correction. Previous works on error correction either implicitly detect error words through target-source attention or CTC (connectionist temporal classification) loss, or explicitly locate specific deletion/substitution/insertion errors. However, implicit error detection does not provide clear signal about which tokens are incorrect and explicit error detection suffers from low detection accuracy. In this paper, we propose SoftCorrect with a soft error detection mechanism to avoid the limitations of both explicit and implicit error detection. Specifically, we first detect whether a token is correct or not through a probability produced by a dedicatedly designed language model, and then design a constrained CTC loss that only duplicates the detected incorrect tokens to let the decoder focus on the correction of error tokens. Compared with implicit error detection with CTC loss, SoftCorrect provides explicit signal about which words are incorrect and thus does not need to duplicate every token but only incorrect tokens; compared with explicit error detection, SoftCorrect does not detect specific deletion/substitution/insertion errors but just leaves it to CTC loss. Experiments on AISHELL-1 and Aidatatang datasets show that SoftCorrect achieves 26.1% and 9.4% CER reduction respectively, outperforming previous works by a large margin, while still enjoying fast speed of parallel generation.
translated by 谷歌翻译
Error correction is widely used in automatic speech recognition (ASR) to post-process the generated sentence, and can further reduce the word error rate (WER). Although multiple candidates are generated by an ASR system through beam search, current error correction approaches can only correct one sentence at a time, failing to leverage the voting effect from multiple candidates to better detect and correct error tokens. In this work, we propose FastCorrect 2, an error correction model that takes multiple ASR candidates as input for better correction accuracy. FastCorrect 2 adopts non-autoregressive generation for fast inference, which consists of an encoder that processes multiple source sentences and a decoder that generates the target sentence in parallel from the adjusted source sentence, where the adjustment is based on the predicted duration of each source token. However, there are some issues when handling multiple source sentences. First, it is non-trivial to leverage the voting effect from multiple source sentences since they usually vary in length. Thus, we propose a novel alignment algorithm to maximize the degree of token alignment among multiple sentences in terms of token and pronunciation similarity. Second, the decoder can only take one adjusted source sentence as input, while there are multiple source sentences. Thus, we develop a candidate predictor to detect the most suitable candidate for the decoder. Experiments on our inhouse dataset and AISHELL-1 show that FastCorrect 2 can further reduce the WER over the previous correction model with single candidate by 3.2% and 2.6%, demonstrating the effectiveness of leveraging multiple candidates in ASR error correction. FastCorrect 2 achieves better performance than the cascaded re-scoring and correction pipeline and can serve as a unified post-processing module for ASR.
translated by 谷歌翻译
Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER) than original ASR outputs. Previous works usually use a sequence-to-sequence model to correct an ASR output sentence autoregressively, which causes large latency and cannot be deployed in online ASR services. A straightforward solution to reduce latency, inspired by non-autoregressive (NAR) neural machine translation, is to use an NAR sequence generation model for ASR error correction, which, however, comes at the cost of significantly increased ASR error rate. In this paper, observing distinctive error patterns and correction operations (i.e., insertion, deletion, and substitution) in ASR, we propose FastCorrect, a novel NAR error correction model based on edit alignment. In training, FastCorrect aligns each source token from an ASR output sentence to the target tokens from the corresponding ground-truth sentence based on the edit distance between the source and target sentences, and extracts the number of target tokens corresponding to each source token during edition/correction, which is then used to train a length predictor and to adjust the source tokens to match the length of the target sentence for parallel generation. In inference, the token number predicted by the length predictor is used to adjust the source tokens for target sequence generation. Experiments on the public AISHELL-1 dataset and an internal industrial-scale ASR dataset show the effectiveness of FastCorrect for ASR error correction: 1) it speeds up the inference by 6-9 times and maintains the accuracy (8-14% WER reduction) compared with the autoregressive correction model; and 2) it outperforms the popular NAR models adopted in neural machine translation and text edition by a large margin.
translated by 谷歌翻译
上下文偏见是端到端自动语音识别(ASR)系统的一项重要且具有挑战性现有方法主要包括上下文lm偏置,并将偏置编码器添加到端到端的ASR模型中。在这项工作中,我们介绍了一种新颖的方法,通过在端到端ASR系统之上添加上下文拼写校正模型来实现上下文偏见。我们将上下文信息与共享上下文编码器合并到序列到序列拼写校正模型中。我们提出的模型包括两种不同的机制:自动回旋(AR)和非自动回旋(NAR)。我们提出过滤算法来处理大尺寸的上下文列表以及性能平衡机制,以控制模型的偏置程度。我们证明所提出的模型是一种普遍的偏见解决方案,它是对域的不敏感的,可以在不同的情况下采用。实验表明,所提出的方法在ASR系统上的相对单词错误率(WER)降低多达51%,并且优于传统偏见方法。与AR溶液相比,提出的NAR模型可将模型尺寸降低43.2%,并将推断加速2.1倍。
translated by 谷歌翻译
由于最近的自然语言处理的进步,几种作品已经将伯特的预先接受审查的屏蔽语言模型(MLM)应用于语音识别的后校正。然而,现有的预先训练的模型仅考虑语义校正,同时忽略了单词的语音特征。因此,语义后校正将降低性能,因为在中国ASR中同音误差相当常见。在本文中,我们提出了一种集体利用了语境化表示的新方法以及错误与其替换候选人之间的语音信息来缓解中国ASR的错误率。我们对现实世界语音识别数据集的实验结果表明,我们所提出的方法明显地低于基线模型的CER,其利用预先训练的BERT MLM作为校正器。
translated by 谷歌翻译
拼写错误纠正是自然语言处理中具有很长历史的主题之一。虽然以前的研究取得了显着的结果,但仍然存在挑战。在越南语中,任务的最先进的方法从其相邻音节中介绍了一个音节的上下文。然而,该方法的准确性可能是不令人满意的,因为如果模型可能会失去上下文,如果两个(或更多)拼写错误彼此静置。在本文中,我们提出了一种纠正越南拼写错误的新方法。我们使用深入学习模型解决错误错误和拼写错误错误的问题。特别地,嵌入层由字节对编码技术提供支持。基于变压器架构的序列模型的序列使我们的方法与上一个问题不同于同一问题的方法。在实验中,我们用大型合成数据集训练模型,这是随机引入的拼写错误。我们使用现实数据集测试所提出的方法的性能。此数据集包含11,202个以9,341不同的越南句子中的人造拼写错误。实验结果表明,我们的方法达到了令人鼓舞的表现,检测到86.8%的误差,81.5%纠正,分别提高了最先进的方法5.6%和2.2%。
translated by 谷歌翻译
中文拼写检查(CSC)的任务旨在检测和纠正文本中可以找到的拼写错误。虽然手动注释高质量的数据集很昂贵且耗时,因此培训数据集的规模通常很小(例如,Sighan15仅包含2339个用于培训的样本),因此基于学习的模型通常会遭受数据稀疏限制。和过度合适的问题,尤其是在大语言模型时代。在本文中,我们致力于研究\ textbf {无监督}范式来解决CSC问题,我们提出了一个名为\ textbf {uchecker}的框架,以进行无监督的拼写错误检测和校正。考虑到其强大的语言诊断能力,将蒙面审慎的语言模型(例如BERT)引入为骨干模型。从各种且灵活的掩蔽操作中受益,我们提出了一种混乱集引导的掩盖策略,以精细培训掩盖语言模型,以进一步提高无监督的检测和校正的性能。标准数据集的实验结果证明了我们提出的模型Uchecker在字符级别和句子级的准确性,精度,回忆和F1分别对拼写错误检测和校正任务分别进行的效率。
translated by 谷歌翻译
最近,语音界正在看到从基于深神经网络的混合模型移动到自动语音识别(ASR)的端到端(E2E)建模的显着趋势。虽然E2E模型在大多数基准测试中实现最先进的,但在ASR精度方面,混合模型仍然在当前的大部分商业ASR系统中使用。有很多实际的因素会影响生产模型部署决定。传统的混合模型,用于数十年的生产优化,通常擅长这些因素。在不为所有这些因素提供优异的解决方案,E2E模型很难被广泛商业化。在本文中,我们将概述最近的E2E模型的进步,专注于解决行业视角的挑战技术。
translated by 谷歌翻译
虽然现代自动语音识别(ASR)系统可以实现高性能,但它们可能会产生削弱读者体验并对下游任务造成伤害的错误。为了提高ASR假设的准确性和可靠性,我们提出了一种用于语音识别器的跨模型后处理系统,其中1)熔断来自不同方式的声学特征和文本特征,2)接合置信度估计器和多个误差校正器任务学习时尚和3)统一纠错和话语抑制模块。与单模或单任务模型相比,我们提出的系统被证明更有效和高效。实验结果表明,我们的后处理系统导致对工业ASR系统的单扬声器和多扬声器语音相对降低的10%相对减少,每个令牌约为1.7ms延迟确保在流语音识别中可以接受后处理引入的额外延迟。
translated by 谷歌翻译
Discriminative pre-trained language models (PLMs) learn to predict original texts from intentionally corrupted ones. Taking the former text as positive and the latter as negative samples, the PLM can be trained effectively for contextualized representation. However, the training of such a type of PLMs highly relies on the quality of the automatically constructed samples. Existing PLMs simply treat all corrupted texts as equal negative without any examination, which actually lets the resulting model inevitably suffer from the false negative issue where training is carried out on pseudo-negative data and leads to less efficiency and less robustness in the resulting PLMs. In this work, on the basis of defining the false negative issue in discriminative PLMs that has been ignored for a long time, we design enhanced pre-training methods to counteract false negative predictions and encourage pre-training language models on true negatives by correcting the harmful gradient updates subject to false negative predictions. Experimental results on GLUE and SQuAD benchmarks show that our counter-false-negative pre-training methods indeed bring about better performance together with stronger robustness.
translated by 谷歌翻译
连接派时间分类(CTC)的模型在自动语音识别(ASR)方面具有吸引力,因为它们的非自动性性质。为了利用仅文本数据,语言模型(LM)集成方法(例如重新纠正和浅融合)已被广泛用于CTC。但是,由于需要降低推理速度,因此他们失去了CTC的非自动性性本质。在这项研究中,我们提出了一种使用电话条件的蒙版LM(PC-MLM)的误差校正方法。在提出的方法中,掩盖了来自CTC的贪婪解码输出中的较不自信的单词令牌。然后,PC-MLM预测这些蒙版的单词令牌给定的单词和手机补充了CTC。我们进一步将其扩展到可删除的PC-MLM,以解决插入错误。由于CTC和PC-MLM均为非自动回旋模型,因此该方法可以快速LM集成。在域适应设置中对自发日本(CSJ)和TED-LIUM2语料库进行的实验评估表明,我们所提出的方法在推理速度方面优于重新逆转和浅融合,并且在CSJ上的识别准确性方面。
translated by 谷歌翻译
自动语音识别(ASR)中编辑的后编辑需要自动纠正ASR系统产生的常见和系统错误。 ASR系统的输出在很大程度上容易出现语音和拼写错误。在本文中,我们建议使用强大的预训练的序列模型BART,BART进一步适应训练以作为剥夺模型,以纠正此类类型的错误。自适应培训是在通过合成诱导错误以及通过合并现有ASR系统中的实际错误获得的增强数据集上执行的。我们还提出了一种简单的方法,可以使用单词级别对齐来恢复输出。对重音语音数据的实验结果表明,我们的策略有效地纠正了大量的ASR错误,并在与竞争性基线相比时会产生改善的结果。我们还强调了在印地语语言中相关的语法误差校正任务中获得的负面结果,显示了通过我们建议的模型捕获更广泛上下文的限制。
translated by 谷歌翻译
The lack of label data is one of the significant bottlenecks for Chinese Spelling Check (CSC). Existing researches use the method of automatic generation by exploiting unlabeled data to expand the supervised corpus. However, there is a big gap between the real input scenario and automatic generated corpus. Thus, we develop a competitive general speller ECSpell which adopts the Error Consistent masking strategy to create data for pretraining. This error consistency masking strategy is used to specify the error types of automatically generated sentences which is consistent with real scene. The experimental result indicates our model outperforms previous state-of-the-art models on the general benchmark. Moreover, spellers often work within a particular domain in real life. Due to lots of uncommon domain terms, experiments on our built domain specific datasets show that general models perform terribly. Inspired by the common practice of input methods, we propose to add an alterable user dictionary to handle the zero-shot domain adaption problem. Specifically, we attach a User Dictionary guided inference module (UD) to a general token classification based speller. Our experiments demonstrate that ECSpell$^{UD}$, namely ECSpell combined with UD, surpasses all the other baselines largely, even approaching the performance on the general benchmark.
translated by 谷歌翻译
在几乎所有文本生成应用中,Word序列在左右(L2R)或左右(R2L)方式中构造,因为自然语言句子是写入L2R或R2L。但是,我们发现自然语言书面订单对文本生成至关重要。在本文中,我们提出了一种螺旋语言建模(SLM),这是一种普遍的方法,使人们能够构建超出L2R和R2L订单的自然语言句子。 SLM允许其中一个从结果文本内的任意令牌开始,并在所选的任意令牌中展开REST令牌。它使解码顺序除了语言模型困惑之外的新优化目标,这进一步提高了所生成文本的分集和质量。此外,SLM使得可以通过选择正确的开始令牌来操纵文本构建过程。 SLM还将生成排序引入了额外的正则化,以提高低资源方案中的模型稳健性。 8次广泛研究的神经机翻译(NMT)任务的实验表明,与传统的L2R解码方法相比,SLM高达4.7 BLEU增加。
translated by 谷歌翻译
误差校正技术仍然有效地通过自动语音识别(ASR)模型来完善输出。现有的端到端错误校正方法基于编码器架构架构过程在解码阶段中所有令牌,都会产生不良的延迟。在本文中,我们提出了一种利用校正操作预测的ASR误差校正方法。更具体地说,我们在编码器和解码器之间构建一个预测指标,以了解是否应保留一个令牌(“ k”),已删除(“ d”)或更改(“ C”)以限制解码仅为输入的一部分序列嵌入(“ C”令牌)用于快速推断。三个公共数据集的实验证明了拟议方法在减少ASR校正中解码过程的潜伏期中的有效性。与固体编码器基线相比,我们提出的两个模型的推理速度至少提高了3次(3.4次和5.7次),同时保持相同的准确性(分别降低0.53%和1.69%)。同时,我们生产并发布了为ASR错误校正社区做出贡献的基准数据集,以促进沿这一行的研究。
translated by 谷歌翻译
变形金刚最近在ASR领域主导。尽管能够产生良好的性能,但它们涉及自回归(AR)解码器,以一一生成令牌,这在计算上效率低下。为了加快推断,非自动回旋(NAR)方法,例如设计单步nar,以实现平行生成。但是,由于输出令牌内的独立性假设,单步nar的性能不如AR模型,尤其是在大规模语料库的情况下。改进单步nar面临两个挑战:首先,准确预测输出令牌的数量并提取隐藏的变量;其次,以增强输出令牌之间的相互依赖性建模。为了应对这两个挑战,我们提出了一个被称为Paraformer的快速准确的平行变压器。这利用了连续的基于集成和火的预测器来预测令牌的数量并生成隐藏的变量。然后,浏览语言模型(GLM)采样器会生成语义嵌入,以增强NAR解码器建模上下文相互依存的能力。最后,我们设计了一种策略来生成负面样本,以进行最小单词错误率训练以进一步提高性能。使用公共Aishell-1,Aishell-2基准和工业级别20,000小时任务的实验表明,拟议的Paraformer可以达到与最先进的AR变压器相当的性能,具有超过10倍的加速。
translated by 谷歌翻译
自回归(AR)和非自动增加(NAR)模型对性能和延迟具有自己的优势,将它们与一个模型相结合,可能会利用两者。目前的组合框架更多地关注多个解码范例的集成,具有统一的生成模型,例如,屏蔽语言模型。然而,由于训练目标和推理之间的差距,概括可能对性能有害。在本文中,我们的目标是通过在统一框架下保留AR和NAR的原始目标来缩小差距。具体地,我们通过将AR和NAR共同建模(左右,左右和直)与新引入的方向变量来提出定向变压器(Diformer),这通过控制每个的预测令牌在那方面有特定的依赖关系。通过方向实现的统一成功地保留了AR和NAR中使用的原始依赖性假设,保留了泛化和性能。 4 WMT基准测试的实验表明,Diformer优于当前的联合建模工作,适用于AR和NAR解码的1.5个以上的BLEU积分,也对最先进的独立AR和NAR模型具有竞争力。
translated by 谷歌翻译
许多在世界上的许多语言的语言现有数据的非数字化书籍和文件锁定了。光学字符识别(OCR)可以用来产生数字化的文字,和以前的工作已经证明的是提高认识,精心资源较少语言的通用OCR系统的结果神经后校正方法的实用程序。然而,这些方法依赖于手工辅助校正后的数据,相对于非注释原始图像需要被数字化,其是相对稀少。在本文中,我们提出了一种半监督学习方法,使得它可以利用这些原始图像,以提高性能,特别是通过运用自我训练,其中模型迭代自身输出训练有素的技术。此外,为了执行在识别词汇的一致性,我们引入一个词法感知解码方法,该方法增强了神经后修正模型与从所识别的文本构成的基于计数的语言模型,使用加权有限状态自动机中实现(WFSA)对于高效和有效的解码。四种濒危语言的结果证明了该方法的效用,具有15-29%的相对误差减少,我们在哪里找到的自我培训和实现持续改善词法感知解码所必需的组合。数据和代码可在https://shrutirij.github.io/ocr-el/。
translated by 谷歌翻译
End-to-end speech recognition models trained using joint Connectionist Temporal Classification (CTC)-Attention loss have gained popularity recently. In these models, a non-autoregressive CTC decoder is often used at inference time due to its speed and simplicity. However, such models are hard to personalize because of their conditional independence assumption that prevents output tokens from previous time steps to influence future predictions. To tackle this, we propose a novel two-way approach that first biases the encoder with attention over a predefined list of rare long-tail and out-of-vocabulary (OOV) words and then uses dynamic boosting and phone alignment network during decoding to further bias the subword predictions. We evaluate our approach on open-source VoxPopuli and in-house medical datasets to showcase a 60% improvement in F1 score on domain-specific rare words over a strong CTC baseline.
translated by 谷歌翻译
文本识别是文档数字化的长期研究问题。现有的方法通常是基于CNN构建的,以用于图像理解,并为Char-Level文本生成而建立RNN。此外,通常需要另一种语言模型来提高整体准确性作为后处理步骤。在本文中,我们提出了一种使用预训练的图像变压器和文本变压器模型(即Trocr)提出的端到端文本识别方法,该模型利用了变压器体系结构,以实现图像理解和文字级级文本生成。TROR模型很简单,但有效,可以通过大规模合成数据进行预训练,并通过人体标记的数据集进行微调。实验表明,TROR模型的表现优于印刷,手写和场景文本识别任务上的当前最新模型。Trocr模型和代码可在\ url {https://aka.ms/trocr}上公开获得。
translated by 谷歌翻译