“总机基准”是自动语音识别(ASR)研究中众所周知的测试集,为声称人类水平转录精度的系统建立了创纪录的性能。这项工作突出了该评估的鲜为人知的实际考虑,这表明了单词错误率(WER)的重大提高,通过纠正参考转录并偏离官方评分方法。在这个更详细和可再现的方案中,即使是商业ASR系统也可以评分低于5%,并且研究系统的既定记录降低到2.3%。提出了一个替代的成绩单精度指标,该指标不会惩罚缺失,并且似乎对人类与机器性能更具歧视性。尽管商业ASR系统仍低于此阈值,但研究系统被证明可以清楚地超过商业人类言语识别的准确性。这项工作还使用标准化的评分工具来探讨通过在替代方案列表中选择最佳的计算Oracle WER。将短语替代表示形式与话语级n-tesp列表和单词级数据结构进行比较。使用密集的晶格并添加量量表的单词,这使Oracle达到0.18%。
translated by 谷歌翻译
上下文ASR将偏见项列表与音频一起列出,随着ASR使用变得更加普遍,最近引起了最新的兴趣。我们正在发布上下文偏见列表,以伴随Enation21数据集,为此任务创建公共基准。我们使用WENET工具包中预处理的端到端ASR模型在此基准测试上介绍了基线结果。我们显示了应用于两种不同解码算法的浅融合上下文偏置的结果。我们的基线结果证实了观察到的观察,即端到端模型尤其是在训练过程中很少见或从未见过的单词,并且现有的浅融合技术不能充分解决这个问题。我们提出了一个替代拼写预测模型,与没有其他拼写的上下文偏见相比,相对相对,将稀有单词相对34.7%,而访问量的单词相对97.2%。该模型在概念上与先前工作中使用的模型相似,但是更容易实现,因为它不依赖发音字典或现有的文本对语音系统。
translated by 谷歌翻译
自动语音识别和文本到语音系统主要以监督方式培训,需要高质量,准确标记的语音数据集。在这项工作中,我们研究语音数据的常见问题,并为语音数据集的构建和交互式错误分析引入工具箱。施工工具基于K \“urzinger等。工作,并且,尽我们所知,数据集探索工具是世界上第一个这类开源工具。我们演示了如何应用这些工具来创建一个俄语语音数据集并分析现有语音数据集(多语种LibrisPeech,Mozilla Common语音)。该工具是开放的,作为Nemo框架的一部分。
translated by 谷歌翻译
Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC (low-resource domain), large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. The ATCO2 corpus covers 1) data collection and pre-processing, 2) pseudo-annotations of speech data, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold annotations for named-entity recognition (callsign, command, value). 2) The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched with automatic transcripts from an in-domain speech recognizer, contextual information, speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. Both available for purchase through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3) The ATCO2-test-set-1h corpus is a one-hour subset from the original test set corpus, that we are offering for free at https://www.atco2.org/data. We expect the ATCO2 corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.
translated by 谷歌翻译
我们提出了一种用于计算自动语音识别(ASR)中错误率的新方法。这个新的指标是针对包含半字符的语言,可以以不同形式编写相同的字符。我们在印地语中实施了我们的方法论,这是指示上下文中的主要语言之一,我们认为这种方法可扩展到包含大型字符集的其他类似语言。我们称我们的指标替代单词错误率(AWER)和替代字符错误率(ACER)。我们使用wav2Vec 2.0 \ cite {baevski2020wav2vec}训练我们的ASR模型。此外,我们使用语言模型来改善我们的模型性能。我们的结果表明,在分析单词和角色级别的错误率方面有了显着提高,ASR系统的可解释性提高了高达$ 3 $ \%的AWER,印地语的ACER $ 7 $ \%。我们的实验表明,在具有复杂发音的语言中,有多种写单词而不改变其含义的方式。在这种情况下,Awer和Acer将更有用,而不是将其作为指标。此外,我们通过新的公制脚本为印地语开了一个21小时的新基准测试数据集。
translated by 谷歌翻译
使用未知数量的扬声器数量的单通道远场录制的自动语音识别(ASR)传统上由级联模块解决。最近的研究表明,与模块化系统相比,端到端(E2E)多扬声器ASR模型可以实现卓越的识别准确性。但是,这些模型不会确保由于其对完整音频上下文的依赖性而实时适用性。这项工作采用实时适用性,作为模型设计的第一优先级,并解决了以前的多扬声器经常性神经网络传感器(MS-RNN-T)的几个挑战。首先,我们在训练期间介绍一般的重叠言论模拟,在LibrisPeechMix测试集上产生14%的相对字错误率(WER)改进。其次,我们提出了一种新的多转RNN-T(MT-RNN-T)模型,其具有基于重叠的目标布置策略,其概括为任意数量的扬声器,而没有模型架构的变化。我们调查在Liblics测试集上培训训练期间看到的最大扬声器数量的影响,并在两位扬声器MS-RNN-T上报告28%的相对加速。第三,我们试验丰富的转录战略,共同承认和分割多方言论。通过深入分析,我们讨论所提出的系统的潜在陷阱以及未来的未来研究方向。
translated by 谷歌翻译
Transcription of legal proceedings is very important to enable access to justice. However, speech transcription is an expensive and slow process. In this paper we describe part of a combined research and industrial project for building an automated transcription tool designed specifically for the Justice sector in the UK. We explain the challenges involved in transcribing court room hearings and the Natural Language Processing (NLP) techniques we employ to tackle these challenges. We will show that fine-tuning a generic off-the-shelf pre-trained Automatic Speech Recognition (ASR) system with an in-domain language model as well as infusing common phrases extracted with a collocation detection model can improve not only the Word Error Rate (WER) of the transcribed hearings but avoid critical errors that are specific of the legal jargon and terminology commonly used in British courts.
translated by 谷歌翻译
构建可用的无线电监控自动语音识别(ASR)系统是资源不足的语言的一项挑战性任务,但这在广播是公众沟通和讨论的主要媒介的社会中至关重要。联合国在乌干达的最初努力证明了如何理解被社交媒体排除在社交媒体中的农村人的看法在国家规划中很重要。但是,由于缺乏转录的语音数据集,这些努力正受到挑战。在本文中,Makerere人工智能研究实验室发布了155小时的Luganda Radio演讲语料库。据我们所知,这是撒哈拉以南非洲第一个公开可用的广播数据集。本文描述了语音语料库的开发,并使用开源语音识别工具包Coqui STT Toolkit提出了基线Luganda ASR绩效结果。
translated by 谷歌翻译
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
translated by 谷歌翻译
In this work we propose a novel token-based training strategy that improves Transformer-Transducer (T-T) based speaker change detection (SCD) performance. The conventional T-T based SCD model loss optimizes all output tokens equally. Due to the sparsity of the speaker changes in the training data, the conventional T-T based SCD model loss leads to sub-optimal detection accuracy. To mitigate this issue, we use a customized edit-distance algorithm to estimate the token-level SCD false accept (FA) and false reject (FR) rates during training and optimize model parameters to minimize a weighted combination of the FA and FR, focusing the model on accurately predicting speaker changes. We also propose a set of evaluation metrics that align better with commercial use cases. Experiments on a group of challenging real-world datasets show that the proposed training method can significantly improve the overall performance of the SCD model with the same number of parameters.
translated by 谷歌翻译
上下文偏见是端到端自动语音识别(ASR)系统的一项重要且具有挑战性现有方法主要包括上下文lm偏置,并将偏置编码器添加到端到端的ASR模型中。在这项工作中,我们介绍了一种新颖的方法,通过在端到端ASR系统之上添加上下文拼写校正模型来实现上下文偏见。我们将上下文信息与共享上下文编码器合并到序列到序列拼写校正模型中。我们提出的模型包括两种不同的机制:自动回旋(AR)和非自动回旋(NAR)。我们提出过滤算法来处理大尺寸的上下文列表以及性能平衡机制,以控制模型的偏置程度。我们证明所提出的模型是一种普遍的偏见解决方案,它是对域的不敏感的,可以在不同的情况下采用。实验表明,所提出的方法在ASR系统上的相对单词错误率(WER)降低多达51%,并且优于传统偏见方法。与AR溶液相比,提出的NAR模型可将模型尺寸降低43.2%,并将推断加速2.1倍。
translated by 谷歌翻译
扬声器日流是一个标签音频或视频录制的任务,与扬声器身份或短暂的任务标记对应于扬声器标识的类,以识别“谁谈到何时发表讲话”。在早期,对MultiSpeaker录音的语音识别开发了扬声器日益衰退算法,以使扬声器自适应处理能够实现扬声器自适应处理。这些算法还将自己的价值作为独立应用程序随着时间的推移,为诸如音频检索等下游任务提供特定于扬声器的核算。最近,随着深度学习技术的出现,这在讲话应用领域的研究和实践中引起了革命性的变化,对扬声器日益改善已经进行了快速进步。在本文中,我们不仅审查了扬声器日益改善技术的历史发展,而且还审查了神经扬声器日益改善方法的最新进步。此外,我们讨论了扬声器日复速度系统如何与语音识别应用相结合,以及最近深度学习的激增是如何引领联合建模这两个组件互相互补的方式。通过考虑这种令人兴奋的技术趋势,我们认为本文对社区提供了有价值的贡献,以通过巩固具有神经方法的最新发展,从而促进更有效的扬声器日益改善进一步进展。
translated by 谷歌翻译
Automatic Speech Recognition (ASR) for air traffic control is generally trained by pooling Air Traffic Controller (ATCO) and pilot data into one set. This is motivated by the fact that pilot's voice communications are more scarce than ATCOs. Due to this data imbalance and other reasons (e.g., varying acoustic conditions), the speech from ATCOs is usually recognized more accurately than from pilots. Automatically identifying the speaker roles is a challenging task, especially in the case of the noisy voice recordings collected using Very High Frequency (VHF) receivers or due to the unavailability of the push-to-talk (PTT) signal, i.e., both audio channels are mixed. In this work, we propose to (1) automatically segment the ATCO and pilot data based on an intuitive approach exploiting ASR transcripts and (2) subsequently consider an automatic recognition of ATCOs' and pilots' voice as two separate tasks. Our work is performed on VHF audio data with high noise levels, i.e., signal-to-noise (SNR) ratios below 15 dB, as this data is recognized to be helpful for various speech-based machine-learning tasks. Specifically, for the speaker role identification task, the module is represented by a simple yet efficient knowledge-based system exploiting a grammar defined by the International Civil Aviation Organization (ICAO). The system accepts text as the input, either manually verified annotations or automatically generated transcripts. The developed approach provides an average accuracy in speaker role identification of about 83%. Finally, we show that training an acoustic model for ASR tasks separately (i.e., separate models for ATCOs and pilots) or using a multitask approach is well suited for the noisy data and outperforms the traditional ASR system where all data is pooled together.
translated by 谷歌翻译
单词错误率(WER)是用于评估自动语音识别(ASR)模型质量的主要度量。已经表明,与典型的英语说话者相比,ASR模型的语音障碍者的扬声器往往更高。在如此高的错误率下,很难确定模型是否可以很有用。这项研究调查了BertScore的使用,BertScore是文本生成的评估指标,以提供对ASR模型质量和实用性的更有信息度量。将Bertscore和WER与语言病理学家手动注释以进行错误类型和评估手动注释的预测错误。发现Bertscore与人类的误差类型和评估评估更相关。在保留含义的拼字法变化(收缩和归一化误差)上,Bertscore特别强大。此外,使用顺序逻辑回归和Akaike的信息标准(AIC)测量,Bertscore比WER更好地评估了错误评估。总体而言,我们的发现表明,从实际角度评估ASR模型性能时,Bertscore可以补充,尤其是对于可访问性应用程序,即使模型的精度也比典型语音较低的模型也很有用。
translated by 谷歌翻译
人民的言论是自由下载的30,000小时,并在CC-BY-SA下进行学术和商业用途的许可的受监管的会话英语语音识别数据集(具有CC-by子集)。通过使用现有转录搜索适当许可的音频数据来通过搜索互联网来收集数据。我们描述了我们的数据收集方法,并在Apache 2.0许可证下发布了我们的数据收集系统。我们表明,在此数据集上培训的模型在Librispeech的测试清洁测试集上实现了9.98%的单词错误率。最后,我们讨论了围绕创建一个相当大量的机器学习的法律和道德问题,并计划继续维护项目的计划根据MLCommons的赞助。
translated by 谷歌翻译
我们提出Vakyansh,这是一种用指示语言识别语音识别的端到端工具包。印度拥有近121种语言和大约125亿扬声器。然而,大多数语言在数据和预验证的模型方面都是低资源。通过Vakyansh,我们介绍了自动数据管道,用于数据创建,模型培训,模型评估和部署。我们以23个指示语言和Train Wav2Vec 2.0预验证的模型创建14,000小时的语音数据。然后,对这些预审预告措施的模型进行了修订,以创建18个指示语言的最先进的语音识别模型,其次是语言模型和标点符号修复模型。我们以使命开源所有这些资源,这将激发语音社区使用ASR模型以指示语言开发语音的首次应用程序。
translated by 谷歌翻译
本文介绍了WenetsPeech,一个由10000多小时的高质量标记语音组成的多域普通话语料库,2400多小时弱贴言论,大约100万小时的语音,总共22400多小时。我们收集来自YouTube和Podcast的数据,涵盖各种演讲样式,场景,域名,主题和嘈杂的条件。引入了基于光学字符识别(OCR)的方法,以在其对应的视频字幕上为YouTube数据生成音频/文本分段候选,而高质量的ASR转录系统用于为播客数据生成音频/文本对候选。然后我们提出了一种新的端到端标签错误检测方法,可以进一步验证和过滤候选者。我们还提供三个手动标记的高质量测试集,以及WenetsPeech进行评估 - 开发用于训练中的交叉验证目的,从互联网收集的匹配测试,并从真实会议中记录的测试\ _MEETING,以获得更具挑战性的不匹配测试。使用有线exeeEX培训的基线系统,用于三个流行的语音识别工具包,即Kaldi,Espnet和Wenet,以及三个测试集的识别结果也被提供为基准。据我们所知,WenetsPeech是目前最大的开放式普通话语音语料库,其中有利于生产级语音识别的研究。
translated by 谷歌翻译
接受社会辅助机器人的基本功能之一是其与环境中其他代理商的通信能力。在Robin项目的背景下,调查了通过与机器人的语音互动的情境对话。本文介绍了具有深度神经网络的不同语音识别实验,专注于生产快速(从网络本身的100ms延迟下),而仍然可靠的型号。即使关键所需特性之一是低延迟,最终的深度神经网络模型也能实现识别罗马尼亚语的最新状态,以获得9.91%的字错误率(WER),当与语言模型相结合,从而改善以前的结果同时提供了改进的运行时性能。此外,我们探索了两个模块,用于校正ASR输出(连字符和大写恢复和未知单词校正),针对Robin项目的目标(在封闭的微观世界中对话)。我们根据API设计模块化架构,允许整合引擎(机器人或外部)根据需要将可用模块链接在一起。最后,我们通过将其集成在相关平台中并通过上传文件或录制新的语音来测试所提出的设计。
translated by 谷歌翻译
In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs allows SLU systems to improve in comparison to the 1-best setup (4% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, and a relative improvement of 18% over the 1-best configuration. Thus, crossmodal architectures represent a good alternative to overcome the limitations of working purely automatically generated textual data.
translated by 谷歌翻译
移动设备正在转换人们与计算机交互的方式,以及对应用程序的语音接口更重要。最近发布的自动语音识别系统非常准确,但通常需要强大的机械(专业图形处理单元)推断,这使得它们在商品设备上运行不切实际,特别是在流模式下运行。通过对(Khassanov等人,2021)的基线哈萨克斯坦模型的推理时间(Khassanov等,2021)的推理时间留下了深刻的印象,我们训练了一个新的基线声学模型(在与上述纸上相同的数据集上)和三种语言模型用于COQUI STT框架。结果看起来很有希望,但进一步训练和参数扫描的时期,或者是限制ASR系统必须支持的词汇,以达到生产水平精度。
translated by 谷歌翻译