Automatic Speech Recognition (ASR) for air traffic control is generally trained by pooling Air Traffic Controller (ATCO) and pilot data into one set. This is motivated by the fact that pilot's voice communications are more scarce than ATCOs. Due to this data imbalance and other reasons (e.g., varying acoustic conditions), the speech from ATCOs is usually recognized more accurately than from pilots. Automatically identifying the speaker roles is a challenging task, especially in the case of the noisy voice recordings collected using Very High Frequency (VHF) receivers or due to the unavailability of the push-to-talk (PTT) signal, i.e., both audio channels are mixed. In this work, we propose to (1) automatically segment the ATCO and pilot data based on an intuitive approach exploiting ASR transcripts and (2) subsequently consider an automatic recognition of ATCOs' and pilots' voice as two separate tasks. Our work is performed on VHF audio data with high noise levels, i.e., signal-to-noise (SNR) ratios below 15 dB, as this data is recognized to be helpful for various speech-based machine-learning tasks. Specifically, for the speaker role identification task, the module is represented by a simple yet efficient knowledge-based system exploiting a grammar defined by the International Civil Aviation Organization (ICAO). The system accepts text as the input, either manually verified annotations or automatically generated transcripts. The developed approach provides an average accuracy in speaker role identification of about 83%. Finally, we show that training an acoustic model for ASR tasks separately (i.e., separate models for ATCOs and pilots) or using a multitask approach is well suited for the noisy data and outperforms the traditional ASR system where all data is pooled together.
translated by 谷歌翻译
Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC (low-resource domain), large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. The ATCO2 corpus covers 1) data collection and pre-processing, 2) pseudo-annotations of speech data, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold annotations for named-entity recognition (callsign, command, value). 2) The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched with automatic transcripts from an in-domain speech recognizer, contextual information, speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. Both available for purchase through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3) The ATCO2-test-set-1h corpus is a one-hour subset from the original test set corpus, that we are offering for free at https://www.atco2.org/data. We expect the ATCO2 corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.
translated by 谷歌翻译
This paper describes a simple yet efficient repetition-based modular system for speeding up air-traffic controllers (ATCos) training. E.g., a human pilot is still required in EUROCONTROL's ESCAPE lite simulator (see https://www.eurocontrol.int/simulator/escape) during ATCo training. However, this need can be substituted by an automatic system that could act as a pilot. In this paper, we aim to develop and integrate a pseudo-pilot agent into the ATCo training pipeline by merging diverse artificial intelligence (AI) powered modules. The system understands the voice communications issued by the ATCo, and, in turn, it generates a spoken prompt that follows the pilot's phraseology to the initial communication. Our system mainly relies on open-source AI tools and air traffic control (ATC) databases, thus, proving its simplicity and ease of replicability. The overall pipeline is composed of the following: (1) a submodule that receives and pre-processes the input stream of raw audio, (2) an automatic speech recognition (ASR) system that transforms audio into a sequence of words; (3) a high-level ATC-related entity parser, which extracts relevant information from the communication, i.e., callsigns and commands, and finally, (4) a speech synthesizer submodule that generates responses based on the high-level ATC entities previously extracted. Overall, we show that this system could pave the way toward developing a real proof-of-concept pseudo-pilot system. Hence, speeding up the training of ATCos while drastically reducing its overall cost.
translated by 谷歌翻译
AI研究中的基石是创建和采用标准化培训和测试数据集,以指定最新模型的进度。一个特别成功的例子是用于培训和评估英语自然语言理解(NLU)模型的胶水数据集。围绕基于BERT的语言模型的大量研究围绕着胶水中NLU任务的性能改进。为了评估其他语言的语言模型,创建了几个特定语言的胶水数据集。语音语言理解(SLU)的领域遵循了类似的轨迹。大型自我监督模型(例如WAV2VEC2)的成功实现了具有相对易于访问的未标记数据的语音模型。然后可以在SLU任务(例如出色的基准测试)上评估这些模型。在这项工作中,我们将其扩展到通过释放Indicsuperb基准测试来指示语言。具体来说,我们做出以下三项贡献。 (i)我们收集了Kathbath,其中包含来自印度203个地区的1,218个贡献者的12个印度语言的1,684小时的标记语音数据。 (ii)使用Kathbath,我们在6个语音任务中创建基准:自动语音识别,扬声器验证,说话者识别(单声道/多),语言识别,逐个示例查询以及对12种语言的关键字发现。 (iii)在发布的基准测试中,我们与常用的基线Fbank一起训练和评估不同的自我监督模型。我们表明,在大多数任务上,特定于语言的微调模型比基线更准确,包括对于语言识别任务的76 \%差距。但是,对于说话者识别,在大型数据集上训练的自我监督模型证明了一个优势。我们希望Indicsuperb有助于发展印度语言的语音语言理解模型的进步。
translated by 谷歌翻译
已经证明了深度学习技术在各种任务中有效,特别是在语音识别系统的发展中,即旨在以一系列写词中的音频句子转录音频句子的系统。尽管该地区进展,但语音识别仍然可以被认为是困难的,特别是对于缺乏可用数据的语言,例如巴西葡萄牙语(BP)。从这个意义上讲,这项工作介绍了仅使用打开可用的音频数据的公共自动语音识别(ASR)系统的开发,从Wav2Vec 2.0 XLSR-53模型的微调,在许多语言中,通过BP数据进行了多种。最终模型在7个不同的数据集中呈现12.4%的平均误差率(在应用语言模型时10.5%)。根据我们的知识,这是开放ASR系统中BP的最佳结果。
translated by 谷歌翻译
自动语音识别和文本到语音系统主要以监督方式培训,需要高质量,准确标记的语音数据集。在这项工作中,我们研究语音数据的常见问题,并为语音数据集的构建和交互式错误分析引入工具箱。施工工具基于K \“urzinger等。工作,并且,尽我们所知,数据集探索工具是世界上第一个这类开源工具。我们演示了如何应用这些工具来创建一个俄语语音数据集并分析现有语音数据集(多语种LibrisPeech,Mozilla Common语音)。该工具是开放的,作为Nemo框架的一部分。
translated by 谷歌翻译
Modern speech recognition systems exhibits rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where diversity of training data is limited. In this work we propose M2DS2, a simple and sample-efficient finetuning strategy for large pretrained speech models, based on mixed source and target domain self-supervision. We find that including source domain self-supervision stabilizes training and avoids mode collapse of the latent representations. For evaluation, we collect HParl, a $120$ hour speech corpus for Greek, consisting of plenary sessions in the Greek Parliament. We merge HParl with two popular Greek corpora to create GREC-MD, a test-bed for multi-domain evaluation of Greek ASR systems. In our experiments we find that, while other Unsupervised Domain Adaptation baselines fail in this resource-constrained environment, M2DS2 yields significant improvements for cross-domain adaptation, even when a only a few hours of in-domain audio are available. When we relax the problem in a weakly supervised setting, we find that independent adaptation for audio using M2DS2 and language using simple LM augmentation techniques is particularly effective, yielding word error rates comparable to the fully supervised baselines.
translated by 谷歌翻译
自动语音识别(ASR)是一个复杂和具有挑战性的任务。近年来,该地区出现了重大进展。特别是对于巴西葡萄牙语(BP)语言,在2020年的下半年,有大约376小时的公众可供ASR任务。在2021年初发布新数据集,这个数字增加到574小时。但是,现有资源由仅包含读取和准备的演讲的Audios组成。缺少数据集包括自发性语音,这在不同的ASR应用中是必不可少的。本文介绍了Coraa(注释Audios语料库)V1。使用290.77小时,在包含验证对(音频转录)的BP中ASR的公共可用数据集。科拉还含有欧洲葡萄牙音像(4.69小时)。我们还提供了一个基于Wav2VEC 2.0 XLSR-53的公共ASR模型,并通过CoraA进行微调。我们的模型在CoraA测试集中实现了24.18%的单词误差率,并且在常见的语音测试集上为20.08%。测量字符错误率时,我们分别获得11.02%和6.34%,分别为CoraA和常见声音。 Coraa Corpora在自发言论中与BP中的改进ASR模型进行了组装,并激励年轻研究人员开始研究葡萄牙语的ASR。所有Corpora都在CC By-NC-ND 4.0许可证下公开提供Https://github.com/nilc-nlp/coraa。
translated by 谷歌翻译
This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The Lib-riSpeech corpus is derived from audiobooks that are part of the Lib-riVox project, and contains 1000 hours of speech sampled at 16 kHz. We have made the corpus freely available for download, along with separately prepared language-model training data and pre-built language models. We show that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself. We are also releasing Kaldi scripts that make it easy to build these systems.
translated by 谷歌翻译
口语理解(SLU)是大多数人机相互作用系统中的核心任务。随着智能家居,智能手机和智能扬声器的出现,SLU已成为该行业的关键技术。在经典的SLU方法中,自动语音识别(ASR)模块将语音信号转录为文本表示,自然语言理解(NLU)模块从中提取语义信息。最近,基于深神经网络的端到端SLU(E2E SLU)已经获得了动力,因为它受益于ASR和NLU部分的联合优化,因此限制了管道架构的误差效应的级联反应。但是,对于E2E模型用于预测语音输入的概念和意图的实际语言特性知之甚少。在本文中,我们提出了一项研究,以确定E2E模型执行SLU任务的信号特征和其他语言特性。该研究是在必须处理非英语(此处法语)语音命令的智能房屋的应用领域进行的。结果表明,良好的E2E SLU性能并不总是需要完美的ASR功能。此外,结果表明,与管道模型相比,E2E模型在处理背景噪声和句法变化方面具有出色的功能。最后,更细粒度的分析表明,E2E模型使用输入信号的音调信息来识别语音命令概念。本文概述的结果和方法提供了一个跳板,以进一步分析语音处理中的E2E模型。
translated by 谷歌翻译
构建可用的无线电监控自动语音识别(ASR)系统是资源不足的语言的一项挑战性任务,但这在广播是公众沟通和讨论的主要媒介的社会中至关重要。联合国在乌干达的最初努力证明了如何理解被社交媒体排除在社交媒体中的农村人的看法在国家规划中很重要。但是,由于缺乏转录的语音数据集,这些努力正受到挑战。在本文中,Makerere人工智能研究实验室发布了155小时的Luganda Radio演讲语料库。据我们所知,这是撒哈拉以南非洲第一个公开可用的广播数据集。本文描述了语音语料库的开发,并使用开源语音识别工具包Coqui STT Toolkit提出了基线Luganda ASR绩效结果。
translated by 谷歌翻译
我们提出Vakyansh,这是一种用指示语言识别语音识别的端到端工具包。印度拥有近121种语言和大约125亿扬声器。然而,大多数语言在数据和预验证的模型方面都是低资源。通过Vakyansh,我们介绍了自动数据管道,用于数据创建,模型培训,模型评估和部署。我们以23个指示语言和Train Wav2Vec 2.0预验证的模型创建14,000小时的语音数据。然后,对这些预审预告措施的模型进行了修订,以创建18个指示语言的最先进的语音识别模型,其次是语言模型和标点符号修复模型。我们以使命开源所有这些资源,这将激发语音社区使用ASR模型以指示语言开发语音的首次应用程序。
translated by 谷歌翻译
开发语音技术是对低资源语言的挑战,其中注释和原始语音数据稀疏。马耳他是一种这样的语言。近年来,对马耳他的计算处理有所增加,包括语音技术,但后者的资源仍然稀疏。在本文中,我们考虑提高这些语言的语音识别的数据增强技术,专注于马耳他作为测试用例。我们考虑三种不同类型的数据增强:无监督的培训,多语言培训和合成演讲的使用作为培训数据。目标是确定这些技术或它们的组合,是改善起始点是大约7小时转录语音的语言的语言的最有效。我们的结果表明,在这里研究了三种数据增强技术,导致我们在不使用语言模型的情况下实现15%的绝对增长。
translated by 谷歌翻译
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
translated by 谷歌翻译
低资源语言的自动语音识别(ASR)改善了语言少数群体的访问,以便人工智能(AI)提供的技术优势。在本文中,我们通过创建一个新的粤语数据集来解决香港广东语言的数据稀缺问题。我们的数据集多域粤语语料库(MDCC)由73.6小时的清洁阅读语音与成绩单配对,从香港的粤语有声读物收集。它结合了哲学,政治,教育,文化,生活方式和家庭领域,涵盖了广泛的主题。我们还查看所有现有的粤语数据集,并在两个最大的数据集(MDCC和公共语音ZH-HK)上执行实验。我们根据其语音类型,数据源,总大小和可用性分析现有数据集。使用Fairseq S2T变压器,最先进的ASR模型进行实验结果,显示了我们数据集的有效性。此外,我们通过在MDCC和常见的声音ZH-HK上应用多数据集学习来创建一个强大而强大的粤语ASR模型。
translated by 谷歌翻译
本文介绍了阿拉伯语多方面自动语音识别的设计与开发。深度神经网络正在成为解决顺序数据问题的有效工具,特别是采用系统的端到端培训。阿拉伯语语音识别是一个复杂的任务,因为存在多种方言,非可用性的大型语言和遗失的声音。因此,这项工作的第一种贡献是开发具有完全或至少部分发声转录的大型多方面语料库。此外,开源语料库已从多个源收集,通过定义公共字符集来对转录中的非标准阿拉伯字母表进行标准化。第二款贡献是开发框架,用于培训实现最先进的性能的声学模型。网络架构包括卷积和复发层的组合。音频数据的频谱图特征在频率VS时域中提取并在网络中馈送。通过复发模型产生的输出帧进一步训练以使音频特征与其相应的转录序列对齐。使用具有Tetra-Gram语言模型的波束搜索解码器来执行序列对准。所提出的系统实现了14%的错误率,以前优于以前的系统。
translated by 谷歌翻译
扬声器日流是一个标签音频或视频录制的任务,与扬声器身份或短暂的任务标记对应于扬声器标识的类,以识别“谁谈到何时发表讲话”。在早期,对MultiSpeaker录音的语音识别开发了扬声器日益衰退算法,以使扬声器自适应处理能够实现扬声器自适应处理。这些算法还将自己的价值作为独立应用程序随着时间的推移,为诸如音频检索等下游任务提供特定于扬声器的核算。最近,随着深度学习技术的出现,这在讲话应用领域的研究和实践中引起了革命性的变化,对扬声器日益改善已经进行了快速进步。在本文中,我们不仅审查了扬声器日益改善技术的历史发展,而且还审查了神经扬声器日益改善方法的最新进步。此外,我们讨论了扬声器日复速度系统如何与语音识别应用相结合,以及最近深度学习的激增是如何引领联合建模这两个组件互相互补的方式。通过考虑这种令人兴奋的技术趋势,我们认为本文对社区提供了有价值的贡献,以通过巩固具有神经方法的最新发展,从而促进更有效的扬声器日益改善进一步进展。
translated by 谷歌翻译
自我监督学习(SSL)在语音识别方面取得了巨大的成功,而有限的探索已尝试完成其他语音处理任务。由于语音信号包含多方面的信息,包括说话者身份,副语言学,口语内容等,学习所有语音任务的通用表示都具有挑战性。为了解决该问题,我们提出了一个新的预培训模型WAVLM,以解决全堆栈的下游语音任务。 Wavlm共同学习了蒙面的语音预测和预训练。通过这种方式,WAVLM不仅可以通过掩盖的语音预测来保持语音内容建模能力,而且还可以通过语音denoing来提高非ASR任务的潜力。此外,WAVLM还采用封闭式的变压器结构的封闭相对位置偏置,以更好地捕获输入语音的序列排序。我们还将培训数据集从60k小时扩展到94K小时。 WAVLM大型在精湛的基准上实现了最先进的性能,并在其代表性基准上为各种语音处理任务带来了重大改进。代码和预培训模型可在https://aka.ms/wavlm上找到。
translated by 谷歌翻译
本文介绍了来自Robin项目的新罗马尼亚语音语料库,称为Robin技术获取语音语料库(Rocintasc)。其主要目的是提高会话代理的行为,允许人机互动在购买技术设备的背景下。本文包含采集过程的详细描述,语料库统计信息以及对低延迟ASR系统以及对话组件的语料库影响的评估。
translated by 谷歌翻译
许多自动语音识别(ASR)数据集包括一个单一的预定义测试集,该测试集由一个或多个演讲者组成,其语音从未出现在培训集中。但是,对于说话者数量很少的数据集,这种“持有说明器”的数据分配策略可能不是理想的选择。这项研究调查了具有最小ASR培训资源的五种语言的十种不同数据拆分方法。我们发现(1)模型性能取决于选择哪个扬声器进行测试; (2)所有固定扬声器的平均单词错误率(WER)不仅与多个随机拆分的平均差异相当,而且与任何给定的单个随机拆分相当; (3)当数据以启发性或对抗性分开时,通常也可以比较; (4)话语持续时间和强度是可变性的相对预测因素,而不管数据分解如何。这些结果表明,广泛使用的宣传者输出的ASR数据分配方法可以产生不反映未见数据或说话者模型性能的结果。在面对数据稀疏时,随机拆分可以产生更可靠和可推广的估计。
translated by 谷歌翻译