本文介绍了WenetsPeech,一个由10000多小时的高质量标记语音组成的多域普通话语料库,2400多小时弱贴言论,大约100万小时的语音,总共22400多小时。我们收集来自YouTube和Podcast的数据,涵盖各种演讲样式,场景,域名,主题和嘈杂的条件。引入了基于光学字符识别(OCR)的方法,以在其对应的视频字幕上为YouTube数据生成音频/文本分段候选,而高质量的ASR转录系统用于为播客数据生成音频/文本对候选。然后我们提出了一种新的端到端标签错误检测方法,可以进一步验证和过滤候选者。我们还提供三个手动标记的高质量测试集,以及WenetsPeech进行评估 - 开发用于训练中的交叉验证目的,从互联网收集的匹配测试,并从真实会议中记录的测试\ _MEETING,以获得更具挑战性的不匹配测试。使用有线exeeEX培训的基线系统,用于三个流行的语音识别工具包,即Kaldi,Espnet和Wenet,以及三个测试集的识别结果也被提供为基准。据我们所知,WenetsPeech是目前最大的开放式普通话语音语料库,其中有利于生产级语音识别的研究。
translated by 谷歌翻译
本文介绍了一种新的普通话 - 英语代码转换语音识别的语料库 - 塔尔奇语料库,适用于培训和评估代码转换语音识别系统。滑石乐谱来自TAL教育小组中真正的在线在线一对一的英语教学场景,其中包含大约587个小时的语音采样16 kHz。据我们所知,滑石科目是世界上标签最大的普通话 - 英语密码开关开源自动语音识别(ASR)数据集。在本文中,我们将详细介绍录制过程,包括捕获设备和语料库环境的音频。并且滑石场可以根据允许许可证免费下载。我们使用滑石乐谱,在两个流行的语音识别工具包中进行ASR实验,以制造包括ESPNET和WENET在内的基线系统。在滑石粉中比较了两个语音识别工具包中的混合错误率(MER)性能。实验结果表明,音频记录和转录的质量是有希望的,基线系统是可行的。
translated by 谷歌翻译
自动语音识别和文本到语音系统主要以监督方式培训,需要高质量,准确标记的语音数据集。在这项工作中,我们研究语音数据的常见问题,并为语音数据集的构建和交互式错误分析引入工具箱。施工工具基于K \“urzinger等。工作,并且,尽我们所知,数据集探索工具是世界上第一个这类开源工具。我们演示了如何应用这些工具来创建一个俄语语音数据集并分析现有语音数据集(多语种LibrisPeech,Mozilla Common语音)。该工具是开放的,作为Nemo框架的一部分。
translated by 谷歌翻译
Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC (low-resource domain), large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. The ATCO2 corpus covers 1) data collection and pre-processing, 2) pseudo-annotations of speech data, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold annotations for named-entity recognition (callsign, command, value). 2) The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched with automatic transcripts from an in-domain speech recognizer, contextual information, speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. Both available for purchase through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3) The ATCO2-test-set-1h corpus is a one-hour subset from the original test set corpus, that we are offering for free at https://www.atco2.org/data. We expect the ATCO2 corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.
translated by 谷歌翻译
人民的言论是自由下载的30,000小时,并在CC-BY-SA下进行学术和商业用途的许可的受监管的会话英语语音识别数据集(具有CC-by子集)。通过使用现有转录搜索适当许可的音频数据来通过搜索互联网来收集数据。我们描述了我们的数据收集方法,并在Apache 2.0许可证下发布了我们的数据收集系统。我们表明,在此数据集上培训的模型在Librispeech的测试清洁测试集上实现了9.98%的单词错误率。最后,我们讨论了围绕创建一个相当大量的机器学习的法律和道德问题,并计划继续维护项目的计划根据MLCommons的赞助。
translated by 谷歌翻译
最近,我们提供了Wenet,这是一种面向生产的端到端语音识别工具包,它引入了统一的两通道(U2)框架和内置运行时,以解决单个中的流和非流传输模式。模型。为了进一步提高ASR性能并促进各种生产要求,在本文中,我们提出了Wenet 2.0,并提供四个重要的更新。 (1)我们提出了U2 ++,这是一个带有双向注意解码器的统一的两次通行框架,其中包括通过左右注意力解码器的未来上下文信息,以提高共享编码器的代表性和在夺回阶段的表现。 (2)我们将基于N-Gram的语言模型和基于WFST的解码器引入WENET 2.0,从而促进了在生产方案中使用丰富的文本数据。 (3)我们设计了一个统一的上下文偏见框架,该框架利用特定于用户的上下文(例如联系人列表)为生产提供快速适应能力,并提高了使用LM和没有LM场景的ASR准确性。 (4)我们设计了一个统一的IO,以支持大规模数据进行有效的模型培训。总而言之,全新的WENET 2.0可在各种Corpora上的原始WENET上取得高达10 \%的相对识别性能提高,并提供了一些重要的以生产为导向的功能。
translated by 谷歌翻译
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
translated by 谷歌翻译
Modern speech recognition systems exhibits rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where diversity of training data is limited. In this work we propose M2DS2, a simple and sample-efficient finetuning strategy for large pretrained speech models, based on mixed source and target domain self-supervision. We find that including source domain self-supervision stabilizes training and avoids mode collapse of the latent representations. For evaluation, we collect HParl, a $120$ hour speech corpus for Greek, consisting of plenary sessions in the Greek Parliament. We merge HParl with two popular Greek corpora to create GREC-MD, a test-bed for multi-domain evaluation of Greek ASR systems. In our experiments we find that, while other Unsupervised Domain Adaptation baselines fail in this resource-constrained environment, M2DS2 yields significant improvements for cross-domain adaptation, even when a only a few hours of in-domain audio are available. When we relax the problem in a weakly supervised setting, we find that independent adaptation for audio using M2DS2 and language using simple LM augmentation techniques is particularly effective, yielding word error rates comparable to the fully supervised baselines.
translated by 谷歌翻译
最近的言语和语言技术的方法预先rain非常大型模型,用于特定任务。然而,这种大型模型的好处通常仅限于世界上少数资源丰富的语言。在这项工作中,我们对来自印度次大陆的低资源语言构建ASR系统进行多种贡献。首先,我们从各种领域策划40个印度语言的17,000小时的原始语音数据,包括教育,新闻,技术和金融。其次,使用这种原始语音数据,我们预先存在于40个印度语言的Wav2Vec样式模型的多个变体。第三,我们分析佩带的模型以查找关键特点:码本矢量的类似探测音素在语言中共享,跨层的表示是语言系列的判别,并且注意力头通常会在小型本地窗口中注意。第四,我们微调了9种语言的下游ASR模型,并在3个公共数据集上获得最先进的结果,包括非常低的资源语言,如Sinhala和Nepali。我们的工作建立了多语言预介质是建立ASR系统的有效策略,为印度次大陆的语言上不同的扬声器建立ASR系统。
translated by 谷歌翻译
本文介绍了阿拉伯语多方面自动语音识别的设计与开发。深度神经网络正在成为解决顺序数据问题的有效工具,特别是采用系统的端到端培训。阿拉伯语语音识别是一个复杂的任务,因为存在多种方言,非可用性的大型语言和遗失的声音。因此,这项工作的第一种贡献是开发具有完全或至少部分发声转录的大型多方面语料库。此外,开源语料库已从多个源收集,通过定义公共字符集来对转录中的非标准阿拉伯字母表进行标准化。第二款贡献是开发框架,用于培训实现最先进的性能的声学模型。网络架构包括卷积和复发层的组合。音频数据的频谱图特征在频率VS时域中提取并在网络中馈送。通过复发模型产生的输出帧进一步训练以使音频特征与其相应的转录序列对齐。使用具有Tetra-Gram语言模型的波束搜索解码器来执行序列对准。所提出的系统实现了14%的错误率,以前优于以前的系统。
translated by 谷歌翻译
This paper describes a simple yet efficient repetition-based modular system for speeding up air-traffic controllers (ATCos) training. E.g., a human pilot is still required in EUROCONTROL's ESCAPE lite simulator (see https://www.eurocontrol.int/simulator/escape) during ATCo training. However, this need can be substituted by an automatic system that could act as a pilot. In this paper, we aim to develop and integrate a pseudo-pilot agent into the ATCo training pipeline by merging diverse artificial intelligence (AI) powered modules. The system understands the voice communications issued by the ATCo, and, in turn, it generates a spoken prompt that follows the pilot's phraseology to the initial communication. Our system mainly relies on open-source AI tools and air traffic control (ATC) databases, thus, proving its simplicity and ease of replicability. The overall pipeline is composed of the following: (1) a submodule that receives and pre-processes the input stream of raw audio, (2) an automatic speech recognition (ASR) system that transforms audio into a sequence of words; (3) a high-level ATC-related entity parser, which extracts relevant information from the communication, i.e., callsigns and commands, and finally, (4) a speech synthesizer submodule that generates responses based on the high-level ATC entities previously extracted. Overall, we show that this system could pave the way toward developing a real proof-of-concept pseudo-pilot system. Hence, speeding up the training of ATCos while drastically reducing its overall cost.
translated by 谷歌翻译
This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The Lib-riSpeech corpus is derived from audiobooks that are part of the Lib-riVox project, and contains 1000 hours of speech sampled at 16 kHz. We have made the corpus freely available for download, along with separately prepared language-model training data and pre-built language models. We show that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself. We are also releasing Kaldi scripts that make it easy to build these systems.
translated by 谷歌翻译
构建可用的无线电监控自动语音识别(ASR)系统是资源不足的语言的一项挑战性任务,但这在广播是公众沟通和讨论的主要媒介的社会中至关重要。联合国在乌干达的最初努力证明了如何理解被社交媒体排除在社交媒体中的农村人的看法在国家规划中很重要。但是,由于缺乏转录的语音数据集,这些努力正受到挑战。在本文中,Makerere人工智能研究实验室发布了155小时的Luganda Radio演讲语料库。据我们所知,这是撒哈拉以南非洲第一个公开可用的广播数据集。本文描述了语音语料库的开发,并使用开源语音识别工具包Coqui STT Toolkit提出了基线Luganda ASR绩效结果。
translated by 谷歌翻译
我们提出Vakyansh,这是一种用指示语言识别语音识别的端到端工具包。印度拥有近121种语言和大约125亿扬声器。然而,大多数语言在数据和预验证的模型方面都是低资源。通过Vakyansh,我们介绍了自动数据管道,用于数据创建,模型培训,模型评估和部署。我们以23个指示语言和Train Wav2Vec 2.0预验证的模型创建14,000小时的语音数据。然后,对这些预审预告措施的模型进行了修订,以创建18个指示语言的最先进的语音识别模型,其次是语言模型和标点符号修复模型。我们以使命开源所有这些资源,这将激发语音社区使用ASR模型以指示语言开发语音的首次应用程序。
translated by 谷歌翻译
低资源语言的自动语音识别(ASR)改善了语言少数群体的访问,以便人工智能(AI)提供的技术优势。在本文中,我们通过创建一个新的粤语数据集来解决香港广东语言的数据稀缺问题。我们的数据集多域粤语语料库(MDCC)由73.6小时的清洁阅读语音与成绩单配对,从香港的粤语有声读物收集。它结合了哲学,政治,教育,文化,生活方式和家庭领域,涵盖了广泛的主题。我们还查看所有现有的粤语数据集,并在两个最大的数据集(MDCC和公共语音ZH-HK)上执行实验。我们根据其语音类型,数据源,总大小和可用性分析现有数据集。使用Fairseq S2T变压器,最先进的ASR模型进行实验结果,显示了我们数据集的有效性。此外,我们通过在MDCC和常见的声音ZH-HK上应用多数据集学习来创建一个强大而强大的粤语ASR模型。
translated by 谷歌翻译
通过共享数据集和基准,已经促进了语音处理的进展。历史上,这些都集中在自动语音识别(ASR),扬声器标识或其他较低级别的任务上。兴趣在更高层次的口语中越来越多,理解任务,包括使用端到端模型,但是此类任务的注释数据集较少。与此同时,最近的工作显示了预先培训通用表示的可能性,然后使用相对较少标记的数据进行微调的多个任务。我们建议为口语语言理解(屠宰)创建一套基准任务,由有限尺寸标记的培训集和相应的评估集组成。该资源将允许研究界跟踪进度,评估高级任务的预先接受预期的表示,并研究开放的问题,例如管道与端到端方法的实用性。我们介绍了雪橇基准套件的第一阶段,包括指定实体识别,情感分析和相应数据集上的ASR。我们专注于自然产生的(未读取或综合)语音和自由可用的数据集。我们为VoxceReb和Voxpopuli数据集的子集提供新的转录和注释,基线模型的评估指标和结果,以及重现基线的开源工具包,并评估新模型。
translated by 谷歌翻译
自动语音识别(ASR)是一个复杂和具有挑战性的任务。近年来,该地区出现了重大进展。特别是对于巴西葡萄牙语(BP)语言,在2020年的下半年,有大约376小时的公众可供ASR任务。在2021年初发布新数据集,这个数字增加到574小时。但是,现有资源由仅包含读取和准备的演讲的Audios组成。缺少数据集包括自发性语音,这在不同的ASR应用中是必不可少的。本文介绍了Coraa(注释Audios语料库)V1。使用290.77小时,在包含验证对(音频转录)的BP中ASR的公共可用数据集。科拉还含有欧洲葡萄牙音像(4.69小时)。我们还提供了一个基于Wav2VEC 2.0 XLSR-53的公共ASR模型,并通过CoraA进行微调。我们的模型在CoraA测试集中实现了24.18%的单词误差率,并且在常见的语音测试集上为20.08%。测量字符错误率时,我们分别获得11.02%和6.34%,分别为CoraA和常见声音。 Coraa Corpora在自发言论中与BP中的改进ASR模型进行了组装,并激励年轻研究人员开始研究葡萄牙语的ASR。所有Corpora都在CC By-NC-ND 4.0许可证下公开提供Https://github.com/nilc-nlp/coraa。
translated by 谷歌翻译
我们总结了使用巨大的自动语音识别(ASR)模型的大量努力的结果,该模型使用包含大约一百万小时音频的大型,多样的未标记数据集进行了预训练。我们发现,即使对于拥有数万个小时的标记数据的非常大的任务,预训练,自我培训和扩大模型大小的组合也大大提高了数据效率。特别是,在具有34K小时标记数据的ASR任务上,通过微调80亿个参数预先训练的构象异构体模型,我们可以匹配最先进的(SOTA)性能(SOTA)的性能,只有3%的培训数据和通过完整的训练集可以显着改善SOTA。我们还报告了从使用大型预训练和自我训练的模型来完成一系列下游任务所获得的普遍利益,这些任务涵盖了广泛的语音域,并涵盖了多个数据集大小的大小,包括在许多人中获得SOTA性能公共基准。此外,我们利用预先训练的网络的学会表示,在非ASR任务上实现SOTA结果。
translated by 谷歌翻译
Speech translation (ST) is the task of directly translating acoustic speech signals in a source language into text in a foreign language. ST task has been addressed, for a long time, using a pipeline approach with two modules : first an Automatic Speech Recognition (ASR) in the source language followed by a text-to-text Machine translation (MT). In the past few years, we have seen a paradigm shift towards the end-to-end approaches using sequence-to-sequence deep neural network models. This paper presents our efforts towards the development of the first Broadcast News end-to-end Arabic to English speech translation system. Starting from independent ASR and MT LDC releases, we were able to identify about 92 hours of Arabic audio recordings for which the manual transcription was also translated into English at the segment level. These data was used to train and compare pipeline and end-to-end speech translation systems under multiple scenarios including transfer learning and data augmentation techniques.
translated by 谷歌翻译
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark datasets for several tasks. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. We follow the blueprint of the Spoken Language Understanding Evaluation (SLUE) benchmark suite. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will be publishing for each task (i) annotations for a relatively small fine-tuning set, (ii) annotated development and test sets, and (iii) baseline models for easy reproducibility and comparisons. In this work, we present the details of data collection and annotation and the performance of the baseline models. We also perform sensitivity analysis of pipeline models' performance (speech recognizer + text model) to the speech recognition accuracy, using more than 20 state-of-the-art speech recognition models.
translated by 谷歌翻译