我们旨在使用大量自动转录语音来改进口语建模(LM)。我们利用INA(法国国家视听学院)的收藏,并在350,000小时的电视节目中应用ASR后获得19GB的文本。由此,通过微调现有的LM(FLAUBERT)或通过从头开始训练LM来培训口语模型。新模型(Flaubert-Oral)与社区共享,并评估了3个下游任务:口语理解,电视节目的分类和语音句法解析。结果表明,与最初的Flaubert版本相比,Flaubert-Oral可能是有益的,表明尽管其固有的嘈杂性,但ASR生成的文本仍可用于构建口头语言模型。
translated by 谷歌翻译
通过共享数据集和基准,已经促进了语音处理的进展。历史上,这些都集中在自动语音识别(ASR),扬声器标识或其他较低级别的任务上。兴趣在更高层次的口语中越来越多,理解任务,包括使用端到端模型,但是此类任务的注释数据集较少。与此同时,最近的工作显示了预先培训通用表示的可能性,然后使用相对较少标记的数据进行微调的多个任务。我们建议为口语语言理解(屠宰)创建一套基准任务,由有限尺寸标记的培训集和相应的评估集组成。该资源将允许研究界跟踪进度,评估高级任务的预先接受预期的表示,并研究开放的问题,例如管道与端到端方法的实用性。我们介绍了雪橇基准套件的第一阶段,包括指定实体识别,情感分析和相应数据集上的ASR。我们专注于自然产生的(未读取或综合)语音和自由可用的数据集。我们为VoxceReb和Voxpopuli数据集的子集提供新的转录和注释,基线模型的评估指标和结果,以及重现基线的开源工具包,并评估新模型。
translated by 谷歌翻译
开发语音技术是对低资源语言的挑战,其中注释和原始语音数据稀疏。马耳他是一种这样的语言。近年来,对马耳他的计算处理有所增加,包括语音技术,但后者的资源仍然稀疏。在本文中,我们考虑提高这些语言的语音识别的数据增强技术,专注于马耳他作为测试用例。我们考虑三种不同类型的数据增强:无监督的培训,多语言培训和合成演讲的使用作为培训数据。目标是确定这些技术或它们的组合,是改善起始点是大约7小时转录语音的语言的语言的最有效。我们的结果表明,在这里研究了三种数据增强技术,导致我们在不使用语言模型的情况下实现15%的绝对增长。
translated by 谷歌翻译
在本文中,我们介绍了从包含超过80,000个小时的未标记的语音的大型数据集预处理捷克单语音频变压器方面的进展,随后使用内域数据组合对自动语音识别任务进行微调,并对模型进行微调。6000小时的跨域转录语音。我们在两个公共数据集(CommunVoice和Voxpopuli)和Malach Project中的一个非常具有挑战性的数据集中评估了各种微调设置的大量实验调色板。我们的结果表明,单语WAV2VEC 2.0模型是强大的ASR系统,它可以利用大型标记和未标记的数据集并成功与最先进的LVCSR系统竞争。此外,当没有用于目标ASR任务的培训数据时,WAV2VEC模型被证明是很好的零射门学习者。
translated by 谷歌翻译
Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC (low-resource domain), large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. The ATCO2 corpus covers 1) data collection and pre-processing, 2) pseudo-annotations of speech data, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold annotations for named-entity recognition (callsign, command, value). 2) The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched with automatic transcripts from an in-domain speech recognizer, contextual information, speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. Both available for purchase through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3) The ATCO2-test-set-1h corpus is a one-hour subset from the original test set corpus, that we are offering for free at https://www.atco2.org/data. We expect the ATCO2 corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.
translated by 谷歌翻译
确保适当的标点符号和字母外壳是朝向应用复杂的自然语言处理算法的关键预处理步骤。这对于缺少标点符号和壳体的文本源,例如自动语音识别系统的原始输出。此外,简短的短信和微博的平台提供不可靠且经常错误的标点符号和套管。本调查概述了历史和最先进的技术,用于恢复标点符号和纠正单词套管。此外,突出了当前的挑战和研究方向。
translated by 谷歌翻译
口语理解(SLU)是大多数人机相互作用系统中的核心任务。随着智能家居,智能手机和智能扬声器的出现,SLU已成为该行业的关键技术。在经典的SLU方法中,自动语音识别(ASR)模块将语音信号转录为文本表示,自然语言理解(NLU)模块从中提取语义信息。最近,基于深神经网络的端到端SLU(E2E SLU)已经获得了动力,因为它受益于ASR和NLU部分的联合优化,因此限制了管道架构的误差效应的级联反应。但是,对于E2E模型用于预测语音输入的概念和意图的实际语言特性知之甚少。在本文中,我们提出了一项研究,以确定E2E模型执行SLU任务的信号特征和其他语言特性。该研究是在必须处理非英语(此处法语)语音命令的智能房屋的应用领域进行的。结果表明,良好的E2E SLU性能并不总是需要完美的ASR功能。此外,结果表明,与管道模型相比,E2E模型在处理背景噪声和句法变化方面具有出色的功能。最后,更细粒度的分析表明,E2E模型使用输入信号的音调信息来识别语音命令概念。本文概述的结果和方法提供了一个跳板,以进一步分析语音处理中的E2E模型。
translated by 谷歌翻译
捷克语是一种非常特殊的语言,因为它在形式和口语形式之间的差异很大。虽然正式(书面)形式主要用于官方文件,文学和公开演讲,但通言(口语)表格在休闲演讲中被广泛使用。该差距引入了ASR系统的严重问题,尤其是在培训或评估包含大量口语语音(例如Malach Project)的数据集上的ASR模型时。在本文中,我们正在根据端到端ASR系统中的新范式解决这个问题,最近引入了自我监督的音频变压器。具体而言,我们正在研究口语语音对WAV2VEC 2.0模型性能的影响及其直接转录口语演讲的能力。我们在培训成绩单,语言模型和评估笔录中以正式和口语形式提出结果。
translated by 谷歌翻译
我们研究应用语言模型(LM)对指示语言自动语音识别(ASR)系统输出的影响。我们微调WAV2VEC $ 2.0 $型号的$ 18 $指示性语言,并通过根据各种来源派生的文本训练的语言模型调整结果。我们的发现表明,平均字符错误率(CER)降低了$ 28 $ \%,平均单词错误率(WER)在解码LM后降低了$ 36 $ \%。我们表明,与多样化的LM相比,大型LM可能无法提供实质性的改进。我们还证明,可以在特定于域的数据上获得高质量的转录,而无需重新培训ASR模型并显示了生物医学领域的结果。
translated by 谷歌翻译
在过去的五年中,基于自动变压器的体系结构的兴起导致了许多自然语言任务的最新表现。尽管这些方法越来越受欢迎,但它们需要大量的数据和计算资源。在数据范围的应用程序条件下,在资源不足的语言上,基准测试方法仍然非常需要对方法进行基准测试。大多数预训练的语言模型都使用英语进行了大规模研究,其中只有少数在法语上进行了评估。在本文中,我们提出了一个统一的基准测试,重点是评估模型质量及其对两个法语口语理解任务的生态影响。尤其是我们基于13个完善的基于变压器的模型基于法语的两个可用语言理解任务:媒体和ATIS-FR。在此框架内,我们表明紧凑的模型可以与较大的模型达到可比的结果,而生态影响却大大降低。但是,此假设是细微的,取决于考虑的压缩方法。
translated by 谷歌翻译
Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and language models (LM) pre-trained on large quantities of unpaired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively.
translated by 谷歌翻译
已经证明了深度学习技术在各种任务中有效,特别是在语音识别系统的发展中,即旨在以一系列写词中的音频句子转录音频句子的系统。尽管该地区进展,但语音识别仍然可以被认为是困难的,特别是对于缺乏可用数据的语言,例如巴西葡萄牙语(BP)。从这个意义上讲,这项工作介绍了仅使用打开可用的音频数据的公共自动语音识别(ASR)系统的开发,从Wav2Vec 2.0 XLSR-53模型的微调,在许多语言中,通过BP数据进行了多种。最终模型在7个不同的数据集中呈现12.4%的平均误差率(在应用语言模型时10.5%)。根据我们的知识,这是开放ASR系统中BP的最佳结果。
translated by 谷歌翻译
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark datasets for several tasks. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. We follow the blueprint of the Spoken Language Understanding Evaluation (SLUE) benchmark suite. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will be publishing for each task (i) annotations for a relatively small fine-tuning set, (ii) annotated development and test sets, and (iii) baseline models for easy reproducibility and comparisons. In this work, we present the details of data collection and annotation and the performance of the baseline models. We also perform sensitivity analysis of pipeline models' performance (speech recognizer + text model) to the speech recognition accuracy, using more than 20 state-of-the-art speech recognition models.
translated by 谷歌翻译
最近的言语和语言技术的方法预先rain非常大型模型,用于特定任务。然而,这种大型模型的好处通常仅限于世界上少数资源丰富的语言。在这项工作中,我们对来自印度次大陆的低资源语言构建ASR系统进行多种贡献。首先,我们从各种领域策划40个印度语言的17,000小时的原始语音数据,包括教育,新闻,技术和金融。其次,使用这种原始语音数据,我们预先存在于40个印度语言的Wav2Vec样式模型的多个变体。第三,我们分析佩带的模型以查找关键特点:码本矢量的类似探测音素在语言中共享,跨层的表示是语言系列的判别,并且注意力头通常会在小型本地窗口中注意。第四,我们微调了9种语言的下游ASR模型,并在3个公共数据集上获得最先进的结果,包括非常低的资源语言,如Sinhala和Nepali。我们的工作建立了多语言预介质是建立ASR系统的有效策略,为印度次大陆的语言上不同的扬声器建立ASR系统。
translated by 谷歌翻译
链接的开放数据实践导致了过去十年中网络上结构化数据的显着增长。这样的结构化数据以机器可读的方式描述了现实世界实体,并为自然语言处理领域的研究创造了前所未有的机会。但是,缺乏有关如何使用此类数据,哪种任务以及它们在多大程度上对这些任务有用的研究。这项工作着重于电子商务领域,以探索利用此类结构化数据来创建可能用于产品分类和链接的语言资源的方法。我们以RDF N四分之一的形式处理数十亿个结构化数据点,以创建数百万个与产品相关的语料库单词,后来以三种不同的方式用于创建语言资源:培训单词嵌入模型,继续预训练类似于Bert的语言模型和训练机器翻译模型,这些模型被用作生成产品相关的关键字的代理。我们对大量基准测试的评估表明,嵌入单词是提高这两个任务准确性的最可靠和一致的方法(在某些数据集中,宏观 - 平均F1中最高6.9个百分点)。但是,其他两种方法并不那么有用。我们的分析表明,这可能是由于许多原因,包括结构化数据中的偏置域表示以及缺乏词汇覆盖范围。我们分享我们的数据集,并讨论如何将我们所学到的经验教训朝着这一方向介绍未来的研究。
translated by 谷歌翻译
The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.
translated by 谷歌翻译
自我监督学习(SSL)在语音识别方面取得了巨大的成功,而有限的探索已尝试完成其他语音处理任务。由于语音信号包含多方面的信息,包括说话者身份,副语言学,口语内容等,学习所有语音任务的通用表示都具有挑战性。为了解决该问题,我们提出了一个新的预培训模型WAVLM,以解决全堆栈的下游语音任务。 Wavlm共同学习了蒙面的语音预测和预训练。通过这种方式,WAVLM不仅可以通过掩盖的语音预测来保持语音内容建模能力,而且还可以通过语音denoing来提高非ASR任务的潜力。此外,WAVLM还采用封闭式的变压器结构的封闭相对位置偏置,以更好地捕获输入语音的序列排序。我们还将培训数据集从60k小时扩展到94K小时。 WAVLM大型在精湛的基准上实现了最先进的性能,并在其代表性基准上为各种语音处理任务带来了重大改进。代码和预培训模型可在https://aka.ms/wavlm上找到。
translated by 谷歌翻译
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
translated by 谷歌翻译
This paper describes a simple yet efficient repetition-based modular system for speeding up air-traffic controllers (ATCos) training. E.g., a human pilot is still required in EUROCONTROL's ESCAPE lite simulator (see https://www.eurocontrol.int/simulator/escape) during ATCo training. However, this need can be substituted by an automatic system that could act as a pilot. In this paper, we aim to develop and integrate a pseudo-pilot agent into the ATCo training pipeline by merging diverse artificial intelligence (AI) powered modules. The system understands the voice communications issued by the ATCo, and, in turn, it generates a spoken prompt that follows the pilot's phraseology to the initial communication. Our system mainly relies on open-source AI tools and air traffic control (ATC) databases, thus, proving its simplicity and ease of replicability. The overall pipeline is composed of the following: (1) a submodule that receives and pre-processes the input stream of raw audio, (2) an automatic speech recognition (ASR) system that transforms audio into a sequence of words; (3) a high-level ATC-related entity parser, which extracts relevant information from the communication, i.e., callsigns and commands, and finally, (4) a speech synthesizer submodule that generates responses based on the high-level ATC entities previously extracted. Overall, we show that this system could pave the way toward developing a real proof-of-concept pseudo-pilot system. Hence, speeding up the training of ATCos while drastically reducing its overall cost.
translated by 谷歌翻译
Modern speech recognition systems exhibits rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where diversity of training data is limited. In this work we propose M2DS2, a simple and sample-efficient finetuning strategy for large pretrained speech models, based on mixed source and target domain self-supervision. We find that including source domain self-supervision stabilizes training and avoids mode collapse of the latent representations. For evaluation, we collect HParl, a $120$ hour speech corpus for Greek, consisting of plenary sessions in the Greek Parliament. We merge HParl with two popular Greek corpora to create GREC-MD, a test-bed for multi-domain evaluation of Greek ASR systems. In our experiments we find that, while other Unsupervised Domain Adaptation baselines fail in this resource-constrained environment, M2DS2 yields significant improvements for cross-domain adaptation, even when a only a few hours of in-domain audio are available. When we relax the problem in a weakly supervised setting, we find that independent adaptation for audio using M2DS2 and language using simple LM augmentation techniques is particularly effective, yielding word error rates comparable to the fully supervised baselines.
translated by 谷歌翻译