Automatic Speech Recognition (ASR) systems typically yield output in lexical form. However, humans prefer a written form output. To bridge this gap, ASR systems usually employ Inverse Text Normalization (ITN). In previous works, Weighted Finite State Transducers (WFST) have been employed to do ITN. WFSTs are nicely suited to this task but their size and run-time costs can make deployment on embedded applications challenging. In this paper, we describe the development of an on-device ITN system that is streaming, lightweight & accurate. At the core of our system is a streaming transformer tagger, that tags lexical tokens from ASR. The tag informs which ITN category might be applied, if at all. Following that, we apply an ITN-category-specific WFST, only on the tagged text, to reliably perform the ITN conversion. We show that the proposed ITN solution performs equivalent to strong baselines, while being significantly smaller in size and retaining customization capabilities.
translated by 谷歌翻译
最近,语音界正在看到从基于深神经网络的混合模型移动到自动语音识别(ASR)的端到端(E2E)建模的显着趋势。虽然E2E模型在大多数基准测试中实现最先进的,但在ASR精度方面,混合模型仍然在当前的大部分商业ASR系统中使用。有很多实际的因素会影响生产模型部署决定。传统的混合模型,用于数十年的生产优化,通常擅长这些因素。在不为所有这些因素提供优异的解决方案,E2E模型很难被广泛商业化。在本文中,我们将概述最近的E2E模型的进步,专注于解决行业视角的挑战技术。
translated by 谷歌翻译
While speech recognition Word Error Rate (WER) has reached human parity for English, long-form dictation scenarios still suffer from segmentation and punctuation problems resulting from irregular pausing patterns or slow speakers. Transformer sequence tagging models are effective at capturing long bi-directional context, which is crucial for automatic punctuation. Automatic Speech Recognition (ASR) production systems, however, are constrained by real-time requirements, making it hard to incorporate the right context when making punctuation decisions. In this paper, we propose a streaming approach for punctuation or re-punctuation of ASR output using dynamic decoding windows and measure its impact on punctuation and segmentation accuracy across scenarios. The new system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%. Streaming punctuation achieves an average BLEU-score improvement of 0.66 for the downstream task of Machine Translation (MT).
translated by 谷歌翻译
确保适当的标点符号和字母外壳是朝向应用复杂的自然语言处理算法的关键预处理步骤。这对于缺少标点符号和壳体的文本源,例如自动语音识别系统的原始输出。此外,简短的短信和微博的平台提供不可靠且经常错误的标点符号和套管。本调查概述了历史和最先进的技术,用于恢复标点符号和纠正单词套管。此外,突出了当前的挑战和研究方向。
translated by 谷歌翻译
基于语音的投入在我们日常生活中获得了智能手机和平板电脑的普及,因为声音是人类计算机交互的最简单而有效的方式。本文旨在设计更有效的基于语音的接口,以查询关系数据库中的结构化数据。我们首先识别名为Speep-to-SQL的新任务,旨在了解人类语音传达的信息,并直接将其转换为结构化查询语言(SQL)语句。对此问题的天真解决方案可以以级联方式工作,即,自动语音识别(ASR)组件,后跟文本到SQL组件。然而,它需要高质量的ASR系统,并且还遭受了两种组件之间的错误复合问题,从而产生有限的性能。为了处理这些挑战,我们进一步提出了一个名为SpeepSQLNET的新型端到端神经结构,直接将人类语音转化为没有外部ASR步骤的SQL查询。 SpeemSQLNET具有充分利用演讲中提供的丰富语言信息的优势。据我们所知,这是第一次尝试根据任意自然语言问题直接综合SQL,而不是基于自然语言的SQL版本或其具有有限的SQL语法的变体。为了验证所提出的问题和模型的有效性,我们还通过捎带广泛使用的文本到SQL数据集来进一步构建名为SpeemQL的数据集。对该数据集的广泛实验评估表明,SpeemSQLNET可以直接从人类语音中直接综合高质量的SQL查询,优于各种竞争对手,以及在精确匹配的准确性方面的级联方法。
translated by 谷歌翻译
上下文偏见是端到端自动语音识别(ASR)系统的一项重要且具有挑战性现有方法主要包括上下文lm偏置,并将偏置编码器添加到端到端的ASR模型中。在这项工作中,我们介绍了一种新颖的方法,通过在端到端ASR系统之上添加上下文拼写校正模型来实现上下文偏见。我们将上下文信息与共享上下文编码器合并到序列到序列拼写校正模型中。我们提出的模型包括两种不同的机制:自动回旋(AR)和非自动回旋(NAR)。我们提出过滤算法来处理大尺寸的上下文列表以及性能平衡机制,以控制模型的偏置程度。我们证明所提出的模型是一种普遍的偏见解决方案,它是对域的不敏感的,可以在不同的情况下采用。实验表明,所提出的方法在ASR系统上的相对单词错误率(WER)降低多达51%,并且优于传统偏见方法。与AR溶液相比,提出的NAR模型可将模型尺寸降低43.2%,并将推断加速2.1倍。
translated by 谷歌翻译
我们提出了一种基于审议的新型方法来端到端(E2E)口语理解(SLU),其中流媒体自动语音识别(ASR)模型会产生第一频繁的假设和第二通通的自然语言(NLU)(NLU) )组件通过对ASR的文本和音频嵌入来生成语义解析。通过将E2E SLU制定为广义解码器,我们的系统能够支持复杂的组成语义结构。此外,ASR和NLU之间的参数共享使该系统特别适合资源受限的(内部设备)环境;我们提出的方法始终在TOPV2数据集的口头版本(Stop)的口语版本上始终优于强大管道NLU基线的0.60%至0.65%。我们证明了文本和音频功能的融合,再加上系统重写第一通道假设的能力,使我们的方法对ASR错误更加强大。最后,我们表明我们的方法可以显着减少从自然语音到合成语音训练时的降解,但是要使文本到语音(TTS)成为可行的解决方案,以扩大E2E SLU。
translated by 谷歌翻译
流动自动语音识别(ASR)模型更为流行,适合基于语音的应用程序。但是,非流入模型在查看整个音频上下文时提供了更好的性能。为了利用语音搜索等流媒体应用程序中非流游模型的好处,它通常在第二通过重新评分模式下使用。使用蒸汽模型生成的候选假设是使用非流程模型重新评分的。在这项工作中,我们在独立和重新评分模式的Flipkart语音搜索任务上评估了基于注意力的端到端ASR模型。这些模型基于收听拼写(LAS)编码器编码器架构。我们基于LSTM,变压器和构象异构体进行不同的编码器变化。我们将这些模型的延迟要求与它们的性能进行比较。总体而言,我们表明,变压器模型提供了可接受的延迟要求。我们报告的相对改善约为16%,第二次通过LAS重新评分,延迟开销低于5ms。我们还强调了CNN前端使用变压器体系结构的重要性,以达到可比的单词错误率(WER)。此外,我们观察到,在第二次通过重新评分模式下,所有编码器都提供了相似的好处,而在独立文本生成模式下,性能差异很明显。
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
在本文中,我们提出了一个神经端到端系统,用于保存视频的语音,唇部同步翻译。该系统旨在将多个组件模型结合在一起,并以目标语言的目标语言与目标语言的原始扬声器演讲的视频与目标语音相结合,但在语音,语音特征,面对原始扬声器的视频中保持着重点。管道从自动语音识别开始,包括重点检测,然后是翻译模型。然后,翻译后的文本由文本到语音模型合成,该模型重新创建了原始句子映射的原始重点。然后,使用语音转换模型将结果的合成语音映射到原始扬声器的声音。最后,为了将扬声器的嘴唇与翻译的音频同步,有条件的基于对抗网络的模型生成了相对于输入面图像以及语音转换模型的输出的适应性唇部运动的帧。最后,系统将生成的视频与转换后的音频结合在一起,以产生最终输出。结果是一个扬声器用另一种语言说话的视频而不真正知道。为了评估我们的设计,我们介绍了完整系统的用户研究以及对单个组件的单独评估。由于没有可用的数据集来评估我们的整个系统,因此我们收集了一个测试集并在此测试集上评估我们的系统。结果表明,我们的系统能够生成令人信服的原始演讲者的视频,同时保留原始说话者的特征。收集的数据集将共享。
translated by 谷歌翻译
在本文中,我们提出了一个名为Wenet的开源,生产第一和生产准备的语音识别工具包,其中实现了一种新的双通方法,以统一流传输和非流媒体端到端(E2E)语音识别单一模型。 WENET的主要动机是缩放研究与E2E演示识别模型的生产之间的差距。 Wenet提供了一种有效的方法,可以在几个真实情景中运送ASR应用程序,这是其他开源E2E语音识别工具包的主要差异和优势。在我们的工具包中,实现了一种新的双通方法。我们的方法提出了一种基于动态的基于块的关注策略,变压器层,允许任意右上下文长度修改在混合CTC /注意架构中。只有更改块大小,可以轻松控制推理延迟。然后,CTC假设被注意力解码器重新筛选以获得最终结果。我们在使用WENET上的Aishell-1数据集上的实验表明,与标准的非流式变压器相比,我们的模型在非流式ASR中实现了5.03 \%相对字符的误差率(CER)。在模型量化之后,我们的模型执行合理的RTF和延迟。
translated by 谷歌翻译
End-to-end speech recognition models trained using joint Connectionist Temporal Classification (CTC)-Attention loss have gained popularity recently. In these models, a non-autoregressive CTC decoder is often used at inference time due to its speed and simplicity. However, such models are hard to personalize because of their conditional independence assumption that prevents output tokens from previous time steps to influence future predictions. To tackle this, we propose a novel two-way approach that first biases the encoder with attention over a predefined list of rare long-tail and out-of-vocabulary (OOV) words and then uses dynamic boosting and phone alignment network during decoding to further bias the subword predictions. We evaluate our approach on open-source VoxPopuli and in-house medical datasets to showcase a 60% improvement in F1 score on domain-specific rare words over a strong CTC baseline.
translated by 谷歌翻译
Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC (low-resource domain), large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. The ATCO2 corpus covers 1) data collection and pre-processing, 2) pseudo-annotations of speech data, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold annotations for named-entity recognition (callsign, command, value). 2) The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched with automatic transcripts from an in-domain speech recognizer, contextual information, speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. Both available for purchase through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3) The ATCO2-test-set-1h corpus is a one-hour subset from the original test set corpus, that we are offering for free at https://www.atco2.org/data. We expect the ATCO2 corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.
translated by 谷歌翻译
Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and language models (LM) pre-trained on large quantities of unpaired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively.
translated by 谷歌翻译
捷克语是一种非常特殊的语言,因为它在形式和口语形式之间的差异很大。虽然正式(书面)形式主要用于官方文件,文学和公开演讲,但通言(口语)表格在休闲演讲中被广泛使用。该差距引入了ASR系统的严重问题,尤其是在培训或评估包含大量口语语音(例如Malach Project)的数据集上的ASR模型时。在本文中,我们正在根据端到端ASR系统中的新范式解决这个问题,最近引入了自我监督的音频变压器。具体而言,我们正在研究口语语音对WAV2VEC 2.0模型性能的影响及其直接转录口语演讲的能力。我们在培训成绩单,语言模型和评估笔录中以正式和口语形式提出结果。
translated by 谷歌翻译
口语理解(SLU)是大多数人机相互作用系统中的核心任务。随着智能家居,智能手机和智能扬声器的出现,SLU已成为该行业的关键技术。在经典的SLU方法中,自动语音识别(ASR)模块将语音信号转录为文本表示,自然语言理解(NLU)模块从中提取语义信息。最近,基于深神经网络的端到端SLU(E2E SLU)已经获得了动力,因为它受益于ASR和NLU部分的联合优化,因此限制了管道架构的误差效应的级联反应。但是,对于E2E模型用于预测语音输入的概念和意图的实际语言特性知之甚少。在本文中,我们提出了一项研究,以确定E2E模型执行SLU任务的信号特征和其他语言特性。该研究是在必须处理非英语(此处法语)语音命令的智能房屋的应用领域进行的。结果表明,良好的E2E SLU性能并不总是需要完美的ASR功能。此外,结果表明,与管道模型相比,E2E模型在处理背景噪声和句法变化方面具有出色的功能。最后,更细粒度的分析表明,E2E模型使用输入信号的音调信息来识别语音命令概念。本文概述的结果和方法提供了一个跳板,以进一步分析语音处理中的E2E模型。
translated by 谷歌翻译
口音构成了识别文化,情感,行为等的组成部分。人们经常由于口音而以不同的方式相互感知。口音本身可以是地位,自豪感和其他情感信息的传送带,可以通过语音本身捕获。口音本身可以定义为:“特定领域,国家或社会群体中的人的单词”或“在单词中给出的音节,句子中的单词或一组音符的特殊强调的方式音符”。语音识别是语音识别领域中最重要的问题之一。语音识别是计算机科学和语言学研究的跨学科子场,其中的主要目的是开发能够将语音转换为文本的技术。演讲可以是任何形式的,例如阅读语音或自发演讲,对话言语。语音与文本不同,有很多多样性。这种多样性源于环境条件,说话者到扬声器的变化,渠道噪音,由于残疾而导致的言语产生差异,存在不足。因此,语音确实是等待被利用的丰富信息来源。
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
波斯语是一种拐点对象 - 动词语言。这一事实使波斯更不确定的语言。但是,使用诸如Zero-Width非加床(ZWNJ)识别,标点符号恢复和波斯ezafe施工的技术将导致我们更加可理解和精确的语言。在波斯的大部分作品中,这些技术是单独解决的。尽管如此,我们认为,对于波斯的文本细化,所有这些任务都是必要的。在这项工作中,我们提出了一个ViraPart框架,它在其核心中使用了嵌入式帕尔兹伯特进行文本澄清。首先,通过分类程序层用于分类过程的分类程序来使用BERT Variant。接下来,我们组合模型输出以输出ClearText。最后,ZWNJ识别,标点恢复和波斯EZAFE施工的提出模型分别执行96.90%,92.13%和98.50%的平均F1宏观分数。实验结果表明,我们的建议方法在波斯语的文本细化中非常有效。
translated by 谷歌翻译
This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.
translated by 谷歌翻译