The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models. Since each of these network architectures has pros and cons, a typical use case is to switch these separate models depending on the application requirement, resulting in the increased overhead of maintaining all models. Several methods for integrating two of these complementary models to mitigate the overhead issue have been proposed; however, if we integrate more models, we will further benefit from these complementary models and realize broader applications with a single system. This paper proposes four-decoder joint modeling (4D) of CTC, attention, RNN-T, and mask-predict, which has the following three advantages: 1) The four decoders are jointly trained so that they can be easily switched depending on the application scenarios. 2) Joint training may bring model regularization and improve the model robustness thanks to their complementary properties. 3) Novel one-pass joint decoding methods using CTC, attention, and RNN-T further improves the performance. The experimental results showed that the proposed model consistently reduced the WER.
translated by 谷歌翻译
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译
The black-box nature of end-to-end speech translation (E2E ST) systems makes it difficult to understand how source language inputs are being mapped to the target language. To solve this problem, we would like to simultaneously generate automatic speech recognition (ASR) and ST predictions such that each source language word is explicitly mapped to a target language word. A major challenge arises from the fact that translation is a non-monotonic sequence transduction task due to word ordering differences between languages -- this clashes with the monotonic nature of ASR. Therefore, we propose to generate ST tokens out-of-order while remembering how to re-order them later. We achieve this by predicting a sequence of tuples consisting of a source word, the corresponding target words, and post-editing operations dictating the correct insertion points for the target word. We examine two variants of such operation sequences which enable generation of monotonic transcriptions and non-monotonic translations from the same speech input simultaneously. We apply our approach to offline and real-time streaming models, demonstrating that we can provide explainable translations without sacrificing quality or latency. In fact, the delayed re-ordering ability of our approach improves performance during streaming. As an added benefit, our method performs ASR and ST simultaneously, making it faster than using two separate systems to perform these tasks.
translated by 谷歌翻译
In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditionally factorize the bilingual task into its constituent monolingual parts are a promising starting point for leveraging monolingual data efficiently. However, these methods require the monolingual modules to perform language segmentation. That is, each monolingual module has to simultaneously detect CS points and transcribe speech segments of one language while ignoring those of other languages -- not a trivial task. We propose to simplify each monolingual module by allowing them to transcribe all speech segments indiscriminately with a monolingual script (i.e. transliteration). This simple modification passes the responsibility of CS point detection to subsequent bilingual modules which determine the final output by considering multiple monolingual transliterations along with external language model information. We apply this transliteration-based approach in an end-to-end differentiable neural network and demonstrate its efficacy for zero-shot CS ASR on Mandarin-English SEAME test sets.
translated by 谷歌翻译
本文提出了将语音分离和增强(SSE)集成到ESPNET工具包中的最新进展。与以前的ESPNET-SE工作相比,已经添加了许多功能,包括最近的最新语音增强模型,并具有各自的培训和评估食谱。重要的是,已经设计了一个新界面,以灵活地将语音增强前端与其他任务相结合,包括自动语音识别(ASR),语音翻译(ST)和口语理解(SLU)。为了展示这种集成,我们在精心设计的合成数据集上进行了实验,用于嘈杂的多通道ST和SLU任务,可以用作未来研究的基准语料库。除了这些新任务外,我们还使用Chime-4和WSJ0-2MIX进行基准多链和单渠道SE方法。结果表明,即使在ASR以外的任务,尤其是在多频道方案中,SE前端与后端任务的集成也是一个有希望的研究方向。该代码可在https://github.com/espnet/espnet上在线获得。 HuggingFace上发布了这项工作的另一个贡献的多通道ST和SLU数据集。
translated by 谷歌翻译
端到端(E2E)模型在口语理解(SLU)系统中变得越来越流行,并开始实现基于管道的方法的竞争性能。但是,最近的工作表明,这些模型努力以相同的意图概括为新的措辞,这表明模型无法理解给定话语的语义内容。在这项工作中,我们在E2E-SLU框架内的未标记文本数据中预先训练了在未标记的文本数据上进行预先训练的语言模型,以构建强大的语义表示。同时结合语义信息和声学信息可以增加推理时间,从而在语音助手等应用程序中部署时会导致高潜伏期。我们开发了一个2频道的SLU系统,该系统使用第一张音频的几秒钟的声学信息进行低潜伏期预测,并通过结合语义和声学表示在第二次通过中进行更高质量的预测。我们从先前的2次端到端语音识别系统上的工作中获得了灵感,该系统同时使用审议网络就可以在音频和第一通道假设上进行。所提出的2个通用SLU系统在Fluent Speech命令挑战集和SLURP数据集上优于基于声学的SLU模型,并减少了延迟,从而改善了用户体验。作为ESPNET-SLU工具包的一部分,我们的代码和模型公开可用。
translated by 谷歌翻译
大型语言模型可以编码有关世界的大量语义知识。这种知识对于旨在采取自然语言表达的高级,时间扩展的指示的机器人可能非常有用。但是,语言模型的一个重大弱点是,它们缺乏现实世界的经验,这使得很难利用它们在给定的体现中进行决策。例如,要求语言模型描述如何清洁溢出物可能会导致合理的叙述,但是它可能不适用于需要在特定环境中执行此任务的特定代理商(例如机器人)。我们建议通过预处理的技能来提供现实世界的基础,这些技能用于限制模型以提出可行且在上下文上适当的自然语言动作。机器人可以充当语​​言模型的“手和眼睛”,而语言模型可以提供有关任务的高级语义知识。我们展示了如何将低级技能与大语言模型结合在一起,以便语言模型提供有关执行复杂和时间扩展说明的过程的高级知识,而与这些技能相关的价值功能则提供了连接必要的基础了解特定的物理环境。我们在许多现实世界的机器人任务上评估了我们的方法,我们表明了对现实世界接地的需求,并且这种方法能够在移动操纵器上完成长远,抽象的自然语言指令。该项目的网站和视频可以在https://say-can.github.io/上找到。
translated by 谷歌翻译
会话双语语言包括三种类型的话语:两个纯粹单色类型和一个内侧型代码切换类型。在这项工作中,我们提出了一个综合框架,共同模拟包括双语语音识别的单声道和代码交换机子任务的可能性。通过定义具有标签到帧同步的单个子任务,我们的联合建模框架可以条件地分解,使得可以仅获得或可能不切换的最终双语输出,仅给出单格式信息。我们表明,该条件分解的联合框架可以由端到端可分解的神经网络进行建模。我们展示了我们拟议模型在单语和代码切换的语料中对双语普通话语音识别的效果。
translated by 谷歌翻译
随着自动语音处理(ASR)系统越来越好,使用ASR输出越来越令于进行下游自然语言处理(NLP)任务。但是,很少的开源工具包可用于在不同口语理解(SLU)基准上生成可重复的结果。因此,需要建立一个开源标准,可以用于具有更快的开始进入SLU研究。我们展示了Espnet-SLU,它旨在在一个框架中快速发展口语语言理解。 Espnet-SLU是一个项目内部到结束语音处理工具包,ESPNET,它是一个广泛使用的开源标准,用于各种语音处理任务,如ASR,文本到语音(TTS)和语音转换(ST)。我们增强了工具包,为各种SLU基准提供实现,使研究人员能够无缝混合和匹配不同的ASR和NLU模型。我们还提供预磨损的模型,具有集中调谐的超参数,可以匹配或甚至优于最新的最先进的性能。该工具包在https://github.com/espnet/espnet上公开提供。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译