语音到语音翻译(S2ST)将输入语音转换为另一种语言。实时交付S2ST的挑战是翻译和语音合成模块之间的累积延迟。尽管最近增量的文本到语音(ITTS)模型已显示出巨大的质量改进,但它们通常需要其他未来的文本输入才能达到最佳性能。在这项工作中,我们通过调整上游语音翻译器来为语音合成器生成高质量的伪lookahead来最大程度地减少ITT的最初等待时间。缓解初始延迟后,我们证明了合成语音的持续时间在延迟中也起着至关重要的作用。我们将其形式化为延迟度量,然后提出一种简单而有效的持续时间缩放方法,以减少延迟。我们的方法始终将延迟减少0.2-0.5秒,而无需牺牲语音翻译质量。
translated by 谷歌翻译
我们提出了直接同时的语音转换(SIMUL-S2ST)模型,此外,翻译的产生与中间文本表示无关。我们的方法利用了最近与离散单位直接语音转换的最新进展,其中从模型中预测了一系列离散表示,而不是连续频谱图特征,而不是以无监督的方式学习,并直接传递给语音的声码器综合在一起。我们还介绍了变分单调的多口语注意力(V-MMA),以处理语音同声翻译中效率低效的政策学习的挑战。然后,同时策略在源语音特征和目标离散单元上运行。我们开展实证研究,比较级联和直接方法对Fisher西班牙语 - 英语和必需的英语西班牙语数据集。直接同步模型显示通过在翻译质量和延迟之间实现更好的权衡来优于级联模型。
translated by 谷歌翻译
本文介绍了流媒体和非流定向晶体翻译的统一端到端帧工作。虽然非流媒体语音翻译的培训配方已经成熟,但尚未建立流媒体传播的食谱。在这项工作中,WEFOCUS在开发一个统一的模型(UNIST),它从基本组成部分的角度支持流媒体和非流媒体ST,包括培训目标,注意机制和解码政策。对最流行的语音到文本翻译基准数据集,MERE-C的实验表明,与媒体ST的BLEU评分和延迟度量有更好的折衷和液化标准端到端基线和级联模型。我们将公开提供我们的代码和评估工具。
translated by 谷歌翻译
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language, which can be achieved with a cascaded system consisting of speech recognition, machine translation and speech synthesis. To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech, which requires strict length control. Previous works usually control the number of words or characters generated by the machine translation model to be similar to the source sentence, without considering the isochronicity of speech as the speech duration of words/characters in different languages varies. In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech. Specifically, we control the speech length of generated sentence by guiding the prediction of each word with the duration information, including the speech duration of itself as well as how much duration is left for the remaining words. We design experiments on four language directions (German -> English, Spanish -> English, Chinese <-> English), and the results show that the proposed method achieves better length control ability on the generated speech than baseline methods. To make up the lack of real-world datasets, we also construct a real-world test set collected from films to provide comprehensive evaluations on the video dubbing task.
translated by 谷歌翻译
The study of the attention mechanism has sparked interest in many fields, such as language modeling and machine translation. Although its patterns have been exploited to perform different tasks, from neural network understanding to textual alignment, no previous work has analysed the encoder-decoder attention behavior in speech translation (ST) nor used it to improve ST on a specific task. In this paper, we fill this gap by proposing an attention-based policy (EDAtt) for simultaneous ST (SimulST) that is motivated by an analysis of the existing attention relations between audio input and textual output. Its goal is to leverage the encoder-decoder attention scores to guide inference in real time. Results on en->{de, es} show that the EDAtt policy achieves overall better results compared to the SimulST state of the art, especially in terms of computational-aware latency.
translated by 谷歌翻译
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, {\textit UnitY}, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.
translated by 谷歌翻译
同时的语音翻译(Simulst)系统旨在以最低的潜伏期生成其输出,通常根据平均滞后(AL)进行计算。在本文中,我们强调,尽管采用了广泛的采用,但AL提供了与相应参考相比产生更长预测的系统的低估分数。我们还表明,这个问题具有实际相关性,因为最近的Simulst系统确实具有过度生成的趋势。作为解决方案,我们提出了LAAL(长度自适应平均滞后),这是一个修改后的度量版本,考虑了过度生成现象,并允许对不足/过度生成系统的公正评估。
translated by 谷歌翻译
在本文中,我们提出了一个神经端到端系统,用于保存视频的语音,唇部同步翻译。该系统旨在将多个组件模型结合在一起,并以目标语言的目标语言与目标语言的原始扬声器演讲的视频与目标语音相结合,但在语音,语音特征,面对原始扬声器的视频中保持着重点。管道从自动语音识别开始,包括重点检测,然后是翻译模型。然后,翻译后的文本由文本到语音模型合成,该模型重新创建了原始句子映射的原始重点。然后,使用语音转换模型将结果的合成语音映射到原始扬声器的声音。最后,为了将扬声器的嘴唇与翻译的音频同步,有条件的基于对抗网络的模型生成了相对于输入面图像以及语音转换模型的输出的适应性唇部运动的帧。最后,系统将生成的视频与转换后的音频结合在一起,以产生最终输出。结果是一个扬声器用另一种语言说话的视频而不真正知道。为了评估我们的设计,我们介绍了完整系统的用户研究以及对单个组件的单独评估。由于没有可用的数据集来评估我们的整个系统,因此我们收集了一个测试集并在此测试集上评估我们的系统。结果表明,我们的系统能够生成令人信服的原始演讲者的视频,同时保留原始说话者的特征。收集的数据集将共享。
translated by 谷歌翻译
我们呈现TranslatOrron 2,一个神经直接语音转换转换模型,可以训练结束到底。 TranslatOrron 2由语音编码器,音素解码器,MEL谱图合成器和连接所有前三个组件的注意模块组成。实验结果表明,翻译ron 2在翻译质量和预测的语音自然方面,通过大幅度优于原始翻译,并且通过减轻超越,例如唠叨或长暂停来大幅提高预测演讲的鲁棒性。我们还提出了一种在翻译语音中保留源代言人声音的新方法。训练有素的模型被限制为保留源扬声器的声音,但与原始翻译ron不同,它无法以不同的扬声器的语音产生语音,使模型对生产部署更加强大,通过减轻潜在的滥用来创建欺骗音频伪影。当新方法与基于简单的替代的数据增强一起使用时,训练的翻译器2模型能够保留每个扬声器的声音,以便用扬声器转动输入输入。
translated by 谷歌翻译
同时语音转换(Simulst)是必须在部分,增量语音输入上执行输出生成的任务。近年来,由于交叉语言应用场景的传播,如国际现场会议和流媒体讲座,Sumulst已经变得很受欢迎,因为在飞行的语音翻译中可以促进用户访问视听内容。在本文中,我们分析到目前为止所开发的Simulst系统的特征,讨论其优势和缺点。然后我们专注于正确评估系统效率所需的评估框架。为此,我们提高了更广泛的性能分析的需求,还包括用户体验的角度。实际上,Simulst Systems不仅应在质量/延迟措施方面进行评估,而且还可以通过以任务为导向的指标计费,例如,用于所采用的可视化策略。鉴于此,我们突出了社区实现的目标以及仍然缺少的目标。
translated by 谷歌翻译
本文介绍了一个端到端的文本到语音系统,CPU延迟低,适用于实时应用。该系统由基于自回归关注的序列到序列声学模型和用于波形生成的LPCNet声码器组成。提出了一种采用塔克罗伦1和2型号的模块的声学模型架构,而通过使用最近提出的基于位置的注意机制来确保稳定性,适用于任意句子长度。在推断期间,解码器是展开的,并且以流式方式执行声学特征生成,允许与句子长度无关的几乎恒定的延迟。实验结果表明,声学模型可以产生比计算机CPU上的实时大约31倍的功能序列,移动CPU上的6.5倍,使其能够满足两个设备上实时应用所需的条件。全端到端系统可以通过听证测试来验证几乎是自然的质量语音。
translated by 谷歌翻译
无监督的文本到语音综合(TTS)系统学会通过观察以下语言来生成与任何语言中任何书面句子相对应的语音波形:1)用该语言收集的未转录语音波形的集合; 2)用该语言编写的文本集合,无需访问任何抄录的语音。开发这种系统可以显着提高语言技术对语言的可用性,而无需大量平行的语音和文本数据。本文提出了一个基于对齐模块的无监督的TTS系统,该模块输出了伪文本和另一个使用伪文本进行训练和真实文本进行推理的合成模块。我们的无监督系统可以以七种语言的方式实现与监督系统相当的性能,每种语音约10-20小时。还对文本单元和声码器的效果进行了仔细的研究,以更好地了解哪些因素可能影响无监督的TTS性能。可以在https://cactuswiththoughts.github.io/unsuptts-demo上找到我们的模型生成的样品,可以在https://github.com/lwang114/unsuptts上找到我们的代码。
translated by 谷歌翻译
We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on EuroParl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pre-training and sparse scaling using Mixture-of-Experts bring large gains to translation performance. The mined data and models are freely available.
translated by 谷歌翻译
本文提出了一种表达语音合成架构,用于在单词级别建模和控制说话方式。它试图借助两个编码器来学习语音数据的单词级风格和韵律表示。通过查找声学特征的每个单词的样式令牌的组合,第二个模型样式,第二个输出单词级序列仅在语音信息上调节,以便从风格信息解开它。两个编码器输出与音素编码器输出对齐并连接,然后用非周度塔歇尔策略模型解码。额外的先前编码器用于自向预测样式标记,以便模型能够在没有参考话语的情况下运行。我们发现所产生的模型给出了对样式的单词级和全局控制,以及韵律转移能力。
translated by 谷歌翻译
我们介绍了一种无线文字语音转换(S2ST)系统,可以将来自一种语言的语音转换为另一种语言,并且可以在不需要任何文本数据的情况下构建。与文献中的现有工作不同,我们解决了模拟多扬声器目标语音的挑战,并用现实世界的S2ST数据训练系统。我们方法的关键是一种自我监督的单位语音标准化技术,该标准化技术将预先训练的语音编码器具有来自多个扬声器的配对声音,以及单个参考扬声器,以减少由于复印件引起的变化,同时保留词汇内容。只有10分钟的语音标准化的配对数据,我们在培训\ vp〜s2st数据集上的S2ST模型时获得平均3.2 BLEU增益,而不是在未标准化的语音目标上培训的基线。我们还将自动开采的S2ST数据纳入并显示额外的2.0 BLEU增益。据我们所知,我们是第一个建立无线的S2ST技术,可以用真实世界的数据培训,并为多种语言配对工作。
translated by 谷歌翻译
In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to-speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.
translated by 谷歌翻译
End-to-End speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. This means that generated speech has to be automatically transcribed, making the evaluation dependent on the availability and quality of automatic speech recognition (ASR) systems. In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. BLASER leverages a multilingual multimodal encoder to directly encode the speech segments for source input, translation output and reference into a shared embedding space and computes a score of the translation quality that can be used as a proxy to human evaluation. To evaluate our approach, we construct training and evaluation sets from more than 40k human annotations covering seven language directions. The best results of BLASER are achieved by training with supervision from human rating scores. We show that when evaluated at the sentence level, BLASER correlates significantly better with human judgment compared to ASR-dependent metrics including ASR-SENTBLEU in all translation directions and ASR-COMET in five of them. Our analysis shows combining speech and text as inputs to BLASER does not increase the correlation with human scores, but best correlations are achieved when using speech, which motivates the goal of our research. Moreover, we show that using ASR for references is detrimental for text-based metrics.
translated by 谷歌翻译
We propose a) a Language Agnostic end-to-end Speech Translation model (LAST), and b) a data augmentation strategy to increase code-switching (CS) performance. With increasing globalization, multiple languages are increasingly used interchangeably during fluent speech. Such CS complicates traditional speech recognition and translation, as we must recognize which language was spoken first and then apply a language-dependent recognizer and subsequent translation component to generate the desired target language output. Such a pipeline introduces latency and errors. In this paper, we eliminate the need for that, by treating speech recognition and translation as one unified end-to-end speech translation problem. By training LAST with both input languages, we decode speech into one target language, regardless of the input language. LAST delivers comparable recognition and speech translation accuracy in monolingual usage, while reducing latency and error rate considerably when CS is observed.
translated by 谷歌翻译
我们介绍了CVSS,这是一种大规模的多语言对语音转换(S2ST)语料库,从21种语言覆盖了21种语言的句子级并行S2ST对。通过将Covost 2从Covost 2的翻译文本综合将翻译文本与最先进的TTS系统合成语音,源自公共语音语音语料库和COVOST 2语音到文本转换(ST)语料库。提供了两个版本的翻译演讲:1)CVSS-C:所有翻译演讲都是一种高质量的规范声音; 2)CVSS-T:翻译语音从相应的源语音传输。此外,CVSS提供标准化的翻译文本,它与翻译语音中的发音匹配。在每个版本的CVSS上,我们建立了基线多语言直接S2ST模型和Cascade S2ST模型,验证了语料库的有效性。为了构建强大的Cascade S2ST基准,我们在Covost 2上培训了St模型,这优于前一种最先进的培训,而无需额外的数据。尽管如此,直接S2ST模型的性能在从头开始训练时接近强级联基线,并且在匹配ST模型中初始化时,仅在ASR转换转换时的0.1或0.7bleu差异。
translated by 谷歌翻译
中文方言文本到语音(TTS)系统通常只能由本地语言学家使用,因为中文方言的书面形式具有不同的字符,成语,语法和使用普通话,甚至本地扬声器也无法输入正确的句子。对于普通话的文本输入,中国方言TT只能产生部分挑剔的语音,而韵律和自然性相对较差。为了降低使用栏并使其在商业广告中更实用,我们提出了一种新型的中国方言TTS前端,并带有翻译模块。它有助于使用正确的拼字法和语法将普通话文本转换为惯用表达式,以便可以改善合成语音的清晰度和自然性。为翻译任务提出了一种具有浏览抽样策略的非自动入围神经机器翻译模型。这是将翻译与TTS Frontend合并的第一项已知作品。我们对广东话的实验批准,拟议的前端可以帮助广东TTS系统通过普通话输入来提高0.27的MOS。
translated by 谷歌翻译