End-to-end Speech Translation (E2E ST) aims to translate source speech into target translation without generating the intermediate transcript. However, existing approaches for E2E ST degrade considerably when only limited ST data are available. We observe that an ST model's performance strongly correlates with its embedding similarity from speech and transcript. In this paper, we propose Word-Aligned COntrastive learning (WACO), a novel method for few-shot speech-to-text translation. Our key idea is bridging word-level representations for both modalities via contrastive learning. We evaluate WACO and other methods on the MuST-C dataset, a widely used ST benchmark. Our experiments demonstrate that WACO outperforms the best baseline methods by 0.7-8.5 BLEU points with only 1-hour parallel data. Code is available at https://anonymous.4open.science/r/WACO .
translated by 谷歌翻译
How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine-tuning based on a pre-trained model using external machine translation (MT) data. In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data. At the second stage of fine-tuning, we take both original speech sequences and original text sequences in parallel into the model to fine-tune the network, and use Jensen-Shannon divergence to regularize their outputs. Experiments on MuST-C speech translation benchmark and analysis show that M^3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
translated by 谷歌翻译
We present a method for introducing a text encoder into pre-trained end-to-end speech translation systems. It enhances the ability of adapting one modality (i.e., source-language speech) to another (i.e., source-language text). Thus, the speech translation model can learn from both unlabeled and labeled data, especially when the source-language text data is abundant. Beyond this, we present a denoising method to build a robust text encoder that can deal with both normal and noisy text data. Our system sets new state-of-the-arts on the MuST-C En-De, En-Fr, and LibriSpeech En-Fr tasks.
translated by 谷歌翻译
本文介绍了我们针对IWSLT 2022离线任务的端到端Yitrans语音翻译系统的提交,该任务从英语音频转换为德语,中文和日语。 Yitrans系统建立在大规模训练的编码器模型上。更具体地说,我们首先设计了多阶段的预训练策略,以建立具有大量标记和未标记数据的多模式模型。然后,我们为下游语音翻译任务微调模型的相应组件。此外,我们做出了各种努力,以提高性能,例如数据过滤,数据增强,语音细分,模型集合等。实验结果表明,我们的Yitrans系统比在三个翻译方向上的强基线取得了显着改进,并且比去年在TST2021英语 - 德国人中的最佳端到端系统方面的改进+5.2 BLEU改进。根据自动评估指标,我们的最终意见在英语 - 德国和英语端到端系统上排名第一。我们使代码和模型公开可用。
translated by 谷歌翻译
由于其误差传播,延迟较少和更少的参数较少的潜力,端到端语音到文本翻译〜(e2e-st)变得越来越受欢迎。鉴于三联培训语料库$ \ langle演讲,转录,翻译\ rangle $,传统的高质量E2E-ST系统利用$ \ langle演讲,转录\ rangle $配对预先培训模型,然后利用$ \ Langle演讲,翻译\ rangle $配对进一步优化它。然而,该过程仅涉及每个阶段的两个元组数据,并且该松散耦合不能完全利用三重态数据之间的关联。在本文中,我们试图基于语音输入模拟转录和翻译的联合概率,以直接利用这种三重态数据。基于此,我们提出了一种新的正规化方法,用于改进三重态数据中双路分解协议的模型培训,理论上应该是相等的。为实现这一目标,我们将两个Kullback-Leibler发散正规化术语介绍到模型培训目的中,以减少双路径输出概率之间的不匹配。然后,训练有素的模型可以通过预定义的早期停止标签自然地被视为E2E-ST模型。 Must-C基准测试的实验表明,我们所提出的方法在所有8个语言对上显着优于最先进的E2E-ST基线,同时在自动语音识别任务中实现更好的性能。我们的代码在https://github.com/duyichao/e2e -st-tda开放。
translated by 谷歌翻译
To alleviate the data scarcity problem in End-to-end speech translation (ST), pre-training on data for speech recognition and machine translation is considered as an important technique. However, the modality gap between speech and text prevents the ST model from efficiently inheriting knowledge from the pre-trained models. In this work, we propose AdaTranS for end-to-end ST. It adapts the speech features with a new shrinking mechanism to mitigate the length mismatch between speech and text features by predicting word boundaries. Experiments on the MUST-C dataset demonstrate that AdaTranS achieves better performance than the other shrinking-based methods, with higher inference speed and lower memory usage. Further experiments also show that AdaTranS can be equipped with additional alignment losses to further improve performance.
translated by 谷歌翻译
We present Mu$^{2}$SLAM, a multilingual sequence-to-sequence model pre-trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition (ASR), Automatic Speech Translation (AST) and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu$^{2}$SLAM trains the speech-text models with a sequence-to-sequence masked denoising objective similar to T5 on the decoder and a masked language modeling (MLM) objective on the encoder, for both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoST AST, Mu$^{2}$SLAM establishes a new state-of-the-art for models trained on public datasets, improving on xx-en translation over the previous best by 1.9 BLEU points and on en-xx translation by 1.1 BLEU points. On Voxpopuli ASR, our model matches the performance of an mSLAM model fine-tuned with an RNN-T decoder, despite using a relatively weaker sequence-to-sequence architecture. On text understanding tasks, our model improves by more than 6\% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks.
translated by 谷歌翻译
端到端语音翻译(E2E-ST)由于其误差传播的潜力,较低的延迟和较少的参数而受到了越来越多的关注。但是,基于神经的方法对该任务的有效性受到可用培训语料库的严重限制,尤其是对于较少或不存在的域中三重障碍培训数据的领域适应性。在本文中,我们提出了一种新型的非参数方法,该方法利用特定于域的文本翻译语料库来实现E2E-ST系统的域适应性。为此,我们首先将一个附加的编码器纳入预先训练的E2E-ST模型中,以实现文本翻译建模,然后通过减少可用三重态训练数据中的通讯表示不匹配来统一解码器的输出表示形式,以实现文本和语音翻译任务。在域适应过程中,引入了K-Nearest-neighbor(KNN)分类器,以使用由域特异性文本翻译语料库构建的外部数据存储器生成最终的翻译分布,而采用通用输出表示来执行相似性搜索。 Europarl-St基准的实验表明,仅涉及内域文本翻译数据时,我们提出的方法在所有翻译方向上平均将基线显着提高了基线,即使表现出强大的强度内域微调方法。
translated by 谷歌翻译
本文介绍了流媒体和非流定向晶体翻译的统一端到端帧工作。虽然非流媒体语音翻译的培训配方已经成熟,但尚未建立流媒体传播的食谱。在这项工作中,WEFOCUS在开发一个统一的模型(UNIST),它从基本组成部分的角度支持流媒体和非流媒体ST,包括培训目标,注意机制和解码政策。对最流行的语音到文本翻译基准数据集,MERE-C的实验表明,与媒体ST的BLEU评分和延迟度量有更好的折衷和液化标准端到端基线和级联模型。我们将公开提供我们的代码和评估工具。
translated by 谷歌翻译
端到端(E2E)语音到文本翻译(ST)通常取决于通过语音识别或文本翻译任务使用源成绩单预处理其编码器和/或解码器,否则翻译性能会大大下降。但是,笔录并不总是可用的,在文献中很少研究这种预处理的E2E ST。在本文中,我们重新审视了这个问题,并探讨了仅在语音翻译对培训的E2E ST质量的程度。我们重新审查了几种证明对ST的有益的技术,并提供了一系列最佳实践,这些实践使基于变压器的E2E ST系统偏向于从头开始训练。此外,我们提出了参数化的距离惩罚,以促进语音自我注意模型中的位置建模。在涵盖23种语言的四个基准测试中,我们的实验表明,在不使用任何成绩单或预处理的情况下,提议的系统达到甚至优于先前采用预处理的研究,尽管差距仍然存在(极为)低资源的设置。最后,我们讨论了神经声学特征建模,其中神经模型旨在直接从原始语音信号中提取声学特征,以简化电感偏见并为模型描述语音增添自由度。我们第一次证明了它的可行性,并在ST任务上表现出令人鼓舞的结果。
translated by 谷歌翻译
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, {\textit UnitY}, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.
translated by 谷歌翻译
End-to-End speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. This means that generated speech has to be automatically transcribed, making the evaluation dependent on the availability and quality of automatic speech recognition (ASR) systems. In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. BLASER leverages a multilingual multimodal encoder to directly encode the speech segments for source input, translation output and reference into a shared embedding space and computes a score of the translation quality that can be used as a proxy to human evaluation. To evaluate our approach, we construct training and evaluation sets from more than 40k human annotations covering seven language directions. The best results of BLASER are achieved by training with supervision from human rating scores. We show that when evaluated at the sentence level, BLASER correlates significantly better with human judgment compared to ASR-dependent metrics including ASR-SENTBLEU in all translation directions and ASR-COMET in five of them. Our analysis shows combining speech and text as inputs to BLASER does not increase the correlation with human scores, but best correlations are achieved when using speech, which motivates the goal of our research. Moreover, we show that using ASR for references is detrimental for text-based metrics.
translated by 谷歌翻译
We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on EuroParl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pre-training and sparse scaling using Mixture-of-Experts bring large gains to translation performance. The mined data and models are freely available.
translated by 谷歌翻译
语音翻译模型无法直接处理较长的音频,例如TED Talks,必须将其分为较短的段。语音翻译数据集提供了音频的手动分割,这些音频在现实世界中不可用,而现有的分割方法通常会在推理时大大降低翻译质量。为了弥合训练的手动分割与推理的自动分割之间的差距,我们提出了有监督的混合音频分割(SHAS),该方法可以有效地从任何手动分段语音语料库中学习最佳分割。首先,我们使用预先训练的WAV2VEC 2.0的语音表示形式来训练分类器,以识别分段中所包含的帧。然后,通过概率分裂和诱导算法找到最佳的分裂点,该算法逐渐在最低概率的框架下逐渐分裂,直到所有段都低于预先指定的长度为止。在Mast-C和MedX上进行的实验表明,通过我们的方法生成的片段的翻译方法将手动分割的质量在5个语言对上进行质量。也就是说,SHAS保留了手动细分的95-98%的BLEU分数,而现有方法的87-93%。我们的方法还可以推广到不同的域,并以看不见的语言实现高零弹性性能。
translated by 谷歌翻译
We present a new approach to perform zero-shot cross-modal transfer between speech and text for translation tasks. Multilingual speech and text are encoded in a joint fixed-size representation space. Then, we compare different approaches to decode these multimodal and multilingual fixed-size representations, enabling zero-shot translation between languages and modalities. All our models are trained without the need of cross-modal labeled translation data. Despite a fixed-size representation, we achieve very competitive results on several text and speech translation tasks. In particular, we significantly improve the state-of-the-art for zero-shot speech translation on Must-C. Incorporating a speech decoder in our framework, we introduce the first results for zero-shot direct speech-to-speech and text-to-speech translation.
translated by 谷歌翻译
Data scarcity is one of the main issues with the end-to-end approach for Speech Translation, as compared to the cascaded one. Although most data resources for Speech Translation are originally document-level, they offer a sentence-level view, which can be directly used during training. But this sentence-level view is single and static, potentially limiting the utility of the data. Our proposed data augmentation method SegAugment challenges this idea and aims to increase data availability by providing multiple alternative sentence-level views of a dataset. Our method heavily relies on an Audio Segmentation system to re-segment the speech of each document, after which we obtain the target text with alignment methods. The Audio Segmentation system can be parameterized with different length constraints, thus giving us access to multiple and diverse sentence-level views for each document. Experiments in MuST-C show consistent gains across 8 language pairs, with an average increase of 2.2 BLEU points, and up to 4.7 BLEU for lower-resource scenarios in mTEDx. Additionally, we find that SegAugment is also applicable to purely sentence-level data, as in CoVoST, and that it enables Speech Translation models to completely close the gap between the gold and automatic segmentation at inference time.
translated by 谷歌翻译
我们提出了Maestro,这是一种自制的培训方法,可以统一从语音和文本方式中学到的表示形式。从语音信号中进行的自我监督学习旨在学习信号中固有的潜在结构,而从文本尝试捕获词汇信息的文本尝试中学习。从不配对的语音和文本序列中学习对齐表示是一项具有挑战性的任务。先前的工作要么隐含地强制执行从这两种方式中学到的表示形式,要通过多任务和参数共享在潜在空间中对齐,或通过语音综合通过模态转换而明确地进行。前者受到两种方式之间的干扰,而后者则引入了额外的复杂性。在本文中,我们提出了一种新颖的算法Maestro,旨在同时从这两种方式中学习统一的表示,可以转移到各种下游任务,例如自动语音识别(ASR)和语音翻译(ST)。 Maestro通过序列比对,持续时间预测和匹配的嵌入在学习空间中通过对齐的蒙版模型损失来学习统一的表示形式。我们在Voxpopuli多语言ASR上建立了一个新的最先进(SOTA),单词错误率相对相对降低8%(WER),多域Speetstew ASR(相对3.7%)和21种英语多语言ST在Covost 2上2.8 BLEU的改善平均21种语言。
translated by 谷歌翻译
尽管视听模型与仅限音频模型相比可以产生卓越的性能和鲁棒性,但由于缺乏标记和未标记的视听数据以及每种方式部署一个模型的成本,它们的开发和采用受到阻碍。在本文中,我们提出了U-Hubert,这是一个自制的预训练框架,可以通过统一的蒙版群集预测目标来利用多模式和单峰语音。通过在预训练期间利用模态辍学,我们证明了一个微调模型可以在PAR上取得比较的性能或比最先进的模态特异性模型更好。此外,我们仅在音频上进行微调的模型可以通过视听和视觉语音输入来表现良好,从而实现了零击的模态概括,以实现语音识别和扬声器验证。特别是,我们的单个模型在带有音频/视听/视觉输入的LRS3上产生1.2%/1.4%/27.2%的语音识别单词错误率。
translated by 谷歌翻译
最先进的编码器模型(例如,用于机器翻译(MT)或语音识别(ASR))作为原子单元构造并端到端训练。没有其他模型的任何组件都无法(重新)使用。我们描述了Legonn,这是一种使用解码器模块构建编码器架构的过程,可以在各种MT和ASR任务中重复使用,而无需进行任何微调。为了实现可重复性,每个编码器和解码器模块之间的界面都基于模型设计器预先定义的离散词汇,将其接地到边缘分布序列。我们提出了两种摄入这些边缘的方法。一个是可区分的,可以使整个网络的梯度流动,另一个是梯度分离的。为了使MT任务之间的解码器模块的可移植性用于不同的源语言和其他任务(例如ASR),我们引入了一种模态不可思议的编码器,该模态编码器由长度控制机制组成,以动态调整编码器的输出长度,以匹配预期的输入长度范围的范围预训练的解码器。我们提出了几项实验来证明Legonn模型的有效性:可以重复使用德国英语(DE-EN)MT任务的训练有素的语言解码器模块,而没有对Europarl English ASR和ROMANIAN-ENGLISH进行微调(RO)(RO)(RO)(RO) -en)MT任务以匹配或击败相应的基线模型。当针对数千个更新的目标任务进行微调时,我们的Legonn模型将RO-EN MT任务提高了1.5个BLEU点,并为Europarl ASR任务降低了12.5%的相对减少。此外,为了显示其可扩展性,我们从三个模块中构成了一个legonn ASR模型 - 每个模块都在三个不同数据集的不同端到端训练的模型中学习 - 将降低的减少降低到19.5%。
translated by 谷歌翻译
我们介绍了一种无线文字语音转换(S2ST)系统,可以将来自一种语言的语音转换为另一种语言,并且可以在不需要任何文本数据的情况下构建。与文献中的现有工作不同,我们解决了模拟多扬声器目标语音的挑战,并用现实世界的S2ST数据训练系统。我们方法的关键是一种自我监督的单位语音标准化技术,该标准化技术将预先训练的语音编码器具有来自多个扬声器的配对声音,以及单个参考扬声器,以减少由于复印件引起的变化,同时保留词汇内容。只有10分钟的语音标准化的配对数据,我们在培训\ vp〜s2st数据集上的S2ST模型时获得平均3.2 BLEU增益,而不是在未标准化的语音目标上培训的基线。我们还将自动开采的S2ST数据纳入并显示额外的2.0 BLEU增益。据我们所知,我们是第一个建立无线的S2ST技术,可以用真实世界的数据培训,并为多种语言配对工作。
translated by 谷歌翻译