End-to-End speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. This means that generated speech has to be automatically transcribed, making the evaluation dependent on the availability and quality of automatic speech recognition (ASR) systems. In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. BLASER leverages a multilingual multimodal encoder to directly encode the speech segments for source input, translation output and reference into a shared embedding space and computes a score of the translation quality that can be used as a proxy to human evaluation. To evaluate our approach, we construct training and evaluation sets from more than 40k human annotations covering seven language directions. The best results of BLASER are achieved by training with supervision from human rating scores. We show that when evaluated at the sentence level, BLASER correlates significantly better with human judgment compared to ASR-dependent metrics including ASR-SENTBLEU in all translation directions and ASR-COMET in five of them. Our analysis shows combining speech and text as inputs to BLASER does not increase the correlation with human scores, but best correlations are achieved when using speech, which motivates the goal of our research. Moreover, we show that using ASR for references is detrimental for text-based metrics.
translated by 谷歌翻译
We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on EuroParl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pre-training and sparse scaling using Mixture-of-Experts bring large gains to translation performance. The mined data and models are freely available.
translated by 谷歌翻译
We present a new approach to perform zero-shot cross-modal transfer between speech and text for translation tasks. Multilingual speech and text are encoded in a joint fixed-size representation space. Then, we compare different approaches to decode these multimodal and multilingual fixed-size representations, enabling zero-shot translation between languages and modalities. All our models are trained without the need of cross-modal labeled translation data. Despite a fixed-size representation, we achieve very competitive results on several text and speech translation tasks. In particular, we significantly improve the state-of-the-art for zero-shot speech translation on Must-C. Incorporating a speech decoder in our framework, we introduce the first results for zero-shot direct speech-to-speech and text-to-speech translation.
translated by 谷歌翻译
As machine translation (MT) metrics improve their correlation with human judgement every year, it is crucial to understand the limitations of such metrics at the segment level. Specifically, it is important to investigate metric behaviour when facing accuracy errors in MT because these can have dangerous consequences in certain contexts (e.g., legal, medical). We curate ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. We use ACES to evaluate a wide range of MT metrics including the submissions to the WMT 2022 metrics shared task and perform several analyses leading to general recommendations for metric developers. We recommend: a) combining metrics with different strengths, b) developing metrics that give more weight to the source and less to surface-level overlap with the reference and c) explicitly modelling additional language-specific information beyond what is available via multilingual embeddings.
translated by 谷歌翻译
End-to-end Speech Translation (E2E ST) aims to translate source speech into target translation without generating the intermediate transcript. However, existing approaches for E2E ST degrade considerably when only limited ST data are available. We observe that an ST model's performance strongly correlates with its embedding similarity from speech and transcript. In this paper, we propose Word-Aligned COntrastive learning (WACO), a novel method for few-shot speech-to-text translation. Our key idea is bridging word-level representations for both modalities via contrastive learning. We evaluate WACO and other methods on the MuST-C dataset, a widely used ST benchmark. Our experiments demonstrate that WACO outperforms the best baseline methods by 0.7-8.5 BLEU points with only 1-hour parallel data. Code is available at https://anonymous.4open.science/r/WACO .
translated by 谷歌翻译
我们介绍了一种无线文字语音转换(S2ST)系统,可以将来自一种语言的语音转换为另一种语言,并且可以在不需要任何文本数据的情况下构建。与文献中的现有工作不同,我们解决了模拟多扬声器目标语音的挑战,并用现实世界的S2ST数据训练系统。我们方法的关键是一种自我监督的单位语音标准化技术,该标准化技术将预先训练的语音编码器具有来自多个扬声器的配对声音,以及单个参考扬声器,以减少由于复印件引起的变化,同时保留词汇内容。只有10分钟的语音标准化的配对数据,我们在培训\ vp〜s2st数据集上的S2ST模型时获得平均3.2 BLEU增益,而不是在未标准化的语音目标上培训的基线。我们还将自动开采的S2ST数据纳入并显示额外的2.0 BLEU增益。据我们所知,我们是第一个建立无线的S2ST技术,可以用真实世界的数据培训,并为多种语言配对工作。
translated by 谷歌翻译
字幕(替代)的语音翻译是通过将符合特定显示指南的字幕划分插入字幕中断,将语音数据自动转化为良好的字幕。与语音翻译(ST)类似,模型训练需要并行数据,其中包括音频输入与其文本翻译配对。然而,在替代方面,还必须用字幕断裂来注释文本。到目前为止,这一要求代表了系统开发的瓶颈,如公开可用的替代公司所证实。为了填补这一空白,我们提出了一种在不干预的情况下将现有的ST Corpora转换为替代资源的方法。我们构建了一个分段模型,该模型通过以多模式的方式利用音频和文本来自动将文本片段分为适当的字幕,从而在零拍摄条件下实现了高分子的质量。对手动和自动分割培训的替代系统的比较实验导致相似的性能,显示了我们方法的有效性。
translated by 谷歌翻译
State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in crosslingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 15 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.
translated by 谷歌翻译
我们假设现有的句子级机器翻译(MT)指标在人类参考包含歧义时会效率降低。为了验证这一假设,我们提出了一种非常简单的方法,用于扩展预审计的指标以在文档级别合并上下文。我们将我们的方法应用于三个流行的指标,即Bertscore,Prism和Comet,以及无参考的公制Comet-QE。我们使用提供的MQM注释评估WMT 2021指标共享任务的扩展指标。我们的结果表明,扩展指标的表现在约85%的测试条件下优于其句子级别的级别,而在排除低质量人类参考的结果时。此外,我们表明我们的文档级扩展大大提高了其对话语现象任务的准确性,从而优于专用基线高达6.1%。我们的实验结果支持我们的初始假设,并表明对指标的简单扩展使他们能够利用上下文来解决参考中的歧义。
translated by 谷歌翻译
我们介绍了CVSS,这是一种大规模的多语言对语音转换(S2ST)语料库,从21种语言覆盖了21种语言的句子级并行S2ST对。通过将Covost 2从Covost 2的翻译文本综合将翻译文本与最先进的TTS系统合成语音,源自公共语音语音语料库和COVOST 2语音到文本转换(ST)语料库。提供了两个版本的翻译演讲:1)CVSS-C:所有翻译演讲都是一种高质量的规范声音; 2)CVSS-T:翻译语音从相应的源语音传输。此外,CVSS提供标准化的翻译文本,它与翻译语音中的发音匹配。在每个版本的CVSS上,我们建立了基线多语言直接S2ST模型和Cascade S2ST模型,验证了语料库的有效性。为了构建强大的Cascade S2ST基准,我们在Covost 2上培训了St模型,这优于前一种最先进的培训,而无需额外的数据。尽管如此,直接S2ST模型的性能在从头开始训练时接近强级联基线,并且在匹配ST模型中初始化时,仅在ASR转换转换时的0.1或0.7bleu差异。
translated by 谷歌翻译
Is it possible to leverage large scale raw and raw parallel corpora to build a general learned metric? Existing learned metrics have gaps to human judgements, are model-dependent or are limited to the domains or tasks where human ratings are available. In this paper, we propose SEScore2, a model-based metric pretrained over million-scale synthetic dataset constructed by our novel retrieval augmented data synthesis pipeline. SEScore2 achieves high correlation to human judgements without any human rating supervisions. Importantly, our unsupervised SEScore2 can outperform supervised metrics, which are trained on the News human ratings, at the TED domain. We evaluate SEScore2 over four text generation tasks across three languages. SEScore2 outperforms all prior unsupervised evaluation metrics in machine translation, speech translation, data-to-text and dialogue generation, with average Kendall improvements 0.158. SEScore2 even outperforms SOTA supervised BLEURT at data-to-text, dialogue generation and overall correlation.
translated by 谷歌翻译
Speech translation (ST) is the task of directly translating acoustic speech signals in a source language into text in a foreign language. ST task has been addressed, for a long time, using a pipeline approach with two modules : first an Automatic Speech Recognition (ASR) in the source language followed by a text-to-text Machine translation (MT). In the past few years, we have seen a paradigm shift towards the end-to-end approaches using sequence-to-sequence deep neural network models. This paper presents our efforts towards the development of the first Broadcast News end-to-end Arabic to English speech translation system. Starting from independent ASR and MT LDC releases, we were able to identify about 92 hours of Arabic audio recordings for which the manual transcription was also translated into English at the segment level. These data was used to train and compare pipeline and end-to-end speech translation systems under multiple scenarios including transfer learning and data augmentation techniques.
translated by 谷歌翻译
人类翻译的文本以同一语言显示出与自然书面文本的不同特征。这种现象被称为翻译人员,被认为是将机器翻译(MT)评估混淆。但是,我们发现现有的翻译工作忽略了一些重要因素,结论主要是相关的,但不是因果关系。在这项工作中,我们收集了Causalmt,这是一个数据集,其中MT培训数据还标有人类翻译方向。我们检查了两个关键因素,即火车测试方向匹配(是否对齐训练和测试集中的人类翻译方向)和数据模型方向匹配(该模型是否沿与人类翻译方向相同的方向学习数据集)。我们表明,这两个因素对MT的性能具有很大的因果影响,除了测试模型方向不匹配的情况下,现有工作对TranslationEse的影响强调了。鉴于我们的发现,我们为MT培训和评估提供了一系列建议。我们的代码和数据在https://github.com/edisonni-hku/causalmt上
translated by 谷歌翻译
Data scarcity is one of the main issues with the end-to-end approach for Speech Translation, as compared to the cascaded one. Although most data resources for Speech Translation are originally document-level, they offer a sentence-level view, which can be directly used during training. But this sentence-level view is single and static, potentially limiting the utility of the data. Our proposed data augmentation method SegAugment challenges this idea and aims to increase data availability by providing multiple alternative sentence-level views of a dataset. Our method heavily relies on an Audio Segmentation system to re-segment the speech of each document, after which we obtain the target text with alignment methods. The Audio Segmentation system can be parameterized with different length constraints, thus giving us access to multiple and diverse sentence-level views for each document. Experiments in MuST-C show consistent gains across 8 language pairs, with an average increase of 2.2 BLEU points, and up to 4.7 BLEU for lower-resource scenarios in mTEDx. Additionally, we find that SegAugment is also applicable to purely sentence-level data, as in CoVoST, and that it enables Speech Translation models to completely close the gap between the gold and automatic segmentation at inference time.
translated by 谷歌翻译
评估指标是文本生成系统的关键成分。近年来,已经提出了几十年前的文本生成质量的人类评估,提出了几个基于伯特的评估指标(包括Bertscore,Moverscore,BLEurt等),这些评估与文本生成质量的人类评估比Bleu或Rouge进行了更好。但是,很少是已知这些度量基于黑盒语言模型表示的指标实际捕获(通常假设它们模型语义相似性)。在这项工作中,我们使用基于简单的回归的全局解释技术来沿着语言因素解开度量标准分数,包括语义,语法,形态和词汇重叠。我们表明,不同的指标捕获了一定程度的各个方面,但它们对词汇重叠大大敏感,就像Bleu和Rouge一样。这暴露了这些新颖性拟议的指标的限制,我们还在对抗对抗测试场景中突出显示。
translated by 谷歌翻译
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, {\textit UnitY}, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.
translated by 谷歌翻译
Large language models (LLMs) that have been trained on multilingual but not parallel text exhibit a remarkable ability to translate between languages. We probe this ability in an in-depth study of the pathways language model (PaLM), which has demonstrated the strongest machine translation (MT) performance among similarly-trained LLMs to date. We investigate various strategies for choosing translation examples for few-shot prompting, concluding that example quality is the most important factor. Using optimized prompts, we revisit previous assessments of PaLM's MT capabilities with more recent test sets, modern MT metrics, and human evaluation, and find that its performance, while impressive, still lags that of state-of-the-art supervised systems. We conclude by providing an analysis of PaLM's MT output which reveals some interesting properties and prospects for future work.
translated by 谷歌翻译
这项工作适用于最低贝叶斯风险(MBR)解码,以优化翻译质量的各种自动化指标。机器翻译中的自动指标最近取得了巨大的进步。特别是,在人类评级(例如BLEurt,或Comet)上微调,在与人类判断的相关性方面是优于表面度量的微调。我们的实验表明,神经翻译模型与神经基于基于神经参考度量,BLEURT的组合导致自动和人类评估的显着改善。通过与经典光束搜索输出不同的翻译获得该改进:这些翻译的可能性较低,并且较少受到Bleu等表面度量的青睐。
translated by 谷歌翻译
Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-ofthe-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.
translated by 谷歌翻译
监督机器翻译的绝大多数评估指标,即(i)假设参考翻译的存在,(ii)受到人体得分的培训,或(iii)利用并行数据。这阻碍了其适用于此类监督信号的情况。在这项工作中,我们开发了完全无监督的评估指标。为此,我们利用评估指标,平行语料库开采和MT系统之间的相似性和协同作用。特别是,我们使用无监督的评估指标来开采伪并行数据,我们用来重塑缺陷的基础向量空间(以迭代方式),并诱导无监督的MT系统,然后提供伪引用作为伪参考作为在中的附加组件中的附加组件指标。最后,我们还从伪并行数据中诱导无监督的多语言句子嵌入。我们表明,我们完全无监督的指标是有效的,即,他们在5个评估数据集中的4个击败了受监督的竞争对手。
translated by 谷歌翻译