挖掘低资源语言的高质量bitexts具有挑战性。本文表明,语言模型的句子表示,并用多个否定性排名损失(一个对比目标)有助于检索清洁的bitexts。实验表明,从我们的方法中挖出的并行数据基本上优于低资源语言高价和Pashto的先前最先进的方法。
translated by 谷歌翻译
翻译质量估计(QE)是预测机器翻译(MT)输出质量的任务,而无需任何参考。作为MT实际应用中的重要组成部分,这项任务已越来越受到关注。在本文中,我们首先提出了XLMRScore,这是一种基于使用XLM-Roberta(XLMR)模型计算的BertScore的简单无监督的QE方法,同时讨论了使用此方法发生的问题。接下来,我们建议两种减轻问题的方法:用未知令牌和预训练模型的跨语性对准替换未翻译的单词,以表示彼此之间的一致性单词。我们在WMT21 QE共享任务的四个低资源语言对上评估了所提出的方法,以及本文介绍的新的英语FARSI测试数据集。实验表明,我们的方法可以在两个零射击方案的监督基线中获得可比的结果,即皮尔森相关性的差异少于0.01,同时在所有低资源语言对中的平均低资源语言对中的无人看管竞争对手的平均水平超过8%的平均水平超过8%。 。
translated by 谷歌翻译
我们介绍Samanantar,是最大的公开可用的并行Corpora Collection,用于指示语言。该集合中的英语和11个上线语言之间总共包含4970万句对(来自两种语言系列)。具体而言,我们从现有的公共可用并行基层编译1240万句对,另外,从网络上挖掘3740万句对,导致4倍增加。我们通过组合许多语料库,工具和方法来挖掘网站的并行句子:(a)Web爬行单格式语料库,(b)文档OCR,用于从扫描的文档中提取句子,(c)用于对齐句子的多语言表示模型,以及(d)近似最近的邻居搜索搜索大量句子。人类评估新矿业的Corpora的样本验证了11种语言的高质量平行句子。此外,我们使用英语作为枢轴语言,从英式并行语料库中提取所有55个指示语言对之间的834百万句子对。我们培训了跨越Samanantar上所有这些语言的多语种NMT模型,这在公开可用的基准上表现出现有的模型和基准,例如弗洛雷斯,建立萨曼塔尔的效用。我们的数据和模型可在Https://indicnlp.ai4bharat.org/samanantar/上公开提供,我们希望他们能够帮助推进NMT和Multibingual NLP的研究。
translated by 谷歌翻译
句子嵌入通常用于文本聚类和语义检索任务中。最先进的句子表示方法基于大量手动标记句子对集合的人工神经网络。高资源语言(例如英语或中文)可以使用足够数量的注释数据。在不太受欢迎的语言中,必须使用多语言模型,从而提供较低的性能。在本出版物中,我们通过提出一种培训有效的语言特定句子编码的方法来解决此问题,而无需手动标记数据。我们的方法是从句子对准双语文本语料库中自动构建释义对数据集。然后,我们使用收集的数据来微调具有附加复发池层的变压器语言模型。我们的句子编码器可以在不到一天的时间内在一张图形卡上进行培训,从而在各种句子级的任务上实现高性能。我们在波兰语中评估了八个语言任务的方法,并将其与最佳可用多语言句子编码器进行比较。
translated by 谷歌翻译
Translating training data into many languages has emerged as a practical solution for improving cross-lingual transfer. For tasks that involve span-level annotations, such as information extraction or question answering, an additional label projection step is required to map annotated spans onto the translated texts. Recently, a few efforts have utilized a simple mark-then-translate method to jointly perform translation and projection by inserting special markers around the labeled spans in the original sentence. However, as far as we are aware, no empirical analysis has been conducted on how this approach compares to traditional annotation projection based on word alignment. In this paper, we present an extensive empirical study across 42 languages and three tasks (QA, NER, and Event Extraction) to evaluate the effectiveness and limitations of both methods, filling an important gap in the literature. Experimental results show that our optimized version of mark-then-translate, which we call EasyProject, is easily applied to many languages and works surprisingly well, outperforming the more complex word alignment-based methods. We analyze several key factors that affect end-task performance, and show EasyProject works well because it can accurately preserve label span boundaries after translation. We will publicly release all our code and data.
translated by 谷歌翻译
监督机器翻译的绝大多数评估指标,即(i)假设参考翻译的存在,(ii)受到人体得分的培训,或(iii)利用并行数据。这阻碍了其适用于此类监督信号的情况。在这项工作中,我们开发了完全无监督的评估指标。为此,我们利用评估指标,平行语料库开采和MT系统之间的相似性和协同作用。特别是,我们使用无监督的评估指标来开采伪并行数据,我们用来重塑缺陷的基础向量空间(以迭代方式),并诱导无监督的MT系统,然后提供伪引用作为伪参考作为在中的附加组件中的附加组件指标。最后,我们还从伪并行数据中诱导无监督的多语言句子嵌入。我们表明,我们完全无监督的指标是有效的,即,他们在5个评估数据集中的4个击败了受监督的竞争对手。
translated by 谷歌翻译
Universal cross-lingual sentence embeddings map semantically similar cross-lingual sentences into a shared embedding space. Aligning cross-lingual sentence embeddings usually requires supervised cross-lingual parallel sentences. In this work, we propose mSimCSE, which extends SimCSE to multilingual settings and reveal that contrastive learning on English data can surprisingly learn high-quality universal cross-lingual sentence embeddings without any parallel data. In unsupervised and weakly supervised settings, mSimCSE significantly improves previous sentence embedding methods on cross-lingual retrieval and multilingual STS tasks. The performance of unsupervised mSimCSE is comparable to fully supervised methods in retrieving low-resource languages and multilingual STS. The performance can be further enhanced when cross-lingual NLI data is available. Our code is publicly available at https://github.com/yaushian/mSimCSE.
translated by 谷歌翻译
我们介绍了第一个用于濒危Erzya语言与俄语以及我们为训练和评估它收集的数据集的神经机器翻译系统。BLEU分别分别为Erzya和Russian的BLEU分数分别为17和19,其中一半以上的翻译被以母语为母语的人可以接受。我们还调整了模型以在Erzya和其他10种语言之间转换,但是如果没有其他并行数据,这些方向上的质量仍然很低。我们将翻译模型与收集的文本语料库一起发布,新的语言标识模型以及适合Erzya语言的多语言句子编码器。这些资源将在https://github.com/slone-nlp/myv-nmt上找到。
translated by 谷歌翻译
State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in crosslingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 15 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.
translated by 谷歌翻译
We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence. We also create manually annotated testsets for 8 languages containing approximately 1000 sentences per language. We demonstrate the utility of the obtained dataset on existing testsets and the Naamapadam-test data for 8 Indic languages. We also release IndicNER, a multilingual mBERT model fine-tuned on the Naamapadam training set. IndicNER achieves the best F1 on the Naamapadam-test set compared to an mBERT model fine-tuned on existing datasets. IndicNER achieves an F1 score of more than 80 for 7 out of 11 Indic languages. The dataset and models are available under open-source licenses at https://ai4bharat.iitm.ac.in/naamapadam.
translated by 谷歌翻译
Word alignment is to find translationally equivalent words between source and target sentences. Previous work has demonstrated that self-training can achieve competitive word alignment results. In this paper, we propose to use word alignments generated by a third-party word aligner to supervise the neural word alignment training. Specifically, source word and target word of each word pair aligned by the third-party aligner are trained to be close neighbors to each other in the contextualized embedding space when fine-tuning a pre-trained cross-lingual language model. Experiments on the benchmarks of various language pairs show that our approach can surprisingly do self-correction over the third-party supervision by finding more accurate word alignments and deleting wrong word alignments, leading to better performance than various third-party word aligners, including the currently best one. When we integrate all supervisions from various third-party aligners, we achieve state-of-the-art word alignment performances, with averagely more than two points lower alignment error rates than the best third-party aligner. We released our code at https://github.com/sdongchuanqi/Third-Party-Supervised-Aligner.
translated by 谷歌翻译
在本文中,我们介绍了一个高质量的大规模基准数据集,用于英语 - 越南语音翻译,其中有508音频小时,由331k的三胞胎组成(句子长度的音频,英语源笔录句,越南人目标subtitle句子)。我们还使用强基础进行了经验实验,发现传统的“级联”方法仍然优于现代“端到端”方法。据我们所知,这是第一个大规模的英语 - 越南语音翻译研究。我们希望我们的公开数据集和研究都可以作为未来研究和英语语音翻译应用的起点。我们的数据集可从https://github.com/vinairesearch/phost获得
translated by 谷歌翻译
Pre-training is an effective technique for ensuring robust performance on a variety of machine learning tasks. It typically depends on large-scale crawled corpora that can result in toxic or biased models. Such data can also be problematic with respect to copyright, attribution, and privacy. Pre-training with synthetic tasks and data is a promising way of alleviating such concerns since no real-world information is ingested by the model. Our goal in this paper is to understand what makes for a good pre-trained model when using synthetic resources. We answer this question in the context of neural machine translation by considering two novel approaches to translation model pre-training. Our first approach studies the effect of pre-training on obfuscated data derived from a parallel corpus by mapping words to a vocabulary of 'nonsense' tokens. Our second approach explores the effect of pre-training on procedurally generated synthetic parallel data that does not depend on any real human language corpus. Our empirical evaluation on multiple language pairs shows that, to a surprising degree, the benefits of pre-training can be realized even with obfuscated or purely synthetic parallel data. In our analysis, we consider the extent to which obfuscated and synthetic pre-training techniques can be used to mitigate the issue of hallucinated model toxicity.
translated by 谷歌翻译
在本文中,我们建议将不同语言的句子表示对齐到统一的嵌入空间,其中可以用简单的点产品计算语义相似之处(交叉语言和单晶)。预先接受的语言模型与翻译排名任务进行微调。现有工作(Feng等人,2020)使用与批量相同的句子作为否定,这可能会遭受易于否定的问题。我们适应MOCO(赫尔,2020)以进一步提高对准质量。作为实验结果表明,我们的模型产生的句子表示在包括Tatoeba en-Zh的许多任务中实现了新的最先进的,包括STATOEBA EN-ZH类似性搜索(Artetxe和Schwenk,2019b),Bucc en-Zh Bitext Mining,7个数据集上的语义文本相似性。
translated by 谷歌翻译
我们提出了一种两阶段的培训方法,用于开发单个NMT模型,以翻译英语和英语的看不见的语言。对于第一阶段,我们将编码器模型初始化以鉴定XLM-R和Roberta的权重,然后对25种语言的平行数据进行多种语言微调。我们发现该模型可以推广到对看不见的语言的零击翻译。在第二阶段,我们利用这种概括能力从单语数据集生成合成的并行数据,然后用连续的反向翻译训练。最终模型扩展到了英语到许多方向,同时保持了多到英语的性能。我们称我们的方法为ecxtra(以英语为中心的跨语言(x)转移)。我们的方法依次利用辅助并行数据和单语言数据,并且在概念上很简单,仅在两个阶段都使用标准的跨熵目标。最终的ECXTRA模型对8种低资源语言的无监督NMT进行了评估,该语言为英语至哈萨克语(22.3> 10.4 bleu)以及其他15个翻译方向的竞争性能而获得了新的最先进。
translated by 谷歌翻译
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART -a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective . mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it also enables new types of transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.
translated by 谷歌翻译
我们对真正低资源语言的神经机翻译(NMT)进行了实证研究,并提出了一个训练课程,适用于缺乏并行培训数据和计算资源的情况,反映了世界上大多数世界语言和研究人员的现实致力于这些语言。以前,已经向低资源语言储存了使用后翻译(BT)和自动编码(AE)任务的无监督NMT。我们证明利用可比的数据和代码切换作为弱监管,与BT和AE目标相结合,即使仅使用适度的计算资源,低资源语言也会显着改进。在这项工作中提出的培训课程实现了Bleu分数,可通过+12.2 Bleu为古吉拉特和+3.7 Bleu为哈萨克斯培训的监督NMT培训,展示了弱势监督的巨大监督态度资源语言。在受到监督数据的培训时,我们的培训课程达到了索马里数据集(索马里29.3的BLEU的最先进的结果)。我们还观察到增加更多时间和GPU来培训可以进一步提高性能,强调报告在MT研究中的报告资源使用的重要性。
translated by 谷歌翻译
一些基于变压器的模型可以执行跨语言转移学习:这些模型可以通过一种语言对特定任务进行培训,并以另一种语言的同一任务给予相对良好的结果,尽管仅在单语任务中进行了预先培训。但是,关于这些基于变压器的模型是否学习跨语言的通用模式,目前尚无共识。我们提出了一种单词级的任务不可能的方法,以评估此类模型构建的上下文化表示的对齐方式。我们表明,与以前的方法相比,我们的方法提供了更准确的翻译成对,以评估单词级别对齐。我们的结果表明,基于多语言变压器模型的某些内部层优于其他明确对齐的表示,甚至根据多语言对齐的更严格的定义,更是如此。
translated by 谷歌翻译
在本文中,我们介绍了DOCMT5,这是一种预先培训的多语言序列到序列语言模型,具有大规模并行文档。虽然以前的方法专注于利用句子级并行数据,但我们尝试构建一个可以理解和生成长文件的通用预训练模型。我们提出了一个简单有效的预训练目标 - 文件重新排序机翻译(DRMT),其中需要翻译和屏蔽的输入文件。 DRMT在各种文档级生成任务中对强大基线带来一致的改进,包括超过12个BLEU积分,用于观看语言对文件级MT,超过7个BLEU积分,用于看不见的语言对文件级MT和3胭脂-1位为言语对交叉术概要。我们在WMT20 De-en和IWSLT15 Zh-ZH文档翻译任务中实现了最先进的(SOTA)。我们还对文档预培训的各种因素进行了广泛的分析,包括(1)预培训数据质量的影响和(2)组合单语言和交叉训练的影响。我们计划公开使用我们的模型检查站。
translated by 谷歌翻译
We propose BERTSCORE, an automatic evaluation metric for text generation. Analogously to common metrics, BERTSCORE computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTSCORE correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTSCORE is more robust to challenging examples when compared to existing metrics.
translated by 谷歌翻译