Neural machine translation(NMT) has aroused wide attention due to its impressive quality. Beyond quality, controlling translation styles is also an important demand for many languages. Previous related studies mainly focus on controlling formality and gain some improvements. However, they still face two challenges. The first is the evaluation limitation. Style contains abundant information including lexis, syntax, etc. But only formality is well studied. The second is the heavy reliance on iterative fine-tuning when new styles are required. Correspondingly, this paper contributes in terms of the benchmark and approach. First, we re-visit this task and propose a multiway stylized machine translation (MSMT) benchmark, which includes multiple categories of styles in four language directions to push the boundary of this task. Second, we propose a method named style activation prompt (StyleAP) by retrieving prompts from stylized monolingual corpus, which needs no extra fine-tuning. Experiments show that StyleAP could effectively control the style of translation and achieve remarkable performance. All of our data and code are released at https://github.com/IvanWang0730/StyleAP.
translated by 谷歌翻译
通过自我监督的学习预先训练的大型语言模型在各种各样的任务上表现出令人印象深刻的零击功能。在这项工作中,我们介绍了Welm:一种针对中文的精心读取的预训练的语言模型,能够无缝执行不同类型的任务,以零或几次演示。 Welm通过“阅读”涵盖广泛主题的精选高质量语料库来接受10b参数的培训。我们表明,韦尔姆拥有有关各种领域和语言的广泛知识。在18个单语(中文)任务中,WELM可以大大优于现有的预训练模型,尺寸相似,并匹配高达25倍大的模型的性能。韦尔姆还表现出强大的多种语言和代码转换理解的能力,优于预先对30种语言进行预培训的现有多语言模型。此外,我们收集了人工编写的提示,并通过多次培训进行了大量的中文和微调韦尔姆的监督数据集。最终的模型可以实现对看不见的任务类型的强烈概括,并在零射门学习中优于无监督的韦尔姆。最后,我们证明韦尔姆具有解释和校准自己的决策的基本技能,这可能是未来研究的有希望的方向。我们的模型可以从https://welm.weixin.qq.com/docs/api/应用。
translated by 谷歌翻译
Multimodal machine translation (MMT) aims to improve translation quality by incorporating information from other modalities, such as vision. Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets. These studies face two challenges. First, they can only utilize triple data (bilingual texts with images), which is scarce; second, current benchmarks are relatively restricted and do not correspond to realistic scenarios. Therefore, this paper correspondingly establishes new methods and new datasets for MMT. First, we propose a framework 2/3-Triplet with two new approaches to enhance MMT by utilizing large-scale non-triple data: monolingual image-text data and parallel text-only data. Second, we construct an English-Chinese {e}-commercial {m}ulti{m}odal {t}ranslation dataset (including training and testing), named EMMT, where its test set is carefully selected as some words are ambiguous and shall be translated mistakenly without the help of images. Experiments show that our method is more suitable for real-world scenarios and can significantly improve translation performance by using more non-triple data. In addition, our model also rivals various SOTA models in conventional multimodal translation benchmarks.
translated by 谷歌翻译
Multilingual Pretrained Language Models (MPLMs) have shown their strong multilinguality in recent empirical cross-lingual transfer studies. In this paper, we propose the Prompts Augmented by Retrieval Crosslingually (PARC) pipeline to improve the zero-shot performance on low-resource languages (LRLs) by augmenting the context with semantically similar sentences retrieved from a high-resource language (HRL) as prompts. PARC improves the zero-shot performance on three downstream tasks (binary sentiment classification, topic categorization and natural language inference) with multilingual parallel test sets across 10 LRLs covering 6 language families in both unlabeled settings (+5.1%) and labeled settings (+16.3%). PARC-labeled also outperforms the finetuning baseline by 3.7%. We find a significant positive correlation between cross-lingual transfer performance on one side, and the similarity between the high- and low-resource languages as well as the amount of low-resource pretraining data on the other side. A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs.
translated by 谷歌翻译
尽管最近在跨模式检索领域取得了进展,但由于缺乏手动注释的数据集,研究的重点较少。在本文中,我们提出了一种用于低资源语言的噪声跨语法跨模式检索方法。为此,我们使用机器翻译(MT)来构建低资源语言的伪并行句子对。但是,由于MT并不完美,因此它倾向于在翻译过程中引入噪音,从而使文本嵌入被损坏,从而损害了检索性能。为了减轻这一点,我们引入了一种多视图自我验证方法来学习噪声稳定目标语言表示,该方法采用了跨注意模块来生成软伪靶标,以从基于相似性的视图和功能 - 功能 - 基于视图。此外,受到无监督的MT的反向翻译的启发,我们最大程度地减少了原点句子和反翻译句子之间的语义差异,以进一步提高文本编码器的噪声稳健性。在三个视频文本和图像文本跨模式检索基准跨不同语言上进行了广泛的实验,结果表明,我们的方法显着改善了整体性能,而无需使用额外的人体标记数据。此外,从最近的视觉和语言预训练框架(即剪辑)中配备了预训练的视觉编码器,我们的模型可实现显着的性能增长,这表明我们的方法与流行的预训练模型兼容。代码和数据可在https://github.com/huiguanlab/nrccr上找到。
translated by 谷歌翻译
Translating training data into many languages has emerged as a practical solution for improving cross-lingual transfer. For tasks that involve span-level annotations, such as information extraction or question answering, an additional label projection step is required to map annotated spans onto the translated texts. Recently, a few efforts have utilized a simple mark-then-translate method to jointly perform translation and projection by inserting special markers around the labeled spans in the original sentence. However, as far as we are aware, no empirical analysis has been conducted on how this approach compares to traditional annotation projection based on word alignment. In this paper, we present an extensive empirical study across 42 languages and three tasks (QA, NER, and Event Extraction) to evaluate the effectiveness and limitations of both methods, filling an important gap in the literature. Experimental results show that our optimized version of mark-then-translate, which we call EasyProject, is easily applied to many languages and works surprisingly well, outperforming the more complex word alignment-based methods. We analyze several key factors that affect end-task performance, and show EasyProject works well because it can accurately preserve label span boundaries after translation. We will publicly release all our code and data.
translated by 谷歌翻译
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART -a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective . mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it also enables new types of transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.
translated by 谷歌翻译
机器翻译系统(MTS)是通过将文本或语音从一种语言转换为另一种语言的有效工具。在像印度这样的大型多语言环境中,对有效的翻译系统的需求变得显而易见,英语和一套印度语言(ILS)正式使用。与英语相反,由于语料库的不可用,IL仍然被视为低资源语言。为了解决不对称性质,多语言神经机器翻译(MNMT)系统会发展为在这个方向上的理想方法。在本文中,我们提出了一个MNMT系统,以解决与低资源语言翻译有关的问题。我们的模型包括两个MNMT系统,即用于英语印度(一对多),另一个用于指示英语(多一对多),其中包含15个语言对(30个翻译说明)的共享编码器码头。由于大多数IL对具有很少的平行语料库,因此不足以训练任何机器翻译模型。我们探索各种增强策略,以通过建议的模型提高整体翻译质量。最先进的变压器体系结构用于实现所提出的模型。大量数据的试验揭示了其优越性比常规模型的优势。此外,本文解决了语言关系的使用(在方言,脚本等方面),尤其是关于同一家族的高资源语言在提高低资源语言表现方面的作用。此外,实验结果还表明了ILS的倒退和域适应性的优势,以提高源和目标语言的翻译质量。使用所有这些关键方法,我们提出的模型在评估指标方面比基线模型更有效,即一组ILS的BLEU(双语评估研究)得分。
translated by 谷歌翻译
双语术语是电子商务领域中重要的机器翻译资源,通常是手动翻译或自动从并行数据中提取的。人类的翻译成本高昂,电子商务并行语料库非常稀缺。但是,同一商品领域中不同语言中的可比数据很丰富。在本文中,我们提出了一个新颖的框架,即从可比较的数据中提取电子商业双语术语。我们的框架受益于电子商务的跨语化预培训,可以充分利用源端术语和目标端句子之间的深层语义关系,以提取相应的目标术语。各种语言对的实验结果表明,我们的方法比各种强大的基线都取得了明显更好的性能。
translated by 谷歌翻译
定义生成任务旨在自动在特定上下文中生成一个单词的定义。但是,由于缺乏针对不同复杂性的数据集,模型产生的定义往往会保持相同的复杂度。本文提出了为具有可控复杂性级别的单词生成定义的新任务。相应地,我们介绍了编译,一个数据集给出了有关中国定义的详细信息,并且每个定义都标有其复杂性级别。编译数据集包括74,303个单词和106,882个定义。据我们所知,它是中国定义生成任务的最大数据集。我们选择各种代表性生成方法作为此任务的基准和进行评估,这说明我们的数据集在协助模型生成不同的复杂性级别定义方面发挥了出色的作用。我们认为,编译数据集将使复杂性可控定义生成的进一步研究受益。
translated by 谷歌翻译
我们介绍了MTG,这是一套新的基准套件,用于培训和评估多语言文本生成。它是具有最大人类通知数据(400K)的第一次传播的多语言多路文本生成数据集。它包括五种语言(英语,德语,法语,西班牙语和中文)的四代任务(故事产生,问题生成,标题生成和文本摘要)。Multiway设置可以启用跨语言和任务的模型测试知识传输功能。使用MTG,我们从不同方面训练和分析了几种流行的多语言生成模型。我们的基准套件通过更多的人为宣传的并行数据促进了模型性能增强。它提供了各种一代方案的全面评估。代码和数据可在\ url {https://github.com/zide05/mtg}上获得。
translated by 谷歌翻译
翻译质量估计(QE)是预测机器翻译(MT)输出质量的任务,而无需任何参考。作为MT实际应用中的重要组成部分,这项任务已越来越受到关注。在本文中,我们首先提出了XLMRScore,这是一种基于使用XLM-Roberta(XLMR)模型计算的BertScore的简单无监督的QE方法,同时讨论了使用此方法发生的问题。接下来,我们建议两种减轻问题的方法:用未知令牌和预训练模型的跨语性对准替换未翻译的单词,以表示彼此之间的一致性单词。我们在WMT21 QE共享任务的四个低资源语言对上评估了所提出的方法,以及本文介绍的新的英语FARSI测试数据集。实验表明,我们的方法可以在两个零射击方案的监督基线中获得可比的结果,即皮尔森相关性的差异少于0.01,同时在所有低资源语言对中的平均低资源语言对中的无人看管竞争对手的平均水平超过8%的平均水平超过8%。 。
translated by 谷歌翻译
Large-scale generative models show an impressive ability to perform a wide range of Natural Language Processing (NLP) tasks using in-context learning, where a few examples are used to describe a task to the model. For Machine Translation (MT), these examples are typically randomly sampled from the development dataset with a similar distribution as the evaluation set. However, it is unclear how the choice of these in-context examples and their ordering impacts the output translation quality. In this work, we aim to understand the properties of good in-context examples for MT in both in-domain and out-of-domain settings. We show that the translation quality and the domain of the in-context examples matter and that 1-shot noisy unrelated example can have a catastrophic impact on output quality. While concatenating multiple random examples reduces the effect of noise, a single good prompt optimized to maximize translation quality on the development dataset can elicit learned information from the pre-trained language model. Adding similar examples based on an n-gram overlap with the test source significantly and consistently improves the translation quality of the outputs, outperforming a strong kNN-MT baseline in 2 out of 4 out-of-domain datasets.
translated by 谷歌翻译
域适应是神经机器翻译的重要挑战。但是,传统的微调解决方案需要多次额外的培训,并产生高昂的成本。在本文中,我们提出了一种非调节范式,通过基于及时的方法解决域的适应性。具体来说,我们构建了双语短语级数据库,并从中检索相关对作为输入句子的提示。通过利用检索到的短语级提示(REPP),我们有效地提高了翻译质量。实验表明,我们的方法改善了域特异性的机器翻译,可用于6.2 BLEU分数,并改善了在没有额外训练的情况下,精度为11.5%的翻译约束。
translated by 谷歌翻译
机器翻译(MT)的单词级质量估计(QE)旨在在不参考的情况下找出翻译句子中的潜在翻译错误。通常,关于文字级别量化宽松的传统作品旨在根据文章编辑工作来预测翻译质量,其中通过比较MT句子之间的单词来自动生成单词标签(“ OK”和“ BAD”)。通过翻译错误率(TER)工具包编辑的句子。虽然可以使用后编辑的工作来在一定程度上测量翻译质量,但我们发现它通常与人类对单词是否良好或翻译不良的判断相抵触。为了克服限制,我们首先创建了一个金色基准数据集,即\ emph {hjqe}(人类对质量估计的判断),专家翻译直接注释了对其判断的不良翻译单词。此外,为了进一步利用平行语料库,我们提出了使用两个标签校正策略的自我监督的预训练,即标记改进策略和基于树的注释策略,以使基于TER的人工量化量子ceper更接近\ emph {HJQE}。我们根据公开可用的WMT en-de和en-ZH Corpora进行实质性实验。结果不仅表明我们提出的数据集与人类的判断更加一致,而且还确认了提议的标签纠正策略的有效性。 。}
translated by 谷歌翻译
对于多语言序列到序列预审预周序模型(多语言SEQ2SEQ PLM),例如姆巴特(Mbart),自制的预处理任务接受了多种单语言的培训,例如25种来自CommonCrawl的语言,而下游的跨语言任务通常在双语语言子集上进行,例如英语 - 德国人,存在数据差异,即领域的差异,以及跨语言学习客观差异,即在训练和填充阶段之间的任务差异。为了弥合上述跨语言域和任务差距,我们将使用额外的代码切换恢复任务扩展了香草预后管道。具体而言,第一阶段采用自我监督的代码转换还原任务作为借口任务,从而允许多语言SEQ2SEQ PLM获取一些域内对齐信息。在第二阶段,我们正常在下游数据上微调模型。 NLG评估(12个双语翻译任务,30个零射击任务和2项跨语言摘要任务)和NLU评估(7个跨语性自然语言推理任务)的实验表明,我们的模型超过了强大的基线MBART,具有标准的FINETUNNING,这表明了我们的模型策略,一致。分析表明,我们的方法可以缩小跨语性句子表示的欧几里得距离,并通过微不足道的计算成本改善模型概括。我们在:https://github.com/zanchangtong/csr4mbart上发布代码。
translated by 谷歌翻译
MARCO排名数据集已广泛用于培训IR任务的深度学习模型,在不同的零射击方案上实现了相当大的效果。但是,这种类型的资源是英语以外的语言的稀缺。在这项工作中,我们呈现MMARCO,MS Marco段落的多语言版本,该数据集包括使用机器翻译创建的13种语言。我们通过微调单语和多语言重新排名模型以及此数据集的密集多语言模型进行了评估。实验结果表明,在我们翻译的数据集上微调微调的多语言模型可以单独对原始英文版的模型进行微调的卓越效果。我们蒸馏的多语言RE-RANKER与非蒸馏模型具有竞争力,而参数较少的5.4倍。最后,我们展现了翻译质量和检索效果之间的正相关性,提供了证据,即翻译方法的改进可能导致多语言信息检索的改进。翻译的数据集和微调模型可在https://github.com/unicamp-dl/mmarco.git上获得。
translated by 谷歌翻译
The word alignment task, despite its prominence in the era of statistical machine translation (SMT), is niche and under-explored today. In this two-part tutorial, we argue for the continued relevance for word alignment. The first part provides a historical background to word alignment as a core component of the traditional SMT pipeline. We zero-in on GIZA++, an unsupervised, statistical word aligner with surprising longevity. Jumping forward to the era of neural machine translation (NMT), we show how insights from word alignment inspired the attention mechanism fundamental to present-day NMT. The second part shifts to a survey approach. We cover neural word aligners, showing the slow but steady progress towards surpassing GIZA++ performance. Finally, we cover the present-day applications of word alignment, from cross-lingual annotation projection, to improving translation.
translated by 谷歌翻译
文本样式传输是自然语言生成中的重要任务,旨在控制生成的文本中的某些属性,例如礼貌,情感,幽默和许多其他特性。它在自然语言处理领域拥有悠久的历史,最近由于深神经模型带来的有希望的性能而重大关注。在本文中,我们对神经文本转移的研究进行了系统调查,自2017年首次神经文本转移工作以来跨越100多个代表文章。我们讨论了任务制定,现有数据集和子任务,评估,以及丰富的方法在存在并行和非平行数据存在下。我们还提供关于这项任务未来发展的各种重要主题的讨论。我们的策据纸张列表在https://github.com/zhijing-jin/text_style_transfer_survey
translated by 谷歌翻译
Large language models (LLMs) that have been trained on multilingual but not parallel text exhibit a remarkable ability to translate between languages. We probe this ability in an in-depth study of the pathways language model (PaLM), which has demonstrated the strongest machine translation (MT) performance among similarly-trained LLMs to date. We investigate various strategies for choosing translation examples for few-shot prompting, concluding that example quality is the most important factor. Using optimized prompts, we revisit previous assessments of PaLM's MT capabilities with more recent test sets, modern MT metrics, and human evaluation, and find that its performance, while impressive, still lags that of state-of-the-art supervised systems. We conclude by providing an analysis of PaLM's MT output which reveals some interesting properties and prospects for future work.
translated by 谷歌翻译