This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/ sentencepiece.
translated by 谷歌翻译
Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and out-of-domain settings.
translated by 谷歌翻译
Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend the pretrained models to new languages, we must modify the tokenizers simultaneously. In this paper, we add new subwords to the SentencePiece tokenizer to apply a multilingual pretrained model to new languages (Inuktitut in this paper). In our experiments, we segmented Inuktitut sentences into subwords without changing the segmentation of already pretrained languages, and applied the mBART-50 pretrained model to English-Inuktitut translation.
translated by 谷歌翻译
This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.
translated by 谷歌翻译
We describe an open-source toolkit for neural machine translation (NMT). The toolkit prioritizes efficiency, modularity, and extensibility with the goal of supporting NMT research into model architectures, feature representations, and source modalities, while maintaining competitive performance and reasonable training requirements. The toolkit consists of modeling and translation support, as well as detailed pedagogical documentation about the underlying techniques.
translated by 谷歌翻译
Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character ngram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English→German and English→Russian by up to 1.1 and 1.3 BLEU, respectively.
translated by 谷歌翻译
我们想要模型的文本单位是什么?从字节到多字表达式,可以在许多粒度下分析和生成文本。直到最近,大多数自然语言处理(NLP)模型通过单词操作,将那些作为离散和原子令牌处理,但从字节对编码(BPE)开始,基于次字的方法在许多领域都变得占主导地位,使得仍然存在小词汇表允许快速推断。是道路字符级模型的结束或字节级处理吗?在这项调查中,我们通过展示和评估基于学习分割的词语和字符以及基于子字的方法的混合方法以及基于学习的分割的杂交方法,连接多行工作。我们得出结论,对于所有应用来说,并且可能永远不会成为所有应用的银子弹奇异解决方案,并且严重思考令牌化对许多应用仍然很重要。
translated by 谷歌翻译
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference -sometimes prohibitively so in the case of very large data sets and large models. Several authors have also charged that NMT systems lack robustness, particularly when input sentences contain rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using residual connections as well as attention connections from the decoder network to the encoder. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. To directly optimize the translation BLEU scores, we consider refining the models by using reinforcement learning, but we found that the improvement in the BLEU scores did not reflect in the human evaluation. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
translated by 谷歌翻译
机器翻译系统(MTS)是通过将文本或语音从一种语言转换为另一种语言的有效工具。在像印度这样的大型多语言环境中,对有效的翻译系统的需求变得显而易见,英语和一套印度语言(ILS)正式使用。与英语相反,由于语料库的不可用,IL仍然被视为低资源语言。为了解决不对称性质,多语言神经机器翻译(MNMT)系统会发展为在这个方向上的理想方法。在本文中,我们提出了一个MNMT系统,以解决与低资源语言翻译有关的问题。我们的模型包括两个MNMT系统,即用于英语印度(一对多),另一个用于指示英语(多一对多),其中包含15个语言对(30个翻译说明)的共享编码器码头。由于大多数IL对具有很少的平行语料库,因此不足以训练任何机器翻译模型。我们探索各种增强策略,以通过建议的模型提高整体翻译质量。最先进的变压器体系结构用于实现所提出的模型。大量数据的试验揭示了其优越性比常规模型的优势。此外,本文解决了语言关系的使用(在方言,脚本等方面),尤其是关于同一家族的高资源语言在提高低资源语言表现方面的作用。此外,实验结果还表明了ILS的倒退和域适应性的优势,以提高源和目标语言的翻译质量。使用所有这些关键方法,我们提出的模型在评估指标方面比基线模型更有效,即一组ILS的BLEU(双语评估研究)得分。
translated by 谷歌翻译
关于阿塞拜疆的神经机器翻译(NMT)的研究很少。在本文中,我们将阿塞拜疆 - 英语NMT系统的性能基于一系列技术和数据集的性能。我们评估哪种细分技术在阿塞拜疆翻译上最有效,并基准了阿塞拜疆NMT模型在几个文本领域中的性能。我们的结果表明,虽然Umigram细分改善了NMT的性能,而Azerbaijani翻译模型则比数量更好,但跨域泛化仍然是一个挑战
translated by 谷歌翻译
The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.
translated by 谷歌翻译
评论是源代码的重要组成部分,是文档的主要来源。这引起了人们对使用大量注释的兴趣训练或评估消耗或生产它们的工具,例如生成甲骨文,甚至是从注释中生成代码,或自动生成代码摘要。这项工作大部分对评论的结构和质量做出了强烈的假设,例如假设它们主要由适当的英语句子组成。但是,我们对这些用例的现有评论的实际质量知之甚少。评论通常包含在其他类型的文本中看不到的独特结构和元素,并且从中过滤或提取信息需要额外的谨慎。本文探讨了来自GitHub的840个最受欢迎的开源项目和Srilab数据集的8422个项目的Python评论的内容和质量,并且Na \“ Ive vs.深入过滤的影响都可以使用现有注释来用于使用现有注释。培训和评估产生评论的系统。
translated by 谷歌翻译
我们为神经机翻译(NMT)提供了一个开源工具包。新工具包主要基于拱形变压器(Vaswani等,2017)以及下面详述的许多其他改进,以便创建一个独立的,易于使用,一致和全面的各个领域的机器翻译任务框架。它是为了支持双语和多语言翻译任务的工具,从构建各个语料库的模型开始推断新的预测或将模型打包给提供功能的JIT格式。
translated by 谷歌翻译
随着自然语言处理领域的最新发展,在使用不同架构的神经机翻译中的使用情况上升了。变压器架构用于实现最先进的准确性,但它们是训练的非常昂贵的昂贵。每个人都不能拥有由高端GPU和其他资源组成的等待。我们在低计算资源上培训我们的模型,并调查结果。正如预期的那样,变形金刚表现出其他架构,但结果有一些令人惊讶的结果。由更多编码器和解码器组成的变形金刚需要花更多的时间来训练,但有更少的BLEU分数。LSTM在实验中表现良好,比较少花时间训练而不是变压器,适合在具有时间限制的情况下使用。
translated by 谷歌翻译
音译是NLP域中的一项任务,其中输出单词是使用任何外语字母编写的类似单词。如今,该系统已针对多种语言对开发,涉及英语作为源或目标单词,并在Google Translate和聊天机器人等多个地方部署。但是,在指示语言的领域进行的研究很少进行,将其译为其他指示语言。本文展示了一个基于变压器(具有一些修改)的多语言模型,该模型比该域中的所有现有模型都可以显着更高的性能和准确性,并且比最先进的模型获得了更好的结果。本文显示了一个模型,该模型可以在以下五种语言之间进行任何一对 - 英语,印地语,孟加拉语,卡纳达语和泰米尔语之间的音译。它适用于语言在任何书面任务中都是通信的障碍的情况。该模型击败了最先进的(对于上述五种语言中的所有对 - 英语,印地语,孟加拉语,卡纳达语和泰米尔语),并获得了80.7%的前1位准确性得分,比比当前最佳结果。此外,该模型在语音准确性方面达到了93.5%(音译主要是基于语音/声音的任务)。
translated by 谷歌翻译
Pre-training is an effective technique for ensuring robust performance on a variety of machine learning tasks. It typically depends on large-scale crawled corpora that can result in toxic or biased models. Such data can also be problematic with respect to copyright, attribution, and privacy. Pre-training with synthetic tasks and data is a promising way of alleviating such concerns since no real-world information is ingested by the model. Our goal in this paper is to understand what makes for a good pre-trained model when using synthetic resources. We answer this question in the context of neural machine translation by considering two novel approaches to translation model pre-training. Our first approach studies the effect of pre-training on obfuscated data derived from a parallel corpus by mapping words to a vocabulary of 'nonsense' tokens. Our second approach explores the effect of pre-training on procedurally generated synthetic parallel data that does not depend on any real human language corpus. Our empirical evaluation on multiple language pairs shows that, to a surprising degree, the benefits of pre-training can be realized even with obfuscated or purely synthetic parallel data. In our analysis, we consider the extent to which obfuscated and synthetic pre-training techniques can be used to mitigate the issue of hallucinated model toxicity.
translated by 谷歌翻译
通常,语音处理模型包括语言模型以及声学模型。无论语言模型的复杂性和变体如何,语言模型需要三个关键的预处理步骤:清洁,标准化和标记。在提到的步骤中,归一化步骤对于在纯文本应用程序中格式化统一是必要的。然而,对于语音处理模块中的嵌入式语言模型,归一化不限于格式化统一。此外,它必须将每个可读符号,数字等转换为它们的发音方式。据我们所知,语音处理模块中没有用于嵌入式语言模型的波斯标准化工具包,因此在本文中,我们提出了一个用于语音应用程序中的文本处理的开源归一化工具包。简而言之,我们考虑不同的可读波斯文,如符号(常见的货币,#,@,URL等),数字(日期,时间,电话号码,国家代码等)等。与其他可用波斯文本规范化工具的比较表明了语音处理中提出的方法的优越性。此外,将模型的性能与其他常见的自然语言库(如HATM和Parsivar)的其他常见的自然语言库进行比较,指示所提出的方法的正确性能。此外,它对一些波斯维基百科数据的评估证实了该方法的适当性能。
translated by 谷歌翻译
We propose a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no changes to the model architecture from a standard NMT system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes an encoder, decoder and attention module, remains unchanged and is shared across all languages. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model without any increase in parameters, which is significantly simpler than previous proposals for Multilingual NMT. On the WMT'14 benchmarks, a single multilingual model achieves comparable performance for English→French and surpasses state-of-the-art results for English→German. Similarly, a single multilingual model surpasses state-of-the-art results for French→English and German→English on WMT'14 and WMT'15 benchmarks, respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. In addition to improving the translation quality of language pairs that the model was trained with, our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages.
translated by 谷歌翻译
在本文中,我们描述了三星研究的提交菲律宾-Konvergen AI团队为WMT'21大规模多语言翻译任务 - 小轨道2.我们向共享任务提交标准SEQ2Seq变压器模型,没有任何培训或架构技巧,主要依靠我们的数据预处理技术来提高性能。我们的最终提交模型在Flores-101 DevTest集中筹集了22.92平均Bleu,并在比赛的隐藏试验集上获得了22.97平均平均Bleu,整体排名第六。尽管只使用标准变压器,我们的型号在印度尼西亚排名第一的javanese,表明数据预处理的重要事项,如果不是更多的,而不是切割边缘模型架构和训练技术。
translated by 谷歌翻译
我们为WordPiece提供了子字正规化方法,该方法使用了令牌化的最大匹配算法。提出的方法MaxMatch-Dropout使用最大匹配算法随机将单词随机删除。它通过对流行预审预测的语言模型(例如Bert-Base)的子词正则化实现了Finetuntization。实验结果表明,MaxMatch-DropOut改善了文本分类和机器翻译任务的性能以及其他子单词正则化方法。此外,我们提供了子词正则化方法的比较分析:使用句子(Unigram),BPE-Dropout和MaxMatch-Dropout的子字正则化。
translated by 谷歌翻译