Directly training a document-to-document (Doc2Doc) neural machine translation (NMT) via Transformer from scratch, especially on small datasets usually fails to converge. Our dedicated probing tasks show that 1) both the absolute position and relative position information gets gradually weakened or even vanished once it reaches the upper encoder layers, and 2) the vanishing of absolute position information in encoder output causes the training failure of Doc2Doc NMT. To alleviate this problem, we propose a position-aware Transformer (P-Transformer) to enhance both the absolute and relative position information in both self-attention and cross-attention. Specifically, we integrate absolute positional information, i.e., position embeddings, into the query-key pairs both in self-attention and cross-attention through a simple yet effective addition operation. Moreover, we also integrate relative position encoding in self-attention. The proposed P-Transformer utilizes sinusoidal position encoding and does not require any task-specified position embedding, segment embedding, or attention mechanism. Through the above methods, we build a Doc2Doc NMT model with P-Transformer, which ingests the source document and completely generates the target document in a sequence-to-sequence (seq2seq) way. In addition, P-Transformer can be applied to seq2seq-based document-to-sentence (Doc2Sent) and sentence-to-sentence (Sent2Sent) translation. Extensive experimental results of Doc2Doc NMT show that P-Transformer significantly outperforms strong baselines on widely-used 9 document-level datasets in 7 language pairs, covering small-, middle-, and large-scales, and achieves a new state-of-the-art. Experimentation on discourse phenomena shows that our Doc2Doc NMT models improve the translation quality in both BLEU and discourse coherence. We make our code available on Github.
translated by 谷歌翻译
幻觉是一种困扰神经机器翻译的一种病理翻译,最近引起了很多关注。简而言之,幻觉翻译是流利的句子,但与源输入几乎没有关系。可以说,如何发生幻觉仍然是一个开放的问题。在本文中,我们建议使用探测方法从模型架构的角度研究幻觉的原因,旨在避免将来的架构设计中的此类问题。通过对各种NMT数据集进行实验,我们发现幻觉通常伴随着不足的编码器,尤其是嵌入式和脆弱的交叉分离,而有趣的是,跨煽动会减轻编码器引起的一些错误。
translated by 谷歌翻译
变压器结构由一系列编码器和解码器网络层堆叠,在神经机器翻译中实现了重大发展。但是,假设下层提供了微不足道或冗余的信息,那么香草变压器主要利用顶层表示形式,从而忽略了潜在有价值的底层特征。在这项工作中,我们提出了组转换器模型(GTRAN),该模型将编码器和解码器的多层表示分为不同的组,然后融合这些组特征以生成目标词。为了证实所提出方法的有效性,对三个双语翻译基准和两个多语言翻译任务进行了广泛的实验和分析实验,包括IWLST-14,IWLST-17,IWLST-17,LDC,WMT-14和OPUS-100基准。实验和分析结果表明,我们的模型通过一致的增益优于其变压器对应物。此外,它可以成功扩展到60个编码层和36个解码器层。
translated by 谷歌翻译
在完全共享所有语言参数的多语言神经机器翻译模型中,通常使用人工语言令牌来指导转换为所需的目标语言。但是,最近的研究表明,预备语言代币有时无法将多语言神经机器翻译模型导航到正确的翻译方向,尤其是在零弹性翻译上。为了减轻此问题,我们提出了两种方法:语言嵌入实施例和语言意识的多头关注,以学习信息丰富的语言表示,以将翻译转换为正确的方向。前者体现了沿着从源到目标的信息流中的不同关键切换点的语言,旨在放大翻译方向引导信号。后者利用矩阵而不是向量来表示连续空间中的语言。矩阵分为多个头,以学习多个子空间中的语言表示。在两个数据集上进行大规模多语言神经机器翻译的实验结果表明,语言意识到的多头注意力受益于监督和零弹性翻译,并大大减轻了脱靶翻译问题。进一步的语言类型学预测实验表明,通过我们的方法学到的基于基质的语言表示能够捕获丰富的语言类型学特征。
translated by 谷歌翻译
神经机翻译模型假设可以通过自动关注网络从双语语料库中学到语法知识。但是,在弱监管中训练的注意网络实际上无法捕获句子的深层结构。当然,我们希望引入外部语法知识来指导注意力学习网络。因此,我们提出了一种新颖的,无参数依赖性缩放的自我关注网络,其将明确的句法依赖关系集成到注意网络中以驱逐注意力分布的分散。最后,提出了两种知识稀疏技术,以防止模型过度禁止嘈杂的句法依赖性。对IWSLT14德语和WMT16德语翻译任务的实验和广泛分析验证了我们方法的有效性。
translated by 谷歌翻译
变压器注意机制的二次计算和内存复杂性限制了对长序列建模的可扩展性。在本文中,我们提出了Luna,一种线性统一嵌套关注机制,使Softmax注意力具有两个嵌套线性关注功能,仅产生线性(与二次)的时间和空间复杂度相反。具体地,通过第一注意功能,LUNA将输入序列包装成固定长度的序列。然后,使用第二关注功能未包装包装序列。与更传统的关注机制相比,LUNA引入具有固定长度的附加序列作为输入和额外的相应输出,允许LUNA线性地进行关注操作,同时还存储足够的上下文信息。我们对三个序列建模任务的基准进行了广泛的评估:长上下文序列建模,神经机平移和大型预磨损的屏蔽语言建模。竞争甚至更好的实验结果表明了Luna的有效性和效率与各种各样相比
translated by 谷歌翻译
Data scarcity is one of the main issues with the end-to-end approach for Speech Translation, as compared to the cascaded one. Although most data resources for Speech Translation are originally document-level, they offer a sentence-level view, which can be directly used during training. But this sentence-level view is single and static, potentially limiting the utility of the data. Our proposed data augmentation method SegAugment challenges this idea and aims to increase data availability by providing multiple alternative sentence-level views of a dataset. Our method heavily relies on an Audio Segmentation system to re-segment the speech of each document, after which we obtain the target text with alignment methods. The Audio Segmentation system can be parameterized with different length constraints, thus giving us access to multiple and diverse sentence-level views for each document. Experiments in MuST-C show consistent gains across 8 language pairs, with an average increase of 2.2 BLEU points, and up to 4.7 BLEU for lower-resource scenarios in mTEDx. Additionally, we find that SegAugment is also applicable to purely sentence-level data, as in CoVoST, and that it enables Speech Translation models to completely close the gap between the gold and automatic segmentation at inference time.
translated by 谷歌翻译
机器翻译历史上的重要突破之一是变压器模型的发展。不仅对于各种翻译任务,而且对于大多数其他NLP任务都是革命性的。在本文中,我们针对一个基于变压器的系统,该系统能够将德语用源句子转换为其英语的对应目标句子。我们对WMT'13数据集的新闻评论德语 - 英语并行句子进行实验。此外,我们研究了来自IWSLT'16数据集的培训中包含其他通用域数据以改善变压器模型性能的效果。我们发现,在培训中包括IWSLT'16数据集,有助于在WMT'13数据集的测试集中获得2个BLEU得分点。引入定性分析以分析通用域数据的使用如何有助于提高产生的翻译句子的质量。
translated by 谷歌翻译
我用Hunglish2语料库训练神经电脑翻译任务的模型。这项工作的主要贡献在培训NMT模型期间评估不同的数据增强方法。我提出了5种不同的增强方法,这些方法是结构感知的,这意味着而不是随机选择用于消隐或替换的单词,句子的依赖树用作增强的基础。我首先关于神经网络的详细文献综述,顺序建模,神经机翻译,依赖解析和数据增强。经过详细的探索性数据分析和Hunglish2语料库的预处理之后,我使用所提出的数据增强技术进行实验。匈牙利语的最佳型号达到了33.9的BLEU得分,而英国匈牙利最好的模型达到了28.6的BLEU得分。
translated by 谷歌翻译
GPT-2和BERT展示了在各种自然语言处理任务上使用预训练的语言模型(LMS)的有效性。但是,在应用于资源丰富的任务时,LM微调通常会遭受灾难性的遗忘。在这项工作中,我们引入了一个协同的培训框架(CTNMT),该框架是将预训练的LMS集成到神经机器翻译(NMT)的关键。我们提出的CTNMT包括三种技术:a)渐近蒸馏,以确保NMT模型可以保留先前的预训练知识; b)动态的开关门,以避免灾难性忘记预训练的知识; c)根据计划的政策调整学习步伐的策略。我们在机器翻译中的实验表明,WMT14英语 - 德语对的CTNMT获得了最高3个BLEU得分,甚至超过了先前的最先进的预培训辅助NMT NMT的NMT。尽管对于大型WMT14英语法国任务,有400万句话,但我们的基本模型仍然可以显着改善最先进的变压器大型模型,超过1个BLEU得分。代码和模型可以从https://github.com/bytedance/neurst/tree/Master/Master/examples/ctnmt下载。
translated by 谷歌翻译
同时翻译,它在仅在源句中只收到几个单词后开始翻译每个句子,在许多情况下都具有重要作用。虽然以前的前缀到前缀框架被认为适合同时翻译并实现良好的性能,但它仍然有两个不可避免的缺点:由于需要为每个延迟的单独模型训练单独模型而导致的高计算资源成本$ k $和不足能够编码信息,因为每个目标令牌只能参加特定的源前缀。我们提出了一种新颖的框架,采用简单但有效的解码策略,该策略专为全句型而设计。在此框架内,培训单个全句型模型可以实现任意给出的延迟并节省计算资源。此外,随着全句型模型的能力来编码整个句子,我们的解码策略可以在实时增强在解码状态中保持的信息。实验结果表明,我们的方法在4个方向上的基准方向达到了更好的翻译质量:Zh $ \ lightarrow $ en,en $ \ lightarrow $ ro和en $ \ leftrightarrow $ de。
translated by 谷歌翻译
以前的工作主要侧重于改善NLU任务的交叉传输,具有多语言预用编码器(MPE),或提高与伯特的监督机器翻译的性能。然而,探索了,MPE是否可以有助于促进NMT模型的交叉传递性。在本文中,我们专注于NMT中的零射频转移任务。在此任务中,NMT模型培训,只有一个语言对的并行数据集和搁置架MPE,然后它直接测试在零拍语言对上。我们为此任务提出了Sixt,一个简单而有效的模型。 SIXT利用了两阶段培训计划利用MPE,并进一步改进了解离编码器和容量增强的解码器。使用此方法,SIMPT显着优于MBart,这是一个用于NMT的预磨削的多语言编码器解码器模型,平均改善了14个源语言的零拍摄的任何英语测试集上的7.1 BLEU。此外,培训计算成本和培训数据较少,我们的模型在15个任何英语测试组上实现了比Criss和M2M-100,两个强大的多语言NMT基线更好的性能。
translated by 谷歌翻译
最近在单语数据和机器翻译(MT)进行微调的预培训方面取得了成功,但尚不清楚如何最好地利用预先训练的模型来完成给定的MT任务。本文在微调MT上的预训练模型时研究了冻结参数的好处和缺点。我们专注于1)微调仅在英语单语言数据的BART上训练的模型。2)微调一个模型,该模型对25种语言的单语言数据进行了培训,Mbart。对于Bart,我们通过冻结大多数模型参数并添加额外的位置嵌入来获得最佳性能。对于MBART,我们将大多数语言对的天真微调的性能与编码器以及大多数解码器搭配。编码器的注意参数对于微调最重要。当将自己限制为越南人对英语的室外训练套装时,我们看到了基线的最大进步。
translated by 谷歌翻译
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language, which can be achieved with a cascaded system consisting of speech recognition, machine translation and speech synthesis. To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech, which requires strict length control. Previous works usually control the number of words or characters generated by the machine translation model to be similar to the source sentence, without considering the isochronicity of speech as the speech duration of words/characters in different languages varies. In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech. Specifically, we control the speech length of generated sentence by guiding the prediction of each word with the duration information, including the speech duration of itself as well as how much duration is left for the remaining words. We design experiments on four language directions (German -> English, Spanish -> English, Chinese <-> English), and the results show that the proposed method achieves better length control ability on the generated speech than baseline methods. To make up the lack of real-world datasets, we also construct a real-world test set collected from films to provide comprehensive evaluations on the video dubbing task.
translated by 谷歌翻译
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART -a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective . mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it also enables new types of transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.
translated by 谷歌翻译
The word alignment task, despite its prominence in the era of statistical machine translation (SMT), is niche and under-explored today. In this two-part tutorial, we argue for the continued relevance for word alignment. The first part provides a historical background to word alignment as a core component of the traditional SMT pipeline. We zero-in on GIZA++, an unsupervised, statistical word aligner with surprising longevity. Jumping forward to the era of neural machine translation (NMT), we show how insights from word alignment inspired the attention mechanism fundamental to present-day NMT. The second part shifts to a survey approach. We cover neural word aligners, showing the slow but steady progress towards surpassing GIZA++ performance. Finally, we cover the present-day applications of word alignment, from cross-lingual annotation projection, to improving translation.
translated by 谷歌翻译
多文件摘要中的一个关键挑战是捕获区分单个文档摘要(SDS)和多文件摘要(MDS)的输入文档之间的关系。现有的MDS工作很少解决此问题。一种有效的方法是编码文档位置信息,以帮助模型捕获跨文档关系。但是,现有的MDS模型(例如基于变压器的模型)仅考虑令牌级的位置信息。此外,这些模型无法捕获句子的语言结构,这不可避免地会引起生成的摘要中的混乱。因此,在本文中,我们提出了可以与MDS的变压器体系结构融合的文档意识到的位置编码和语言引导的编码。对于文档感知的位置编码,我们引入了一项通用协议,以指导文档编码功能的选择。对于语言引导的编码,我们建议使用简单但有效的非线性编码学习者进行特征学习,将句法依赖关系嵌入依赖关系掩码中。广泛的实验表明,所提出的模型可以生成高质量的摘要。
translated by 谷歌翻译
In Neural Machine Translation (NMT), each token prediction is conditioned on the source sentence and the target prefix (what has been previously translated at a decoding step). However, previous work on interpretability in NMT has mainly focused solely on source sentence tokens' attributions. Therefore, we lack a full understanding of the influences of every input token (source sentence and target prefix) in the model predictions. In this work, we propose an interpretability method that tracks input tokens' attributions for both contexts. Our method, which can be extended to any encoder-decoder Transformer-based model, allows us to better comprehend the inner workings of current NMT models. We apply the proposed method to both bilingual and multilingual Transformers and present insights into their behaviour.
translated by 谷歌翻译
现有的文档级神经计算机翻译(NMT)模型具有足够探索的不同上下文设置,为目标生成提供指导。但是,对于慷慨的上下文信息,对揭开更多样化的背景的注意力很少。在本文中,我们提出了一种选择性的内存增强神经文件翻译模型,以处理包含上下文的大假设空间的文档。具体而言,我们从训练语料库中检索类似的双语句子对来增强全局上下文,然后通过选择性机制扩展双流注意模型,以捕获本地上下文和不同的全局背景。这种统一的方法允许我们的模型在三个公开的文档级机器翻译数据集上优雅地培训,并且显着优于以前的文档级NMT型号。
translated by 谷歌翻译
变压器模型是置换等分之一的。要提供输入令牌的顺序和类型信息,通常将位置和段嵌入式添加到输入中。最近的作品提出了具有相对位置编码的位置编码的变化,实现了更好的性能。我们的分析表明,增益实际上来自从输入中将位置信息移动到注意层。由此激励,我们介绍了变压器(饮食)的解耦的位置注意,一个简单但有效的机制,将位置和分段信息编码为变压器模型。该方法具有更快的培训和推理时间,同时在胶水,Xtreme和WMT基准上实现竞争性能。我们进一步概括了我们的方法到远程变压器并显示性能增益。
translated by 谷歌翻译