Domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application. However, the traditional adaptation by further training on in-domain data rapidly weakens the model's ability to generalize to other domains, making the open-ended deployments of the adapted models prone to errors. This work introduces novel training objectives built upon a semantic similarity of the predicted tokens to the reference. Our results show that (1) avoiding the common assumption of a single correct prediction by constructing the training target from tokens' semantic similarity can mitigate catastrophic forgetting during domain adaptation, while (2) preserving the quality of the adaptation, (3) with negligible additions to compute costs. In the broader perspective, the objectives grounded in a soft token alignment pioneer the exploration of the middle ground between the efficient but naive exact-match token-level objectives and expressive but computationally- and resource-intensive sequential objectives.
translated by 谷歌翻译
尽管表现出色,但大型语言模型(LLMS)还是臭名昭著的缺陷,因为它们偏爱简单的,表面级的文本关系而不是问题的完全语义复杂性。该提案调查了该问题的共同点,其在训练领域之外概括的能力较弱。我们调查了各种研究方向,提供了模型泛化能力的估计,发现将其中一些措施纳入培训目标会导致神经模型的分布鲁棒性增强。基于这些发现,我们提出了未来的研究指导,以增强LLM的鲁棒性。
translated by 谷歌翻译
不断增长的数据量导致更大的通用模型。通常遗漏特定用例,因为通用模型在域特定情况下往往表现不佳。我们的工作通过用于从通用域(并行文本)语料库的域名数据的方法解决了这个差距,用于机器翻译的任务。所提出的方法根据具有单孔域的特定数据集的余弦相似度在并行通用域数据中排列句子。然后,我们选择具有最高相似性分数的顶级k句,以培训调整的新机器翻译系统到特定的域数据。我们的实验结果表明,在通用或通用和域数据的混合训练的域内训练的模型训练的模型。也就是说,我们的方法以低计算成本和数据大小选择高质量的域特定培训实例。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
We propose BERTSCORE, an automatic evaluation metric for text generation. Analogously to common metrics, BERTSCORE computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTSCORE correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTSCORE is more robust to challenging examples when compared to existing metrics.
translated by 谷歌翻译
非自动性变压器(NAT)是文本生成模型的家族,旨在通过并行预测整个句子来减少解码延迟。但是,这种延迟减少牺牲了捕获从左到右的依赖性的能力,从而使NAT学习非常具有挑战性。在本文中,我们介绍了理论和经验分析,以揭示NAT学习的挑战,并提出统一的观点来了解现有的成功。首先,我们表明,简单地通过最大化可能性来训练NAT可以导致边际分布的近似值,但在代币之间降低了所有依赖关系,在该数据集的条件总相关性可以测量删除的信息。其次,我们在统一的框架中正式化了许多以前的目标,并表明他们的成功可以得出结论,以最大程度地提高代理分布的可能性,从而减少了信息损失。实证研究表明,我们的观点可以解释NAT学习中的现象,并指导新培训方法的设计。
translated by 谷歌翻译
The word alignment task, despite its prominence in the era of statistical machine translation (SMT), is niche and under-explored today. In this two-part tutorial, we argue for the continued relevance for word alignment. The first part provides a historical background to word alignment as a core component of the traditional SMT pipeline. We zero-in on GIZA++, an unsupervised, statistical word aligner with surprising longevity. Jumping forward to the era of neural machine translation (NMT), we show how insights from word alignment inspired the attention mechanism fundamental to present-day NMT. The second part shifts to a survey approach. We cover neural word aligners, showing the slow but steady progress towards surpassing GIZA++ performance. Finally, we cover the present-day applications of word alignment, from cross-lingual annotation projection, to improving translation.
translated by 谷歌翻译
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART -a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective . mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it also enables new types of transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.
translated by 谷歌翻译
在任何翻译工作流程中,从源到目标的域知识保存至关重要。在翻译行业中,接收高度专业化的项目是很常见的,那里几乎没有任何平行的内域数据。在这种情况下,没有足够的内域数据来微调机器翻译(MT)模型,生成与相关上下文一致的翻译很具有挑战性。在这项工作中,我们提出了一种新颖的方法,用于域适应性,以利用最新的审计语言模型(LMS)来用于特定于域的MT的域数据增强,并模拟(a)的(a)小型双语数据集的域特征,或(b)要翻译的单语源文本。将这个想法与反翻译相结合,我们可以为两种用例生成大量的合成双语内域数据。为了进行调查,我们使用最先进的变压器体系结构。我们采用混合的微调来训练模型,从而显着改善了内域文本的翻译。更具体地说,在这两种情况下,我们提出的方法分别在阿拉伯语到英语对阿拉伯语言对上分别提高了大约5-6个BLEU和2-3 BLEU。此外,人类评估的结果证实了自动评估结果。
translated by 谷歌翻译
连接视觉和语言在生成智能中起着重要作用。因此,已经致力于图像标题的大型研究工作,即用句法和语义有意义的句子描述图像。从2015年开始,该任务通常通过由Visual Encoder组成的管道和文本生成的语言模型来解决任务。在这些年来,两种组件通过对象区域,属性,介绍多模态连接,完全关注方法和伯特早期融合策略的利用而显着发展。但是,无论令人印象深刻的结果,图像标题的研究还没有达到结论性答案。这项工作旨在提供图像标题方法的全面概述,从视觉编码和文本生成到培训策略,数据集和评估度量。在这方面,我们量化地比较了许多相关的最先进的方法来确定架构和培训策略中最有影响力的技术创新。此外,讨论了问题的许多变体及其开放挑战。这项工作的最终目标是作为理解现有文献的工具,并突出显示计算机视觉和自然语言处理的研究领域的未来方向可以找到最佳的协同作用。
translated by 谷歌翻译
多语言语言模型(\ mllms),如mbert,xlm,xlm-r,\ textit {etc。}已成为一种可行的选择,使预先估计到大量语言的力量。鉴于他们的成功在零射击转移学习中,在(i)建立更大的\ mllms〜覆盖了大量语言(ii)创建覆盖更广泛的任务和语言来评估的详尽工作基准mllms〜(iii)分析单音零点,零拍摄交叉和双语任务(iv)对Monolingual的性能,了解\ mllms〜(v)增强(通常)学习的通用语言模式(如果有的话)有限的容量\ mllms〜以提高他们在已见甚至看不见语言的表现。在这项调查中,我们审查了现有的文学,涵盖了上述与\ MLLMS有关的广泛研究领域。根据我们的调查,我们建议您有一些未来的研究方向。
translated by 谷歌翻译
本文介绍了一种新的数据增强方法,用于神经机器翻译,该方法可以在语言内部和跨语言内部实施更强的语义一致性。我们的方法基于条件掩盖语言模型(CMLM),该模型是双向的,可以在左右上下文以及标签上有条件。我们证明CMLM是生成上下文依赖性单词分布的好技术。特别是,我们表明CMLM能够通过在替换过程中对源和目标进行调节来实现语义一致性。此外,为了增强多样性,我们将软词替换的想法纳入了数据增强,该概念用词汇上的概率分布代替了一个单词。在不同量表的四个翻译数据集上进行的实验表明,总体解决方案会导致更现实的数据增强和更好的翻译质量。与最新作品相比,我们的方法始终取得了最佳性能,并且在基线上的提高了1.90个BLEU点。
translated by 谷歌翻译
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
translated by 谷歌翻译
Recent work has improved language models (LMs) remarkably by equipping them with a non-parametric memory component. However, most existing approaches only introduce mem-ories at testing time or represent them using a separately trained encoder, resulting in suboptimal training of the language model. In this work, we present TRIME, a novel yet simple training approach designed for training LMs with memory augmentation. Our approach uses a training objective that directly takes in-batch examples as accessible memory. We also present new methods for memory construction and data batching, which are used for adapting to different sets of memories--local, long-term, and external memory--at testing time. We evaluate TRIME on multiple language modeling and machine translation benchmarks and show that it is able to achieve significant improvements across all the settings. Concretely, TRIME reduces the perplexity from 18.70 to 15.37 on WIKITEXT-103, by effectively leveraging a large memory set from the training corpus. Compared to standard LM training, TRIME adds negligible computational overhead and is compatible with different neural architectures, making it a versatile solution for training memory-augmented LMs.
translated by 谷歌翻译
机器翻译系统(MTS)是通过将文本或语音从一种语言转换为另一种语言的有效工具。在像印度这样的大型多语言环境中,对有效的翻译系统的需求变得显而易见,英语和一套印度语言(ILS)正式使用。与英语相反,由于语料库的不可用,IL仍然被视为低资源语言。为了解决不对称性质,多语言神经机器翻译(MNMT)系统会发展为在这个方向上的理想方法。在本文中,我们提出了一个MNMT系统,以解决与低资源语言翻译有关的问题。我们的模型包括两个MNMT系统,即用于英语印度(一对多),另一个用于指示英语(多一对多),其中包含15个语言对(30个翻译说明)的共享编码器码头。由于大多数IL对具有很少的平行语料库,因此不足以训练任何机器翻译模型。我们探索各种增强策略,以通过建议的模型提高整体翻译质量。最先进的变压器体系结构用于实现所提出的模型。大量数据的试验揭示了其优越性比常规模型的优势。此外,本文解决了语言关系的使用(在方言,脚本等方面),尤其是关于同一家族的高资源语言在提高低资源语言表现方面的作用。此外,实验结果还表明了ILS的倒退和域适应性的优势,以提高源和目标语言的翻译质量。使用所有这些关键方法,我们提出的模型在评估指标方面比基线模型更有效,即一组ILS的BLEU(双语评估研究)得分。
translated by 谷歌翻译
Recently, contrastive learning attracts increasing interests in neural text generation as a new solution to alleviate the exposure bias problem. It introduces a sequence-level training signal which is crucial to generation tasks that always rely on auto-regressive decoding. However, previous methods using contrastive learning in neural text generation usually lead to inferior performance. In this paper, we analyse the underlying reasons and propose a new Contrastive Neural Text generation framework, CoNT. CoNT addresses bottlenecks that prevent contrastive learning from being widely adopted in generation tasks from three aspects -- the construction of contrastive examples, the choice of the contrastive loss, and the strategy in decoding. We validate CoNT on five generation tasks with ten benchmarks, including machine translation, summarization, code comment generation, data-to-text generation and commonsense generation. Experimental results show that CoNT clearly outperforms the conventional training framework on all the ten benchmarks with a convincing margin. Especially, CoNT surpasses previous the most competitive contrastive learning method for text generation, by 1.50 BLEU on machine translation and 1.77 ROUGE-1 on summarization, respectively. It achieves new state-of-the-art on summarization, code comment generation (without external data) and data-to-text generation.
translated by 谷歌翻译
最近非自动增加(NAR)机器翻译最近取得了显着的改进,现在优于一些基准测试的自动增加(AR)模型,为AR推断提供有效的替代方案。然而,虽然AR转换通常使用多语言模型来实现,但是从语言之间的转移和改善的服务效率,多语言NAR模型仍然相对未开发。作为一个示例NAR模型和变压器作为半NAR模型,采用连接员时间分类(CTC),我们展示了多语种NAR的全面实证研究。我们在容量限制下对相关语言与负转移之间的积极转移来测试其能力。随着NAR模型需要蒸馏培训套,我们仔细研究双语与多语种教师的影响。最后,我们适合多语言NAR的缩放法,这使得其相对于AR模型的性能随着模型量表的增加而定量。
translated by 谷歌翻译
最近在单语数据和机器翻译(MT)进行微调的预培训方面取得了成功,但尚不清楚如何最好地利用预先训练的模型来完成给定的MT任务。本文在微调MT上的预训练模型时研究了冻结参数的好处和缺点。我们专注于1)微调仅在英语单语言数据的BART上训练的模型。2)微调一个模型,该模型对25种语言的单语言数据进行了培训,Mbart。对于Bart,我们通过冻结大多数模型参数并添加额外的位置嵌入来获得最佳性能。对于MBART,我们将大多数语言对的天真微调的性能与编码器以及大多数解码器搭配。编码器的注意参数对于微调最重要。当将自己限制为越南人对英语的室外训练套装时,我们看到了基线的最大进步。
translated by 谷歌翻译
我们介绍了一种新的分布式策略梯度算法,并表明它在优化机器翻译模型时,在培训稳定性和概括性绩效方面都优于现有的奖励感知培训程序,例如增强,最低风险培训(MRT)和近端政策优化(PPO)。我们称之为MAD的算法(由于在重要性加权计算中使用平均绝对偏差),它分布式数据生成器在Worker节点上每个源句子对多个候选者进行采样,而中心学习者则更新了策略。 MAD取决于两个降低差异策略:(1)一种有条件的奖励归一化方法,可确保每个源句子都具有正面和负面奖励翻译示例,以及(2)一种新的强大重要性加权方案,充当条件性熵正常化器。在各种翻译任务上进行的实验表明,使用MAD算法在使用贪婪的解码和梁搜索时,使用MAD算法学到的策略表现良好,并且学到的政策对训练过程中使用的特定奖励很敏感。
translated by 谷歌翻译
Diffusion models have quickly become the go-to paradigm for generative modelling of perceptual signals (such as images and sound) through iterative refinement. Their success hinges on the fact that the underlying physical phenomena are continuous. For inherently discrete and categorical data such as language, various diffusion-inspired alternatives have been proposed. However, the continuous nature of diffusion models conveys many benefits, and in this work we endeavour to preserve it. We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. We demonstrate its efficacy on several language modelling tasks.
translated by 谷歌翻译