在本文中,我们通过生成的对抗网络(GAN)架构探索机器翻译改进。我们从Relgan,一个文本制造模型和鼻孔机械翻译模型中获取灵感,实现了一个学习将尴尬,非流利的英语句子转换为流利的模型,同时只培训在单梅换语料库上。我们利用参数$ \ lambda $来控制从输入句子的偏差量,即保持原始令牌和修改它更流利之间的权衡。在某些情况下,我们的结果改进了基于短语的机器翻译。特别是,带变压器发生器的GaN显示出一些有希望的结果。我们建议将来的一些方向建立在这种概念上建立。
translated by 谷歌翻译
近年来,文本的风格特性吸引了计算语言学研究人员。具体来说,研究人员研究了文本样式转移(TST)任务,该任务旨在在保留其样式独立内容的同时改变文本的风格属性。在过去的几年中,已经开发了许多新颖的TST算法,而该行业利用这些算法来实现令人兴奋的TST应用程序。由于这种共生,TST研究领域迅速发展。本文旨在对有关文本样式转移的最新研究工作进行全面审查。更具体地说,我们创建了一种分类法来组织TST模型,并提供有关最新技术状况的全面摘要。我们回顾了针对TST任务的现有评估方法,并进行了大规模的可重复性研究,我们在两个公开可用的数据集上实验基准了19个最先进的TST TST算法。最后,我们扩展了当前趋势,并就TST领域的新开发发展提供了新的观点。
translated by 谷歌翻译
本文对过去二十年来对自然语言生成(NLG)的研究提供了全面的审查,特别是与数据到文本生成和文本到文本生成深度学习方法有关,以及NLG的新应用技术。该调查旨在(a)给出关于NLG核心任务的最新综合,以及该领域采用的建筑;(b)详细介绍各种NLG任务和数据集,并提请注意NLG评估中的挑战,专注于不同的评估方法及其关系;(c)强调一些未来的强调和相对近期的研究问题,因为NLG和其他人工智能领域的协同作用而增加,例如计算机视觉,文本和计算创造力。
translated by 谷歌翻译
This work presents a thorough review concerning recent studies and text generation advancements using Generative Adversarial Networks. The usage of adversarial learning for text generation is promising as it provides alternatives to generate the so-called "natural" language. Nevertheless, adversarial text generation is not a simple task as its foremost architecture, the Generative Adversarial Networks, were designed to cope with continuous information (image) instead of discrete data (text). Thus, most works are based on three possible options, i.e., Gumbel-Softmax differentiation, Reinforcement Learning, and modified training objectives. All alternatives are reviewed in this survey as they present the most recent approaches for generating text using adversarial-based techniques. The selected works were taken from renowned databases, such as Science Direct, IEEEXplore, Springer, Association for Computing Machinery, and arXiv, whereas each selected work has been critically analyzed and assessed to present its objective, methodology, and experimental results.
translated by 谷歌翻译
文本样式传输是自然语言生成中的重要任务,旨在控制生成的文本中的某些属性,例如礼貌,情感,幽默和许多其他特性。它在自然语言处理领域拥有悠久的历史,最近由于深神经模型带来的有希望的性能而重大关注。在本文中,我们对神经文本转移的研究进行了系统调查,自2017年首次神经文本转移工作以来跨越100多个代表文章。我们讨论了任务制定,现有数据集和子任务,评估,以及丰富的方法在存在并行和非平行数据存在下。我们还提供关于这项任务未来发展的各种重要主题的讨论。我们的策据纸张列表在https://github.com/zhijing-jin/text_style_transfer_survey
translated by 谷歌翻译
通过大型预训练模型传输学习已经改变了自然语言处理中当前应用程序的景观(NLP)。最近的最佳优化,结合了两个预先训练的模型,BERT和GPT-2的变形AutoEncoder(VAE),并且其与生成的逆境网络(GANs)的组合已被证明是生产小说,但非常人性化的文本。 Optimus和GAN组合避免了GAN到文本的离散领域的麻烦,并防止了标准最大似然方法的曝光偏差。我们将GAN的培训结合在潜在的空间中,并为单词生成的Optimus解码器的FineTuning。这种方法可以让我们模拟句子的高级功能,以及低级Word-By-Word生成。我们通过利用GPT-2的结构以及将基于熵的内在动机奖励添加到质量和多样性之间的平衡来使用加强学习(RL)。我们基准测试VAE-GaN模型的结果,并显示了我们RL FineTuning在三个广泛使用的文本生成数据集中带来的改进,结果结果大大超越了当前最先进的所生成文本的质量。
translated by 谷歌翻译
机器翻译系统(MTS)是通过将文本或语音从一种语言转换为另一种语言的有效工具。在像印度这样的大型多语言环境中,对有效的翻译系统的需求变得显而易见,英语和一套印度语言(ILS)正式使用。与英语相反,由于语料库的不可用,IL仍然被视为低资源语言。为了解决不对称性质,多语言神经机器翻译(MNMT)系统会发展为在这个方向上的理想方法。在本文中,我们提出了一个MNMT系统,以解决与低资源语言翻译有关的问题。我们的模型包括两个MNMT系统,即用于英语印度(一对多),另一个用于指示英语(多一对多),其中包含15个语言对(30个翻译说明)的共享编码器码头。由于大多数IL对具有很少的平行语料库,因此不足以训练任何机器翻译模型。我们探索各种增强策略,以通过建议的模型提高整体翻译质量。最先进的变压器体系结构用于实现所提出的模型。大量数据的试验揭示了其优越性比常规模型的优势。此外,本文解决了语言关系的使用(在方言,脚本等方面),尤其是关于同一家族的高资源语言在提高低资源语言表现方面的作用。此外,实验结果还表明了ILS的倒退和域适应性的优势,以提高源和目标语言的翻译质量。使用所有这些关键方法,我们提出的模型在评估指标方面比基线模型更有效,即一组ILS的BLEU(双语评估研究)得分。
translated by 谷歌翻译
The machine translation mechanism translates texts automatically between different natural languages, and Neural Machine Translation (NMT) has gained attention for its rational context analysis and fluent translation accuracy. However, processing low-resource languages that lack relevant training attributes like supervised data is a current challenge for Natural Language Processing (NLP). We incorporated a technique known Active Learning with the NMT toolkit Joey NMT to reach sufficient accuracy and robust predictions of low-resource language translation. With active learning, a semi-supervised machine learning strategy, the training algorithm determines which unlabeled data would be the most beneficial for obtaining labels using selected query techniques. We implemented two model-driven acquisition functions for selecting the samples to be validated. This work uses transformer-based NMT systems; baseline model (BM), fully trained model (FTM) , active learning least confidence based model (ALLCM), and active learning margin sampling based model (ALMSM) when translating English to Hindi. The Bilingual Evaluation Understudy (BLEU) metric has been used to evaluate system results. The BLEU scores of BM, FTM, ALLCM and ALMSM systems are 16.26, 22.56 , 24.54, and 24.20, respectively. The findings in this paper demonstrate that active learning techniques helps the model to converge early and improve the overall quality of the translation system.
translated by 谷歌翻译
The word alignment task, despite its prominence in the era of statistical machine translation (SMT), is niche and under-explored today. In this two-part tutorial, we argue for the continued relevance for word alignment. The first part provides a historical background to word alignment as a core component of the traditional SMT pipeline. We zero-in on GIZA++, an unsupervised, statistical word aligner with surprising longevity. Jumping forward to the era of neural machine translation (NMT), we show how insights from word alignment inspired the attention mechanism fundamental to present-day NMT. The second part shifts to a survey approach. We cover neural word aligners, showing the slow but steady progress towards surpassing GIZA++ performance. Finally, we cover the present-day applications of word alignment, from cross-lingual annotation projection, to improving translation.
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
机器翻译历史上的重要突破之一是变压器模型的发展。不仅对于各种翻译任务,而且对于大多数其他NLP任务都是革命性的。在本文中,我们针对一个基于变压器的系统,该系统能够将德语用源句子转换为其英语的对应目标句子。我们对WMT'13数据集的新闻评论德语 - 英语并行句子进行实验。此外,我们研究了来自IWSLT'16数据集的培训中包含其他通用域数据以改善变压器模型性能的效果。我们发现,在培训中包括IWSLT'16数据集,有助于在WMT'13数据集的测试集中获得2个BLEU得分点。引入定性分析以分析通用域数据的使用如何有助于提高产生的翻译句子的质量。
translated by 谷歌翻译
我们介绍了第一个用于濒危Erzya语言与俄语以及我们为训练和评估它收集的数据集的神经机器翻译系统。BLEU分别分别为Erzya和Russian的BLEU分数分别为17和19,其中一半以上的翻译被以母语为母语的人可以接受。我们还调整了模型以在Erzya和其他10种语言之间转换,但是如果没有其他并行数据,这些方向上的质量仍然很低。我们将翻译模型与收集的文本语料库一起发布,新的语言标识模型以及适合Erzya语言的多语言句子编码器。这些资源将在https://github.com/slone-nlp/myv-nmt上找到。
translated by 谷歌翻译
数据增强是通过转换为机器学习的人工创建数据的人工创建,是一个跨机器学习学科的研究领域。尽管它对于增加模型的概括功能很有用,但它还可以解决许多其他挑战和问题,从克服有限的培训数据到正规化目标到限制用于保护隐私的数据的数量。基于对数据扩展的目标和应用的精确描述以及现有作品的分类法,该调查涉及用于文本分类的数据增强方法,并旨在为研究人员和从业者提供简洁而全面的概述。我们将100多种方法划分为12种不同的分组,并提供最先进的参考文献来阐述哪种方法可以通过将它们相互关联,从而阐述了哪种方法。最后,提供可能构成未来工作的基础的研究观点。
translated by 谷歌翻译
我们提出了两种小型无监督方法,用于消除文本中的毒性。我们的第一个方法结合了最近的两个想法:(1)使用小型条件语言模型的生成过程的指导和(2)使用释义模型进行风格传输。我们使用良好的令人措辞的令人愉快的释放器,由风格培训的语言模型引导,以保持文本内容并消除毒性。我们的第二种方法使用BERT用他们的非攻击性同义词取代毒性单词。我们通过使BERT替换具有可变数量的单词的屏蔽令牌来使该方法更灵活。最后,我们介绍了毒性去除任务的风格转移模型的第一个大规模比较研究。我们将模型与许多用于样式传输的方法进行比较。使用无监督的样式传输指标的组合以可参考方式评估该模型。两种方法都建议产生新的SOTA结果。
translated by 谷歌翻译
Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training. Targetside monolingual data plays an important role in boosting fluency for phrasebased statistical machine translation, and we investigate the use of monolingual data for NMT. In contrast to previous work, which combines NMT models with separately trained language models, we note that encoder-decoder NMT architectures already have the capacity to learn the same information as a language model, and we explore strategies to train with monolingual data without changing the neural network architecture. By pairing monolingual training data with an automatic backtranslation, we can treat it as additional parallel training data, and we obtain substantial improvements on the WMT 15 task English↔German (+2.8-3.7 BLEU), and for the low-resourced IWSLT 14 task Turkish→English (+2.1-3.4 BLEU), obtaining new state-of-the-art results. We also show that fine-tuning on in-domain monolingual and parallel data gives substantial improvements for the IWSLT 15 task English→German.
translated by 谷歌翻译
神经机器翻译(NMT)模型在大型双语数据集上已有效。但是,现有的方法和技术表明,该模型的性能高度取决于培训数据中的示例数量。对于许多语言而言,拥有如此数量的语料库是一个牵强的梦想。我们从单语言词典探索新语言的单语扬声器中汲取灵感,我们研究了双语词典对具有极低或双语语料库的语言的适用性。在本文中,我们使用具有NMT模型的双语词典探索方法,以改善资源极低的资源语言的翻译。我们将此工作扩展到多语言系统,表现出零拍的属性。我们详细介绍了字典质量,培训数据集大小,语言家族等对翻译质量的影响。多种低资源测试语言的结果表明,我们的双语词典方法比基线相比。
translated by 谷歌翻译
深度神经语言模型的最新进展与大规模数据集的能力相结合,加速了自然语言生成系统的发展,这些系统在多种任务和应用程序上下文中产生流利和连贯的文本(在各种成功程度上)。但是,为所需的用户控制这些模型的输出仍然是一个开放的挑战。这不仅对于自定义生成语言的内容和样式至关重要,而且对于他们在现实世界中的安全可靠部署至关重要。我们提出了一项关于受约束神经语言生成的新兴主题的广泛调查,在该主题中,我们通过区分条件和约束(后者是在输出文本上而不是输入的可检验条件),正式定义和分类自然语言生成问题,目前是可检验的)约束文本生成任务,并查看受限文本生成的现有方法和评估指标。我们的目的是强调这个新兴领域的最新进展和趋势,以告知最有希望的方向和局限性,以推动受约束神经语言生成研究的最新作品。
translated by 谷歌翻译
我们为神经机翻译(NMT)提供了一个开源工具包。新工具包主要基于拱形变压器(Vaswani等,2017)以及下面详述的许多其他改进,以便创建一个独立的,易于使用,一致和全面的各个领域的机器翻译任务框架。它是为了支持双语和多语言翻译任务的工具,从构建各个语料库的模型开始推断新的预测或将模型打包给提供功能的JIT格式。
translated by 谷歌翻译
Pre-trained models have achieved remarkable success in natural language processing (NLP). However, existing pre-training methods underutilize the benefits of language understanding for generation. Inspired by the idea of Generative Adversarial Networks (GANs), we propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator, unifying the ability of language understanding and generation in a single model. Our model, named as GanLM, is trained with two pre-training objectives: replaced token detection and replaced token denoising. Specifically, given masked source sentences, the generator outputs the target distribution and the discriminator predicts whether the target sampled tokens from distribution are incorrect. The target sentence is replaced with misclassified tokens to construct noisy previous context, which is used to generate the gold sentence. In general, both tasks improve the ability of language understanding and generation by selectively using the denoising data. Extensive experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models (PLMs) and achieves state-of-the-art performance.
translated by 谷歌翻译
作为有效的策略,数据增强(DA)减轻了深度学习技术可能失败的数据稀缺方案。它广泛应用于计算机视觉,然后引入自然语言处理并实现了许多任务的改进。DA方法的主要重点之一是提高培训数据的多样性,从而帮助模型更好地推广到看不见的测试数据。在本调查中,我们根据增强数据的多样性,将DA方法框架为三类,包括释义,注释和采样。我们的论文根据上述类别,详细分析了DA方法。此外,我们还在NLP任务中介绍了他们的应用以及挑战。
translated by 谷歌翻译