We present a novel approach to generating news headlines in Finnish for a given news story. We model this as a summarization task where a model is given a news article, and its task is to produce a concise headline describing the main topic of the article. Because there are no openly available GPT-2 models for Finnish, we will first build such a model using several corpora. The model is then fine-tuned for the headline generation task using a massive news corpus. The system is evaluated by 3 expert journalists working in a Finnish media house. The results showcase the usability of the presented approach as a headline suggestion tool to facilitate the news production process.
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
Controllable Text Generation (CTG) is emerging area in the field of natural language generation (NLG). It is regarded as crucial for the development of advanced text generation technologies that are more natural and better meet the specific constraints in practical applications. In recent years, methods using large-scale pre-trained language models (PLMs), in particular the widely used transformer-based PLMs, have become a new paradigm of NLG, allowing generation of more diverse and fluent text. However, due to the lower level of interpretability of deep neural networks, the controllability of these methods need to be guaranteed. To this end, controllable text generation using transformer-based PLMs has become a rapidly growing yet challenging new research hotspot. A diverse range of approaches have emerged in the recent 3-4 years, targeting different CTG tasks which may require different types of controlled constraints. In this paper, we present a systematic critical review on the common tasks, main approaches and evaluation methods in this area. Finally, we discuss the challenges that the field is facing, and put forward various promising future directions. To the best of our knowledge, this is the first survey paper to summarize CTG techniques from the perspective of PLMs. We hope it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.
translated by 谷歌翻译
We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-ofthe-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous nonsparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.
translated by 谷歌翻译
Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset -matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by ( 1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.
translated by 谷歌翻译
当前的语言模型可以产生高质量的文本。他们只是复制他们之前看到的文本,或者他们学习了普遍的语言抽象吗?要取笑这些可能性,我们介绍了乌鸦,这是一套评估生成文本的新颖性,专注于顺序结构(n-gram)和句法结构。我们将这些分析应用于四种神经语言模型(LSTM,变压器,变换器-XL和GPT-2)。对于本地结构 - 例如,单个依赖性 - 模型生成的文本比来自每个模型的测试集的人类生成文本的基线显着不那么新颖。对于大规模结构 - 例如,总句结构 - 模型生成的文本与人生成的基线一样新颖甚至更新颖,但模型仍然有时复制,在某些情况下,在训练集中重复超过1000字超过1,000字的通道。我们还表现了广泛的手动分析,表明GPT-2的新文本通常在形态学和语法中形成良好,但具有合理的语义问题(例如,是自相矛盾)。
translated by 谷歌翻译
食物对于人类生存至关重要。如此之多,以至于我们开发了不同的食谱来满足我们的口味需求。在这项工作中,我们提出了一种新颖的方式,可以使用变压器(特别是自动回归语言模型)从头开始创建新的细餐食谱。考虑到一小部分食物食谱数据集,我们尝试训练模型以识别烹饪技术,提出新颖的食谱并测试用最小数据进行微调的功能。
translated by 谷歌翻译
由于在开放式文本生成中取得了重大进展,衡量机器生成的文本是如何对人类语言的关键问题。我们介绍紫红色,一个开放式文本生成的比较措施,它直接将文本生成模型的学习分布与使用发散边界的分发进行了分布到人写的文本。淡紫色通过计算量化嵌入空间中的信息分流来缩放到现代文本生成模型。通过对三个开放式发电任务的广泛实证研究,我们发现紫红色标识了所生成文本的已知属性,天然存在模型大小,并与人类判断相关,而不是现有的分布评估度量的限制较少。
translated by 谷歌翻译
It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data.We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.
translated by 谷歌翻译
深度神经语言模型的最新进展与大规模数据集的能力相结合,加速了自然语言生成系统的发展,这些系统在多种任务和应用程序上下文中产生流利和连贯的文本(在各种成功程度上)。但是,为所需的用户控制这些模型的输出仍然是一个开放的挑战。这不仅对于自定义生成语言的内容和样式至关重要,而且对于他们在现实世界中的安全可靠部署至关重要。我们提出了一项关于受约束神经语言生成的新兴主题的广泛调查,在该主题中,我们通过区分条件和约束(后者是在输出文本上而不是输入的可检验条件),正式定义和分类自然语言生成问题,目前是可检验的)约束文本生成任务,并查看受限文本生成的现有方法和评估指标。我们的目的是强调这个新兴领域的最新进展和趋势,以告知最有希望的方向和局限性,以推动受约束神经语言生成研究的最新作品。
translated by 谷歌翻译
我们提出了两种小型无监督方法,用于消除文本中的毒性。我们的第一个方法结合了最近的两个想法:(1)使用小型条件语言模型的生成过程的指导和(2)使用释义模型进行风格传输。我们使用良好的令人措辞的令人愉快的释放器,由风格培训的语言模型引导,以保持文本内容并消除毒性。我们的第二种方法使用BERT用他们的非攻击性同义词取代毒性单词。我们通过使BERT替换具有可变数量的单词的屏蔽令牌来使该方法更灵活。最后,我们介绍了毒性去除任务的风格转移模型的第一个大规模比较研究。我们将模型与许多用于样式传输的方法进行比较。使用无监督的样式传输指标的组合以可参考方式评估该模型。两种方法都建议产生新的SOTA结果。
translated by 谷歌翻译
Task agnostic generative pretraining (GPT) has recently proved promising for zero- and few-shot learning, gradually diverting attention from the expensive supervised learning paradigm. Although the community is accumulating knowledge as to capabilities of English-language autoregressive models such as GPT-3 adopting this generative approach, scholarship about these models remains acutely Anglocentric. Consequently, the community currently has serious gaps in its understanding of this class of models, their potential, and their societal impacts in diverse settings, linguistic traditions, and cultures. To alleviate this issue for Arabic, a collection of diverse languages and language varieties with more than $400$ million population, we introduce JASMINE, a suite of powerful Arabic autoregressive Transformer language models ranging in size between 300 million-13 billion parameters. We pretrain our new models with large amounts of diverse data (400GB of text) from different Arabic varieties and domains. We evaluate JASMINE extensively in both intrinsic and extrinsic settings, using a comprehensive benchmark for zero- and few-shot learning across a wide range of NLP tasks. We also carefully develop and release a novel benchmark for both automated and human evaluation of Arabic autoregressive models focused at investigating potential social biases, harms, and toxicity in these models. We aim to responsibly release our models with interested researchers, along with code for experimenting with them
translated by 谷歌翻译
Privacy preserving deep learning is an emerging field in machine learning that aims to mitigate the privacy risks in the use of deep neural networks. One such risk is training data extraction from language models that have been trained on datasets , which contain personal and privacy sensitive information. In our study, we investigate the extent of named entity memorization in fine-tuned BERT models. We use single-label text classification as representative downstream task and employ three different fine-tuning setups in our experiments, including one with Differentially Privacy (DP). We create a large number of text samples from the fine-tuned BERT models utilizing a custom sequential sampling strategy with two prompting strategies. We search in these samples for named entities and check if they are also present in the fine-tuning datasets. We experiment with two benchmark datasets in the domains of emails and blogs. We show that the application of DP has a huge effect on the text generation capabilities of BERT. Furthermore, we show that a fine-tuned BERT does not generate more named entities entities specific to the fine-tuning dataset than a BERT model that is pre-trained only. This suggests that BERT is unlikely to emit personal or privacy sensitive named entities. Overall, our results are important to understand to what extent BERT-based services are prone to training data extraction attacks.
translated by 谷歌翻译
随着越来越多的可用文本数据,能够自动分析,分类和摘要这些数据的算法的开发已成为必需品。在本研究中,我们提出了一种用于关键字识别的新颖算法,即表示给定文档的关键方面的一个或多字短语的提取,称为基于变压器的神经标记器,用于关键字识别(TNT-KID)。通过将变压器架构适用于手头的特定任务并利用域特定语料库上的预先磨损的语言模型,该模型能够通过提供竞争和强大的方式克服监督和无监督的最先进方法的缺陷在各种不同的数据集中的性能,同时仅需要最佳执行系统所需的手动标记的数据。本研究还提供了彻底的错误分析,具有对模型内部运作的有价值的见解和一种消融研究,测量关键字识别工作流程的特定组分对整体性能的影响。
translated by 谷歌翻译
鉴于大型语言模型的广泛能力,应该有可能朝着一般的文本的助手工作,这些助手与人类价值一致,这意味着它是有帮助,诚实的和无害的。在此方向上的初始遗传,我们研究简单的基线技术和评估,例如提示。我们发现,从模型规模增加适度的干预措施的好处,概括为各种对准评估,并不会损害大型模型的性能。接下来,我们调查与对齐,比较仿制,二进制歧视和排名偏好建模相关的几个培训目标的缩放趋势。我们发现排名优先级模型比模仿学习更好地表现得多,并且通常以模型大小更有利地缩放。相比之下,二进制歧视通常与模仿学习非常类似地执行和缩放。最后,我们研究了一种“偏好模型预训练阶段的培训阶段,其目的是在对人偏好的芬明时提高样本效率。
translated by 谷歌翻译
我们提出了Pangu-Coder,这是一种仅预读的解码器语言模型,该模型采用pangu-alpha架构进行文本到代码生成,即给定自然语言问题描述的编程语言解决方案的合成。我们使用两阶段策略训练Pangu-Coder:第一阶段采用因果语言建模(CLM)来预先培训原始编程语言数据,而第二阶段则使用因果语言建模和掩盖语言建模(MLM)的组合培训目标,专注于文本到代码生成的下游任务,并培训松散的自然语言程序定义和代码功能。最后,我们讨论了pangu-coder-ft,该pander the是通过竞争性编程问题和代码与持续集成测试的结合进行了微调的。我们评估了pangu-coder,重点是它是否生成功能上正确的程序,并证明它在参加较小的上下文窗口和较少的数据培训的同时,它比诸如Codex之类的类似大小的模型(例如Codex)实现等效性或更好的性能。
translated by 谷歌翻译
大型预训练的语言模型能够产生多种多样的文本。从提示开始,这些模型产生了一种可以不可预测的叙述。现有的可控文本生成方法,该方法指导用户指定方向的文本中的叙述,需要创建培训语料库和额外的耗时培训程序。本文提出并调查了Contocation2Text,这是一种用于俄罗斯自动可控文本生成的插件方法,不需要微调。该方法基于两个交互模型:自回归语言Rugpt-3模型和自动编码语言Ruroberta模型。该方法的想法是根据自动编码模型的输出分布将自回归模型的输出分布移动,以确保文本中叙事的连贯过渡向指南短语,其中可以包含单个单词或搭配。能够考虑到令牌的左和右下方的自动编码模型“告诉”“自动回归模型”在当前一代步骤中,该模型是令牌最不合逻辑的,从而增加或降低了相应令牌的概率。使用该方法生成新闻文章的实验显示了其对自动生成的流利文本的有效性,这些文本包含用户指定的短语之间的连贯过渡。
translated by 谷歌翻译