智能论文笔记

Automatic Generation of Factual News Headlines in Finnish

Maximilian Koppatz , Khalid Alnajjar , Mika Hämäläinen , Thierry Poibeau

分类：自然语言处理

2022-12-05

We present a novel approach to generating news headlines in Finnish for a given news story. We model this as a summarization task where a model is given a news article, and its task is to produce a concise headline describing the main topic of the article. Because there are no openly available GPT-2 models for Finnish, we will first build such a model using several corpora. The model is then fine-tuned for the headline generation task using a massive news corpus. The system is evaluated by 3 expert journalists working in a Finnish media house. The results showcase the usability of the presented approach as a headline suggestion tool to facilitate the news production process.

translated by 谷歌翻译

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li , Peter J. Liu

分类：

2019-10-23

Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

translated by 谷歌翻译

A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models

Hanqing Zhang , Haolin Song , Shaoyu Li , Ming Zhou , Dawei Song

分类：自然语言处理

2022-01-14

Controllable Text Generation (CTG) is emerging area in the field of natural language generation (NLG). It is regarded as crucial for the development of advanced text generation technologies that are more natural and better meet the specific constraints in practical applications. In recent years, methods using large-scale pre-trained language models (PLMs), in particular the widely used transformer-based PLMs, have become a new paradigm of NLG, allowing generation of more diverse and fluent text. However, due to the lower level of interpretability of deep neural networks, the controllability of these methods need to be guaranteed. To this end, controllable text generation using transformer-based PLMs has become a rapidly growing yet challenging new research hotspot. A diverse range of approaches have emerged in the recent 3-4 years, targeting different CTG tasks which may require different types of controlled constraints. In this paper, we present a systematic critical review on the common tasks, main approaches and evaluation methods in this area. Finally, we discuss the challenges that the field is facing, and put forward various promising future directions. To the best of our knowledge, this is the first survey paper to summarize CTG techniques from the perspective of PLMs. We hope it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.

translated by 谷歌翻译

Language models are few-shot learners

分类：

We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-ofthe-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous nonsparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.

translated by 谷歌翻译

Language models are unsupervised multitask learners

分类：

Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset -matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

translated by 谷歌翻译

Grammatical Error Correction: A Survey of the State of the Art

Christopher Bryant , Zheng Yuan , Muhammad Reza Qorib , Hannan Cao , Hwee Tou Ng , Ted Briscoe

分类：自然语言处理 | 人工智能

2022-11-09

Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.

translated by 谷歌翻译

Survey of Hallucination in Natural Language Generation

Ziwei Ji , Nayeon Lee , Rita Frieske , Tiezheng Yu , Dan Su , Yan Xu , Etsuko Ishii , Yejin Bang , Wenliang Dai , Andrea Madotto

分类：自然语言处理

2022-02-08

Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.

translated by 谷歌翻译

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Mike Lewis , Yinhan Liu , Naman Goyal , Marjan Ghazvininejad , Abdelrahman Mohamed , Omer Levy , Ves Stoyanov , Luke Zettlemoyer

分类：

2019-10-29

We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by ( 1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.

translated by 谷歌翻译

How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN

R. Thomas McCoy , Paul Smolensky , Tal Linzen , Jianfeng Gao , Asli Celikyilmaz

分类：自然语言处理

2021-11-18

当前的语言模型可以产生高质量的文本。他们只是复制他们之前看到的文本，或者他们学习了普遍的语言抽象吗？要取笑这些可能性，我们介绍了乌鸦，这是一套评估生成文本的新颖性，专注于顺序结构（n-gram）和句法结构。我们将这些分析应用于四种神经语言模型（LSTM，变压器，变换器-XL和GPT-2）。对于本地结构 - 例如，单个依赖性 - 模型生成的文本比来自每个模型的测试集的人类生成文本的基线显着不那么新颖。对于大规模结构 - 例如，总句结构 - 模型生成的文本与人生成的基线一样新颖甚至更新颖，但模型仍然有时复制，在某些情况下，在训练集中重复超过1000字超过1,000字的通道。我们还表现了广泛的手动分析，表明GPT-2的新文本通常在形态学和语法中形成良好，但具有合理的语义问题（例如，是自相矛盾）。

translated by 谷歌翻译

Towards Fine-Dining Recipe Generation with Generative Pre-trained Transformers

Konstantinos Katserelis , Konstantinos Skianis

分类：自然语言处理 | 人工智能 | 机器学习

2022-09-26

食物对于人类生存至关重要。如此之多，以至于我们开发了不同的食谱来满足我们的口味需求。在这项工作中，我们提出了一种新颖的方式，可以使用变压器（特别是自动回归语言模型）从头开始创建新的细餐食谱。考虑到一小部分食物食谱数据集，我们尝试训练模型以识别烹饪技术，提出新颖的食谱并测试用最小数据进行微调的功能。

translated by 谷歌翻译

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers

Krishna Pillutla , Swabha Swayamdipta , Rowan Zellers , John Thickstun , Sean Welleck , Yejin Choi , Zaid Harchaoui

分类：自然语言处理

2021-02-02

由于在开放式文本生成中取得了重大进展，衡量机器生成的文本是如何对人类语言的关键问题。我们介绍紫红色，一个开放式文本生成的比较措施，它直接将文本生成模型的学习分布与使用发散边界的分发进行了分布到人写的文本。淡紫色通过计算量化嵌入空间中的信息分流来缩放到现代文本生成模型。通过对三个开放式发电任务的广泛实证研究，我们发现紫红色标识了所生成文本的已知属性，天然存在模型大小，并与人类判断相关，而不是现有的分布评估度量的限制较少。

translated by 谷歌翻译

Extracting Training Data from Large Language Models

Nicholas Carlini , Florian Tramer , Eric Wallace , Matthew Jagielski , Ariel Herbert-Voss , Katherine Lee , Adam Roberts , Tom Brown , Dawn Song , Ulfar Erlingsson

分类：

2020-12-14

It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data.We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.

translated by 谷歌翻译

Why is constrained neural language generation particularly challenging?

Cristina Garbacea , Qiaozhu Mei

分类：自然语言处理 | 人工智能

2022-06-11

深度神经语言模型的最新进展与大规模数据集的能力相结合，加速了自然语言生成系统的发展，这些系统在多种任务和应用程序上下文中产生流利和连贯的文本（在各种成功程度上）。但是，为所需的用户控制这些模型的输出仍然是一个开放的挑战。这不仅对于自定义生成语言的内容和样式至关重要，而且对于他们在现实世界中的安全可靠部署至关重要。我们提出了一项关于受约束神经语言生成的新兴主题的广泛调查，在该主题中，我们通过区分条件和约束（后者是在输出文本上而不是输入的可检验条件），正式定义和分类自然语言生成问题，目前是可检验的）约束文本生成任务，并查看受限文本生成的现有方法和评估指标。我们的目的是强调这个新兴领域的最新进展和趋势，以告知最有希望的方向和局限性，以推动受约束神经语言生成研究的最新作品。

translated by 谷歌翻译

Text Detoxification using Large Pre-trained Neural Models

David Dale , Anton Voronov , Daryna Dementieva , Varvara Logacheva , Olga Kozlova , Nikita Semenov , Alexander Panchenko

分类：自然语言处理 | 机器学习

2021-09-18

我们提出了两种小型无监督方法，用于消除文本中的毒性。我们的第一个方法结合了最近的两个想法：（1）使用小型条件语言模型的生成过程的指导和（2）使用释义模型进行风格传输。我们使用良好的令人措辞的令人愉快的释放器，由风格培训的语言模型引导，以保持文本内容并消除毒性。我们的第二种方法使用BERT用他们的非攻击性同义词取代毒性单词。我们通过使BERT替换具有可变数量的单词的屏蔽令牌来使该方法更灵活。最后，我们介绍了毒性去除任务的风格转移模型的第一个大规模比较研究。我们将模型与许多用于样式传输的方法进行比较。使用无监督的样式传输指标的组合以可参考方式评估该模型。两种方法都建议产生新的SOTA结果。

translated by 谷歌翻译

JASMINE: Arabic GPT Models for Few-Shot Learning

El Moatez Billah Nagoudi , Muhammad Abdul-Mageed , AbdelRahim Elmadany , Alcides Alcoba Inciarte , Md Tawkat Islam Khondaker

分类：自然语言处理

2022-12-21

Task agnostic generative pretraining (GPT) has recently proved promising for zero- and few-shot learning, gradually diverting attention from the expensive supervised learning paradigm. Although the community is accumulating knowledge as to capabilities of English-language autoregressive models such as GPT-3 adopting this generative approach, scholarship about these models remains acutely Anglocentric. Consequently, the community currently has serious gaps in its understanding of this class of models, their potential, and their societal impacts in diverse settings, linguistic traditions, and cultures. To alleviate this issue for Arabic, a collection of diverse languages and language varieties with more than $400$ million population, we introduce JASMINE, a suite of powerful Arabic autoregressive Transformer language models ranging in size between 300 million-13 billion parameters. We pretrain our new models with large amounts of diverse data (400GB of text) from different Arabic varieties and domains. We evaluate JASMINE extensively in both intrinsic and extrinsic settings, using a comprehensive benchmark for zero- and few-shot learning across a wide range of NLP tasks. We also carefully develop and release a novel benchmark for both automated and human evaluation of Arabic autoregressive models focused at investigating potential social biases, harms, and toxicity in these models. We aim to responsibly release our models with interested researchers, along with code for experimenting with them

translated by 谷歌翻译

A Study on Extracting Named Entities from Fine-tuned vs. Differentially Private Fine-tuned BERT Models

Andor Diera , Nicolas Lell , Aygul Garifullina , Ansgar Scherp

分类：自然语言处理

2022-12-07

Privacy preserving deep learning is an emerging field in machine learning that aims to mitigate the privacy risks in the use of deep neural networks. One such risk is training data extraction from language models that have been trained on datasets , which contain personal and privacy sensitive information. In our study, we investigate the extent of named entity memorization in fine-tuned BERT models. We use single-label text classification as representative downstream task and employ three different fine-tuning setups in our experiments, including one with Differentially Privacy (DP). We create a large number of text samples from the fine-tuned BERT models utilizing a custom sequential sampling strategy with two prompting strategies. We search in these samples for named entities and check if they are also present in the fine-tuning datasets. We experiment with two benchmark datasets in the domains of emails and blogs. We show that the application of DP has a huge effect on the text generation capabilities of BERT. Furthermore, we show that a fine-tuned BERT does not generate more named entities entities specific to the fine-tuning dataset than a BERT model that is pre-trained only. This suggests that BERT is unlikely to emit personal or privacy sensitive named entities. Overall, our results are important to understand to what extent BERT-based services are prone to training data extraction attacks.

translated by 谷歌翻译

TNT-KID: Transformer-based Neural Tagger for Keyword Identification

Matej Martinc , Blaž Škrlj , Senja Pollak

分类：自然语言处理

2020-03-20

随着越来越多的可用文本数据，能够自动分析，分类和摘要这些数据的算法的开发已成为必需品。在本研究中，我们提出了一种用于关键字识别的新颖算法，即表示给定文档的关键方面的一个或多字短语的提取，称为基于变压器的神经标记器，用于关键字识别（TNT-KID）。通过将变压器架构适用于手头的特定任务并利用域特定语料库上的预先磨损的语言模型，该模型能够通过提供竞争和强大的方式克服监督和无监督的最先进方法的缺陷在各种不同的数据集中的性能，同时仅需要最佳执行系统所需的手动标记的数据。本研究还提供了彻底的错误分析，具有对模型内部运作的有价值的见解和一种消融研究，测量关键字识别工作流程的特定组分对整体性能的影响。

translated by 谷歌翻译

A General Language Assistant as a Laboratory for Alignment

Amanda Askell , Yuntao Bai , Anna Chen , Dawn Drain , Deep Ganguli , Tom Henighan , Andy Jones , Nicholas Joseph , Ben Mann , Nova DasSarma

分类：自然语言处理 | 机器学习

2021-12-01

鉴于大型语言模型的广泛能力，应该有可能朝着一般的文本的助手工作，这些助手与人类价值一致，这意味着它是有帮助，诚实的和无害的。在此方向上的初始遗传，我们研究简单的基线技术和评估，例如提示。我们发现，从模型规模增加适度的干预措施的好处，概括为各种对准评估，并不会损害大型模型的性能。接下来，我们调查与对齐，比较仿制，二进制歧视和排名偏好建模相关的几个培训目标的缩放趋势。我们发现排名优先级模型比模仿学习更好地表现得多，并且通常以模型大小更有利地缩放。相比之下，二进制歧视通常与模仿学习非常类似地执行和缩放。最后，我们研究了一种“偏好模型预训练阶段的培训阶段，其目的是在对人偏好的芬明时提高样本效率。

translated by 谷歌翻译

PanGu-Coder: Program Synthesis with Function-Level Language Modeling

Fenia Christopoulou , Gerasimos Lampouras , Milan Gritta , Guchun Zhang , Yinpeng Guo , Zhongqi Li , Qi Zhang , Meng Xiao , Bo Shen , Lin Li

分类：机器学习 | 人工智能 | 自然语言处理

2022-07-22

我们提出了Pangu-Coder，这是一种仅预读的解码器语言模型，该模型采用pangu-alpha架构进行文本到代码生成，即给定自然语言问题描述的编程语言解决方案的合成。我们使用两阶段策略训练Pangu-Coder：第一阶段采用因果语言建模（CLM）来预先培训原始编程语言数据，而第二阶段则使用因果语言建模和掩盖语言建模（MLM）的组合培训目标，专注于文本到代码生成的下游任务，并培训松散的自然语言程序定义和代码功能。最后，我们讨论了pangu-coder-ft，该pander the是通过竞争性编程问题和代码与持续集成测试的结合进行了微调的。我们评估了pangu-coder，重点是它是否生成功能上正确的程序，并证明它在参加较小的上下文窗口和较少的数据培训的同时，它比诸如Codex之类的类似大小的模型（例如Codex）实现等效性或更好的性能。

translated by 谷歌翻译

Collocation2Text: Controllable Text Generation from Guide Phrases in Russian

Sergey Vychegzhanin , Evgeny Kotelnikov

分类：自然语言处理

2022-06-18

大型预训练的语言模型能够产生多种多样的文本。从提示开始，这些模型产生了一种可以不可预测的叙述。现有的可控文本生成方法，该方法指导用户指定方向的文本中的叙述，需要创建培训语料库和额外的耗时培训程序。本文提出并调查了Contocation2Text，这是一种用于俄罗斯自动可控文本生成的插件方法，不需要微调。该方法基于两个交互模型：自回归语言Rugpt-3模型和自动编码语言Ruroberta模型。该方法的想法是根据自动编码模型的输出分布将自回归模型的输出分布移动，以确保文本中叙事的连贯过渡向指南短语，其中可以包含单个单词或搭配。能够考虑到令牌的左和右下方的自动编码模型“告诉”“自动回归模型”在当前一代步骤中，该模型是令牌最不合逻辑的，从而增加或降低了相应令牌的概率。使用该方法生成新闻文章的实验显示了其对自动生成的流利文本的有效性，这些文本包含用户指定的短语之间的连贯过渡。

translated by 谷歌翻译