智能论文笔记

Neural Transfer Learning for Repairing Security Vulnerabilities in C Code

Zimin Chen , Steve Kommrusch , Martin Monperrus

分类：机器学习

2021-04-16

在本文中，我们解决了深入学习的软件漏洞自动修复问题。数据驱动漏洞修复的主要问题是已知确认漏洞的少数现有数据集仅由几千例组成。然而，培训深度学习模型通常需要数十万例的例子。在这项工作中，我们利用了错误修复任务和漏洞修复任务的直觉相关，并且可以传输来自错误修复的知识可以传输到修复漏洞。在机器学习界中，这种技术称为转移学习。在本文中，我们提出了一种修复名为Vreepair的安全漏洞的方法，该方法是基于转移学习。 vreepair首先在大型错误修复语料库上培训，然后在漏洞修复数据集上调整，这是一个较小的数量级。在我们的实验中，我们表明，仅在错误修复语料库上培训的模型可能已经修复了一些漏洞。然后，我们证明转移学习改善了修复易受攻击的C功能的能力。我们还表明，转移学习模型比具有去噪任务训练的模型更好，并在漏洞固定任务上进行微调。总而言之，本文表明，与在小型数据集上的学习相比，转移学习适用于修复C中的安全漏洞。

translated by 谷歌翻译

Can OpenAI Codex and Other Large Language Models Help Us Fix Security Bugs?

Hammond Pearce , Benjamin Tan , Baleegh Ahmad , Ramesh Karri , Brendan Dolan-Gavitt

分类：人工智能

2021-12-03

人类开发人员可以使用网络安全缺陷生产代码。可以新兴'智能'代码完成工具有助于修复这些缺点吗？在这项工作中，我们研究了对零拍摄漏洞修复的代码（如Openai的Codex和AI21的侏罗纪J-1）使用大型语言模型（如Openai的Codex和AI21的J-1）。我们调查设计方面的挑战，提示将Coax LLMS进入生成不安全代码的修复版本。由于许多方法来短语和句法 - 具有自然语言，这很困难。通过对四个商业，黑盒子，“现成的”典型的模型进行大规模研究，以及局部训练的模型，在合成，手工制作和现实世界的安全错误场景的混合中，我们的实验表明，LLMS可以共同修复100％的综合生成和手工制作的情景，以及58％的脆弱性，在真实的开源项目中的历史错误中选择。

translated by 谷歌翻译

JEMMA: An Extensible Java Dataset for ML4Code Applications

Anjan Karmakar , Miltiadis Allamanis , Romain Robbes

分类：机器学习

2022-12-18

Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.

translated by 谷歌翻译

Improving Automated Program Repair with Domain Adaptation

Armin Zirak , Hadi Hemati

分类：人工智能 | 机器学习

2022-12-21

Automated Program Repair (APR) is defined as the process of fixing a bug/defect in the source code, by an automated tool. APR tools have recently experienced promising results by leveraging state-of-the-art Neural Language Processing (NLP) techniques. APR tools such as TFix and CodeXGLUE combine text-to-text transformers with software-specific techniques are outperforming alternatives, these days. However, in most APR studies the train and test sets are chosen from the same set of projects. In reality, however, APR models are meant to be generalizable to new and different projects. Therefore, there is a potential threat that reported APR models with high effectiveness perform poorly when the characteristics of the new project or its bugs are different than the training set's(Domain Shift). In this study, we first define and measure the domain shift problem in automated program repair. Then, we then propose a domain adaptation framework that can adapt an APR model for a given target project. We conduct an empirical study with three domain adaptation methods FullFineTuning, TuningWithLightWeightAdapterLayers, and CurriculumLearning using two state-of-the-art domain adaptation tools (TFix and CodeXGLUE) and two APR models on 611 bugs from 19 projects. The results show that our proposed framework can improve the effectiveness of TFix by 13.05% and CodeXGLUE by 23.4%. Another contribution of this study is the proposal of a data synthesis method to address the lack of labelled data in APR. We leverage transformers to create a bug generator model. We use the generated synthetic data to domain adapt TFix and CodeXGLUE on the projects with no data (Zero-shot learning), which results in an average improvement of 5.76% and 24.42% for TFix and CodeXGLUE, respectively.

translated by 谷歌翻译

Neural Program Repair: Systems, Challenges and Solutions

Wenkang Zhong , Chuanyi Li , Jidong Ge , Bin Luo

分类：神经与进化计算

2022-02-22

自动化程序维修（APR）旨在自动修复源代码中的错误。最近，随着深度学习（DL）领域的进步，神经程序修复（NPR）研究的兴起，该研究将APR作为翻译任务从Buggy Code开始，以纠正代码并采用基于编码器decoder架构的神经网络。与其他APR技术相比，NPR方法在适用性方面具有很大的优势，因为它们不需要任何规范（即测试套件）。尽管NPR一直是一个热门的研究方向，但该领域还没有任何概述。为了帮助感兴趣的读者了解现有NPR系统的体系结构，挑战和相应的解决方案，我们对本文的最新研究进行了文献综述。我们首先介绍该领域的背景知识。接下来，要理解，我们将NPR过程分解为一系列模块，并在每个模块上阐述各种设计选择。此外，我们确定了一些挑战并讨论现有解决方案的影响。最后，我们得出结论，并为未来的研究提供了一些有希望的方向。

translated by 谷歌翻译

VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements

Yangruibo Ding , Sahil Suneja , Yunhui Zheng , Jim Laredo , Alessandro Morari , Gail Kaiser , Baishakhi Ray

分类：机器学习

2021-12-20

在源代码中自动定位易受攻击的陈述至关重要，以确保软件安全性和缓解开发人员的调试工作。这在当今软件生态系统中变得更加重要，其中易受攻击的代码可以在像GitHub这样的软件存储库中轻松且无意中流动。在这类数百万的代码行中，传统的静态和动态方法争取缩放。虽然基于机器学习的方法在这样的设置中看起来很有希望，但大多数工作都在较高的粒度下检测到脆弱的代码 - 在方法或文件级别。因此，开发人员仍然需要检查大量代码以找到需要修复的弱势陈述。本文提出了一种新的集合学习方法来定位脆弱的陈述。我们的模型结合了基于图形的基于序列的神经网络，以成功捕获程序图的本地和全局上下文，并有效地了解代码语义和易受攻击的模式。为了研究天鹅绒的效果，我们使用了一个现成的合成数据集和最近发布的现实世界数据集。在静态分析设置中，未提前检测到易受攻击功能，Velvet可以实现4.5倍的性能，而不是真实世界数据上的基线静态分析仪。对于孤立的漏洞本地化任务，在我们假设特定漏洞声明未知的同时知道函数的漏洞，我们将天鹅绒与几个神经网络进行比较，这些内部网络也参加了本地和全局代码背景。天鹅绒分别达到99.6％和43.6％的13.6％，分别在合成数据和现实世界数据上实现了高精度，优于基线深度学习模型5.3-29.0％。

translated by 谷歌翻译

Learning to Parallelize in a Shared-Memory Environment with Transformers

Re'em Harel , Yuval Pinter , Gal Oren

分类：自然语言处理 | 机器学习

2022-04-27

在过去的几年中，世界已转向多核和多核共享内存体系结构。结果，通过将共享内存并行化方案引入软件应用程序，越来越需要利用这些体系结构。 OpenMP是实现此类方案的最全面的API，其特征是可读接口。然而，由于平行共享内存的管理中普遍存在的陷阱，将OpenMP引入代码很具有挑战性。为了促进此任务的性能，多年来创建了许多源代码（S2S）编译器，任务是将OpenMP指令自动插入代码。除了对输入格式的鲁棒性有限外，这些编译器仍然无法在定位可行的代码和生成适当指令时获得令人满意的覆盖范围和精确度。在这项工作中，我们建议利用ML技术的最新进展，特别是自然语言处理（NLP），以完全替换S2S编译器。我们创建一个数据库（语料库），专门用于此目标。 Open-Opm包含28,000多个代码片段，其中一半包含OpenMP指令，而另一半根本不需要并行化。我们使用语料库来培训系统来自动对需要并行化的代码段进行分类，并建议单个OpenMP条款。我们为这些任务培训了几个名为Bragformer的变压器模型，并表明它们的表现优于统计训练的基线和自动S2S并行化编译器，这既可以分类OpenMP指令的总体需求，又要介绍私人和还原条款。我们的源代码和数据库可在以下网址获得：https：//github.com/scientific-computing-lab-nrcn/pragformer。

translated by 谷歌翻译

Can we learn from developer mistakes? Learning to localize and repair real bugs from real bug fixes

Cedric Richter , Heike Wehrheim

分类：机器学习

2022-07-01

在开源存储库中发现的真正错误修复似乎是学习本地化和修复实际错误的理想来源。但是，缺乏大规模的错误修复集合使过去难以有效利用过去的较大神经模型的真正错误修复。相比之下，人工错误 - 通过突变现有源代码产生的人为错误可以轻松地以足够的规模获得，因此在培训现有方法时通常是首选的。尽管如此，在面对真正的错误时，经过对人造错误的培训的本地化和维修模型通常在表现不佳。这就提出了一个问题，是否在实际错误修复程序上培训的错误本地化和维修模型在本地化和维修实际错误方面更有效。我们通过引入Realit，这是一种预先培训和预先计算方法，以有效地学习从真正的错误修复中进行本地化和修复真实的错误来解决这个问题。 Realit首先是在传统突变操作员产生的大量人造错误上进行的，然后在较小的一组实际错误修复程序上进行了微调。微调不需要对学习算法进行任何修改，因此可以轻松地在各种培训方案中用于错误定位或维修（即使实际培训数据很少）。此外，我们发现，对使用真实错误修复的培训在经验上几乎使现有模型在实际错误上的本地化性能翻了一番，同时维护甚至改善了维修性能。

translated by 谷歌翻译

Beyond the C: Retargetable Decompilation using Neural Machine Translation

Iman Hosseini , Brendan Dolan-Gavitt

分类：自然语言处理

2022-12-17

The problem of reversing the compilation process, decompilation, is an important tool in reverse engineering of computer software. Recently, researchers have proposed using techniques from neural machine translation to automate the process in decompilation. Although such techniques hold the promise of targeting a wider range of source and assembly languages, to date they have primarily targeted C code. In this paper we argue that existing neural decompilers have achieved higher accuracy at the cost of requiring language-specific domain knowledge such as tokenizers and parsers to build an abstract syntax tree (AST) for the source language, which increases the overhead of supporting new languages. We explore a different tradeoff that, to the extent possible, treats the assembly and source languages as plain text, and show that this allows us to build a decompiler that is easily retargetable to new languages. We evaluate our prototype decompiler, Beyond The C (BTC), on Go, Fortran, OCaml, and C, and examine the impact of parameters such as tokenization and training data selection on the quality of decompilation, finding that it achieves comparable decompilation results to prior work in neural decompilation with significantly less domain knowledge. We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.

translated by 谷歌翻译

FixEval: Execution-based Evaluation of Program Fixes for Competitive Programming Problems

Md Mahim Anjum Haque , Wasi Uddin Ahmad , Ismini Lourentzou , Chris Brown

分类：机器学习

2022-06-15

源代码存储库由大型代码库组成，通常包含容易发生的程序。软件的复杂性日益增加导致时间和识别这些缺陷的时间和成本急剧上升。存在各种方法可以自动生成错误代码的修复程序。但是，由于特定错误的可能解决方案的组合空间很大，因此没有很多工具和数据集可以有效地评估生成的代码。在这项工作中，我们介绍了FixeVal，这是一个基准，其中包括竞争性编程问题及其各自修复程序的基准。我们引入了丰富的测试套件，以评估和评估模型生成程序修复的正确性。我们将两种在编程语言上鉴定的变压器语言模型视为我们的基准，并使用基于匹配和基于执行的评估指标对其进行比较。我们的实验表明，基于匹配的指标不能准确反映模型生成的程序修复，而基于执行的方法通过专门为该解决方案设计的所有情况和场景评估程序。因此，我们认为FixeVal提供了朝着实际自动错误修复和模型生成的代码评估的步骤。

translated by 谷歌翻译

Proceedings of the 3rd International Workshop on Reading Music Systems

Jorge Calvo-Zaragoza , Alexander Pacha

分类：计算机视觉 | 机器学习

2022-12-01

The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 3rd International Workshop on Reading Music Systems, held in Alicante on the 23rd of July 2021.

translated by 谷歌翻译

Assessing Project-Level Fine-Tuning of ML4SE Models

Egor Bogomolov , Sergey Zhuravlev , Egor Spirin , Timofey Bryksin

分类：机器学习

2022-06-07

软件工程（ML4SE）的机器学习是一个积极发展的研究领域，专注于帮助程序员工作的方法。为了在实践中应用开发的方法，他们需要实现合理的质量，以帮助而不是分散开发人员的注意力。尽管开发新方法来代码表示和数据收集可以提高模型的整体质量，但它没有考虑到我们可以从手头项目中获得的信息。在这项工作中，我们研究了如果我们针对特定项目，则如何提高模型的质量。我们开发一个框架来评估质量改进，模型可以在特定项目上的方法名称预测任务进行微调后获得。我们评估了三种不同复杂性的模型，并在三个设置中进行了比较它们的质量：在大型Java项目的大型数据集上进行培训，进一步对特定项目的数据进行了微调，并从头开始训练了此数据。我们表明，每项项目的微调可以极大地提高模型的质量，因为它们捕获了项目的领域和命名约定。我们开放用于数据收集的工具以及运行实验的代码：https：//zenodo.org/record/6040745。

translated by 谷歌翻译

AdaptivePaste: Code Adaptation through Learning Semantics-aware Variable Usage Representations

Xiaoyu Liu , Jinu Jang , Neel Sundaresan , Miltiadis Allamanis , Alexey Svyatkovskiy

分类：自然语言处理

2022-05-23

In software development, it is common for programmers to copy-paste or port code snippets and then adapt them to their use case. This scenario motivates the code adaptation task -- a variant of program repair which aims to adapt variable identifiers in a pasted snippet of code to the surrounding, preexisting source code. However, no existing approach has been shown to effectively address this task. In this paper, we introduce AdaptivePaste, a learning-based approach to source code adaptation, based on transformers and a dedicated dataflow-aware deobfuscation pre-training task to learn meaningful representations of variable usage patterns. We evaluate AdaptivePaste on a dataset of code snippets in Python. Results suggest that our model can learn to adapt source code with 79.8% accuracy. To evaluate how valuable is AdaptivePaste in practice, we perform a user study with 10 Python developers on a hundred real-world copy-paste instances. The results show that AdaptivePaste reduces the dwell time to nearly half the time it takes for manual code adaptation, and helps to avoid bugs. In addition, we utilize the participant feedback to identify potential avenues for improvement of AdaptivePaste.

translated by 谷歌翻译

DeepPERF: A Deep Learning-Based Approach For Improving Software Performance

Spandan Garg , Roshanak Zilouchian Moghaddam , Colin B. Clement , Neel Sundaresan , Chen Wu

分类：人工智能

2022-06-27

改善软件性能是软件开发周期中重要但充满挑战的部分。如今，大多数性能效率低下是由绩效专家确定和修补的。深度学习方法的最新进展和开源数据的广泛可用性为自动化绩效问题的识别和修补提供了一个绝佳的机会。在本文中，我们提出了Deepperf，这是一种基于变压器的方法，以建议针对C＃应用程序进行性能改进。我们在英语和源代码语料库上预告了Deepperf，然后进行了Finetuning的任务，以生成C＃应用程序的性能改进补丁。我们的评估表明，我们的模型可以在约53％的案例中生成与开发人员修复相同的性能改进建议，在我们专家验证的C＃开发人员进行的绩效更改的数据集中，逐字化约34％。此外，我们使用基准测试和单元测试在GitHub上在50个开源C＃存储库上评估Deepperf，并发现我们的模型能够提出有效的性能改进，以改善CPU使用和内存分配。到目前为止，我们已经提交了19个带有28种不同性能优化的拉装重新要求，其中11个PR已获得项目所有者的批准。

translated by 谷歌翻译

Transformer-Based Language Models for Software Vulnerability Detection

Chandra Thapa , Seung Ick Jang , Muhammad Ejaz Ahmed , Seyit Camtepe , Josef Pieprzyk , Surya Nepal

分类：人工智能 | 机器学习

2022-04-07

基于变压器的大型语言模型在自然语言处理中表现出色。通过考虑这些模型在一个领域中获得的知识的可传递性，以及自然语言与高级编程语言（例如C/C ++）的亲密关系，这项工作研究了如何利用（大）基于变压器语言模型检测软件漏洞以及这些模型在漏洞检测任务方面的良好程度。在这方面，首先提出了一个系统的（凝聚）框架，详细介绍了源代码翻译，模型准备和推理。然后，使用具有多个漏洞的C/C ++源代码的软件漏洞数据集进行经验分析，该数据集对应于库功能调用，指针使用，数组使用情况和算术表达式。我们的经验结果证明了语言模型在脆弱性检测中的良好性能。此外，这些语言模型具有比当代模型更好的性能指标，例如F1得分，即双向长期记忆和双向封闭式复发单元。由于计算资源，平台，库和依赖项的要求，对语言模型进行实验始终是具有挑战性的。因此，本文还分析了流行的平台，以有效地微调这些模型并在选择平台时提出建议。

translated by 谷歌翻译

Syntax-Aware On-the-Fly Code Completion

Wannita Takerngsaksiri , Chakkrit Tantithamthavorn , Yuan-Fang Li

分类：人工智能

2022-11-09

Code completion aims to help improve developers' productivity by suggesting the next code tokens from a given context. Various approaches have been proposed to incorporate abstract syntax tree (AST) information for model training, ensuring that code completion is aware of the syntax of the programming languages. However, existing syntax-aware code completion approaches are not on-the-fly, as we found that for every two-thirds of characters that developers type, AST fails to be extracted because it requires the syntactically correct source code, limiting its practicality in real-world scenarios. On the other hand, existing on-the-fly code completion does not consider syntactic information yet. In this paper, we propose PyCoder to leverage token types, a kind of lightweight syntactic information, which is readily available and aligns with the natural order of source code. Our PyCoder is trained in a multi-task training manner so that by learning the supporting task of predicting token types during the training phase, the models achieve better performance on predicting tokens and lines of code without the need for token types in the inference phase. Comprehensive experiments show that PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions, which is 0.43%-24.25% more accurate than baselines. In addition, PyCoder achieves an exact match of 43.37% for the line-level predictions, which is 3.63%-84.73% more accurate than baselines. These results lead us to conclude that token type information (an alternative to syntactic information) that is rarely used in the past can greatly improve the performance of code completion approaches, without requiring the syntactically correct source code like AST-based approaches do. Our PyCoder is publicly available on HuggingFace.

translated by 谷歌翻译

NatGen: Generative pre-training by "Naturalizing" source code

Saikat Chakraborty , Toufique Ahmed , Yangruibo Ding , Premkumar Devanbu , Baishakhi Ray

分类：人工智能 | 机器学习

2022-06-15

源代码的预训练的生成语言模型（例如PLBART，CODET5，SPT-CODE）在过去几年中对多个任务（包括代码生成和翻译）产生了强劲的结果。这些模型采用了不同的训练前目标，以自我监督的方式从非常大规模的语料库中学习代码构建的统计数据。预训练模型的成功很大程度上取决于这些预训练的目标。本文提出了一个新的预训练目标，即“归化”源代码，利用代码的双峰，双通道（正式和自然渠道）性质。与自然语言不同，代码的双峰，双通道的性质使我们能够大规模生成语义上等效的代码。我们介绍了六类的语义保存转换，以引入非自然的代码形式，然后强迫我们的模型制作开发人员编写的更自然的原创程序。学习在没有明确的手动监督的情况下，通过大型的开源代码来生成等效但更自然的代码，有助于模型学习摄入和生成代码。我们将模型在三个生成软件工程任务中微调：代码生成，代码翻译和代码改进，具有有限的人类策划标记数据并实现最先进的性能与CODET5。我们表明，我们的预训练模型在零射门和少数学习方面特别有竞争力，并且在学习代码属性（例如语法，数据流）方面更好。

translated by 谷歌翻译

Neurosymbolic Repair for Low-Code Formula Languages

Rohan Bavishi , Harshit Joshi , José Pablo Cambronero Sánchez , Anna Fariha , Sumit Gulwani , Vu Le , Ivan Radicek , Ashish Tiwari

分类：人工智能

2022-07-24

大多数低编码平台的用户，例如Excel和PowerApps，都以特定于域的公式语言编写程序来执行非平凡的任务。用户通常可以编写他们想要的大部分程序，但是引入了一些小错误，这些错误会产生破损的公式。这些错误既可以是句法和语义，也很难让低代码用户识别和修复，即使只能通过一些编辑解决。我们正式化了产生最后一英里维修问题等编辑的问题。为了解决这个问题，我们开发了Lamirage，这是一种最后一英里的维修发动机发电机，结合了符号和神经技术，以低代码公式语言进行最后一英里维修。 Lamirage采用语法和一组特定领域的约束/规则，它们共同近似目标语言，并使用它们来生成可以用该语言修复公式的维修引擎。为了应对本地化错误和对候选维修进行排名的挑战，Lamirage利用神经技术，而它依赖于符号方法来生成候选维修。这种组合使Lamirage可以找到满足提供的语法和约束的维修，然后选择最自然的修复。我们将Lamirage与400个Real Excel和PowerFX公式的最新神经和符号方法进行了比较，其中Lamirage的表现优于所有基线。我们释放这些基准，以鼓励在低代码域中进行后续工作。

translated by 谷歌翻译

Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions

Hammond Pearce , Baleegh Ahmad , Benjamin Tan , Brendan Dolan-Gavitt , Ramesh Karri

分类：人工智能

2021-08-20

在设计基于AI的系统中，有蓬勃发展的兴趣，以帮助人类设计计算系统，包括自动生成计算机代码的工具。这些最值得注意的是，以第一个自我描述的“Ai对程序员”，GitHub Copilot，一种在开源GitHub代码上培训的语言模型。但是，代码通常包含错误 - 因此，鉴于Copilot处理的大量未曝避代码，肯定是语言模型将从可利用的错误代码中学到。这提出了对Copilot代码捐助的安全的担忧。在这项工作中，我们系统地调查了可能导致Github CopIlot推荐不安全代码的普遍存在和条件。为了执行此分析，我们提示CopIlot在与高风险CWE相关的方案中生成代码（例如，从吉利的“前25名”列表中的方案）。我们探索了三个不同代码生成轴上的Copilot的表现 - 检查它如何表现为特定的弱点多样性，提示的多样性以及域的多样性。总共生产89个不同的Copilot方案，以完成，生产1,689个计划。其中，我们发现大约40％的脆弱。

translated by 谷歌翻译

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li , Peter J. Liu

分类：

2019-10-23

Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

translated by 谷歌翻译