预审前的语言模型已被证明在许多与软件有关的一代任务中都是有效的。但是,它们不适合编辑任务,因为它们不是为了推理编辑的原因。为了解决这个问题,我们提出了一个新颖的预处理目标,该目标明确地对编辑进行了建模并使用它来构建Coditt5,这是一种用于软件相关编辑任务的大型语言模型,该任务是在大量源代码和自然语言评论中鉴定的。我们将其对各种下游编辑任务进行微调,包括评论更新,错误修复和自动代码审核。通过优于基于纯生成的模型,我们证明了方法的普遍性及其对编辑任务的适用性。我们还展示了纯生成模型和我们的基于编辑的模型如何通过简单的重读策略相互补充,我们可以通过该策略实现三个下游编辑任务的最新性能。
translated by 谷歌翻译
源代码的预训练的生成语言模型(例如PLBART,CODET5,SPT-CODE)在过去几年中对多个任务(包括代码生成和翻译)产生了强劲的结果。这些模型采用了不同的训练前目标,以自我监督的方式从非常大规模的语料库中学习代码构建的统计数据。预训练模型的成功很大程度上取决于这些预训练的目标。本文提出了一个新的预训练目标,即“归化”源代码,利用代码的双峰,双通道(正式和自然渠道)性质。与自然语言不同,代码的双峰,双通道的性质使我们能够大规模生成语义上等效的代码。我们介绍了六类的语义保存转换,以引入非自然的代码形式,然后强迫我们的模型制作开发人员编写的更自然的原创程序。学习在没有明确的手动监督的情况下,通过大型的开源代码来生成等效但更自然的代码,有助于模型学习摄入和生成代码。我们将模型在三个生成软件工程任务中微调:代码生成,代码翻译和代码改进,具有有限的人类策划标记数据并实现最先进的性能与CODET5。我们表明,我们的预训练模型在零射门和少数学习方面特别有竞争力,并且在学习代码属性(例如语法,数据流)方面更好。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
As the complexity of modern software continues to escalate, software engineering has become an increasingly daunting and error-prone endeavor. In recent years, the field of Neural Code Intelligence (NCI) has emerged as a promising solution, leveraging the power of deep learning techniques to tackle analytical tasks on source code with the goal of improving programming efficiency and minimizing human errors within the software industry. Pretrained language models have become a dominant force in NCI research, consistently delivering state-of-the-art results across a wide range of tasks, including code summarization, generation, and translation. In this paper, we present a comprehensive survey of the NCI domain, including a thorough review of pretraining techniques, tasks, datasets, and model architectures. We hope this paper will serve as a bridge between the natural language and programming language communities, offering insights for future research in this rapidly evolving field.
translated by 谷歌翻译
我们提出了Pangu-Coder,这是一种仅预读的解码器语言模型,该模型采用pangu-alpha架构进行文本到代码生成,即给定自然语言问题描述的编程语言解决方案的合成。我们使用两阶段策略训练Pangu-Coder:第一阶段采用因果语言建模(CLM)来预先培训原始编程语言数据,而第二阶段则使用因果语言建模和掩盖语言建模(MLM)的组合培训目标,专注于文本到代码生成的下游任务,并培训松散的自然语言程序定义和代码功能。最后,我们讨论了pangu-coder-ft,该pander the是通过竞争性编程问题和代码与持续集成测试的结合进行了微调的。我们评估了pangu-coder,重点是它是否生成功能上正确的程序,并证明它在参加较小的上下文窗口和较少的数据培训的同时,它比诸如Codex之类的类似大小的模型(例如Codex)实现等效性或更好的性能。
translated by 谷歌翻译
评论或源代码的自然语言描述是软件开发人员中的标准实践。通过传达代码的重要方面,例如功能和用法,评论有助于软件项目维护。但是,当对代码进行修改而无需对该注释的纠正进行更正时,就会出现评论和代码之间的不一致性,这为开发人员混淆和错误提供了可能性。在本文中,我们提出了两个基于Bert(Devlin等,2019)和Longformer(Beltagy等,2020)的模型,以在自然语言推理(NLI)上下文中检测这种不一致之处。通过在代码更改期间和之后对先前建立的评论方法对的评估,我们证明了我们的模型的表现优于多个基准,并产生与排除语言和词汇特征的最先进模型的可比结果。我们进一步讨论了未来研究的想法,以使用预读的语言模型进行不一致的检测和自动评论更新。
translated by 谷歌翻译
GitHub提交的记录,该代码随着自然语言消息的描述而变化,对于软件开发人员来说,在理解软件演变方面起着至关重要的作用。为了促进开源软件社区的开发,我们收集了一个提交基准,包括790万次跨7种编程语言的投入。基于此基准测试,我们提出了Citsbart,这是GitHub提交的大型预训练的编码器变压器模型。该模型由三个类别(即,为了学习提交碎片表示的六个预训练任务)预先培训(即,剥夺目标,跨模式生成和对比度学习)。此外,我们将一个“委托智能”框架与一项理解任务和提交的三个世代任务统一。这些任务的综合实验表明,提案巴特大大优于以前的代码预先培训作品。进一步的分析还揭示了每个预训练任务可增强模型性能。我们鼓励后续研究人员在将来为我们的框架贡献更多与承诺相关的下游任务。
translated by 谷歌翻译
文本内容通常是协作写作过程的输出:我们从初始草稿开始,提出建议并反复进行更改。不可知的是,当今的语言模型只能产生最终结果。结果,他们缺乏对协作写作至关重要的几种能力:他们无法更新现有文本,难以控制和无法进行口头计划或解释其行为。为了解决这些缺点,我们介绍了Peer,这是一种协作语言模型,经过训练以模仿整个写作过程本身:Peer可以编写草稿,添加建议,提出编辑并为其行为提供解释。至关重要的是,我们训练多个同伴能够填补写作过程的各个部分的实例,从而可以使用自训练技术来提高培训数据的质量,数量和多样性。这通过使其适用于没有编辑历史的域,并提高其遵循说明,编写有用的评论并解释其动作的能力,从而释放了Peer的全部潜力。我们表明,同行在各个领域和编辑任务上取得了强大的性能。
translated by 谷歌翻译
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by ( 1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.
translated by 谷歌翻译
预训练模型已在许多代码智能任务中有效。这些模型在大规模未标记的语料库中进行了预训练,然后在下游任务中进行了微调。但是,由于预训练和下游任务的输入是不同的形式,因此很难充分探索预训练模型的知识。此外,微调的性能强烈依赖于下游数据的量,而实际上,具有稀缺数据的场景很常见。自然语言处理(NLP)领域的最新研究表明,迅速调整,一种调整的新范式,减轻上述问题并在各种NLP任务中实现了有希望的结果。在迅速调整中,在调整过程中插入的提示提供了特定于任务的知识,这对于具有相对较少数据的任务特别有益。在本文中,我们凭经验评估了代码智能任务中迅速调整的用法和效果。我们对流行的预训练模型Codebert和codet5进行及时调整,并尝试三个代码智能任务,包括缺陷预测,代码摘要和代码翻译。我们的实验结果表明,在所有三个任务中,迅速调整始终优于微调。此外,及时调整在低资源场景中显示出很大的潜力,例如,对于代码摘要,平均将微调的BLEU分数提高了26%以上。我们的结果表明,我们可以调整代码智能任务的迅速调整,以实现更好的性能,尤其是在缺乏特定于任务的数据时,我们可以调整及时调整。
translated by 谷歌翻译
源代码存储库由大型代码库组成,通常包含容易发生的程序。软件的复杂性日益增加导致时间和识别这些缺陷的时间和成本急剧上升。存在各种方法可以自动生成错误代码的修复程序。但是,由于特定错误的可能解决方案的组合空间很大,因此没有很多工具和数据集可以有效地评估生成的代码。在这项工作中,我们介绍了FixeVal,这是一个基准,其中包括竞争性编程问题及其各自修复程序的基准。我们引入了丰富的测试套件,以评估和评估模型生成程序修复的正确性。我们将两种在编程语言上鉴定的变压器语言模型视为我们的基准,并使用基于匹配和基于执行的评估指标对其进行比较。我们的实验表明,基于匹配的指标不能准确反映模型生成的程序修复,而基于执行的方法通过专门为该解决方案设计的所有情况和场景评估程序。因此,我们认为FixeVal提供了朝着实际自动错误修复和模型生成的代码评估的步骤。
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We will make our code and pre-trained models publicly available.
translated by 谷歌翻译
训练有素的机器学习模型,利用大量的开源软件数据,现在已经成为自动化许多软件工程任务的有趣方法。几个硒任务都受到这种方法,在过去的几年里,性能逐渐改善,具有更好的模型和培训方法。更多,更多样化,清洁,标记数据更好的培训;但构建高质量的数据集是耗时和挑战。增强清洁量和多样性的方法,标记数据通常具有广泛的适用性。对于某些语言(例如,Ruby)标记的数据不那么丰富;在其他(例如,JavaScript)中,可用数据可能更多地关注某些应用域,从而更加多样化。作为围绕此类数据瓶颈,我们提出了证据表明,不同语言(执行相同功能)的人写代码相当相似,特别是保留标识符命名模式;我们进一步提出了证据表明标识符是软件工程任务培训数据的一个非常重要的要素。我们利用这种相当偶然的现象来查找可用的多语言训练数据(跨不同语言)的证据可用于放大性能。我们研究这一点3个不同的任务:代码摘要,代码检索和功能命名。我们注意到,这种数据增强方法与不同的任务,语言和机器学习模型广泛兼容。
translated by 谷歌翻译
在本文中,我们解决了深入学习的软件漏洞自动修复问题。数据驱动漏洞修复的主要问题是已知确认漏洞的少数现有数据集仅由几千例组成。然而,培训深度学习模型通常需要数十万例的例子。在这项工作中,我们利用了错误修复任务和漏洞修复任务的直觉相关,并且可以传输来自错误修复的知识可以传输到修复漏洞。在机器学习界中,这种技术称为转移学习。在本文中,我们提出了一种修复名为Vreepair的安全漏洞的方法,该方法是基于转移学习。 vreepair首先在大型错误修复语料库上培训,然后在漏洞修复数据集上调整,这是一个较小的数量级。在我们的实验中,我们表明,仅在错误修复语料库上培训的模型可能已经修复了一些漏洞。然后,我们证明转移学习改善了修复易受攻击的C功能的能力。我们还表明,转移学习模型比具有去噪任务训练的模型更好,并在漏洞固定任务上进行微调。总而言之,本文表明,与在小型数据集上的学习相比,转移学习适用于修复C中的安全漏洞。
translated by 谷歌翻译
上下文:堆栈溢出对于寻求编程问题答案的软件开发人员非常有帮助。先前的研究表明,越来越多的问题质量低,因此从潜在的答案者那里获得了更少的关注。 Gao等。提出了一个基于LSTM的模型(即BilstM-CC),以自动从代码片段中生成问题标题,以提高问题质量。但是,只有在问题主体中使用代码段无法为标题生成提供足够的信息,而LSTMS无法捕获令牌之间的远程依赖性。目的:本文提出了基于深度学习的新型模型CCBERT,旨在通过充分利用整个问题主体的双模式信息来增强问题标题生成的性能。方法:CCBERT遵循编码器范式范式,并使用Codebert将问题主体编码为隐藏的表示形式,堆叠的变压器解码器以生成预测的代币,以及附加的复制注意层来完善输出分布。编码器和解码器都执行多头自我注意操作,以更好地捕获远程依赖性。本文构建了一个数据集,该数据集包含大约200,000个高质量问题,该数据从Stack Overflow正式发布的数据中滤除,以验证CCBERT模型的有效性。结果:CCBERT优于数据集上的所有基线模型。对仅代码和低资源数据集进行的实验表明,CCBERT的优势性能较小。人类评估还显示了CCBERT关于可读性和相关标准的出色表现。
translated by 谷歌翻译
最近,人们对使用深度学习自动化软件工程任务的兴趣激增。这项工作解决了代码生成问题的问题,该问题是在其中以不同的语言或自然语言描述生成目标代码的目标代码。代码生成的大多数最先进的深度学习模型都使用主要是为自然语言设计的培训策略。但是,理解和生成代码需要对代码语法和语义的更严格理解。通过这种动机,我们开发了一个编码器变压器模型,其中训练编码器和解码器分别识别源和目标代码中的语法和数据流。我们不仅通过利用源代码的语法树和数据流程图来使编码器结构感知,而且我们还确保我们的解码器通过引入两个辅助任务来保留目标代码的语法和数据流:路径预测和数据流预测。据我们所知,这是第一项引入结构感知的变压器解码器,以通过对目标语法和数据流进行建模来增强生成代码的质量。所提出的结构编码器模型在CodexGlue基准测试中实现了代码翻译和文本对代码生成任务的最新性能。
translated by 谷歌翻译
我们提出了一项实证研究,以适应现有的经过验证的文本对文本模型,以备长期输入。通过沿预训练管道的三个轴的全面研究 - 模型架构,优化目标和训练式语料库,我们提出了一种有效的食谱,以从现有的短篇小说模型中构建长篇小说模型。具体而言,我们用汇总仪的块关注替换了变压器中的全部注意力,并使用蒙版的跨度预测任务为模型预算,长度不同。就训练训练的语料库而言,我们发现,与使用通常在其域覆盖范围中通常受到限制的现有长文档语料库相比,使用大型开放域语料库的随机串联的短篇小说可以提高性能。通过这些发现,我们建立了一个长篇文本模型,该模型可以在长篇文本质量检查任务上实现竞争性能,并在五个长文本摘要数据集上建立新的最新技术,通常优于先前的方法,具有较大的模型大小。
translated by 谷歌翻译
Controllable Text Generation (CTG) is emerging area in the field of natural language generation (NLG). It is regarded as crucial for the development of advanced text generation technologies that are more natural and better meet the specific constraints in practical applications. In recent years, methods using large-scale pre-trained language models (PLMs), in particular the widely used transformer-based PLMs, have become a new paradigm of NLG, allowing generation of more diverse and fluent text. However, due to the lower level of interpretability of deep neural networks, the controllability of these methods need to be guaranteed. To this end, controllable text generation using transformer-based PLMs has become a rapidly growing yet challenging new research hotspot. A diverse range of approaches have emerged in the recent 3-4 years, targeting different CTG tasks which may require different types of controlled constraints. In this paper, we present a systematic critical review on the common tasks, main approaches and evaluation methods in this area. Finally, we discuss the challenges that the field is facing, and put forward various promising future directions. To the best of our knowledge, this is the first survey paper to summarize CTG techniques from the perspective of PLMs. We hope it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.
translated by 谷歌翻译
Code completion aims to help improve developers' productivity by suggesting the next code tokens from a given context. Various approaches have been proposed to incorporate abstract syntax tree (AST) information for model training, ensuring that code completion is aware of the syntax of the programming languages. However, existing syntax-aware code completion approaches are not on-the-fly, as we found that for every two-thirds of characters that developers type, AST fails to be extracted because it requires the syntactically correct source code, limiting its practicality in real-world scenarios. On the other hand, existing on-the-fly code completion does not consider syntactic information yet. In this paper, we propose PyCoder to leverage token types, a kind of lightweight syntactic information, which is readily available and aligns with the natural order of source code. Our PyCoder is trained in a multi-task training manner so that by learning the supporting task of predicting token types during the training phase, the models achieve better performance on predicting tokens and lines of code without the need for token types in the inference phase. Comprehensive experiments show that PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions, which is 0.43%-24.25% more accurate than baselines. In addition, PyCoder achieves an exact match of 43.37% for the line-level predictions, which is 3.63%-84.73% more accurate than baselines. These results lead us to conclude that token type information (an alternative to syntactic information) that is rarely used in the past can greatly improve the performance of code completion approaches, without requiring the syntactically correct source code like AST-based approaches do. Our PyCoder is publicly available on HuggingFace.
translated by 谷歌翻译