我们提出了Pangu-Coder,这是一种仅预读的解码器语言模型,该模型采用pangu-alpha架构进行文本到代码生成,即给定自然语言问题描述的编程语言解决方案的合成。我们使用两阶段策略训练Pangu-Coder:第一阶段采用因果语言建模(CLM)来预先培训原始编程语言数据,而第二阶段则使用因果语言建模和掩盖语言建模(MLM)的组合培训目标,专注于文本到代码生成的下游任务,并培训松散的自然语言程序定义和代码功能。最后,我们讨论了pangu-coder-ft,该pander the是通过竞争性编程问题和代码与持续集成测试的结合进行了微调的。我们评估了pangu-coder,重点是它是否生成功能上正确的程序,并证明它在参加较小的上下文窗口和较少的数据培训的同时,它比诸如Codex之类的类似大小的模型(例如Codex)实现等效性或更好的性能。
translated by 谷歌翻译
程序合成或代码生成旨在生成满足问题规范的程序。使用大规模预处理的语言模型(LMS)的最新方法显示出令人鼓舞的结果,但它们有一些关键的局限性。特别是,他们经常遵循标准监督的微调程序,仅从对自然语言问题描述和基础真相计划对培训代码生成模型。这种范式在很大程度上忽略了问题规范中的一些重要但潜在的信号,例如单位测试,因此在求解复杂的看不见的编码任务时通常会导致性能差。为了解决这些局限性,我们提出了“ Coderl”,这是通过验证的LMS和深入强化学习(RL)实现程序合成任务的新框架。具体而言,在培训期间,我们将代码生成的LM视为参与者网络,并引入批评网络,该网络经过培训,以预测生成的程序的功能正确性,并为演员提供密集的反馈信号。在推理期间,我们引入了一种新一代程序,具有关键的抽样策略,该过程允许模型根据示例单位测试和评论家分数的反馈自动重新生成程序。对于模型骨架,我们扩展了Codet5的编码器架构,具有增强的学习目标,更大的模型大小和更好的预处理数据。我们的方法不仅在具有挑战性的应用程序基准上实现了新的SOTA结果,而且还显示出强大的零弹性传输能力,并在简单的MBPP基准上具有新的SOTA结果。
translated by 谷歌翻译
As the complexity of modern software continues to escalate, software engineering has become an increasingly daunting and error-prone endeavor. In recent years, the field of Neural Code Intelligence (NCI) has emerged as a promising solution, leveraging the power of deep learning techniques to tackle analytical tasks on source code with the goal of improving programming efficiency and minimizing human errors within the software industry. Pretrained language models have become a dominant force in NCI research, consistently delivering state-of-the-art results across a wide range of tasks, including code summarization, generation, and translation. In this paper, we present a comprehensive survey of the NCI domain, including a thorough review of pretraining techniques, tasks, datasets, and model architectures. We hope this paper will serve as a bridge between the natural language and programming language communities, offering insights for future research in this rapidly evolving field.
translated by 谷歌翻译
大型语言模型已经证明了能够在自然语言和编程语言文本上进行条件和生成的能力。这样的模型打开了多语言代码生成的可能性:代码生成模型是否可以将知识从一种语言推广到另一种语言?尽管当代代码生成模型可以生成语义上正确的Python代码,但对它们使用其他语言的能力知之甚少。我们通过提出Multipl-E来促进该主题的探索,这是自然语言到代码生成的第一个多语言平行基准。 Multipl-E扩展了HumaneVal基准(Chen等,2021),以支持另外18种编程语言,涵盖了一系列编程范式和受欢迎程度。我们在Multipl-E:Codex和Incoder上评估了两个最先进的代码生成模型。我们发现,在几种语言上,法典匹配,甚至超过了其在Python上的性能。在多型E中表示的编程语言范围使我们能够探索语言频率和语言功能对模型性能的影响。最后,将代码生成基准分配给新编程语言的多重方法既可扩展又可扩展。我们描述了一种通用方法,可以轻松地增加对新基准和语言的支持。
translated by 谷歌翻译
源代码的预训练的生成语言模型(例如PLBART,CODET5,SPT-CODE)在过去几年中对多个任务(包括代码生成和翻译)产生了强劲的结果。这些模型采用了不同的训练前目标,以自我监督的方式从非常大规模的语料库中学习代码构建的统计数据。预训练模型的成功很大程度上取决于这些预训练的目标。本文提出了一个新的预训练目标,即“归化”源代码,利用代码的双峰,双通道(正式和自然渠道)性质。与自然语言不同,代码的双峰,双通道的性质使我们能够大规模生成语义上等效的代码。我们介绍了六类的语义保存转换,以引入非自然的代码形式,然后强迫我们的模型制作开发人员编写的更自然的原创程序。学习在没有明确的手动监督的情况下,通过大型的开源代码来生成等效但更自然的代码,有助于模型学习摄入和生成代码。我们将模型在三个生成软件工程任务中微调:代码生成,代码翻译和代码改进,具有有限的人类策划标记数据并实现最先进的性能与CODET5。我们表明,我们的预训练模型在零射门和少数学习方面特别有竞争力,并且在学习代码属性(例如语法,数据流)方面更好。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
这项工作表明了如何以编程难题的形式使用大规模语言模型(LMS)与经过验证的解决方案合成编程问题,然后可以将其用于微调相同的模型,从而提高其性能。这项工作以最近的两项发展为基础。首先,LMS在非平凡的推理和算法实施中取得了突破,生成可以解决某些中级竞争性编程问题的代码。但是,培训代码LMS涉及策划的一组自然语言问题描述以及源代码测试和解决方案,这些测试和解决方案的大小有限。其次,引入了一种新的编程挑战格式,称为编程难题,该格式不需要自然语言描述,并通过源代码测试直接指定。在这项工作中,我们展示了如何使用Python解释器验证的合成编程难题和解决方案,可用于改善从P3求解测试难题的性能,P3是一套Python公共基准的Python编程难题。此外,我们发布了由Codex模型生成的100万个难题和解决方案的数据集,我们证明可以通过微调改善较小的模型。
translated by 谷歌翻译
Code completion aims to help improve developers' productivity by suggesting the next code tokens from a given context. Various approaches have been proposed to incorporate abstract syntax tree (AST) information for model training, ensuring that code completion is aware of the syntax of the programming languages. However, existing syntax-aware code completion approaches are not on-the-fly, as we found that for every two-thirds of characters that developers type, AST fails to be extracted because it requires the syntactically correct source code, limiting its practicality in real-world scenarios. On the other hand, existing on-the-fly code completion does not consider syntactic information yet. In this paper, we propose PyCoder to leverage token types, a kind of lightweight syntactic information, which is readily available and aligns with the natural order of source code. Our PyCoder is trained in a multi-task training manner so that by learning the supporting task of predicting token types during the training phase, the models achieve better performance on predicting tokens and lines of code without the need for token types in the inference phase. Comprehensive experiments show that PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions, which is 0.43%-24.25% more accurate than baselines. In addition, PyCoder achieves an exact match of 43.37% for the line-level predictions, which is 3.63%-84.73% more accurate than baselines. These results lead us to conclude that token type information (an alternative to syntactic information) that is rarely used in the past can greatly improve the performance of code completion approaches, without requiring the syntactically correct source code like AST-based approaches do. Our PyCoder is publicly available on HuggingFace.
translated by 谷歌翻译
预训练模型已在许多代码智能任务中有效。这些模型在大规模未标记的语料库中进行了预训练,然后在下游任务中进行了微调。但是,由于预训练和下游任务的输入是不同的形式,因此很难充分探索预训练模型的知识。此外,微调的性能强烈依赖于下游数据的量,而实际上,具有稀缺数据的场景很常见。自然语言处理(NLP)领域的最新研究表明,迅速调整,一种调整的新范式,减轻上述问题并在各种NLP任务中实现了有希望的结果。在迅速调整中,在调整过程中插入的提示提供了特定于任务的知识,这对于具有相对较少数据的任务特别有益。在本文中,我们凭经验评估了代码智能任务中迅速调整的用法和效果。我们对流行的预训练模型Codebert和codet5进行及时调整,并尝试三个代码智能任务,包括缺陷预测,代码摘要和代码翻译。我们的实验结果表明,在所有三个任务中,迅速调整始终优于微调。此外,及时调整在低资源场景中显示出很大的潜力,例如,对于代码摘要,平均将微调的BLEU分数提高了26%以上。我们的结果表明,我们可以调整代码智能任务的迅速调整,以实现更好的性能,尤其是在缺乏特定于任务的数据时,我们可以调整及时调整。
translated by 谷歌翻译
We present CodeBERT, a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. We develop Code-BERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both "bimodal" data of NL-PL pairs and "unimodal" data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing. 1
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
Large language models (LLMs) have demonstrated an impressive ability to generate code for various programming tasks. In many instances, LLMs can generate a correct program for a task when given numerous trials. Consequently, a recent trend is to do large scale sampling of programs using a model and then filtering/ranking the programs based on the program execution on a small number of known unit tests to select one candidate solution. However, these approaches assume that the unit tests are given and assume the ability to safely execute the generated programs (which can do arbitrary dangerous operations such as file manipulations). Both of the above assumptions are impractical in real-world software development. In this paper, we propose CodeRanker, a neural ranker that can predict the correctness of a sampled program without executing it. Our CodeRanker is fault-aware i.e., it is trained to predict different kinds of execution information such as predicting the exact compile/runtime error type (e.g., an IndexError or a TypeError). We show that CodeRanker can significantly increase the pass@1 accuracy of various code generation models (including Codex, GPT-Neo, GPT-J) on APPS, HumanEval and MBPP datasets.
translated by 谷歌翻译
Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.
translated by 谷歌翻译
训练有素的机器学习模型,利用大量的开源软件数据,现在已经成为自动化许多软件工程任务的有趣方法。几个硒任务都受到这种方法,在过去的几年里,性能逐渐改善,具有更好的模型和培训方法。更多,更多样化,清洁,标记数据更好的培训;但构建高质量的数据集是耗时和挑战。增强清洁量和多样性的方法,标记数据通常具有广泛的适用性。对于某些语言(例如,Ruby)标记的数据不那么丰富;在其他(例如,JavaScript)中,可用数据可能更多地关注某些应用域,从而更加多样化。作为围绕此类数据瓶颈,我们提出了证据表明,不同语言(执行相同功能)的人写代码相当相似,特别是保留标识符命名模式;我们进一步提出了证据表明标识符是软件工程任务培训数据的一个非常重要的要素。我们利用这种相当偶然的现象来查找可用的多语言训练数据(跨不同语言)的证据可用于放大性能。我们研究这一点3个不同的任务:代码摘要,代码检索和功能命名。我们注意到,这种数据增强方法与不同的任务,语言和机器学习模型广泛兼容。
translated by 谷歌翻译
Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.
translated by 谷歌翻译
Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We will make our code and pre-trained models publicly available.
translated by 谷歌翻译
源代码存储库由大型代码库组成,通常包含容易发生的程序。软件的复杂性日益增加导致时间和识别这些缺陷的时间和成本急剧上升。存在各种方法可以自动生成错误代码的修复程序。但是,由于特定错误的可能解决方案的组合空间很大,因此没有很多工具和数据集可以有效地评估生成的代码。在这项工作中,我们介绍了FixeVal,这是一个基准,其中包括竞争性编程问题及其各自修复程序的基准。我们引入了丰富的测试套件,以评估和评估模型生成程序修复的正确性。我们将两种在编程语言上鉴定的变压器语言模型视为我们的基准,并使用基于匹配和基于执行的评估指标对其进行比较。我们的实验表明,基于匹配的指标不能准确反映模型生成的程序修复,而基于执行的方法通过专门为该解决方案设计的所有情况和场景评估程序。因此,我们认为FixeVal提供了朝着实际自动错误修复和模型生成的代码评估的步骤。
translated by 谷歌翻译
大规模的,预训练的语言模型几乎没有学习的方法是回答有关代码问题的有力方法,例如,如何完成给定的代码示例,甚至从头开始生成代码段。这些模型的成功提出了一个问题,它们是否可以作为构建广泛代码生成工具的基础。传统上,此类工具是为每个任务手动和单独构建的。取而代之的是,只需提供一些示例或对预期工具行为的自然语言描述,就可以从单个预训练的语言模型中获取不同的工具。本文研究了代码的最先进的,预先训练的代码模型,Codex可能会达到此目的。我们考虑通过一系列传统工具针对的三个代码操纵和代码生成任务:(i)代码突变; (ii)从自然语言文档中测试甲骨文的生成; (iii)测试案例生成。对于每个任务,我们将几杆学习与手动构建的工具进行比较。我们的结果表明,基于模型的工具补充(代码突变),在PAR上(测试Oracle生成),甚至超越了其各自的传统构建的工具(测试案例生成),同时施加了开发它们的努力。通过比较基于模型的工具的不同变体的有效性,我们提供了有关如何将适当输入(“提示”)设计到模型以及模型大小的影响的见解。例如,我们发现,提供对代码生成任务的小型自然语言描述是改善预测的一种简单方法。总体而言,我们得出的结论是,很少有语言模型令人惊讶地有效,但是还有更多的工作要做,例如探索更多样化的方式来促使和解决更多有关任务。
translated by 谷歌翻译
我们介绍了一种称为编程拼图的新型编程挑战,作为方案合成的客观和全面评估,并释放Python编程拼图的开源数据集(P3)。每个拼图由短Python程序$ F $定义,目标是找到一个使$ F $返回true的输入。谜题是目的,因为每个人都由其验证者$ F $的源代码完全指定,因此评估为测试候选解决方案所需的$ F $。它们不需要答案密钥或输入/输出示例,也不依赖于自然语言理解。该数据集是全面的,因为它跨越一系列困难和域的问题,从琐碎的字符串操纵问题,经典编程谜题(例如,河内塔),用于采访/竞争编程问题(例如,动态编程),在算法和数学中的长期开放问题(例如,因子)。我们开发基准枚举程序合成,GPT-3和能够解决难题的食盒求解器 - 即使没有访问任何参考解决方案 - 通过从他们自己的过去的解决方案中学习。 Codex表现最佳,解决高达18%的397个测试问题的测试问题,每次尝试和80%的问题占1,000个问题。在一个小的用户学习中,我们发现拼图解决性能和编码体验之间的正相关性,以及人类和AI求解器的难题难度之间。因此,P3的进一步改进可能对许多程序合成区域产生重大影响。
translated by 谷歌翻译
Natural language processing for programming, which aims to use NLP techniques to assist programming, has experienced an explosion in recent years. However, there is no literature that systematically reviews related work from the full spectrum. In this paper, we comprehensively investigate existing work, ranging from early deductive models to the latest competition-level models. Another advantage of this paper is the completeness of the technique category, which provides easy access to locating and comparing future works.
translated by 谷歌翻译