图形神经网络已被证明可以为各种软件工程任务产生令人印象深刻的结果。但是,现有技术仍然有两个问题:(1)长期依赖性和(2)不同的代码组件在不应该的情况下被视为平等。为了解决这些问题,我们提出了一种表示代码为层次结构(代码层次结构)的方法,其中不同的代码组件在各个粒度级别分别表示。然后,为了处理每个表示级别的表示,我们设计了一个新颖的网络体系结构Echelon,它结合了异质图形变压器网络和基于树的卷积神经网络的优势,以学习具有代码依赖性信息丰富的抽象语法树。我们还提出了一个新颖的预处理目标,称为缺失子树预测以补充我们的代码层次结构。评估结果表明,我们的方法在三个任务中大大优于其他基准:任何代码完成,代码分类和代码克隆检测。
translated by 谷歌翻译
Automated software debugging is a crucial task for improving the productivity of software developers. Many neural-based techniques have been proven effective for debugging-related tasks such as bug localization and program repair (or bug fixing). However, these techniques often focus only on either one of them or approach them in a stage-wise manner, ignoring the mutual benefits between them. In this work, we propose a novel unified \emph{Detect-Localize-Repair} framework based on a pretrained programming language model CodeT5 to seamlessly address these tasks, named CodeT5-DLR. Specifically, we propose three objectives to adapt the generic CodeT5 for debugging: a bug detection objective to determine whether a given code snippet is buggy or not, a bug localization objective to identify the buggy lines, and a program repair objective to translate the buggy code to its fixed version. We evaluate it on each of these tasks and their combined setting on two newly collected line-level debugging datasets in Java and Python. Extensive results show that our model significantly outperforms existing baselines from both NLP and software engineering domains.
translated by 谷歌翻译
As the complexity of modern software continues to escalate, software engineering has become an increasingly daunting and error-prone endeavor. In recent years, the field of Neural Code Intelligence (NCI) has emerged as a promising solution, leveraging the power of deep learning techniques to tackle analytical tasks on source code with the goal of improving programming efficiency and minimizing human errors within the software industry. Pretrained language models have become a dominant force in NCI research, consistently delivering state-of-the-art results across a wide range of tasks, including code summarization, generation, and translation. In this paper, we present a comprehensive survey of the NCI domain, including a thorough review of pretraining techniques, tasks, datasets, and model architectures. We hope this paper will serve as a bridge between the natural language and programming language communities, offering insights for future research in this rapidly evolving field.
translated by 谷歌翻译
Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.
translated by 谷歌翻译
当使用深度学习技术对程序语言进行建模时,广泛采用了带有树或图形结构的神经网络,以捕获程序抽象语法树(AST)中的丰富结构信息。但是,计划中广泛存在长期/全球依赖性,大多数这些神经体系结构无法捕获这些依赖性。在本文中,我们提出了Tree-Transformer,这是一种新型的递归树结构神经网络,旨在克服上述局限性。树转化器利用两个多头注意单元来建模兄弟姐妹和父子节点对之间的依赖关系。此外,我们提出了一个双向传播策略,以允许节点信息向两个方向传递:沿树木的自下而上和自上而下。通过结合自下而上和自上而下的传播,树转化器可以同时学习全局上下文和有意义的节点特征。广泛的实验结果表明,我们的树转换器在具有树级和节点级别的预测任务中,在与程序相关的任务中优于现有基于树或基于图的神经网络,这表明Tree-Transformer在学习两个树级时都表现良好和节点级表示。
translated by 谷歌翻译
深度学习在各种软件工程任务中广泛使用,例如,节目分类和缺陷预测。虽然该技术消除了特征工程所需的过程,但源代码模型的构建显着影响了这些任务的性能。最近的作品主要集中在通过引入从CFG提取的上下文依赖项来补充基于AST的源代码模型。但是,所有这些都关注基本块的表示,这是上下文依赖性的基础。在本文中,我们集成了AST和CFG,并提出了一种嵌入了分层依赖项的新型源代码模型。基于此,我们还设计了一种神经网络,这取决于图表关注机制。特殊地,我们介绍了基本块的句法结构,即其对应的AST,在源代码模型中提供足够的信息并填补间隙。我们在三种实际软件工程任务中评估了该模型,并将其与其他最先进的方法进行了比较。结果表明,我们的模型可以显着提高性能。例如,与最佳性能的基线相比,我们的模型将参数的比例降低了50 \%并实现了对程序分类任务的准确性的4 \%改进。
translated by 谷歌翻译
代码搜索目标是根据自然语言查询检索相关的代码片段,以提高软件生产力和质量。但是,由于源代码和查询之间的语义间隙,自动代码搜索是具有挑战性的。大多数现有方法主要考虑嵌入的顺序信息,其中文本背后的结构信息不完全考虑。在本文中,我们设计了一个名为GraphsearchNet的新型神经网络框架,通过共同学习源代码和查询的富集语义来启用有效和准确的源代码搜索。具体地,我们建议将源代码和查询编码为两个图,其中双向GGNN以捕获图表的本地结构信息。此外,我们通过利用有效的多主题来增强BigGNN,以补充BigGNN错过的全球依赖。关于Java和Python数据集的广泛实验说明了GraphSearchNet优于当前最先进的工作原位。
translated by 谷歌翻译
Natural language processing for programming, which aims to use NLP techniques to assist programming, has experienced an explosion in recent years. However, there is no literature that systematically reviews related work from the full spectrum. In this paper, we comprehensively investigate existing work, ranging from early deductive models to the latest competition-level models. Another advantage of this paper is the completeness of the technique category, which provides easy access to locating and comparing future works.
translated by 谷歌翻译
代码摘要可帮助开发人员理解程序并减少在软件维护过程中推断程序功能的时间。最近的努力诉诸深度学习技术,例如序列到序列模型,以生成准确的代码摘要,其中基于变压器的方法已实现了有希望的性能。但是,在此任务域中,有效地将代码结构信息集成到变压器中的情况不足。在本文中,我们提出了一种名为SG-Trans的新方法,将代码结构属性纳入变压器。具体而言,我们将局部符号信息(例如,代码令牌和语句)和全局句法结构(例如,数据流程图)注入变压器的自我发项模块中。为了进一步捕获代码的层次结构特征,局部信息和全局结构旨在分布在下层和变压器高层的注意力头中。广泛的评估表明,SG-trans的表现优于最先进的方法。与表现最佳的基线相比,SG-Trans在流星评分方面仍然可以提高1.4%和2.0%,这是一个广泛用于测量发电质量的度量,分别在两个基准数据集上。
translated by 谷歌翻译
Pre-trained language models for programming languages have shown a powerful ability on processing many Software Engineering (SE) tasks, e.g., program synthesis, code completion, and code search. However, it remains to be seen what is behind their success. Recent studies have examined how pre-trained models can effectively learn syntax information based on Abstract Syntax Trees. In this paper, we figure out what role the self-attention mechanism plays in understanding code syntax and semantics based on AST and static analysis. We focus on a well-known representative code model, CodeBERT, and study how it can learn code syntax and semantics by the self-attention mechanism and Masked Language Modelling (MLM) at the token level. We propose a group of probing tasks to analyze CodeBERT. Based on AST and static analysis, we establish the relationships among the code tokens. First, Our results show that CodeBERT can acquire syntax and semantics knowledge through self-attention and MLM. Second, we demonstrate that the self-attention mechanism pays more attention to dependence-relationship tokens than to other tokens. Different attention heads play different roles in learning code semantics; we show that some of them are weak at encoding code semantics. Different layers have different competencies to represent different code properties. Deep CodeBERT layers can encode the semantic information that requires some complex inference in the code context. More importantly, we show that our analysis is helpful and leverage our conclusions to improve CodeBERT. We show an alternative approach for pre-training models, which makes fully use of the current pre-training strategy, i.e, MLM, to learn code syntax and semantics, instead of combining features from different code data formats, e.g., data-flow, running-time states, and program outputs.
translated by 谷歌翻译
从结构化数据中学习是一项核心机器学习任务。通常,此类数据表示为图,通常仅考虑(键入)节点对之间的二进制关系。对于具有高度结构化数据的许多域而言,这是一个实质性的限制。一个重要的域是源代码,基于超图的表示可以更好地捕获代码的语义丰富和结构化的性质。在这项工作中,我们提出了热量,这是一种能够代表键入和合格的超图的神经模型,在该模型中,每个Hyperede都明确地符合参与节点的贡献。它可以看作是传递神经网络和变压器的消息的概括。我们使用新型程序代表程序来评估知识库完成和错误检测和维修的热量。在这两种情况下,它都优于强大的基线,表明其力量和通用性。
translated by 谷歌翻译
由不同类型的节点和边缘组成的学习异质图增强了均匀图技术的结果。这样的图形的一个有趣示例是代表可能的软件代码执行流的控制流图。由于此类图代表了代码的更多语义信息,因此为这些图形开发技术和工具可能对检测软件中的漏洞的可靠性非常有益。但是,现有的异质图技术仍然不足以处理复杂的图形,在处理复杂的图形中,不同类型的节点和边缘数量较大且可变。本文集中于以太坊智能合约作为由构建在控制流图和包含不同类型的节点和链接的呼叫图的异质合同图表示的软件代码样本。我们提出了曼多(Mando),这是一种新的异质图表示,以学习这种异质合同图的结构。 Mando提取自定义的Metapaths,该Metapaths在不同类型的节点及其邻居之间建立了关系连接。此外,它开发了一个多米达异构图注意网络,以学习不同类型的节点及其在异质合同图中的多层嵌入,可以更准确地捕获智能合约的代码语义,并便利两者。 - 水平和粗粒合同级别的漏洞检测。我们对大型智能合同数据集的广泛评估表明,曼多(Mando)在粗粒合同水平上改善了其他技术的脆弱性检测结果。更重要的是,它是第一种基于学习的方法,能够在细粒度的线条层面上识别漏洞,并在F1分数方面将基于代码分析的传统漏洞检测方法显着提高了11.35%至70.81%。
translated by 谷歌翻译
源代码的表示学习对于将机器学习应用于软件工程任务至关重要。已经显示,跨不同编程语言的学习代码表示比从单语言数据集中学习更有效,因为来自多语言数据集的更多培训数据可提高该模型从源代码中提取语言 - 不平衡信息的能力。但是,现有的多语言模型忽略了特定于语言的信息,这对于在多语言数据集中培训的下游任务至关重要,同时仅着眼于学习不同语言之间的共享参数。为了解决这个问题,我们提出了MetatPtrans,这是一种用于多语言代码表示学习的元学习方法。 MetAtPtrans根据输入源代码段的特定编程语言为特征提取器生成不同的参数,从而使模型能够同时学习语言 - 语言和特定于语言的信息。实验结果表明,MetAtPtrans可将最新方法的F1得分显着提高到2.40个百分点,以汇总代码摘要,这是一项语言不可或缺的任务;以及TOP-1(TOP-5)的预测准确性高达7.32(13.15)百分点,以完成代码完成,这是一项特定于语言的任务。
translated by 谷歌翻译
The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts (e.g., source code or executable code) into a form that is suitable for learning. Many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level (e.g., apk2vec) or conducted for one specific downstream task (e.g., smali2vec). Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks (e.g., at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in two distinct class-level software engineering tasks: Malicious Code Localization and Defect Prediction. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task.
translated by 谷歌翻译
动态类型的语言如JavaScript和Python已成为最受欢迎的使用中的使用中。重要的优势可以从动态类型的程序中的类型注释累积。逐渐键入的这种方法是由Querecript编程系统示例,允许程序员指定部分键入的程序,然后使用静态分析来推断剩余类型。然而,通常,静态类型推断的有效性受到限制,取决于程序结构和初始注释的复杂性。结果,对于可以在动态类型的程序中可以在静态预测类型中推进本领域的新​​方法的强大动机,并且该具有可接受的性能用于交互式编程环境。以前的工作表明了使用深度学习的概率类型推断的承诺。在本文中,我们通过引入一系列图形的神经网络(GNN)模型来推进过去的工作,该模型在新型流程图(TFG)表示上运行。 TFG表示输入程序的元素,作为与语法边缘和数据流边缘连接的图表节点,并且我们的GNN模型训练以预测给定输入程序的TFG中的类型标签。我们为我们的评估数据集中的100种最常见类型的GNN模型研究了不同的设计选择,并显示了我们最佳的准确性的两个GNN配置,分别实现了87.76%和86.89%的前1个精度,优于两个最密切相关的深度学习型推断从过去的工作 - 矮人的前进剂,顶级1的精度为84.62%,兰丹特精确为79.45%。此外,这两种配置的平均推理吞吐量为353.8和1,303.9文件/秒,而DeepTyper的186.7个文件/秒和LambDanet的1,050.3文件/秒。
translated by 谷歌翻译
While pre-trained language models (LM) for code have achieved great success in code completion, they generate code conditioned only on the contents within the file, i.e., in-file context, but ignore the rich semantics in other files within the same project, i.e., cross-file context, a critical source of information that is especially useful in modern modular software development. Such overlooking constrains code language models' capacity in code completion, leading to unexpected behaviors such as generating hallucinated class member functions or function calls with unexpected arguments. In this work, we develop a cross-file context finder tool, CCFINDER, that effectively locates and retrieves the most relevant cross-file context. We propose CoCoMIC, a framework that incorporates cross-file context to learn the in-file and cross-file context jointly on top of pretrained code LMs. CoCoMIC successfully improves the existing code LM with a 19.30% relative increase in exact match and a 15.41% relative increase in identifier matching for code completion when the cross-file context is provided.
translated by 谷歌翻译
在本文中,我们试图通过引入深度学习模型的句法归纳偏见来建立两所学校之间的联系。我们提出了两个归纳偏见的家族,一个家庭用于选区结构,另一个用于依赖性结构。选区归纳偏见鼓励深度学习模型使用不同的单位(或神经元)分别处理长期和短期信息。这种分离为深度学习模型提供了一种方法,可以从顺序输入中构建潜在的层次表示形式,即更高级别的表示由高级表示形式组成,并且可以分解为一系列低级表示。例如,在不了解地面实际结构的情况下,我们提出的模型学会通过根据其句法结构组成变量和运算符的表示来处理逻辑表达。另一方面,依赖归纳偏置鼓励模型在输入序列中找到实体之间的潜在关系。对于自然语言,潜在关系通常被建模为一个定向依赖图,其中一个单词恰好具有一个父节点和零或几个孩子的节点。将此约束应用于类似变压器的模型之后,我们发现该模型能够诱导接近人类专家注释的有向图,并且在不同任务上也优于标准变压器模型。我们认为,这些实验结果为深度学习模型的未来发展展示了一个有趣的选择。
translated by 谷歌翻译
代码克隆是实现类似功能的代码段对。克隆检测是自动源代码理解的基本分支,在重构建议,窃检测和代码摘要中具有许多应用程序。克隆检测的一个特别有趣的案例是检测语义克隆,即具有相同功能但实现方面有显着差异的代码段。检测语义克隆的一种有希望的方法是对比度学习(CL),这是一种在计算机视觉中流行的机器学习范式,但尚未用于代码处理。我们的工作旨在评估最受欢迎的CL算法以及两个任务上的三个源代码表示形式。第一个任务是代码克隆检测,我们在包含104个算法的实现的POJ-104数据集上进行了评估。第二个任务是窃检测。为了评估此任务上的模型,我们介绍了CodeTransFormator,这是用于转换源代码的工具。我们使用它来创建一个基于竞争性编程解决方案模仿窃代码的数据集。我们为这两项任务培训了九个模型,并将其与现有的六种方法进行了比较,包括传统工具和现代培训的神经模型。我们评估的结果表明,提议的模型在每个任务中都具有多样性,但是基于图的模型的性能通常高于其他模型。在CL算法中,SIMCLR和SWAV带来更好的结果,而MoCo是最强大的方法。我们的代码和训练有素的模型可在https://doi.org/10.5281/zenodo.6360627,https://doi.org/10.5281/zenodo.5596345获得。
translated by 谷歌翻译
我们提出了Pangu-Coder,这是一种仅预读的解码器语言模型,该模型采用pangu-alpha架构进行文本到代码生成,即给定自然语言问题描述的编程语言解决方案的合成。我们使用两阶段策略训练Pangu-Coder:第一阶段采用因果语言建模(CLM)来预先培训原始编程语言数据,而第二阶段则使用因果语言建模和掩盖语言建模(MLM)的组合培训目标,专注于文本到代码生成的下游任务,并培训松散的自然语言程序定义和代码功能。最后,我们讨论了pangu-coder-ft,该pander the是通过竞争性编程问题和代码与持续集成测试的结合进行了微调的。我们评估了pangu-coder,重点是它是否生成功能上正确的程序,并证明它在参加较小的上下文窗口和较少的数据培训的同时,它比诸如Codex之类的类似大小的模型(例如Codex)实现等效性或更好的性能。
translated by 谷歌翻译
在源代码中自动定位易受攻击的陈述至关重要,以确保软件安全性和缓解开发人员的调试工作。这在当今软件生态系统中变得更加重要,其中易受攻击的代码可以在像GitHub这样的软件存储库中轻松且无意中流动。在这类数百万的代码行中,传统的静态和动态方法争取缩放。虽然基于机器学习的方法在这样的设置中看起来很有希望,但大多数工作都在较高的粒度下检测到脆弱的代码 - 在方法或文件级别。因此,开发人员仍然需要检查大量代码以找到需要修复的弱势陈述。本文提出了一种新的集合学习方法来定位脆弱的陈述。我们的模型结合了基于图形的基于序列的神经网络,以成功捕获程序图的本地和全局上下文,并有效地了解代码语义和易受攻击的模式。为了研究天鹅绒的效果,我们使用了一个现成的合成数据集和最近发布的现实世界数据集。在静态分析设置中,未提前检测到易受攻击功能,Velvet可以实现4.5倍的性能,而不是真实世界数据上的基线静态分析仪。对于孤立的漏洞本地化任务,在我们假设特定漏洞声明未知的同时知道函数的漏洞,我们将天鹅绒与几个神经网络进行比较,这些内部网络也参加了本地和全局代码背景。天鹅绒分别达到99.6%和43.6%的13.6%,分别在合成数据和现实世界数据上实现了高精度,优于基线深度学习模型5.3-29.0%。
translated by 谷歌翻译