Code pre-trained models (CodePTMs) have recently demonstrated significant success in code intelligence. To interpret these models, some probing methods have been applied. However, these methods fail to consider the inherent characteristics of codes. In this paper, to address the problem, we propose a novel probing method CAT-probing to quantitatively interpret how CodePTMs attend code structure. We first denoise the input code sequences based on the token types pre-defined by the compilers to filter those tokens whose attention scores are too small. After that, we define a new metric CAT-score to measure the commonality between the token-level attention scores generated in CodePTMs and the pair-wise distances between corresponding AST nodes. The higher the CAT-score, the stronger the ability of CodePTMs to capture code structure. We conduct extensive experiments to integrate CAT-probing with representative CodePTMs for different programming languages. Experimental results show the effectiveness of CAT-probing in CodePTM interpretation. Our codes and data are publicly available at https://github.com/nchen909/CodeAttention.
translated by 谷歌翻译
在源代码处理的领域中,基于变压器的表示模型表现出强大的功能,并在许多任务中都实现了最先进的(SOTA)性能。尽管变压器模型处理了顺序源代码,但证据表明,它们也可以捕获结构信息(\ eg,在语法树,数据流,控制流,\等)。我们提出了汇总的注意力评分,这是一种研究变压器学到的结构信息的方法。我们还提出了汇总的注意图,这是一种从预训练模型中提取程序图的新方法。我们从多个角度测量我们的方法。此外,根据我们的经验发现,我们使用自动提取的图形来替换可变滥用任务中那些巧妙的手动设计图。实验结果表明,我们自动提取的语义图非常有意义且有效,这为我们提供了一个新的观点,可以理解和使用模型中包含的信息。
translated by 谷歌翻译
We present CodeBERT, a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. We develop Code-BERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both "bimodal" data of NL-PL pairs and "unimodal" data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing. 1
translated by 谷歌翻译
预训练的语言模型的目的是学习文本数据的上下文表示。预训练的语言模型已成为自然语言处理和代码建模的主流。使用探针,一种研究隐藏矢量空间的语言特性的技术,以前的作品表明,这些预训练的语言模型在其隐藏表示中编码简单的语言特性。但是,以前的工作都没有评估这些模型是否编码编程语言的整个语法结构。在本文中,我们证明了\ textit {句法子空间}的存在,该{语法子空间}位于预训练的语言模型的隐藏表示中,其中包含编程语言的句法信息。我们表明,可以从模型的表示形式中提取此子空间,并定义一种新颖的探测方法AST-Probe,该方法可以恢复输入代码段的整个抽象语法树(AST)。在我们的实验中,我们表明这种句法子空间存在于五个最先进的预训练的语言模型中。此外,我们强调说,模型的中间层是编码大多数AST信息的模型。最后,我们估计该句法子空间的最佳大小,并表明其尺寸大大低于模型的表示空间。这表明,预训练的语言模型使用其表示空间的一小部分来编码编程语言的句法信息。
translated by 谷歌翻译
Pre-trained language models for programming languages have shown a powerful ability on processing many Software Engineering (SE) tasks, e.g., program synthesis, code completion, and code search. However, it remains to be seen what is behind their success. Recent studies have examined how pre-trained models can effectively learn syntax information based on Abstract Syntax Trees. In this paper, we figure out what role the self-attention mechanism plays in understanding code syntax and semantics based on AST and static analysis. We focus on a well-known representative code model, CodeBERT, and study how it can learn code syntax and semantics by the self-attention mechanism and Masked Language Modelling (MLM) at the token level. We propose a group of probing tasks to analyze CodeBERT. Based on AST and static analysis, we establish the relationships among the code tokens. First, Our results show that CodeBERT can acquire syntax and semantics knowledge through self-attention and MLM. Second, we demonstrate that the self-attention mechanism pays more attention to dependence-relationship tokens than to other tokens. Different attention heads play different roles in learning code semantics; we show that some of them are weak at encoding code semantics. Different layers have different competencies to represent different code properties. Deep CodeBERT layers can encode the semantic information that requires some complex inference in the code context. More importantly, we show that our analysis is helpful and leverage our conclusions to improve CodeBERT. We show an alternative approach for pre-training models, which makes fully use of the current pre-training strategy, i.e, MLM, to learn code syntax and semantics, instead of combining features from different code data formats, e.g., data-flow, running-time states, and program outputs.
translated by 谷歌翻译
源代码的表示学习对于将机器学习应用于软件工程任务至关重要。已经显示,跨不同编程语言的学习代码表示比从单语言数据集中学习更有效,因为来自多语言数据集的更多培训数据可提高该模型从源代码中提取语言 - 不平衡信息的能力。但是,现有的多语言模型忽略了特定于语言的信息,这对于在多语言数据集中培训的下游任务至关重要,同时仅着眼于学习不同语言之间的共享参数。为了解决这个问题,我们提出了MetatPtrans,这是一种用于多语言代码表示学习的元学习方法。 MetAtPtrans根据输入源代码段的特定编程语言为特征提取器生成不同的参数,从而使模型能够同时学习语言 - 语言和特定于语言的信息。实验结果表明,MetAtPtrans可将最新方法的F1得分显着提高到2.40个百分点,以汇总代码摘要,这是一项语言不可或缺的任务;以及TOP-1(TOP-5)的预测准确性高达7.32(13.15)百分点,以完成代码完成,这是一项特定于语言的任务。
translated by 谷歌翻译
预训练模型已在许多代码智能任务中有效。这些模型在大规模未标记的语料库中进行了预训练,然后在下游任务中进行了微调。但是,由于预训练和下游任务的输入是不同的形式,因此很难充分探索预训练模型的知识。此外,微调的性能强烈依赖于下游数据的量,而实际上,具有稀缺数据的场景很常见。自然语言处理(NLP)领域的最新研究表明,迅速调整,一种调整的新范式,减轻上述问题并在各种NLP任务中实现了有希望的结果。在迅速调整中,在调整过程中插入的提示提供了特定于任务的知识,这对于具有相对较少数据的任务特别有益。在本文中,我们凭经验评估了代码智能任务中迅速调整的用法和效果。我们对流行的预训练模型Codebert和codet5进行及时调整,并尝试三个代码智能任务,包括缺陷预测,代码摘要和代码翻译。我们的实验结果表明,在所有三个任务中,迅速调整始终优于微调。此外,及时调整在低资源场景中显示出很大的潜力,例如,对于代码摘要,平均将微调的BLEU分数提高了26%以上。我们的结果表明,我们可以调整代码智能任务的迅速调整,以实现更好的性能,尤其是在缺乏特定于任务的数据时,我们可以调整及时调整。
translated by 谷歌翻译
代码摘要可帮助开发人员理解程序并减少在软件维护过程中推断程序功能的时间。最近的努力诉诸深度学习技术,例如序列到序列模型,以生成准确的代码摘要,其中基于变压器的方法已实现了有希望的性能。但是,在此任务域中,有效地将代码结构信息集成到变压器中的情况不足。在本文中,我们提出了一种名为SG-Trans的新方法,将代码结构属性纳入变压器。具体而言,我们将局部符号信息(例如,代码令牌和语句)和全局句法结构(例如,数据流程图)注入变压器的自我发项模块中。为了进一步捕获代码的层次结构特征,局部信息和全局结构旨在分布在下层和变压器高层的注意力头中。广泛的评估表明,SG-trans的表现优于最先进的方法。与表现最佳的基线相比,SG-Trans在流星评分方面仍然可以提高1.4%和2.0%,这是一个广泛用于测量发电质量的度量,分别在两个基准数据集上。
translated by 谷歌翻译
图形神经网络已被证明可以为各种软件工程任务产生令人印象深刻的结果。但是,现有技术仍然有两个问题:(1)长期依赖性和(2)不同的代码组件在不应该的情况下被视为平等。为了解决这些问题,我们提出了一种表示代码为层次结构(代码层次结构)的方法,其中不同的代码组件在各个粒度级别分别表示。然后,为了处理每个表示级别的表示,我们设计了一个新颖的网络体系结构Echelon,它结合了异质图形变压器网络和基于树的卷积神经网络的优势,以学习具有代码依赖性信息丰富的抽象语法树。我们还提出了一个新颖的预处理目标,称为缺失子树预测以补充我们的代码层次结构。评估结果表明,我们的方法在三个任务中大大优于其他基准:任何代码完成,代码分类和代码克隆检测。
translated by 谷歌翻译
Natural language processing for programming, which aims to use NLP techniques to assist programming, has experienced an explosion in recent years. However, there is no literature that systematically reviews related work from the full spectrum. In this paper, we comprehensively investigate existing work, ranging from early deductive models to the latest competition-level models. Another advantage of this paper is the completeness of the technique category, which provides easy access to locating and comparing future works.
translated by 谷歌翻译
Pre-trained programming language (PL) models (such as CodeT5, CodeBERT, GraphCodeBERT, etc.,) have the potential to automate software engineering tasks involving code understanding and code generation. However, these models operate in the natural channel of code, i.e., they are primarily concerned with the human understanding of the code. They are not robust to changes in the input and thus, are potentially susceptible to adversarial attacks in the natural channel. We propose, CodeAttack, a simple yet effective black-box attack model that uses code structure to generate effective, efficient, and imperceptible adversarial code samples and demonstrates the vulnerabilities of the state-of-the-art PL models to code-specific adversarial attacks. We evaluate the transferability of CodeAttack on several code-code (translation and repair) and code-NL (summarization) tasks across different programming languages. CodeAttack outperforms state-of-the-art adversarial NLP attack models to achieve the best overall drop in performance while being more efficient, imperceptible, consistent, and fluent. The code can be found at https://github.com/reddy-lab-code-research/CodeAttack.
translated by 谷歌翻译
最近,人们对使用深度学习自动化软件工程任务的兴趣激增。这项工作解决了代码生成问题的问题,该问题是在其中以不同的语言或自然语言描述生成目标代码的目标代码。代码生成的大多数最先进的深度学习模型都使用主要是为自然语言设计的培训策略。但是,理解和生成代码需要对代码语法和语义的更严格理解。通过这种动机,我们开发了一个编码器变压器模型,其中训练编码器和解码器分别识别源和目标代码中的语法和数据流。我们不仅通过利用源代码的语法树和数据流程图来使编码器结构感知,而且我们还确保我们的解码器通过引入两个辅助任务来保留目标代码的语法和数据流:路径预测和数据流预测。据我们所知,这是第一项引入结构感知的变压器解码器,以通过对目标语法和数据流进行建模来增强生成代码的质量。所提出的结构编码器模型在CodexGlue基准测试中实现了代码翻译和文本对代码生成任务的最新性能。
translated by 谷歌翻译
Pre-trained language models have achieved promising success in code retrieval tasks, where a natural language documentation query is given to find the most relevant existing code snippet. However, existing models focus only on optimizing the documentation code pairs by embedding them into latent space, without the association of external knowledge. In this paper, we propose a generation-augmented query expansion framework. Inspired by the human retrieval process - sketching an answer before searching, in this work, we utilize the powerful code generation model to benefit the code retrieval task. Specifically, we demonstrate that rather than merely retrieving the target code snippet according to the documentation query, it would be helpful to augment the documentation query with its generation counterpart - generated code snippets from the code generation model. To the best of our knowledge, this is the first attempt that leverages the code generation model to enhance the code retrieval task. We achieve new state-of-the-art results on the CodeSearchNet benchmark and surpass the baselines significantly.
translated by 谷歌翻译
训练有素的机器学习模型,利用大量的开源软件数据,现在已经成为自动化许多软件工程任务的有趣方法。几个硒任务都受到这种方法,在过去的几年里,性能逐渐改善,具有更好的模型和培训方法。更多,更多样化,清洁,标记数据更好的培训;但构建高质量的数据集是耗时和挑战。增强清洁量和多样性的方法,标记数据通常具有广泛的适用性。对于某些语言(例如,Ruby)标记的数据不那么丰富;在其他(例如,JavaScript)中,可用数据可能更多地关注某些应用域,从而更加多样化。作为围绕此类数据瓶颈,我们提出了证据表明,不同语言(执行相同功能)的人写代码相当相似,特别是保留标识符命名模式;我们进一步提出了证据表明标识符是软件工程任务培训数据的一个非常重要的要素。我们利用这种相当偶然的现象来查找可用的多语言训练数据(跨不同语言)的证据可用于放大性能。我们研究这一点3个不同的任务:代码摘要,代码检索和功能命名。我们注意到,这种数据增强方法与不同的任务,语言和机器学习模型广泛兼容。
translated by 谷歌翻译
GitHub提交的记录,该代码随着自然语言消息的描述而变化,对于软件开发人员来说,在理解软件演变方面起着至关重要的作用。为了促进开源软件社区的开发,我们收集了一个提交基准,包括790万次跨7种编程语言的投入。基于此基准测试,我们提出了Citsbart,这是GitHub提交的大型预训练的编码器变压器模型。该模型由三个类别(即,为了学习提交碎片表示的六个预训练任务)预先培训(即,剥夺目标,跨模式生成和对比度学习)。此外,我们将一个“委托智能”框架与一项理解任务和提交的三个世代任务统一。这些任务的综合实验表明,提案巴特大大优于以前的代码预先培训作品。进一步的分析还揭示了每个预训练任务可增强模型性能。我们鼓励后续研究人员在将来为我们的框架贡献更多与承诺相关的下游任务。
translated by 谷歌翻译
上下文:堆栈溢出对于寻求编程问题答案的软件开发人员非常有帮助。先前的研究表明,越来越多的问题质量低,因此从潜在的答案者那里获得了更少的关注。 Gao等。提出了一个基于LSTM的模型(即BilstM-CC),以自动从代码片段中生成问题标题,以提高问题质量。但是,只有在问题主体中使用代码段无法为标题生成提供足够的信息,而LSTMS无法捕获令牌之间的远程依赖性。目的:本文提出了基于深度学习的新型模型CCBERT,旨在通过充分利用整个问题主体的双模式信息来增强问题标题生成的性能。方法:CCBERT遵循编码器范式范式,并使用Codebert将问题主体编码为隐藏的表示形式,堆叠的变压器解码器以生成预测的代币,以及附加的复制注意层来完善输出分布。编码器和解码器都执行多头自我注意操作,以更好地捕获远程依赖性。本文构建了一个数据集,该数据集包含大约200,000个高质量问题,该数据从Stack Overflow正式发布的数据中滤除,以验证CCBERT模型的有效性。结果:CCBERT优于数据集上的所有基线模型。对仅代码和低资源数据集进行的实验表明,CCBERT的优势性能较小。人类评估还显示了CCBERT关于可读性和相关标准的出色表现。
translated by 谷歌翻译
代码搜索目标是根据自然语言查询检索相关的代码片段,以提高软件生产力和质量。但是,由于源代码和查询之间的语义间隙,自动代码搜索是具有挑战性的。大多数现有方法主要考虑嵌入的顺序信息,其中文本背后的结构信息不完全考虑。在本文中,我们设计了一个名为GraphsearchNet的新型神经网络框架,通过共同学习源代码和查询的富集语义来启用有效和准确的源代码搜索。具体地,我们建议将源代码和查询编码为两个图,其中双向GGNN以捕获图表的本地结构信息。此外,我们通过利用有效的多主题来增强BigGNN,以补充BigGNN错过的全球依赖。关于Java和Python数据集的广泛实验说明了GraphSearchNet优于当前最先进的工作原位。
translated by 谷歌翻译
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones -- the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on six downstream tasks. We find that without any input-dependent attention, all models achieve competitive performance -- an average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones. Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success. Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture.
translated by 谷歌翻译
Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We will make our code and pre-trained models publicly available.
translated by 谷歌翻译
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
translated by 谷歌翻译