在软件开发过程中,开发人员需要回答有关代码语义方面的查询。即使已经用神经方法进行了广泛的自然语言研究,但尚未探索使用神经网络对代码回答语义查询的问题。这主要是因为没有现有的数据集,具有提取性问答和答案对,涉及复杂概念和较长推理的代码。我们通过构建一个名为Codequeries的新的,策划的数据集并提出了一种关于代码的神经问题方法来弥合这一差距。我们基于最先进的预训练的代码模型,以预测答案和支持事实跨度。给定查询和代码,只有一些代码可能与回答查询有关。我们首先在理想的环境下进行实验,其中仅给出了模型的相关代码,并表明我们的模型做得很好。然后,我们在三个务实的考虑因素下进行实验:(1)扩展到大尺寸的代码,(2)从有限数量的示例中学习,(3)代码中对次要语法错误的鲁棒性。我们的结果表明,虽然神经模型可以抵御代码中的次要语法错误,代码的大小增加,与查询无关的代码的存在以及减少的培训示例数量限制了模型性能。我们正在释放数据和模型,以促进未来关于回答代码语义查询的问题的工作。
translated by 谷歌翻译
Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.
translated by 谷歌翻译
在过去的几年中,世界已转向多核和多核共享内存体系结构。结果,通过将共享内存并行化方案引入软件应用程序,越来越需要利用这些体系结构。 OpenMP是实现此类方案的最全面的API,其特征是可读接口。然而,由于平行共享内存的管理中普遍存在的陷阱,将OpenMP引入代码很具有挑战性。为了促进此任务的性能,多年来创建了许多源代码(S2S)编译器,任务是将OpenMP指令自动插入代码。除了对输入格式的鲁棒性有限外,这些编译器仍然无法在定位可行的代码和生成适当指令时获得令人满意的覆盖范围和精确度。在这项工作中,我们建议利用ML技术的最新进展,特别是自然语言处理(NLP),以完全替换S2S编译器。我们创建一个数据库(语料库),专门用于此目标。 Open-Opm包含28,000多个代码片段,其中一半包含OpenMP指令,而另一半根本不需要并行化。我们使用语料库来培训系统来自动对需要并行化的代码段进行分类,并建议单个OpenMP条款。我们为这些任务培训了几个名为Bragformer的变压器模型,并表明它们的表现优于统计训练的基线和自动S2S并行化编译器,这既可以分类OpenMP指令的总体需求,又要介绍私人和还原条款。我们的源代码和数据库可在以下网址获得:https://github.com/scientific-computing-lab-nrcn/pragformer。
translated by 谷歌翻译
源代码的最先进的神经模型倾向于在代码的生成时进行评估,并且通常在长地平任务中的产生,例如整个方法体的产生。我们建议使用静态程序分析仪的弱监督来解决这一缺陷。我们的神经统计方法允许深入的生成模型来象征地计算它已经生成的代码中的静态分析工具,长距离语义关系。在培训期间,该模型观察这些关系,并学习生成条件上的程序。考虑到包含该方法的类的剩余部分,我们将我们的方法应用于生成整个Java方法的问题。我们的实验表明,该方法显着地优于最先进的变换器和模型,明确试图在制作程序中没有基本语义错误的程序以及在句法匹配地面真理方面来学习此任务的模型。
translated by 谷歌翻译
In software development, it is common for programmers to copy-paste or port code snippets and then adapt them to their use case. This scenario motivates the code adaptation task -- a variant of program repair which aims to adapt variable identifiers in a pasted snippet of code to the surrounding, preexisting source code. However, no existing approach has been shown to effectively address this task. In this paper, we introduce AdaptivePaste, a learning-based approach to source code adaptation, based on transformers and a dedicated dataflow-aware deobfuscation pre-training task to learn meaningful representations of variable usage patterns. We evaluate AdaptivePaste on a dataset of code snippets in Python. Results suggest that our model can learn to adapt source code with 79.8% accuracy. To evaluate how valuable is AdaptivePaste in practice, we perform a user study with 10 Python developers on a hundred real-world copy-paste instances. The results show that AdaptivePaste reduces the dwell time to nearly half the time it takes for manual code adaptation, and helps to avoid bugs. In addition, we utilize the participant feedback to identify potential avenues for improvement of AdaptivePaste.
translated by 谷歌翻译
评论是源代码的重要组成部分,是文档的主要来源。这引起了人们对使用大量注释的兴趣训练或评估消耗或生产它们的工具,例如生成甲骨文,甚至是从注释中生成代码,或自动生成代码摘要。这项工作大部分对评论的结构和质量做出了强烈的假设,例如假设它们主要由适当的英语句子组成。但是,我们对这些用例的现有评论的实际质量知之甚少。评论通常包含在其他类型的文本中看不到的独特结构和元素,并且从中过滤或提取信息需要额外的谨慎。本文探讨了来自GitHub的840个最受欢迎的开源项目和Srilab数据集的8422个项目的Python评论的内容和质量,并且Na \“ Ive vs.深入过滤的影响都可以使用现有注释来用于使用现有注释。培训和评估产生评论的系统。
translated by 谷歌翻译
在本文中,我们解决了深入学习的软件漏洞自动修复问题。数据驱动漏洞修复的主要问题是已知确认漏洞的少数现有数据集仅由几千例组成。然而,培训深度学习模型通常需要数十万例的例子。在这项工作中,我们利用了错误修复任务和漏洞修复任务的直觉相关,并且可以传输来自错误修复的知识可以传输到修复漏洞。在机器学习界中,这种技术称为转移学习。在本文中,我们提出了一种修复名为Vreepair的安全漏洞的方法,该方法是基于转移学习。 vreepair首先在大型错误修复语料库上培训,然后在漏洞修复数据集上调整,这是一个较小的数量级。在我们的实验中,我们表明,仅在错误修复语料库上培训的模型可能已经修复了一些漏洞。然后,我们证明转移学习改善了修复易受攻击的C功能的能力。我们还表明,转移学习模型比具有去噪任务训练的模型更好,并在漏洞固定任务上进行微调。总而言之,本文表明,与在小型数据集上的学习相比,转移学习适用于修复C中的安全漏洞。
translated by 谷歌翻译
反向工程师受益于二进制中的标识符(例如函数名称)的存在,但通常将其删除以释放。训练机器学习模型自动预测功能名称是有希望的,但从根本上讲很难:与自然语言中的单词不同,大多数函数名称仅出现一次。在本文中,我们通过引入极端功能标签(XFL)来解决此问题,这是一种极端的多标签学习方法,可为二进制功能选择适当的标签。 XFL将函数名称分为代币,将每个功能视为具有自然语言标记文本的问题的信息标签。我们将二进制代码的语义与通过dexter进行标签,这是一种新颖的函数,将基于静态分析的特征与来自呼叫图的本地上下文和整个二进制的全局上下文相结合。我们证明,XFL/Dexter在Debian Project的10,047个二进制数据集上的功能标签上优于最新技术,获得了83.5%的精度。我们还研究了XFL与文献中的替代二进制嵌入的组合,并表明Dexter始终为这项任务做得最好。结果,我们证明了二进制函数标记可以通过多标签学习有效地措辞,并且二进制函数嵌入得益于包括明确的语义特征。
translated by 谷歌翻译
改善软件性能是软件开发周期中重要但充满挑战的部分。如今,大多数性能效率低下是由绩效专家确定和修补的。深度学习方法的最新进展和开源数据的广泛可用性为自动化绩效问题的识别和修补提供了一个绝佳的机会。在本文中,我们提出了Deepperf,这是一种基于变压器的方法,以建议针对C#应用程序进行性能改进。我们在英语和源代码语料库上预告了Deepperf,然后进行了Finetuning的任务,以生成C#应用程序的性能改进补丁。我们的评估表明,我们的模型可以在约53%的案例中生成与开发人员修复相同的性能改进建议,在我们专家验证的C#开发人员进行的绩效更改的数据集中,逐字化约34%。此外,我们使用基准测试和单元测试在GitHub上在50个开源C#存储库上评估Deepperf,并发现我们的模型能够提出有效的性能改进,以改善CPU使用和内存分配。到目前为止,我们已经提交了19个带有28种不同性能优化的拉装重新要求,其中11个PR已获得项目所有者的批准。
translated by 谷歌翻译
Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BIGBIRD, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BIGBIRD is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BIGBIRD drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.
translated by 谷歌翻译
自动问题应答(QA)系统的目的是以时间有效的方式向用户查询提供答案。通常在数据库(或知识库)或通常被称为语料库的文件集合中找到答案。在过去的几十年里,收购知识的扩散,因此生物医学领域的新科学文章一直是指数增长。因此,即使对于领域专家,也难以跟踪域中的所有信息。随着商业搜索引擎的改进,用户可以在某些情况下键入其查询并获得最相关的一小组文档,以及在某些情况下从文档中的相关片段。但是,手动查找所需信息或答案可能仍然令人疑惑和耗时。这需要开发高效的QA系统,该系统旨在为用户提供精确和精确的答案提供了生物医学领域的自然语言问题。在本文中,我们介绍了用于开发普通域QA系统的基本方法,然后彻底调查生物医学QA系统的不同方面,包括使用结构化数据库和文本集合的基准数据集和几种提出的方​​法。我们还探讨了当前系统的局限性,并探索潜在的途径以获得进一步的进步。
translated by 谷歌翻译
近年来,在挑战的多跳QA任务方面有令人印象深刻的进步。然而,当面对输入文本中的一些干扰时,这些QA模型可能会失败,并且它们进行多跳推理的可解释性仍然不确定。以前的逆势攻击作品通常编辑整个问题句,这对测试基于实体的多跳推理能力有限。在本文中,我们提出了一种基于多跳推理链的逆势攻击方法。我们将从查询实体开始的多跳推理链与构造的图表中的答案实体一起制定,这使我们能够将问题对齐到每个推理跳跃,从而攻击任何跃点。我们将问题分类为不同的推理类型和对应于所选推理跳的部分问题,以产生分散注意力的句子。我们在HotpotQA DataSet上的三个QA模型上测试我们的对抗方案。结果表明,对答案和支持事实预测的显着性能降低,验证了我们推理基于链条推理模型的攻击方法的有效性以及它们的脆弱性。我们的对抗重新培训进一步提高了这些模型的性能和鲁棒性。
translated by 谷歌翻译
我们提出了Pangu-Coder,这是一种仅预读的解码器语言模型,该模型采用pangu-alpha架构进行文本到代码生成,即给定自然语言问题描述的编程语言解决方案的合成。我们使用两阶段策略训练Pangu-Coder:第一阶段采用因果语言建模(CLM)来预先培训原始编程语言数据,而第二阶段则使用因果语言建模和掩盖语言建模(MLM)的组合培训目标,专注于文本到代码生成的下游任务,并培训松散的自然语言程序定义和代码功能。最后,我们讨论了pangu-coder-ft,该pander the是通过竞争性编程问题和代码与持续集成测试的结合进行了微调的。我们评估了pangu-coder,重点是它是否生成功能上正确的程序,并证明它在参加较小的上下文窗口和较少的数据培训的同时,它比诸如Codex之类的类似大小的模型(例如Codex)实现等效性或更好的性能。
translated by 谷歌翻译
Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HOTPOTQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison. We show that HOTPOTQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.
translated by 谷歌翻译
While pre-trained language models (LM) for code have achieved great success in code completion, they generate code conditioned only on the contents within the file, i.e., in-file context, but ignore the rich semantics in other files within the same project, i.e., cross-file context, a critical source of information that is especially useful in modern modular software development. Such overlooking constrains code language models' capacity in code completion, leading to unexpected behaviors such as generating hallucinated class member functions or function calls with unexpected arguments. In this work, we develop a cross-file context finder tool, CCFINDER, that effectively locates and retrieves the most relevant cross-file context. We propose CoCoMIC, a framework that incorporates cross-file context to learn the in-file and cross-file context jointly on top of pretrained code LMs. CoCoMIC successfully improves the existing code LM with a 19.30% relative increase in exact match and a 15.41% relative increase in identifier matching for code completion when the cross-file context is provided.
translated by 谷歌翻译
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
translated by 谷歌翻译
多跳的推理(即跨两个或多个文档的推理)是NLP模型的关键要素,该模型利用大型语料库表现出广泛的知识。为了检索证据段落,多跳模型必须与整个啤酒花的快速增长的搜索空间抗衡,代表结合多个信息需求的复杂查询,并解决有关在训练段落之间跳出的最佳顺序的歧义。我们通过Baleen解决了这些问题,Baleen可以提高多跳检索的准确性,同时从多跳的训练信号中学习强大的训练信号的准确性。为了驯服搜索空间,我们提出了凝结的检索,该管道总结了每个跃点后检索到单个紧凑型上下文的管道。为了建模复杂的查询,我们引入了一个重点的后期相互作用检索器,该检索器允许同一查询表示的不同部分匹配不同的相关段落。最后,为了推断无序的训练段落中的跳跃依赖性,我们设计了潜在的跳跃订购,这是一种弱者的策略,在该策略中,受过训练的检索员本身选择了啤酒花的顺序。我们在检索中评估Baleen的两跳问答和多跳的要求验证,并确定最先进的绩效。
translated by 谷歌翻译
人类开发人员可以使用网络安全缺陷生产代码。可以新兴'智能'代码完成工具有助于修复这些缺点吗?在这项工作中,我们研究了对零拍摄漏洞修复的代码(如Openai的Codex和AI21的侏罗纪J-1)使用大型语言模型(如Openai的Codex和AI21的J-1)。我们调查设计方面的挑战,提示将Coax LLMS进入生成不安全代码的修复版本。由于许多方法来短语和句法 - 具有自然语言,这很困难。通过对四个商业,黑盒子,“现成的”典型的模型进行大规模研究,以及局部训练的模型,在合成,手工制作和现实世界的安全错误场景的混合中,我们的实验表明,LLMS可以共同修复100%的综合生成和手工制作的情景,以及58%的脆弱性,在真实的开源项目中的历史错误中选择。
translated by 谷歌翻译
Semantic code search is the task of retrieving a code snippet given a textual description of its functionality. Recent work has been focused on using similarity metrics between neural embeddings of text and code. However, current language models are known to struggle with longer, compositional text, and multi-step reasoning. To overcome this limitation, we propose supplementing the query sentence with a layout of its semantic structure. The semantic layout is used to break down the final reasoning decision into a series of lower-level decisions. We use a Neural Module Network architecture to implement this idea. We compare our model - NS3 (Neuro-Symbolic Semantic Search) - to a number of baselines, including state-of-the-art semantic code retrieval methods, and evaluate on two datasets - CodeSearchNet and Code Search and Question Answering. We demonstrate that our approach results in more precise code retrieval, and we study the effectiveness of our modular design when handling compositional queries.
translated by 谷歌翻译
The problem of reversing the compilation process, decompilation, is an important tool in reverse engineering of computer software. Recently, researchers have proposed using techniques from neural machine translation to automate the process in decompilation. Although such techniques hold the promise of targeting a wider range of source and assembly languages, to date they have primarily targeted C code. In this paper we argue that existing neural decompilers have achieved higher accuracy at the cost of requiring language-specific domain knowledge such as tokenizers and parsers to build an abstract syntax tree (AST) for the source language, which increases the overhead of supporting new languages. We explore a different tradeoff that, to the extent possible, treats the assembly and source languages as plain text, and show that this allows us to build a decompiler that is easily retargetable to new languages. We evaluate our prototype decompiler, Beyond The C (BTC), on Go, Fortran, OCaml, and C, and examine the impact of parameters such as tokenization and training data selection on the quality of decompilation, finding that it achieves comparable decompilation results to prior work in neural decompilation with significantly less domain knowledge. We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.
translated by 谷歌翻译