恢复程序的呼叫图对于基于流程间分析任务和应用程序至关重要。核心挑战是识别间接呼叫的目标(即,间接分支机构)。由于二进制文件中的信息丢失,如果目标程序以二元形式为二元形式,则变得更具挑战性。二进制文件的现有间接Callee识别解决方案都具有高误报和负面,使呼叫图不准确。在本文中,我们提出了一种基于暹罗神经网络的新解决方案,受到质疑答案应用的进步的启发。关键洞察力是,神经网络可以学习通过理解其上下文,即附近呼叫和分支机构的指示是间接代表的潜在目标。在此洞察力之后,我们首先预处理目标二进制文件,以提取电话和分支的上下文。然后,我们构建适用于汇编语言的自定义自然语言处理(NLP)模型。此外,我们收集了丰富的呼叫和分支,并将其上下文与NLP模型嵌入,然后培训暹罗网络和分类器以回答电呼叫路上的问题。我们已经实施了Inclelee的原型,并在几组目标上进行了评估。评价结果表明,我们的解决方案可以将手段与F1措施相匹配93.7%,召回的93.8%,精度为93.5%,比最先进的解决方案好得多。为了展示其有用性,我们将iCallee应用于两个特定的应用 - 二进制代码相似性检测和二进制程序硬化,并发现它可以大大提高最先进的解决方案。
translated by 谷歌翻译
二进制代码分析的最新趋势促进了基于教学嵌入模型的神经解决方案的使用。指令嵌入模型是一个神经网络,将汇编指令序列转换为嵌入向量。如果对嵌入式网络进行了训练,从而使从代码到向量的翻译部分保留了语义,则该网络有效地代表了汇编代码模型。在本文中,我们介绍了Binbert,这是一种新颖的装配代码模型。 Binbert建立在汇编指令序列和符号执行信息的庞大数据集中的预训练的变压器上。 Binbert可以应用于汇编指令序列,并且可以微调,即可以作为任务特定数据的神经体系结构的一部分进行重新训练。通过微调,Binbert学会了如何将获得预培训获得的通用知识应用于特定任务。我们根据多任务基准评估了Binbert,我们专门设计了用于测试组装代码的理解。基准是由几个任务组成的,其中一些是从文献中获得的,以及我们设计的一些新颖任务,并结合了内在和下游任务。我们的结果表明,Binbert优于二进制指令嵌入的最先进模型,提高了二进制代码理解的标准。
translated by 谷歌翻译
反向工程师受益于二进制中的标识符(例如函数名称)的存在,但通常将其删除以释放。训练机器学习模型自动预测功能名称是有希望的,但从根本上讲很难:与自然语言中的单词不同,大多数函数名称仅出现一次。在本文中,我们通过引入极端功能标签(XFL)来解决此问题,这是一种极端的多标签学习方法,可为二进制功能选择适当的标签。 XFL将函数名称分为代币,将每个功能视为具有自然语言标记文本的问题的信息标签。我们将二进制代码的语义与通过dexter进行标签,这是一种新颖的函数,将基于静态分析的特征与来自呼叫图的本地上下文和整个二进制的全局上下文相结合。我们证明,XFL/Dexter在Debian Project的10,047个二进制数据集上的功能标签上优于最新技术,获得了83.5%的精度。我们还研究了XFL与文献中的替代二进制嵌入的组合,并表明Dexter始终为这项任务做得最好。结果,我们证明了二进制函数标记可以通过多标签学习有效地措辞,并且二进制函数嵌入得益于包括明确的语义特征。
translated by 谷歌翻译
Decompilation aims to transform a low-level program language (LPL) (eg., binary file) into its functionally-equivalent high-level program language (HPL) (e.g., C/C++). It is a core technology in software security, especially in vulnerability discovery and malware analysis. In recent years, with the successful application of neural machine translation (NMT) models in natural language processing (NLP), researchers have tried to build neural decompilers by borrowing the idea of NMT. They formulate the decompilation process as a translation problem between LPL and HPL, aiming to reduce the human cost required to develop decompilation tools and improve their generalizability. However, state-of-the-art learning-based decompilers do not cope well with compiler-optimized binaries. Since real-world binaries are mostly compiler-optimized, decompilers that do not consider optimized binaries have limited practical significance. In this paper, we propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries. NeurDP uses a graph neural network (GNN) model to convert LPL to an intermediate representation (IR), which bridges the gap between source code and optimized binary. We also design an Optimized Translation Unit (OTU) to split functions into smaller code fragments for better translation performance. Evaluation results on datasets containing various types of statements show that NeurDP can decompile optimized binaries with 45.21% higher accuracy than state-of-the-art neural decompilation frameworks.
translated by 谷歌翻译
构建静态呼叫图需要在健全和精度之间进行权衡。不幸的是,用于构建呼叫图的程序分析技术通常不精确。为了解决这个问题,研究人员最近提出了通过机器学习为静态分析构建的后处理呼叫图所授权的呼叫图。机器学习模型的构建是为了通过在随机森林分类器中提取结构特征来捕获呼叫图中的信息。然后,它消除了预测为误报的边缘。尽管机器学习模型显示了改进,但它们仍然受到限制,因为它们不考虑源代码语义,因此通常无法有效地区分真实和误报。在本文中,我们提出了一种新颖的呼叫图修剪技术AutoRoprouner,用于通过统计语义和结构分析消除呼叫图中的假阳性。给定一个由传统静态分析工具构建的呼叫图,AutoProuner采用基于变压器的方法来捕获呼叫者与呼叫图中每个边缘相关的呼叫者和Callee函数之间的语义关系。为此,AutoProuner微型调节模型是在大型语料库上预先训练的代码模型,以根据其语义的描述表示源代码。接下来,该模型用于从与呼叫图中的每个边缘相关的功能中提取语义特征。 AutoProuner使用这些语义功能以及从呼叫图提取的结构特征通过馈送前向神经网络分类。我们在现实世界程序的基准数据集上进行的经验评估表明,AutoProuner的表现优于最先进的基线,从而改善了F量级,在识别静态呼叫图中识别错误阳性边缘方面,高达13%。
translated by 谷歌翻译
深度学习在各种软件工程任务中广泛使用,例如,节目分类和缺陷预测。虽然该技术消除了特征工程所需的过程,但源代码模型的构建显着影响了这些任务的性能。最近的作品主要集中在通过引入从CFG提取的上下文依赖项来补充基于AST的源代码模型。但是,所有这些都关注基本块的表示,这是上下文依赖性的基础。在本文中,我们集成了AST和CFG,并提出了一种嵌入了分层依赖项的新型源代码模型。基于此,我们还设计了一种神经网络,这取决于图表关注机制。特殊地,我们介绍了基本块的句法结构,即其对应的AST,在源代码模型中提供足够的信息并填补间隙。我们在三种实际软件工程任务中评估了该模型,并将其与其他最先进的方法进行了比较。结果表明,我们的模型可以显着提高性能。例如,与最佳性能的基线相比,我们的模型将参数的比例降低了50 \%并实现了对程序分类任务的准确性的4 \%改进。
translated by 谷歌翻译
基于变压器的大型语言模型在自然语言处理中表现出色。通过考虑这些模型在一个领域中获得的知识的可传递性,以及自然语言与高级编程语言(例如C/C ++)的亲密关系,这项工作研究了如何利用(大)基于变压器语言模型检测软件漏洞以及这些模型在漏洞检测任务方面的良好程度。在这方面,首先提出了一个系统的(凝聚)框架,详细介绍了源代码翻译,模型准备和推理。然后,使用具有多个漏洞的C/C ++源代码的软件漏洞数据集进行经验分析,该数据集对应于库功能调用,指针使用,数组使用情况和算术表达式。我们的经验结果证明了语言模型在脆弱性检测中的良好性能。此外,这些语言模型具有比当代模型更好的性能指标,例如F1得分,即双向长期记忆和双向封闭式复发单元。由于计算资源,平台,库和依赖项的要求,对语言模型进行实验始终是具有挑战性的。因此,本文还分析了流行的平台,以有效地微调这些模型并在选择平台时提出建议。
translated by 谷歌翻译
在本文中,我们解决了深入学习的软件漏洞自动修复问题。数据驱动漏洞修复的主要问题是已知确认漏洞的少数现有数据集仅由几千例组成。然而,培训深度学习模型通常需要数十万例的例子。在这项工作中,我们利用了错误修复任务和漏洞修复任务的直觉相关,并且可以传输来自错误修复的知识可以传输到修复漏洞。在机器学习界中,这种技术称为转移学习。在本文中,我们提出了一种修复名为Vreepair的安全漏洞的方法,该方法是基于转移学习。 vreepair首先在大型错误修复语料库上培训,然后在漏洞修复数据集上调整,这是一个较小的数量级。在我们的实验中,我们表明,仅在错误修复语料库上培训的模型可能已经修复了一些漏洞。然后,我们证明转移学习改善了修复易受攻击的C功能的能力。我们还表明,转移学习模型比具有去噪任务训练的模型更好,并在漏洞固定任务上进行微调。总而言之,本文表明,与在小型数据集上的学习相比,转移学习适用于修复C中的安全漏洞。
translated by 谷歌翻译
Binary code similarity detection (BCSD) is widely used in various binary analysis tasks such as vulnerability search, malware detection, clone detection, and patch analysis. Recent studies have shown that the learning-based binary code embedding models perform better than the traditional feature-based approaches. In this paper, we proposed a novel transformer-based binary code embedding model, named UniASM, to learn representations of the binary functions. We designed two new training tasks to make the spatial distribution of the generated vectors more uniform, which can be used directly in BCSD without any fine-tuning. In addition, we proposed a new tokenization approach for binary functions, increasing the token's semantic information while mitigating the out-of-vocabulary (OOV) problem. The experimental results show that UniASM outperforms state-of-the-art (SOTA) approaches on the evaluation dataset. We achieved the average scores of recall@1 on cross-compilers, cross-optimization-levels and cross-obfuscations are 0.72, 0.63, and 0.77, which is higher than existing SOTA baselines. In a real-world task of known vulnerability searching, UniASM outperforms all the current baselines.
translated by 谷歌翻译
深度学习方法的最新突破引发了人们对基于学习的错误探测器的兴趣。与传统的静态分析工具相比,这些错误检测器是直接从数据中学到的,因此更容易创建。另一方面,它们很难训练,需要大量数据,而这些数据不容易获得。在本文中,我们提出了一种称为Meta Bug检测的新方法,该方法比现有基于学习的错误探测器具有三个至关重要的优势:Bug-Type通用(即,能够捕获在培训期间完全没有观察到的错误类型),可以自我解释(即能够在没有任何外部可解释方法的情况下解释其自身的预测)和样本有效(即,比标准错误检测器所需的培训数据要少得多)。我们的广泛评估表明,我们的元错误检测器(MBD)有效地捕获了各种错误,包括NULL指针解除,阵列索引外部漏洞,文件句柄泄漏甚至是并发程序中的数据竞赛;在此过程中,MBD还大大优于几个值得注意的基线,包括Facebook推断,一种著名的静态分析工具和FICS,即最新的异常检测方法。
translated by 谷歌翻译
代码克隆是实现类似功能的代码段对。克隆检测是自动源代码理解的基本分支,在重构建议,窃检测和代码摘要中具有许多应用程序。克隆检测的一个特别有趣的案例是检测语义克隆,即具有相同功能但实现方面有显着差异的代码段。检测语义克隆的一种有希望的方法是对比度学习(CL),这是一种在计算机视觉中流行的机器学习范式,但尚未用于代码处理。我们的工作旨在评估最受欢迎的CL算法以及两个任务上的三个源代码表示形式。第一个任务是代码克隆检测,我们在包含104个算法的实现的POJ-104数据集上进行了评估。第二个任务是窃检测。为了评估此任务上的模型,我们介绍了CodeTransFormator,这是用于转换源代码的工具。我们使用它来创建一个基于竞争性编程解决方案模仿窃代码的数据集。我们为这两项任务培训了九个模型,并将其与现有的六种方法进行了比较,包括传统工具和现代培训的神经模型。我们评估的结果表明,提议的模型在每个任务中都具有多样性,但是基于图的模型的性能通常高于其他模型。在CL算法中,SIMCLR和SWAV带来更好的结果,而MoCo是最强大的方法。我们的代码和训练有素的模型可在https://doi.org/10.5281/zenodo.6360627,https://doi.org/10.5281/zenodo.5596345获得。
translated by 谷歌翻译
尽管不断努力提高代码搜索的有效性和效率,但仍未解决两个问题。首先,编程语言具有固有的牢固结构链接,并且代码的特征是文本表单将省略其中包含的结构信息。其次,代码和查询之间存在潜在的语义关系,跨序列对齐代码和文本是具有挑战性的,因此在相似性匹配期间,向量在空间上保持一致。为了解决这两个问题,在本文中,提出了一个名为CSSAM的代码搜索模型(代码语义和结构注意匹配)。通过引入语义和结构匹配机制,CSSAM有效提取并融合了多维代码功能。具体而言,开发了交叉和残留层,以促进代码和查询的高纬度空间比对。通过利用残差交互,匹配模块旨在保留更多的代码语义和描述性功能,从而增强了代码及其相应查询文本之间的附着力。此外,为了提高模型对代码固有结构的理解,提出了一个名为CSRG的代码表示结构(代码语义表示图),用于共同表示抽象语法树节点和代码的数据流。根据两个包含540K和330K代码段的公开可用数据集的实验结果,CSSAM在两个数据集中分别在获得最高的SR@1/5/10,MRR和NDCG@50方面大大优于基本线。此外,进行消融研究是为了定量衡量CSSAM每个关键组成部分对代码搜索效率和有效性的影响,这为改进高级代码搜索解决方案提供了见解。
translated by 谷歌翻译
The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts (e.g., source code or executable code) into a form that is suitable for learning. Many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level (e.g., apk2vec) or conducted for one specific downstream task (e.g., smali2vec). Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks (e.g., at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in two distinct class-level software engineering tasks: Malicious Code Localization and Defect Prediction. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task.
translated by 谷歌翻译
Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.
translated by 谷歌翻译
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 3rd International Workshop on Reading Music Systems, held in Alicante on the 23rd of July 2021.
translated by 谷歌翻译
Memory-safety bugs introduce critical software-security issues. Rust provides memory-safe mechanisms to avoid memory-safety bugs in programming, while still allowing unsafe escape hatches via unsafe code. However, the unsafe code that enhances the usability of Rust provides clear spots for finding memory-safety bugs in Rust source code. In this paper, we claim that these unsafe spots can still be identifiable in Rust binary code via machine learning and be leveraged for finding memory-safety bugs. To support our claim, we propose the tool textttrustspot, that enables reverse engineering to learn an unsafe classifier that proposes a list of functions in Rust binaries for downstream analysis. We empirically show that the function proposals by textttrustspot can recall $92.92\%$ of memory-safety bugs, while it covers only $16.79\%$ of the entire binary code. As an application, we demonstrate that the function proposals are used in targeted fuzzing on Rust packages, which contribute to reducing the fuzzing time compared to non-targeted fuzzing.
translated by 谷歌翻译
The problem of reversing the compilation process, decompilation, is an important tool in reverse engineering of computer software. Recently, researchers have proposed using techniques from neural machine translation to automate the process in decompilation. Although such techniques hold the promise of targeting a wider range of source and assembly languages, to date they have primarily targeted C code. In this paper we argue that existing neural decompilers have achieved higher accuracy at the cost of requiring language-specific domain knowledge such as tokenizers and parsers to build an abstract syntax tree (AST) for the source language, which increases the overhead of supporting new languages. We explore a different tradeoff that, to the extent possible, treats the assembly and source languages as plain text, and show that this allows us to build a decompiler that is easily retargetable to new languages. We evaluate our prototype decompiler, Beyond The C (BTC), on Go, Fortran, OCaml, and C, and examine the impact of parameters such as tokenization and training data selection on the quality of decompilation, finding that it achieves comparable decompilation results to prior work in neural decompilation with significantly less domain knowledge. We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.
translated by 谷歌翻译
最近的工作通过从上下文重建令牌来了解源代码的上下文表示。对于诸如英语中汇总代码的下游语义理解任务,这些表示应该理想地捕获程序功能。但是,我们表明流行的基于重建的BERT模型对源代码编辑敏感,即使编辑保存语义。我们提出了僵局:一种学习代码功能的对比预训练任务,而不是形成。触发预先训练神经网络,以识别许多不等效的干扰者之间的程序的功能类似的变体。我们使用自动源到源编译器作为数据增强的形式来缩放可扩展这些变体。对比预训练将JavaScript摘要和打字类型推理准确性提高2%至13%。我们还提出了一个新的零拍摄JavaScript代码克隆检测数据集,显示施加均比更强大和语义有意义。就此而言,我们以39%的Auroc在普发的环境中以39%的AUROC倾斜,高达5%的自然代码。
translated by 谷歌翻译
大多数自动化软件测试任务可以从测试用例的抽象表示中受益。传统上,这是通过基于测试案例的代码覆盖范围来完成的。规范级别的标准可以替换代码覆盖范围以更好地表示测试用例的行为,但通常不具有成本效益。在本文中,我们假设测试用例的执行痕迹可以使其在自动测试任务中抽象其行为的好选择。我们提出了一种新颖的嵌入方法Test2VEC,该方法将测试执行映射到潜在空间。我们在测试案例的优先级(TP)任务中评估了此表示形式。我们的默认TP方法基于嵌入式向量与历史失败测试向量的相似性。我们还根据测试向量的多样性研究了一种替代方案。最后,我们提出了一种决定给定测试套件的方法,以决定选择哪种TP。该实验基于几个真实和种子故障,具有超过一百万个执行痕迹。结果表明,就第一个失败测试案例(FFR)的中位数等级而言,我们提议的TP将最佳替代品提高了41.80%。就中位数APFD和中位数归一化FFR而言,它的表现优于传统代码覆盖范围的方法25.05%和59.25%。
translated by 谷歌翻译
功能级二进制代码相似性检测在网络空间安全性领域至关重要。它可以帮助我们在发布的软件中找到错误并检测专利侵权,并在预防供应链攻击中起关键作用。一个实用的嵌入学习框架依赖于矢量表示系统的鲁棒性以及功能对注释的准确性。传统上,基于学习的方法是基于学习的方法。但是,用准确的标签对不同的功能对进行注释非常困难。这些监督的学习方法很容易被过度训练,并且遭受了鲁棒性问题的困扰。为了减轻这些问题,我们提出了FUN2VEC:二进制功能级表示的对比学习框架。我们采用一种无监督的学习方法,并将二进制代码相似性检测作为实例歧视。 FUN2VEC直接用于分解的二进制功能,并且可以使用任何编码器实现。它不需要标记类似或不同信息的手动。我们使用编译器优化选项和代码混淆技术来生成增强数据。我们的实验结果表明,我们的方法超过了准确性的最先进,并且在几次射击设置中具有很大的优势。
translated by 谷歌翻译