现代深度学习框架提供嵌入在Python中的必要的急切执行编程接口,以提供生产的开发体验。但是,深度学习从业者有时需要捕获和转换程序结构以进行性能优化,可视化,分析和硬件集成。我们研究了深度学习中使用的程序捕获和转型的不同设计。通过设计典型的深度学习用例而不是长尾部,可以为程序捕获和转换创建更简单的框架。我们在Torch.fx中应用了这一原理,是一个完全在Python写入的Pytorch的程序捕获和转换库,并通过ML从业者进行高开发人员生产力优化。我们存在案例研究,展示了Torch.fx如何实现先前在Pytorch生态系统中无法访问的工作流程。
translated by 谷歌翻译
Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.
translated by 谷歌翻译
深度学习的快速进步正在导致一系列快速变化的模型,对计算的需求急剧增长。但是,随着框架将性能优化专门针对流行网络的模式,它们隐含地限制了推动研究进展的新颖和多样化的模型。我们通过定义灵活和用户可定制的管道来优化基于数据运动最小化的任意深神经网络的培训来赋予深度学习研究人员的能力。管道始于Pytorch或ONNX中的标准网络,并通过逐步降低转换计算。我们定义了四个级别的通用转换级别,从局部操作员优化到全球数据运动减少。这些在以数据为中心的图形中间表示上运行,该表示在各个级别的抽象级别表达计算和数据移动,包括扩展基本运算符,例如其基础计算的卷积。设计的核心是管道的互动性和内省性质。每个部分都可以通过Python API扩展,并且可以使用GUI进行交互调整。我们在十个不同的网络上展示了竞争性能或加速性,交互式优化发现了高效网络中的新机会。
translated by 谷歌翻译
随着机器学习系统的计算要求以及机器学习框架的规模和复杂性的增加,基本框架创新变得具有挑战性。尽管计算需求驱动了最近的编译器,网络和硬件的进步,但通过机器学习工具对这些进步的利用却以较慢的速度发生。这部分是由于与现有框架原型制作新的计算范式有关的困难。大型框架将机器学习研究人员和从业人员作为最终用户的优先级优先,并且很少关注能够向前推动框架的系统研究人员 - 我们认为两者都是同等重要的利益相关者。我们介绍了手电筒,这是一个开源库,旨在通过优先考虑开放式,模块化,可定制的内部设备以及最新的,可用于研究的模型和培训设置,以刺激机器学习工具和系统的创新。手电筒使系统研究人员能够快速原型并尝试机器学习计算中的新思想,并且开销低,与其他流行的机器学习框架竞争并经常超过其他流行的机器学习框架。我们将手电筒视为一种工具,可以使可以使广泛使用的图书馆受益,并使机器学习和系统研究人员更加紧密地结合在一起。手电筒可从https://github.com/flashlight/flashlight获得。
translated by 谷歌翻译
深度神经网络(DNN)模型和数据集的快速增长大小引起了各种分布策略,如数据,张量模型,管道并行和其混合组合。这些策略中的每一个都提供自己的权衡,并在不同的模型和硬件拓扑上展示最佳性能。选择给定设置的最佳策略集是具有挑战性的,因为搜索空间组合增长,并且在群集上调试和测试昂贵。在这项工作中,我们提出了DISTIR,对于分布式DNN计算,这是针对有效分析而定制的分布式DNN计算的表达中间表示,例如模拟。这使得能够自动识别顶级执行策略,而无需在物理硬件上执行。与事先工作不同,Distir自然可以表达许多分发策略,包括管道并行性,具有任意时间表。我们对MLP培训和GPT-2推理模型的评估演示了DISTIR及其模拟器启用快速网格在跨越1000多种配置的复杂分配空间上搜索,以某些制度的数量级递减优化时间。
translated by 谷歌翻译
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. Tensor-Flow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, generalpurpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with a focus on training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that Tensor-Flow achieves for several real-world applications.
translated by 谷歌翻译
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-ofthe-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator.The system is open sourced and in production use inside several major companies.
translated by 谷歌翻译
一般矩阵乘法或GEMM内核在高性能计算和机器学习中占据中心位置。最近的NVIDIA GPU包括Gemm加速器,如Nvidia的张量核心。他们的剥削受到双语言问题的阻碍:它需要低级编程,这意味着低程序员的工作效率或使用只提供有限组件集的库。由于建立的组件方面的REPRASING算法经常引入开销,因此图书馆缺乏灵活性限制了探索新算法的自由。因此,使用GEMMS的研究人员无法立即享受编程生产力,高性能和研究灵活性。在本文中,我们解决了这个问题。我们在科学朱莉娅编程语言中展示了三组抽象和接口来编程宝石。界面和抽象共同设计用于研究人员的需求和朱莉娅的特征,以实现足够的担忧和灵活性的充分分离,以便在不支付性能价格的情况下轻松地扩展基本宝石。将我们的Gemms与最先进的图书馆Cublas和Cutlass进行比较,我们证明我们的性能在图书馆的相同球场中,并且在某些情况下甚至超过它,而无需在CUDA C ++中编写单行代码或者组装,而不面临灵活限制。
translated by 谷歌翻译
越来越多地用于优化深度神经网络(DNN)模型,以满足性能,资源利用和其他要求,越来越多地使用深入学习(DL)编译器(例如TVM和Tensorrt)。这些编译器中的错误可以产生优化的模型,其语义与原始模型不同,并产生不正确的结果,影响了下流应用程序的正确性。但是,由于其复杂性,在这些编译器中找到错误是具有挑战性的。在这项工作中,我们提出了一种新的模糊测试方法,用于在深入学习编译器中查找错误。我们的核心方法使用(i)轻重量操作员规范来生成多样化但有效的DNN模型,使我们能够行使编译器的大部分转换逻辑; (ii)基于梯度的搜索过程,用于查找模型输入,该过程避免在模型执行过程中避免任何浮点异常值,从而减少了错过错误或错误警报的机会; (iii)差异测试以识别错误。我们在NNSmith中实施了这种方法,该方法在过去的七个月中为TVM,Tensorrt,OnxRuntime和Pytorch发现了65个新错误。在这52个已得到证实,项目维护者已确定了44个。
translated by 谷歌翻译
在过去十年中,已经开发出新的深度学习(DL)算法,工作负载和硬件来解决各种问题。尽管工作量和硬件生态系统的进步,DL系统的编程方法是停滞不前的。 DL工作负载从DL库中的高度优化,特定于平台和不灵活的内核,或者在新颖的操作员的情况下,通过具有强大性能的DL框架基元建立参考实现。这项工作介绍了Tensor加工基元(TPP),一个编程抽象,用于高效的DL工作负载的高效,便携式实现。 TPPS定义了一组紧凑而多才多艺的2D张镜操作员(或虚拟张量ISA),随后可以用作构建块,以在高维张量上构建复杂的运算符。 TPP规范是平台 - 不可行的,因此通过TPPS表示的代码是便携式的,而TPP实现是高度优化的,并且特定于平台。我们展示了我们使用独立内核和端到端DL&HPC工作负载完全通过TPPS表达的方法的效力和生存性,这在多个平台上优于最先进的实现。
translated by 谷歌翻译
人工智能(AI)对计算的巨大需求正在推动对AI的硬件和软件系统的无与伦比的投资。这导致了专用硬件设备数量的爆炸,现在由主要的云供应商提供。通过通过基于张量的界面隐藏低级复杂性,张量计算运行时间(TCR)(例如Pytorch)允许数据科学家有效利用新硬件提供的令人兴奋的功能。在本文中,我们探讨了数据库管理系统如何在AI空间中乘坐创新浪潮。我们设计,构建和评估张量查询处理器(TQP):TQP将SQL查询转换为张量程序,并在TCR上执行它们。 TQP能够通过在张量例程中实现与关系运算符的新颖算法来运行完整的TPC-H基准。同时,TQP可以支持各种硬件,而仅需要通常的开发工作。实验表明,与专用CPU和仅GPU的系统相比,TQP可以将查询执行时间提高到10美元$ \ times $。最后,TQP可以加速查询ML预测和SQL端到端,并在CPU基线上输送高达9 $ \ times $速度。
translated by 谷歌翻译
我们呈现GSPMD,一种用于公共机器学习计算的自动,基于编译的并行化系统。它允许用户以与单个设备的方式相同的方式编写程序,然后通过关于如何分发Tensors的一些注释来提供提示,基于哪个GSPMD将并行化计算。其分区的表示简单尚不一般,允许它在各种模型上表达并行性的不同或混合范式。GSPMD基于有限的用户注释为每个运算符的分区Inventing,使得缩放现有的单设备程序方便。它解决了生产使用的几种技术挑战,允许GSPMD实现50%至62%的计算利用率,用于高达2048个云TPUv3核心,适用于高达1万亿参数的模型。
translated by 谷歌翻译
这项工作侧重于特定于域的加速器的有效敏捷设计方法。我们采用垂直开发堆栈的功能逐个功能增强,并将其应用于TVM / VTA推理加速器。我们已经增强了VTA设计空间,并启用了用于额外工作负载的端到端支持。这是通过增强VTA微架构和指令集架构(ISA)来实现的,以及通过增强TVM编译堆栈来支持各种VTA配置。 VTA TSIM实现(基于凿子)已通过ALU / GEMM执行单元的完全流水线版本增强。在TSIM中,内存宽度现在可以在8-64字节之间。对于支持较大的刮板,已经使场宽度更加灵活。已添加新的说明:元素 - WISE 8位乘法,支持深度卷积,并使用焊盘值的选择加载以支持最大池。还添加了对更多层和更好的双缓冲。完全管制的ALU / GEMM有助于显着帮助:4.9倍的循环较少,最小区域更改为在默认配置下运行RESET-18。可以实例化特征在于11.5倍的循环计数的配置,以12倍的循环计数更大的区域。显示了区域性能帕累托曲线上的许多点,展示了执行单元尺寸,内存接口宽度和刻痕尺寸的余额。最后,VTA现在能够运行MobileNet 1.0和所有层进行Resnet,包括先前禁用的池和完全连接的图层。 TVM / VTA架构始终在几分钟内以RTL呈现端到端工作量评估。通过我们的修改,它现在提供了更大的可行配置,具有广泛的成本与性能。所有提到的所有功能都可以在OpenSource叉中提供,而这些功能的子集已经上游。
translated by 谷歌翻译
我们提出了一个自动静态分析仪Pytea,可检测Pytorch码中的张量误差。张量误差在深度神经网络代码中是至关重要的;一旦在训练阶段中间发生张量形状不匹配,就会丢失大部分训练成本和中间结果。鉴于输入Pytorch源,Pytea静态跟踪每个可能的执行路径,收集路径的张量操作序列所需的张量形状约束,并决定如果约束是不匹配的(因此发生形状误差)。 Pytea的可扩展性和精确铰链对现实世界的Pytorch应用的特点:Pytea保守修剪后的执行路径数很少爆炸,循环简单,以通过我们的符号抽象来限制。我们测试了Pytea在官方Pytorch存储库中的项目中,以及在StackOverflow中发现的一些张计错误代码。 Pytea在几秒钟内成功地检测这些代码中的张量形状误差。
translated by 谷歌翻译
随着深度学习模型的速度较大,需要进行大型型号培训的系统级解决方案。我们展示了Amazon Sagemaker模型并行性,这是一个与Pytorch集成的软件库,并且可以使用模型并行性和其他内存节省功能轻松培训大型模型。与现有解决方案相比,Sagemaker库的实现更通用,灵活,因为它可以自动分区和运行具有最小代码的任意模型架构上的管道并行性,并且还为张量并行度提供一般和可扩展的框架,它支持更广泛的用例,并且可以轻松应用于新培训脚本的模块化。该库还将本机Pytorch用户体验保留到更大的程度,支持模块重复使用和动态图形,同时让用户完全控制训练步骤的细节。我们评估GPT-3,Roberta,BERT和神经协作过滤的性能,并表现出对现有解决方案的竞争性能。
translated by 谷歌翻译
我们向开放的神经网络交换(ONNX)中间表示格式提出扩展,以表示任意量化的量化神经网络。我们首先通过利用整数剪辑来引入对现有基于ONX的量化格式低精度量化的支持,从而产生了两个新的向后兼容的变体:带有剪辑和量化clip-dequantize(QCDQ)格式的量化运算符格式。然后,我们引入了一种新型的高级ONNX格式,称为量化ONNX(QONNX),该格式介绍了三个新运算符 - Quant,Biporlquant和Trunc,以表示均匀的量化。通过保持QONNX IR高级和灵活性,我们可以针对更广泛的平台。我们还介绍了与QONNX合作的实用程序,以及其在FINN和HLS4ML工具链中使用的示例。最后,我们介绍了QONNX模型动物园,以共享低精确的量化神经网络。
translated by 谷歌翻译
离散的傅立叶变换(DFT)库是科学计算的最关键的软件组件之一。受FFTW的启发,FFTW是一个广泛用于DFT HPC计算的库,我们将编译器技术应用于HPC傅立叶变换库的开发。在这项工作中,我们介绍了基于多级中间表示(MLIR)的FFTC(一种特定领域的语言),用于表达傅立叶变换算法。我们介绍FFTC的初始设计,实现和初步结果。
translated by 谷歌翻译
Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.
translated by 谷歌翻译
Artificial Intelligence (AI) is having a tremendous impact across most areas of science. Applications of AI in healthcare have the potential to improve our ability to detect, diagnose, prognose, and intervene on human disease. For AI models to be used clinically, they need to be made safe, reproducible and robust, and the underlying software framework must be aware of the particularities (e.g. geometry, physiology, physics) of medical data being processed. This work introduces MONAI, a freely available, community-supported, and consortium-led PyTorch-based framework for deep learning in healthcare. MONAI extends PyTorch to support medical data, with a particular focus on imaging, and provide purpose-specific AI model architectures, transformations and utilities that streamline the development and deployment of medical AI models. MONAI follows best practices for software-development, providing an easy-to-use, robust, well-documented, and well-tested software framework. MONAI preserves the simple, additive, and compositional approach of its underlying PyTorch libraries. MONAI is being used by and receiving contributions from research, clinical and industrial teams from around the world, who are pursuing applications spanning nearly every aspect of healthcare.
translated by 谷歌翻译
Array programming provides a powerful, compact, expressive syntax for accessing, manipulating, and operating on data in vectors, matrices, and higher-dimensional arrays [1]. NumPy is the primary array programming library for the Python language [2,3,4,5]. It plays an essential role in research analysis pipelines in fields as diverse as physics, chemistry, astronomy, geoscience, biology, psychology, material science, engineering, finance, and economics. For example, in astronomy, NumPy was an important part of the software stack used in the discovery of gravitational waves [6] and the first imaging of a black hole [7].Here we show how a few fundamental array concepts lead to a simple and powerful programming paradigm for organizing, exploring, and analyzing scientific data. NumPy is the foundation upon which the entire scientific Python universe is constructed. It is so pervasive that several projects, targeting audiences with specialized needs, have developed their own NumPy-like interfaces and array objects. Because of its central position in the ecosystem, NumPy increasingly plays the role of an interoperability layer between these new array computation libraries.
translated by 谷歌翻译