深度神经网络(DNNS)的边缘训练是持续学习的理想目标。但是,这受到训练所需的巨大计算能力的阻碍。硬件近似乘数表明,它们在获得DNN推理加速器中获得资源效率的有效性;但是,使用近似乘数的培训在很大程度上尚未开发。为了通过支持DNN培训的近似乘数来构建有效的资源加速器,需要对不同DNN体系结构和不同近似乘数进行彻底评估。本文介绍了近似值,这是一个开源框架,允许使用模拟近似乘数快速评估DNN训练和推理。近似值与TensorFlow(TF)一样用户友好,仅需要对DNN体系结构的高级描述以及近似乘数的C/C ++功能模型。我们通过使用GPU(AMSIM)上的基于基于LUT的近似浮点(FP)乘数模拟器来提高乘数在乘数级别的模拟速度。近似值利用CUDA并有效地将AMSIM集成到张量库中,以克服商业GPU中的本机硬件近似乘数的缺乏。我们使用近似值来评估使用LENET和RESNETS体系结构的小型和大型数据集(包括Imagenet)的近似乘数的DNN训练的收敛性和准确性。与FP32和BFLOAT16乘数相比,评估表明测试准确性相似的收敛行为和可忽略不计的变化。与训练和推理中基于CPU的近似乘数模拟相比,GPU加速近似值快2500倍以上。基于具有本地硬件乘数的高度优化的闭合源Cudnn/Cublas库,原始张量量仅比近似值快8倍。
translated by 谷歌翻译
我们日常生活中的深度学习是普遍存在的,包括自驾车,虚拟助理,社交网络服务,医疗服务,面部识别等,但是深度神经网络在训练和推理期间需要大量计算资源。该机器学习界主要集中在模型级优化(如深度学习模型的架构压缩),而系统社区则专注于实施级别优化。在其间,在算术界中提出了各种算术级优化技术。本文在模型,算术和实施级技术方面提供了关于资源有效的深度学习技术的调查,并确定了三种不同级别技术的资源有效的深度学习技术的研究差距。我们的调查基于我们的资源效率度量定义,阐明了较低级别技术的影响,并探讨了资源有效的深度学习研究的未来趋势。
translated by 谷歌翻译
本文介绍了有关如何架构,设计和优化深神经网络(DNN)的最新概述,以提高性能并保留准确性。该论文涵盖了一组跨越整个机器学习处理管道的优化。我们介绍两种类型的优化。第一个改变了DNN模型,需要重新训练,而第二个则不训练。我们专注于GPU优化,但我们认为提供的技术可以与其他AI推理平台一起使用。为了展示DNN模型优化,我们在流行的Edge AI推理平台(Nvidia Jetson Agx Xavier)上改善了光流的最先进的深层网络体系结构之一,RAFT ARXIV:2003.12039。
translated by 谷歌翻译
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
translated by 谷歌翻译
基于von-neumann架构的传统计算系统,数据密集型工作负载和应用程序(如机器学习)和应用程序都是基本上限制的。随着数据移动操作和能量消耗成为计算系统设计中的关键瓶颈,对近数据处理(NDP),机器学习和特别是神经网络(NN)的加速器等非传统方法的兴趣显着增加。诸如Reram和3D堆叠的新兴内存技术,这是有效地架构基于NN的基于NN的加速器,因为它们的工作能力是:高密度/低能量存储和近记忆计算/搜索引擎。在本文中,我们提出了一种为NN设计NDP架构的技术调查。通过基于所采用的内存技术对技术进行分类,我们强调了它们的相似之处和差异。最后,我们讨论了需要探索的开放挑战和未来的观点,以便改进和扩展未来计算平台的NDP架构。本文对计算机学习领域的计算机架构师,芯片设计师和研究人员来说是有价值的。
translated by 谷歌翻译
在过去十年中,已经开发出新的深度学习(DL)算法,工作负载和硬件来解决各种问题。尽管工作量和硬件生态系统的进步,DL系统的编程方法是停滞不前的。 DL工作负载从DL库中的高度优化,特定于平台和不灵活的内核,或者在新颖的操作员的情况下,通过具有强大性能的DL框架基元建立参考实现。这项工作介绍了Tensor加工基元(TPP),一个编程抽象,用于高效的DL工作负载的高效,便携式实现。 TPPS定义了一组紧凑而多才多艺的2D张镜操作员(或虚拟张量ISA),随后可以用作构建块,以在高维张量上构建复杂的运算符。 TPP规范是平台 - 不可行的,因此通过TPPS表示的代码是便携式的,而TPP实现是高度优化的,并且特定于平台。我们展示了我们使用独立内核和端到端DL&HPC工作负载完全通过TPPS表达的方法的效力和生存性,这在多个平台上优于最先进的实现。
translated by 谷歌翻译
对将AI功能从云上的数据中心转移到边缘或最终设备的需求越来越大,这是由在智能手机,AR/VR设备,自动驾驶汽车和各种汽车上运行的快速实时AI的应用程序举例说明的。物联网设备。然而,由于DNN计算需求与边缘或最终设备上的计算能力之间的较大增长差距,这种转变受到了严重的阻碍。本文介绍了XGEN的设计,这是DNN的优化框架,旨在弥合差距。 XGEN将横切共同设计作为其一阶考虑。它的全栈AI面向AI的优化包括在DNN软件堆栈的各个层的许多创新优化,所有这些优化都以合作的方式设计。独特的技术使XGEN能够优化各种DNN,包括具有极高深度的DNN(例如Bert,GPT,其他变形金刚),并生成代码比现有DNN框架中的代码快几倍,同时提供相同的准确性水平。
translated by 谷歌翻译
原则上,稀疏的神经网络应该比传统的密集网络更有效。大脑中的神经元表现出两种类型的稀疏性;它们稀疏地相互连接和稀疏活跃。当组合时,这两种类型的稀疏性,称为重量稀疏性和激活稀疏性,提出了通过两个数量级来降低神经网络的计算成本。尽管存在这种潜力,但今天的神经网络只使用重量稀疏提供适度的性能益处,因为传统的计算硬件无法有效地处理稀疏网络。在本文中,我们引入了互补稀疏性,这是一种显着提高现有硬件对双稀疏网络性能的新技术。我们证明我们可以实现高性能运行的重量稀疏网络,我们可以通过结合激活稀疏性来乘以这些加速。采用互补稀疏性,我们显示出对FPGA的推断的吞吐量和能效提高了100倍。我们分析了典型的商业卷积网络等各种内核的可扩展性和资源权衡,例如Resnet-50和MobileNetv2。我们的互补稀疏性的结果表明,重量加激活稀疏性可以是有效的缩放未来AI模型的有效组合。
translated by 谷歌翻译
一般矩阵乘法或GEMM内核在高性能计算和机器学习中占据中心位置。最近的NVIDIA GPU包括Gemm加速器,如Nvidia的张量核心。他们的剥削受到双语言问题的阻碍:它需要低级编程,这意味着低程序员的工作效率或使用只提供有限组件集的库。由于建立的组件方面的REPRASING算法经常引入开销,因此图书馆缺乏灵活性限制了探索新算法的自由。因此,使用GEMMS的研究人员无法立即享受编程生产力,高性能和研究灵活性。在本文中,我们解决了这个问题。我们在科学朱莉娅编程语言中展示了三组抽象和接口来编程宝石。界面和抽象共同设计用于研究人员的需求和朱莉娅的特征,以实现足够的担忧和灵活性的充分分离,以便在不支付性能价格的情况下轻松地扩展基本宝石。将我们的Gemms与最先进的图书馆Cublas和Cutlass进行比较,我们证明我们的性能在图书馆的相同球场中,并且在某些情况下甚至超过它,而无需在CUDA C ++中编写单行代码或者组装,而不面临灵活限制。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
深度神经网络(DNN)的记录断裂性能具有沉重的参数化,导致外部动态随机存取存储器(DRAM)进行存储。 DRAM访问的禁用能量使得在资源受限的设备上部署DNN是不普遍的,呼叫最小化重量和数据移动以提高能量效率。我们呈现SmartDeal(SD),算法框架,以进行更高成本的存储器存储/访问的较低成本计算,以便在推理和培训中积极提高存储和能量效率。 SD的核心是一种具有结构约束的新型重量分解,精心制作以释放硬件效率潜力。具体地,我们将每个重量张量分解为小基矩阵的乘积以及大的结构稀疏系数矩阵,其非零被量化为-2的功率。由此产生的稀疏和量化的DNN致力于为数据移动和重量存储而大大降低的能量,因为由于稀疏的比特 - 操作和成本良好的计算,恢复原始权重的最小开销。除了推理之外,我们采取了另一次飞跃来拥抱节能培训,引入创新技术,以解决培训时出现的独特障碍,同时保留SD结构。我们还设计专用硬件加速器,充分利用SD结构来提高实际能源效率和延迟。我们在不同的设置中对多个任务,模型和数据集进行实验。结果表明:1)应用于推理,SD可实现高达2.44倍的能效,通过实际硬件实现评估; 2)应用于培训,储存能量降低10.56倍,减少了10.56倍和4.48倍,与最先进的训练基线相比,可忽略的准确性损失。我们的源代码在线提供。
translated by 谷歌翻译
深神经网络(DNNS)在各种机器学习(ML)应用程序中取得了巨大成功,在计算机视觉,自然语言处理和虚拟现实等中提供了高质量的推理解决方案。但是,基于DNN的ML应用程序也带来计算和存储要求的增加了很多,对于具有有限的计算/存储资源,紧张的功率预算和较小形式的嵌入式系统而言,这尤其具有挑战性。挑战还来自各种特定应用的要求,包括实时响应,高通量性能和可靠的推理准确性。为了应对这些挑战,我们介绍了一系列有效的设计方法,包括有效的ML模型设计,定制的硬件加速器设计以及硬件/软件共同设计策略,以启用嵌入式系统上有效的ML应用程序。
translated by 谷歌翻译
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power.Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS/s working directly on a compressed network, corresponding to 3 TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88×10 4 frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.
translated by 谷歌翻译
We introduce a method to train Quantized Neural Networks (QNNs) -neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At traintime the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.
translated by 谷歌翻译
在小型电池约束的物流设备上部署现代TinyML任务需要高计算能效。使用非易失性存储器(NVM)的模拟内存计算(IMC)承诺在深神经网络(DNN)推理中的主要效率提高,并用作DNN权重的片上存储器存储器。然而,在系统级别尚未完全理解IMC的功能灵活性限制及其对性能,能量和面积效率的影响。为了目标实际的端到端的IOT应用程序,IMC阵列必须括在异构可编程系统中,引入我们旨在解决这项工作的新系统级挑战。我们介绍了一个非均相紧密的聚类架构,整合了8个RISC-V核心,内存计算加速器(IMA)和数字加速器。我们在高度异构的工作负载上基准测试,例如来自MobileNetv2的瓶颈层,显示出11.5倍的性能和9.5倍的能效改进,而在核心上高度优化并行执行相比。此外,我们通过将我们的异构架构缩放到多阵列加速器,探讨了在IMC阵列资源方面对全移动级DNN(MobileNetv2)的端到端推断的要求。我们的结果表明,我们的解决方案在MobileNetv2的端到端推断上,在执行延迟方面比现有的可编程架构更好,比最先进的异构解决方案更好的数量级集成内存计算模拟核心。
translated by 谷歌翻译
When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden of the convolution or linear layers. In recent mobile-friendly DNNs, however, the relative number of operations involved in processing these layers has significantly reduced. As a result, the proportion of the execution time of other layers, such as batch normalization layers, has increased. Thus, in this work, we conduct a detailed analysis of the batch normalization layer to efficiently reduce the runtime overhead in the batch normalization process. Backed up by the thorough analysis, we present an extremely efficient batch normalization, named LightNorm, and its associated hardware module. In more detail, we fuse three approximation techniques that are i) low bit-precision, ii) range batch normalization, and iii) block floating point. All these approximate techniques are carefully utilized not only to maintain the statistics of intermediate feature maps, but also to minimize the off-chip memory accesses. By using the proposed LightNorm hardware, we can achieve significant area and energy savings during the DNN training without hurting the training accuracy. This makes the proposed hardware a great candidate for the on-device training.
translated by 谷歌翻译
胶囊网络(CAPSNET)是图像处理的新兴趋势。与卷积神经网络相反,CAPSNET不容易受到对象变形的影响,因为对象的相对空间信息在整个网络中保存。但是,它们的复杂性主要与胶囊结构和动态路由机制有关,这使得以其原始形式部署封闭式以由小型微控制器(MCU)供电的设备几乎是不合理的。在一个智力从云到边缘迅速转移的时代,这种高复杂性对在边缘的采用capsnets的采用构成了严重的挑战。为了解决此问题,我们提出了一个API,用于执行ARM Cortex-M和RISC-V MCUS中的量化capsnet。我们的软件内核扩展了ARM CMSIS-NN和RISC-V PULP-NN,以用8位整数作为操作数支持胶囊操作。随之而来的是,我们提出了一个框架,以执行CAPSNET的训练后量化。结果显示,记忆足迹的减少近75%,准确性损失范围从0.07%到0.18%。在吞吐量方面,我们的ARM Cortex-M API可以分别在仅119.94和90.60毫秒(MS)的中型胶囊和胶囊层执行(STM32H7555ZIT6U,Cortex-M7 @ 480 MHz)。对于GAP-8 SOC(RISC-V RV32IMCXPULP @ 170 MHz),延迟分别降至7.02和38.03 ms。
translated by 谷歌翻译
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
translated by 谷歌翻译
这项工作专注于通过使用直接稀疏算法来提高一些卷积神经网络(CNNS)并提高图形处理单元(GPU)的效率。 NVIDIA深神经网络(CUDNN)图书馆是GPU的深度学习(DL)算法的最有效实现。 GPU是深度学习计算最常用的加速器。提高CNN模型效率的最常用技术之一是重量修剪和量化。修剪有两种主要类型:结构和非结构性。首先,可以在许多类型的加速器上更容易地加速,但是通过这种类型,难以实现稀疏度水平和高精度,与第二种类型一样高。在一些深入的CNN模型中,刷新的非结构修剪可以产生高达90%或更多的重量张量,其稀疏性。在本文中,提出了修剪算法,这使得可以在没有精确下降的情况下实现高稀疏水平。在下一阶段,线性和非线性量化适用于进一步的时间和占地面积。本文是一篇关于有效修剪技术的先前发表的论文,并且目前具有高稀疏性的实际模型和降低精度,可以实现比CUDNN图书馆更好的性能。
translated by 谷歌翻译
培训广泛和深度神经网络(DNN)需要大量的存储资源,例如内存,因为在转发传播期间必须在存储器中保存中间激活数据,然后恢复以便向后传播。然而,由于硬件设计约束,诸如GPU之类的最先进的加速器(例如GPU)仅配备了非常有限的存储容量,这显着限制了在训练大规模DNN时的最大批量大小和性能加速。传统的记忆保存技术均受性能开销或受限互连带宽或特定互连技术的约束。在本文中,我们提出了一种新颖的记忆高效的CNN训练框架(称为Comet),利用错误界限的损耗压缩来显着降低训练的内存要求,以允许培训更大的模型或加速培训。不同于采用基于图像的有损压缩机(例如JPEG)的最先进的解决方案来压缩激活数据,我们的框架故意采用严格的错误控制机制来采用错误界限的损耗压缩。具体而言,我们对从改变的激活数据传播到梯度的压缩误差传播的理论分析,并经验探讨改变梯度对训练过程的影响。基于这些分析,我们优化了误报的损耗压缩,并提出了一种用于激活数据压缩的自适应误差控制方案。我们评估我们对最先进的解决方案的设计,其中包含五个广泛采用的CNN和Imagenet DataSet。实验表明,我们所提出的框架可以在基线训练中显着降低13.5倍,并分别在另一个最先进的基于压缩框架上的1.8倍,几乎没有准确性损失。
translated by 谷歌翻译