复杂的深层神经网络(例如胶囊网络(CAPSNET))以计算密集型操作为代价表现出较高的学习能力。为了使其在边缘设备上的部署,我们建议利用近似计算来设计诸如SoftMax和Squash等复杂操作的近似变体。在我们的实验中,与确切功能相比,我们评估了通过ASIC设计流实施的设计和量化capsnets的准确性的区域,功耗和关键路径延迟之间的权衡。
translated by 谷歌翻译
胶囊网络(CAPSNET)是图像处理的新兴趋势。与卷积神经网络相反,CAPSNET不容易受到对象变形的影响,因为对象的相对空间信息在整个网络中保存。但是,它们的复杂性主要与胶囊结构和动态路由机制有关,这使得以其原始形式部署封闭式以由小型微控制器(MCU)供电的设备几乎是不合理的。在一个智力从云到边缘迅速转移的时代,这种高复杂性对在边缘的采用capsnets的采用构成了严重的挑战。为了解决此问题,我们提出了一个API,用于执行ARM Cortex-M和RISC-V MCUS中的量化capsnet。我们的软件内核扩展了ARM CMSIS-NN和RISC-V PULP-NN,以用8位整数作为操作数支持胶囊操作。随之而来的是,我们提出了一个框架,以执行CAPSNET的训练后量化。结果显示,记忆足迹的减少近75%,准确性损失范围从0.07%到0.18%。在吞吐量方面,我们的ARM Cortex-M API可以分别在仅119.94和90.60毫秒(MS)的中型胶囊和胶囊层执行(STM32H7555ZIT6U,Cortex-M7 @ 480 MHz)。对于GAP-8 SOC(RISC-V RV32IMCXPULP @ 170 MHz),延迟分别降至7.02和38.03 ms。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
In this work, we demonstrate the offline FPGA realization of both recurrent and feedforward neural network (NN)-based equalizers for nonlinearity compensation in coherent optical transmission systems. First, we present a realization pipeline showing the conversion of the models from Python libraries to the FPGA chip synthesis and implementation. Then, we review the main alternatives for the hardware implementation of nonlinear activation functions. The main results are divided into three parts: a performance comparison, an analysis of how activation functions are implemented, and a report on the complexity of the hardware. The performance in Q-factor is presented for the cases of bidirectional long-short-term memory coupled with convolutional NN (biLSTM + CNN) equalizer, CNN equalizer, and standard 1-StpS digital back-propagation (DBP) for the simulation and experiment propagation of a single channel dual-polarization (SC-DP) 16QAM at 34 GBd along 17x70km of LEAF. The biLSTM+CNN equalizer provides a similar result to DBP and a 1.7 dB Q-factor gain compared with the chromatic dispersion compensation baseline in the experimental dataset. After that, we assess the Q-factor and the impact of hardware utilization when approximating the activation functions of NN using Taylor series, piecewise linear, and look-up table (LUT) approximations. We also show how to mitigate the approximation errors with extra training and provide some insights into possible gradient problems in the LUT approximation. Finally, to evaluate the complexity of hardware implementation to achieve 400G throughput, fixed-point NN-based equalizers with approximated activation functions are developed and implemented in an FPGA.
translated by 谷歌翻译
我们提出了一个新颖的框架,用于设计无乘数内核机器,该机器可以在智能边缘设备等资源约束平台上使用。该框架使用基于边缘传播(MP)技术的分段线性(PWL)近似值,仅使用加法/减法,移位,比较和寄存器底流/溢出操作。我们建议使用针对现场可编程门阵列(FPGA)平台进行优化的基于硬件的MP推理和在线培训算法。我们的FPGA实施消除了对DSP单元的需求,并减少了LUT的数量。通过重复使用相同的硬件进行推理和培训,我们表明该平台可以克服由MP近似产生的分类错误和本地最小值。该提议的无乘数MP-Kernel机器在FPGA上的实施导致估计的能源消耗为13.4 PJ,功率消耗为107 MW,每台均具有〜9K LUTS和FFS,每张均具有256 x 32个大小的核与其他可比实现相比,区域和区域。
translated by 谷歌翻译
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
translated by 谷歌翻译
在当今智能网络物理系统时代,由于它们在复杂的现实世界应用中的最新性能,深度神经网络(DNN)已无处不在。这些网络的高计算复杂性转化为增加的能源消耗,这是在资源受限系统中部署大型DNN的首要障碍。通过培训后量化实现的定点(FP)实现通常用于减少这些网络的能源消耗。但是,FP中的均匀量化间隔将数据结构的位宽度限制为大值,因为需要以足够的分辨率来表示大多数数字并避免较高的量化误差。在本文中,我们利用了关键见解,即(在大多数情况下)DNN的权重和激活主要集中在零接近零,只有少数几个具有较大的幅度。我们提出了Conlocnn,该框架是通过利用来实现节能低精度深度卷积神经网络推断的框架:(1)重量的不均匀量化,以简化复杂的乘法操作的简化; (2)激活值之间的相关性,可以在低成本的情况下以低成本进行部分补偿,而无需任何运行时开销。为了显着从不均匀的量化中受益,我们还提出了一种新颖的数据表示格式,编码低精度二进制签名数字,以压缩重量的位宽度,同时确保直接使用编码的权重来使用新颖的多重和处理 - 积累(MAC)单元设计。
translated by 谷歌翻译
We introduce a method to train Quantized Neural Networks (QNNs) -neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At traintime the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.
translated by 谷歌翻译
诸如GELU,LIZESION和SOFTMAX之类的非线性操作是变压器模型的必备且昂贵的构建块。有几种先前的作品简化了这些操作,使用查找表或整数计算,但是这种近似值遭受了更低的精度或相当大的硬件成本,并且长期延迟。本文提出了一种精确且硬件友好的近似框架,用于高效变压器推断。我们的框架采用简单的神经网络作为通用近似器,其结构等效地转换成LUT。拟议的框架,称为NN-LUT可以准确地更换流行伯特模型中的所有非线性操作,在面积,功耗和延迟中显着降低。
translated by 谷歌翻译
基于von-neumann架构的传统计算系统,数据密集型工作负载和应用程序(如机器学习)和应用程序都是基本上限制的。随着数据移动操作和能量消耗成为计算系统设计中的关键瓶颈,对近数据处理(NDP),机器学习和特别是神经网络(NN)的加速器等非传统方法的兴趣显着增加。诸如Reram和3D堆叠的新兴内存技术,这是有效地架构基于NN的基于NN的加速器,因为它们的工作能力是:高密度/低能量存储和近记忆计算/搜索引擎。在本文中,我们提出了一种为NN设计NDP架构的技术调查。通过基于所采用的内存技术对技术进行分类,我们强调了它们的相似之处和差异。最后,我们讨论了需要探索的开放挑战和未来的观点,以便改进和扩展未来计算平台的NDP架构。本文对计算机学习领域的计算机架构师,芯片设计师和研究人员来说是有价值的。
translated by 谷歌翻译
深神经网络(DNNS)在各种机器学习(ML)应用程序中取得了巨大成功,在计算机视觉,自然语言处理和虚拟现实等中提供了高质量的推理解决方案。但是,基于DNN的ML应用程序也带来计算和存储要求的增加了很多,对于具有有限的计算/存储资源,紧张的功率预算和较小形式的嵌入式系统而言,这尤其具有挑战性。挑战还来自各种特定应用的要求,包括实时响应,高通量性能和可靠的推理准确性。为了应对这些挑战,我们介绍了一系列有效的设计方法,包括有效的ML模型设计,定制的硬件加速器设计以及硬件/软件共同设计策略,以启用嵌入式系统上有效的ML应用程序。
translated by 谷歌翻译
在本文中,我们提出了一种方法,以最大程度地减少训练有素的卷积神经网络(Convnet)的计算复杂性。这个想法是要近似给定的Convnet的所有元素,并替换原始的卷积过滤器和参数(汇总和偏置系数;以及激活函数),并有效地近似计算复杂性。低复杂性卷积过滤器是通过基于Frobenius Norm的二进制(零)线性编程方案获得的,该方案在一组二元理性的集合上获得。最终的矩阵允许无乘法计算,仅需要添加和位移动操作。这样的低复杂性结构为低功率,高效的硬件设计铺平了道路。我们将方法应用于三种不同复杂性的用例中:(i)“轻”但有效的转换供面部检测(约有1000个参数); (ii)另一个用于手写数字分类的(超过180000个参数); (iii)一个明显更大的Convnet:Alexnet,$ \ $ \ $ 120万美元。我们评估了不同近似级别的各个任务的总体绩效。在所有考虑的应用中,都得出了非常低的复杂性近似值,以保持几乎相等的分类性能。
translated by 谷歌翻译
当今的大多数计算机视觉管道都是围绕深神经网络构建的,卷积操作需要大部分一般的计算工作。与标准算法相比,Winograd卷积算法以更少的MAC计算卷积,当使用具有2x2尺寸瓷砖$ F_2 $的版本时,3x3卷积的操作计数为2.25倍。即使收益很大,Winograd算法具有较大的瓷砖尺寸,即$ f_4 $,在提高吞吐量和能源效率方面具有更大的潜力,因为它将所需的MAC降低了4倍。不幸的是,具有较大瓷砖尺寸的Winograd算法引入了数值问题,这些问题阻止了其在整数域特异性加速器上的使用和更高的计算开销,以在空间和Winograd域之间转换输入和输出数据。为了解锁Winograd $ F_4 $的全部潜力,我们提出了一种新颖的Tap-Wise量化方法,该方法克服了使用较大瓷砖的数值问题,从而实现了仅整数的推断。此外,我们介绍了以功率和区域效率的方式处理Winograd转换的自定义硬件单元,并展示了如何将此类自定义模块集成到工业级,可编程的DSA中。对大量最先进的计算机视觉基准进行了广泛的实验评估表明,Tap-Wise量化算法使量化的Winograd $ F_4 $网络几乎与FP32基线一样准确。 Winograd增强的DSA可实现高达1.85倍的能源效率,最高可用于最先进的细分和检测网络的端到端速度高达1.83倍。
translated by 谷歌翻译
本文介绍了有关如何架构,设计和优化深神经网络(DNN)的最新概述,以提高性能并保留准确性。该论文涵盖了一组跨越整个机器学习处理管道的优化。我们介绍两种类型的优化。第一个改变了DNN模型,需要重新训练,而第二个则不训练。我们专注于GPU优化,但我们认为提供的技术可以与其他AI推理平台一起使用。为了展示DNN模型优化,我们在流行的Edge AI推理平台(Nvidia Jetson Agx Xavier)上改善了光流的最先进的深层网络体系结构之一,RAFT ARXIV:2003.12039。
translated by 谷歌翻译
Resistive Random-Access Memory (RRAM) is well-suited to accelerate neural network (NN) workloads as RRAM-based Processing-in-Memory (PIM) architectures natively support highly-parallel multiply-accumulate (MAC) operations that form the backbone of most NN workloads. Unfortunately, NN workloads such as transformers require support for non-MAC operations (e.g., softmax) that RRAM cannot provide natively. Consequently, state-of-the-art works either integrate additional digital logic circuits to support the non-MAC operations or offload the non-MAC operations to CPU/GPU, resulting in significant performance and energy efficiency overheads due to data movement. In this work, we propose NEON, a novel compiler optimization to enable the end-to-end execution of the NN workload in RRAM. The key idea of NEON is to transform each non-MAC operation into a lightweight yet highly-accurate neural network. Utilizing neural networks to approximate the non-MAC operations provides two advantages: 1) We can exploit the key strength of RRAM, i.e., highly-parallel MAC operation, to flexibly and efficiently execute non-MAC operations in memory. 2) We can simplify RRAM's microarchitecture by eliminating the additional digital logic circuits while reducing the data movement overheads. Acceleration of the non-MAC operations in memory enables NEON to achieve a 2.28x speedup compared to an idealized digital logic-based RRAM. We analyze the trade-offs associated with the transformation and demonstrate feasible use cases for NEON across different substrates.
translated by 谷歌翻译
已经证明量化是提高深神经网络推理效率的重要方法(DNN)。然而,在将DNN权重或从高精度格式从高精度格式量化到它们量化的对应物的同时,在准确性和效率之间取得良好的平衡仍然具有挑战性。我们提出了一种称为弹性显着位量化(ESB)的新方法,可控制量化值的有效位数,以获得具有更少资源的更好的推理准确性。我们设计一个统一的数学公式,以限制ESB的量化值,具有灵活的有效位。我们还引入了分布差对准器(DDA),以定量对齐全精密重量或激活值和量化值之间的分布。因此,ESB适用于各种重量和DNN的激活的各种钟形分布,从而保持高推理精度。从较少的量化值中受益于较少的量化值,ESB可以降低乘法复杂性。我们将ESB实施为加速器,并定量评估其对FPGA的效率。广泛的实验结果表明,ESB量化始终如一地优于最先进的方法,并分别通过AlexNet,Resnet18和MobileNetv2的平均精度提高4.78%,1.92%和3.56%。此外,ESB作为加速器可以在Xilinx ZCU102 FPGA平台上实现1K LUT的10.95 GOPS峰值性能。与FPGA上的CPU,GPU和最先进的加速器相比,ESB加速器可以分别将能效分别提高到65倍,11x和26倍。
translated by 谷歌翻译
定制硬件(HW)的快速进展,用于加速深神经网络(DNN)的推广速度。以前,软MAX层不是DNN加速HW的主要关注点,因为其部分在多层的感知或卷积神经网络中相对较小。然而,由于注意机制广泛用于各种现代DNN,因此软墨幅层的成本效益实现变得非常重要。在本文中,我们提出了两种方法来近似软MAX计算,这是基于查找表(LUT)的使用。所需的LUT大小非常小(约700字节),因为如果将归一化应用于输入,则Softmax的分子和分母的范围是稳定的。我们通过各种基准(Coco17,WMT14,WMT17,胶水)验证了通过不同AI任务(对象检测,机器翻译,情感分析和语义等效)和DNN型号(DETR,Transformer,BERT)的提出的技术(Detr,Transformer,Bert)。我们表明,8位近似允许获得可接受的精度损失低于1.0 \%$。
translated by 谷歌翻译
代表低精度的深度神经网络(DNN)是一种有希望的方法来实现有效的加速和记忆力。以前的方法在低精度中培训DNN的方法通常在重量更新期间在高精度中保持重量的重量副本。由于低精度数字系统与学习算法之间的复杂相互作用,直接具有低精度重量的培训导致精度下降。为了解决这个问题,我们开发了一个共同设计的低精度训练框架,被称为LNS-MADAM,我们共同设计了对数号系统(LNS)和乘法权重算法(MADAM)。我们证明了LNS-MADAM在重量更新期间导致低量化误差,即使精度有限,也导致稳定的收敛。我们进一步提出了LNS-MADAM的硬件设计,可以解决实现LNS计算的有效数据路径的实际挑战。我们的实现有效地降低了LNS - 整数转换和部分总和累积所产生的能量开销。实验结果表明,LNS-MADAM为全精密对应物达到了可比的准确性,只有8位对流行的计算机视觉和自然语言任务。与全精密浮点实施相比,LNS-MADAM将能耗降低超过90。
translated by 谷歌翻译
在小型电池约束的物流设备上部署现代TinyML任务需要高计算能效。使用非易失性存储器(NVM)的模拟内存计算(IMC)承诺在深神经网络(DNN)推理中的主要效率提高,并用作DNN权重的片上存储器存储器。然而,在系统级别尚未完全理解IMC的功能灵活性限制及其对性能,能量和面积效率的影响。为了目标实际的端到端的IOT应用程序,IMC阵列必须括在异构可编程系统中,引入我们旨在解决这项工作的新系统级挑战。我们介绍了一个非均相紧密的聚类架构,整合了8个RISC-V核心,内存计算加速器(IMA)和数字加速器。我们在高度异构的工作负载上基准测试,例如来自MobileNetv2的瓶颈层,显示出11.5倍的性能和9.5倍的能效改进,而在核心上高度优化并行执行相比。此外,我们通过将我们的异构架构缩放到多阵列加速器,探讨了在IMC阵列资源方面对全移动级DNN(MobileNetv2)的端到端推断的要求。我们的结果表明,我们的解决方案在MobileNetv2的端到端推断上,在执行延迟方面比现有的可编程架构更好,比最先进的异构解决方案更好的数量级集成内存计算模拟核心。
translated by 谷歌翻译
When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden of the convolution or linear layers. In recent mobile-friendly DNNs, however, the relative number of operations involved in processing these layers has significantly reduced. As a result, the proportion of the execution time of other layers, such as batch normalization layers, has increased. Thus, in this work, we conduct a detailed analysis of the batch normalization layer to efficiently reduce the runtime overhead in the batch normalization process. Backed up by the thorough analysis, we present an extremely efficient batch normalization, named LightNorm, and its associated hardware module. In more detail, we fuse three approximation techniques that are i) low bit-precision, ii) range batch normalization, and iii) block floating point. All these approximate techniques are carefully utilized not only to maintain the statistics of intermediate feature maps, but also to minimize the off-chip memory accesses. By using the proposed LightNorm hardware, we can achieve significant area and energy savings during the DNN training without hurting the training accuracy. This makes the proposed hardware a great candidate for the on-device training.
translated by 谷歌翻译