We introduce a method to train Quantized Neural Networks (QNNs) -neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At traintime the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.
translated by 谷歌翻译
In this work we introduce a binarized deep neural network (BDNN) model. BDNNs are trained using a novel binarized back propagation algorithm (BBP), which uses binary weights and binary neurons during the forward and backward propagation, while retaining precision of the stored weights in which gradients are accumulated. At test phase, BDNNs are fully binarized and can be implemented in hardware with low circuit complexity. The proposed binarized networks can be implemented using binary convolutions and proxy matrix multiplications with only standard binary XNOR and population count (popcount) operations. BBP is expected to reduce energy consumption by at least two orders of magnitude when compared to the hardware implementation of existing training algorithms. We obtained near state-of-the-art results with BDNNs on the permutation-invariant MNIST, CIFAR-10 and SVHN datasets.
translated by 谷歌翻译
Deep Neural Networks (DNN) have achieved state-of-the-art results in a wide range of tasks, with the best results obtained with large training sets and large models. In the past, GPUs enabled these breakthroughs because of their greater computational speed. In the future, faster computation at both training and test time is likely to be crucial for further progress and for consumer applications on low-power devices. As a result, there is much interest in research and development of dedicated hardware for Deep Learning (DL). Binary weights, i.e., weights which are constrained to only two possible values (e.g. -1 or 1), would bring great benefits to specialized DL hardware by replacing many multiply-accumulate operations by simple accumulations, as multipliers are the most space and powerhungry components of the digital implementation of neural networks. We introduce BinaryConnect, a method which consists in training a DNN with binary weights during the forward and backward propagations, while retaining precision of the stored weights in which gradients are accumulated. Like other dropout schemes, we show that BinaryConnect acts as regularizer and we obtain near state-of-the-art results with BinaryConnect on the permutation-invariant MNIST, CIFAR-10 and SVHN.
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32× memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58× faster convolutional operations (in terms of number of the high precision operations) and 32× memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than 16% in top-1 accuracy. Our code is available at: http://allenai.org/plato/xnornet.
translated by 谷歌翻译
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
translated by 谷歌翻译
我们日常生活中的深度学习是普遍存在的,包括自驾车,虚拟助理,社交网络服务,医疗服务,面部识别等,但是深度神经网络在训练和推理期间需要大量计算资源。该机器学习界主要集中在模型级优化(如深度学习模型的架构压缩),而系统社区则专注于实施级别优化。在其间,在算术界中提出了各种算术级优化技术。本文在模型,算术和实施级技术方面提供了关于资源有效的深度学习技术的调查,并确定了三种不同级别技术的资源有效的深度学习技术的研究差距。我们的调查基于我们的资源效率度量定义,阐明了较低级别技术的影响,并探讨了资源有效的深度学习研究的未来趋势。
translated by 谷歌翻译
Although weight and activation quantization is an effective approach for Deep Neural Network (DNN) compression and has a lot of potentials to increase inference speed leveraging bit-operations, there is still a noticeable gap in terms of prediction accuracy between the quantized model and the full-precision model. To address this gap, we propose to jointly train a quantized, bit-operation-compatible DNN and its associated quantizers, as opposed to using fixed, handcrafted quantization schemes such as uniform or logarithmic quantization. Our method for learning the quantizers applies to both network weights and activations with arbitrary-bit precision, and our quantizers are easy to train. The comprehensive experiments on CIFAR-10 and ImageNet datasets show that our method works consistently well for various network structures such as AlexNet, VGG-Net, GoogLeNet, ResNet, and DenseNet, surpassing previous quantization methods in terms of accuracy by an appreciable margin. Code available at https://github.com/Microsoft/LQ-Nets
translated by 谷歌翻译
混合精确的深神经网络达到了硬件部署所需的能源效率和吞吐量,尤其是在资源有限的情况下,而无需牺牲准确性。但是,不容易找到保留精度的最佳每层钻头精度,尤其是在创建巨大搜索空间的大量模型,数据集和量化技术中。为了解决这一困难,最近出现了一系列文献,并且已经提出了一些实现有希望的准确性结果的框架。在本文中,我们首先总结了文献中通常使用的量化技术。然后,我们对混合精液框架进行了彻底的调查,该调查是根据其优化技术进行分类的,例如增强学习和量化技术,例如确定性舍入。此外,讨论了每个框架的优势和缺点,我们在其中呈现并列。我们最终为未来的混合精液框架提供了指南。
translated by 谷歌翻译
日益复杂的机器学习模型的不断增长的计算需求通常需要使用强大的基于云的基础架构进行培训。已知二元神经网络由于其极端的计算和内存节省了更高精确的替代方案,因此有望进行现场推断。但是,他们现有的训练方法需要同时存储所有层的高精度激活,这通常使在内存受限的设备上学习不可行。在本文中,我们证明了二进制神经网络训练所需的向后传播操作对量化非常强大,从而使现代模型的现场学习成为实用命题。我们介绍了一种低成本的二元神经网络训练策略,该策略表现出相当大的记忆范围减少,同时几乎没有准确的损失与Courbariaux&Bengio的标准方法。这些减少主要是通过仅以二进制格式保留激活来实现的。在后一种算法上,我们的置换替换量看到记忆需求减少3--5 $ \ times $,同时在可比时间内达到相似的测试准确性,这些型号跨越了一系列经过培训的小型模型,用于对流行数据集进行分类。我们还展示了对二进制RESNET-18的从划痕成像网训练,并实现了3.78 $ \ times $减少内存。我们的工作是开源的,包括覆盆子Pi靶向原型,我们用来验证建模的内存降低并捕获相关的能量滴。这样的节省将避免不必要的云下载,减少延迟,提高能源效率和保护最终用户的隐私。
translated by 谷歌翻译
由于神经网络变得更加强大,因此在现实世界中部署它们的愿望是一个上升的愿望;然而,神经网络的功率和准确性主要是由于它们的深度和复杂性,使得它们难以部署,尤其是在资源受限的设备中。最近出现了神经网络量化,以满足这种需求通过降低网络的精度来降低神经网络的大小和复杂性。具有较小和更简单的网络,可以在目标硬件的约束中运行神经网络。本文调查了在过去十年中开发的许多神经网络量化技术。基于该调查和神经网络量化技术的比较,我们提出了该地区的未来研究方向。
translated by 谷歌翻译
When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden of the convolution or linear layers. In recent mobile-friendly DNNs, however, the relative number of operations involved in processing these layers has significantly reduced. As a result, the proportion of the execution time of other layers, such as batch normalization layers, has increased. Thus, in this work, we conduct a detailed analysis of the batch normalization layer to efficiently reduce the runtime overhead in the batch normalization process. Backed up by the thorough analysis, we present an extremely efficient batch normalization, named LightNorm, and its associated hardware module. In more detail, we fuse three approximation techniques that are i) low bit-precision, ii) range batch normalization, and iii) block floating point. All these approximate techniques are carefully utilized not only to maintain the statistics of intermediate feature maps, but also to minimize the off-chip memory accesses. By using the proposed LightNorm hardware, we can achieve significant area and energy savings during the DNN training without hurting the training accuracy. This makes the proposed hardware a great candidate for the on-device training.
translated by 谷歌翻译
在本文中,我们提出了一种方法,以最大程度地减少训练有素的卷积神经网络(Convnet)的计算复杂性。这个想法是要近似给定的Convnet的所有元素,并替换原始的卷积过滤器和参数(汇总和偏置系数;以及激活函数),并有效地近似计算复杂性。低复杂性卷积过滤器是通过基于Frobenius Norm的二进制(零)线性编程方案获得的,该方案在一组二元理性的集合上获得。最终的矩阵允许无乘法计算,仅需要添加和位移动操作。这样的低复杂性结构为低功率,高效的硬件设计铺平了道路。我们将方法应用于三种不同复杂性的用例中:(i)“轻”但有效的转换供面部检测(约有1000个参数); (ii)另一个用于手写数字分类的(超过180000个参数); (iii)一个明显更大的Convnet:Alexnet,$ \ $ \ $ 120万美元。我们评估了不同近似级别的各个任务的总体绩效。在所有考虑的应用中,都得出了非常低的复杂性近似值,以保持几乎相等的分类性能。
translated by 谷歌翻译
已经证明量化是提高深神经网络推理效率的重要方法(DNN)。然而,在将DNN权重或从高精度格式从高精度格式量化到它们量化的对应物的同时,在准确性和效率之间取得良好的平衡仍然具有挑战性。我们提出了一种称为弹性显着位量化(ESB)的新方法,可控制量化值的有效位数,以获得具有更少资源的更好的推理准确性。我们设计一个统一的数学公式,以限制ESB的量化值,具有灵活的有效位。我们还引入了分布差对准器(DDA),以定量对齐全精密重量或激活值和量化值之间的分布。因此,ESB适用于各种重量和DNN的激活的各种钟形分布,从而保持高推理精度。从较少的量化值中受益于较少的量化值,ESB可以降低乘法复杂性。我们将ESB实施为加速器,并定量评估其对FPGA的效率。广泛的实验结果表明,ESB量化始终如一地优于最先进的方法,并分别通过AlexNet,Resnet18和MobileNetv2的平均精度提高4.78%,1.92%和3.56%。此外,ESB作为加速器可以在Xilinx ZCU102 FPGA平台上实现1K LUT的10.95 GOPS峰值性能。与FPGA上的CPU,GPU和最先进的加速器相比,ESB加速器可以分别将能效分别提高到65倍,11x和26倍。
translated by 谷歌翻译
深度神经网络(DNN)的记录断裂性能具有沉重的参数化,导致外部动态随机存取存储器(DRAM)进行存储。 DRAM访问的禁用能量使得在资源受限的设备上部署DNN是不普遍的,呼叫最小化重量和数据移动以提高能量效率。我们呈现SmartDeal(SD),算法框架,以进行更高成本的存储器存储/访问的较低成本计算,以便在推理和培训中积极提高存储和能量效率。 SD的核心是一种具有结构约束的新型重量分解,精心制作以释放硬件效率潜力。具体地,我们将每个重量张量分解为小基矩阵的乘积以及大的结构稀疏系数矩阵,其非零被量化为-2的功率。由此产生的稀疏和量化的DNN致力于为数据移动和重量存储而大大降低的能量,因为由于稀疏的比特 - 操作和成本良好的计算,恢复原始权重的最小开销。除了推理之外,我们采取了另一次飞跃来拥抱节能培训,引入创新技术,以解决培训时出现的独特障碍,同时保留SD结构。我们还设计专用硬件加速器,充分利用SD结构来提高实际能源效率和延迟。我们在不同的设置中对多个任务,模型和数据集进行实验。结果表明:1)应用于推理,SD可实现高达2.44倍的能效,通过实际硬件实现评估; 2)应用于培训,储存能量降低10.56倍,减少了10.56倍和4.48倍,与最先进的训练基线相比,可忽略的准确性损失。我们的源代码在线提供。
translated by 谷歌翻译
在资源受限的嵌入式系统上部署卷积神经网络的关键推动力是二进制神经网络(BNN)。 BNNS通过将功能和权重进行分配来保存内存并简化计算。不幸的是,二进制不可避免地伴随着准确性的严重降低。为了减少二进制和完整精确网络之间的准确性差距,最近提出了许多维修方法,我们已经将其分类并在本章中进行了单一概述。维修方法分为两个主要分支,培训技术和网络拓扑变化,可以进一步分为较小的类别。后一个类别为嵌入式系统引入了额外的成本(能源消耗或额外的面积),而前者则没有。从我们的概述中,我们可以观察到在减少准确性差距方面取得了进展,但是BNN论文并不对应使用哪种修复方法进行对齐,以获得高度准确的BNN。因此,本章包含一项经验综述,该综述评估了许多维修方法的好处,而不是Resnet-20 \&Cifar10和Resnet-18 \&Cifar100基准。我们发现三个维修类别最有益:功能二进制器,功能归一化和双重残留。基于这篇评论,我们讨论未来的方向和研究机会。我们勾勒出与BNN在嵌入式系统上相关的收益和成本,因为BNN是否能够缩小准确性差距,同时在资源受限的嵌入式系统上保持高能效率仍然有待观察。
translated by 谷歌翻译
代表低精度的深度神经网络(DNN)是一种有希望的方法来实现有效的加速和记忆力。以前的方法在低精度中培训DNN的方法通常在重量更新期间在高精度中保持重量的重量副本。由于低精度数字系统与学习算法之间的复杂相互作用,直接具有低精度重量的培训导致精度下降。为了解决这个问题,我们开发了一个共同设计的低精度训练框架,被称为LNS-MADAM,我们共同设计了对数号系统(LNS)和乘法权重算法(MADAM)。我们证明了LNS-MADAM在重量更新期间导致低量化误差,即使精度有限,也导致稳定的收敛。我们进一步提出了LNS-MADAM的硬件设计,可以解决实现LNS计算的有效数据路径的实际挑战。我们的实现有效地降低了LNS - 整数转换和部分总和累积所产生的能量开销。实验结果表明,LNS-MADAM为全精密对应物达到了可比的准确性,只有8位对流行的计算机视觉和自然语言任务。与全精密浮点实施相比,LNS-MADAM将能耗降低超过90。
translated by 谷歌翻译
胶囊网络(CAPSNET)是图像处理的新兴趋势。与卷积神经网络相反,CAPSNET不容易受到对象变形的影响,因为对象的相对空间信息在整个网络中保存。但是,它们的复杂性主要与胶囊结构和动态路由机制有关,这使得以其原始形式部署封闭式以由小型微控制器(MCU)供电的设备几乎是不合理的。在一个智力从云到边缘迅速转移的时代,这种高复杂性对在边缘的采用capsnets的采用构成了严重的挑战。为了解决此问题,我们提出了一个API,用于执行ARM Cortex-M和RISC-V MCUS中的量化capsnet。我们的软件内核扩展了ARM CMSIS-NN和RISC-V PULP-NN,以用8位整数作为操作数支持胶囊操作。随之而来的是,我们提出了一个框架,以执行CAPSNET的训练后量化。结果显示,记忆足迹的减少近75%,准确性损失范围从0.07%到0.18%。在吞吐量方面,我们的ARM Cortex-M API可以分别在仅119.94和90.60毫秒(MS)的中型胶囊和胶囊层执行(STM32H7555ZIT6U,Cortex-M7 @ 480 MHz)。对于GAP-8 SOC(RISC-V RV32IMCXPULP @ 170 MHz),延迟分别降至7.02和38.03 ms。
translated by 谷歌翻译
用于压缩神经网络的非均匀量化策略通常实现的性能比其对应于对应物,即统一的策略,因为其优越的代表性能力。然而,许多非均匀量化方法在实现不均匀量化的权重/激活时忽略了复杂的投影过程,这在硬件部署中引起了不可忽略的时间和空间开销。在这项研究中,我们提出了非均匀致均匀的量化(N2UQ),一种方法,其能够保持非均匀方法的强表示能力,同时硬件友好且有效地作为模型推理的均匀量化。我们通过学习灵活的等距输入阈值来实现这一目标,以更好地拟合潜在的分布,同时将这些实值输入量化为等距输出电平。要使用可学习的输入阈值训练量化网络,我们将广义直通估计器(G-STE)介绍,用于难以应答的后向衍生计算W.r.t.阈值参数。此外,我们考虑熵保持正则化,以进一步降低重量量化的信息损失。即使在这种不利约束的施加均匀量化的重量和激活的情况下,我们的N2UQ也经历了最先进的非均匀量化方法,在想象中达到了0.7〜1.8%,展示了N2UQ设计的贡献。代码将公开可用。
translated by 谷歌翻译
神经网络的外包计算允许用户访问艺术模型的状态,而无需投资专门的硬件和专业知识。问题是用户对潜在的隐私敏感数据失去控制。通过同性恋加密(HE)可以在加密数据上执行计算,而不会显示其内容。在这种知识的系统化中,我们深入了解与隐私保留的神经网络相结合的方法。我们将更改分类为神经网络模型和架构,使其在他和这些变化的影响方面提供影响。我们发现众多挑战是基于隐私保留的深度学习,例如通过加密方案构成的计算开销,可用性和限制。
translated by 谷歌翻译