本文提出了一种具有多个循环训练的训练方法,可在低位量化的卷积神经网络(CNN)中实现增强性能。量化是获得轻量级CNN的流行方法,其中使用预审计模型的初始化被广泛用于克服低分辨率量化中的降解性能。但是,实际值及其低位量化量之间的大量量化错误在获得复杂网络和大型数据集的可接受性能方面遇到了困难。所提出的训练方法在多个量化步骤中轻轻地将验证模型的知识传达给了低位量化模型。在每个量化步骤中,模型的训练重量用于初始化下一个模型的权重,而量化位深度减少了一个。随着量化位深度的微小变化,可以弥合性能差距,从而提供更好的权重初始化。在循环训练中,在训练低位量化模型后,其训练的权重用于初始化其准确模型要训练。通过以迭代方式使用精确模型的更好的训练能力,该方法可以在每个循环中为低位量化模型产生增强的训练重量。值得注意的是,训练方法可以分别提高ImageNet数据集上的二进制RESNET-18的TOP-1和前5个精度,分别为5.80%和6.85%。
translated by 谷歌翻译
We introduce a method to train Quantized Neural Networks (QNNs) -neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At traintime the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.
translated by 谷歌翻译
Although considerable progress has been obtained in neural network quantization for efficient inference, existing methods are not scalable to heterogeneous devices as one dedicated model needs to be trained, transmitted, and stored for one specific hardware setting, incurring considerable costs in model training and maintenance. In this paper, we study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. With this representation, we can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model. To this end, we propose a simple once quantization-aware training (QAT) scheme for obtaining high-performance vertical-layered models. Our design incorporates a cascade downsampling mechanism which allows us to obtain multiple quantized networks from one full precision source model by progressively mapping the higher precision weights to their adjacent lower precision counterparts. Then, with networks of different bit-widths from one source model, multi-objective optimization is employed to train the shared source model weights such that they can be updated simultaneously, considering the performance of all networks. By doing this, the shared weights will be optimized to balance the performance of different quantized models, thus making the weights transferable among different bit widths. Experiments show that the proposed vertical-layered representation and developed once QAT scheme are effective in embodying multiple quantized networks into a single one and allow one-time training, and it delivers comparable performance as that of quantized models tailored to any specific bit-width. Code will be available.
translated by 谷歌翻译
Although weight and activation quantization is an effective approach for Deep Neural Network (DNN) compression and has a lot of potentials to increase inference speed leveraging bit-operations, there is still a noticeable gap in terms of prediction accuracy between the quantized model and the full-precision model. To address this gap, we propose to jointly train a quantized, bit-operation-compatible DNN and its associated quantizers, as opposed to using fixed, handcrafted quantization schemes such as uniform or logarithmic quantization. Our method for learning the quantizers applies to both network weights and activations with arbitrary-bit precision, and our quantizers are easy to train. The comprehensive experiments on CIFAR-10 and ImageNet datasets show that our method works consistently well for various network structures such as AlexNet, VGG-Net, GoogLeNet, ResNet, and DenseNet, surpassing previous quantization methods in terms of accuracy by an appreciable margin. Code available at https://github.com/Microsoft/LQ-Nets
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
模型二进制化是一种压缩神经网络并加速其推理过程的有效方法。但是,1位模型和32位模型之间仍然存在显着的性能差距。实证研究表明,二进制会导致前进和向后传播中的信息损失。我们提出了一个新颖的分布敏感信息保留网络(DIR-NET),该网络通过改善内部传播和引入外部表示,将信息保留在前后传播中。 DIR-NET主要取决于三个技术贡献:(1)最大化二进制(IMB)的信息:最小化信息损失和通过重量平衡和标准化同时同时使用权重/激活的二进制误差; (2)分布敏感的两阶段估计器(DTE):通过共同考虑更新能力和准确的梯度来通过分配敏感的软近似来保留梯度的信息; (3)代表性二进制 - 意识蒸馏(RBD):通过提炼完整精确和二元化网络之间的表示来保留表示信息。 DIR-NET从统一信息的角度研究了BNN的前进过程和后退过程,从而提供了对网络二进制机制的新见解。我们的DIR-NET中的三种技术具有多功能性和有效性,可以在各种结构中应用以改善BNN。关于图像分类和客观检测任务的综合实验表明,我们的DIR-NET始终优于主流和紧凑型体系结构(例如Resnet,vgg,vgg,EfficityNet,darts和mobilenet)下最新的二进制方法。此外,我们在现实世界中的资源有限设备上执行DIR-NET,该设备可实现11.1倍的存储空间和5.4倍的速度。
translated by 谷歌翻译
We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32× memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58× faster convolutional operations (in terms of number of the high precision operations) and 32× memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than 16% in top-1 accuracy. Our code is available at: http://allenai.org/plato/xnornet.
translated by 谷歌翻译
Adder Neural Network (AdderNet) provides a new way for developing energy-efficient neural networks by replacing the expensive multiplications in convolution with cheaper additions (i.e.l1-norm). To achieve higher hardware efficiency, it is necessary to further study the low-bit quantization of AdderNet. Due to the limitation that the commutative law in multiplication does not hold in l1-norm, the well-established quantization methods on convolutional networks cannot be applied on AdderNets. Thus, the existing AdderNet quantization techniques propose to use only one shared scale to quantize both the weights and activations simultaneously. Admittedly, such an approach can keep the commutative law in the l1-norm quantization process, while the accuracy drop after low-bit quantization cannot be ignored. To this end, we first thoroughly analyze the difference on distributions of weights and activations in AdderNet and then propose a new quantization algorithm by redistributing the weights and the activations. Specifically, the pre-trained full-precision weights in different kernels are clustered into different groups, then the intra-group sharing and inter-group independent scales can be adopted. To further compensate the accuracy drop caused by the distribution difference, we then develop a lossless range clamp scheme for weights and a simple yet effective outliers clamp strategy for activations. Thus, the functionality of full-precision weights and the representation ability of full-precision activations can be fully preserved. The effectiveness of the proposed quantization method for AdderNet is well verified on several benchmarks, e.g., our 4-bit post-training quantized adder ResNet-18 achieves an 66.5% top-1 accuracy on the ImageNet with comparable energy efficiency, which is about 8.5% higher than that of the previous AdderNet quantization methods.
translated by 谷歌翻译
在本文中,我们通过变换量化压缩卷积神经网络(CNN)权重。以前的CNN量化技术倾向于忽略权重和激活的联合统计,以给定的量化比特率产生次优CNN性能,或者在训练期间考虑其关节统计,并且不促进已经训练的CNN模型的有效压缩。我们最佳地转换(去相关)并使用速率失真框架来量化训练后的权重,以改善任何给定的量化比特率的压缩。变换量化在单个框架中统一量化和维度减少(去相关性)技术,以促进CNN的低比特率压缩和变换域中的有效推断。我们首先介绍CNN量化的速率和失真理论,并将最佳量化呈现为速率失真优化问题。然后,我们表明,通过在本文中获得的最佳端到端学习变换(ELT),可以使用最佳位深度分配来解决此问题。实验表明,变换量化在雷则和非烫伤量化方案中推进了CNN压缩中的技术状态。特别是,我们发现使用再培训的转换量化能够压缩CNN模型,例如AlexNet,Reset和DenSenet,以非常低的比特率(1-2比特)。
translated by 谷歌翻译
二进制神经网络(BNNS)将原始的全精度权重和激活为1位,带有符号功能。由于传统符号函数的梯度几乎归零,因此不能用于反向传播,因此已经提出了几次尝试来通过使用近似梯度来缓解优化难度。然而,这些近似损坏了事实梯度的主要方向。为此,我们建议使用用于训练BNN的正弦函数的组合来估计傅立叶频域中的符号功能的梯度,即频域近似(FDA)。该提出的方法不会影响占据大部分整体能量的原始符号功能的低频信息,并且将忽略高频系数以避免巨大的计算开销。此外,我们将噪声适配模块嵌入到训练阶段以补偿近似误差。关于多个基准数据集和神经架构的实验说明了使用我们的方法学习的二进制网络实现了最先进的准确性。代码将在\ texit {https://gitee.com/mindspore/models/tree/master/research/cv/fda-bnn}上获得。
translated by 谷歌翻译
为了以计算有效的方式部署深层模型,经常使用模型量化方法。此外,由于新的硬件支持混合的位算术操作,最近对混合精度量化(MPQ)的研究开始通过搜索网络中不同层和模块的优化位低宽,从而完全利用表示的能力。但是,先前的研究主要是在使用强化学习,神经体系结构搜索等的昂贵方案中搜索MPQ策略,或者简单地利用部分先验知识来进行位于刻度分配,这可能是有偏见和优势的。在这项工作中,我们提出了一种新颖的随机量化量化(SDQ)方法,该方法可以在更灵活,更全球优化的空间中自动学习MPQ策略,并具有更平滑的梯度近似。特别是,可区分的位宽参数(DBP)被用作相邻位意选择之间随机量化的概率因素。在获取最佳MPQ策略之后,我们将进一步训练网络使用熵感知的bin正则化和知识蒸馏。我们广泛评估了不同硬件(GPU和FPGA)和数据集的多个网络的方法。 SDQ的表现优于所有最先进的混合或单个精度量化,甚至比较低的位置量化,甚至比各种重新网络和Mobilenet家族的全精度对应物更好,这表明了我们方法的有效性和优势。
translated by 谷歌翻译
本文研究了重量和激活都将二进制神经网络(BNN)二进制为1位值,从而大大降低了记忆使用率和计算复杂性。由于现代深层神经网络具有复杂的设计,具有复杂的架构,其准确性,因此权重和激活分布的多样性非常高。因此,传统的符号函数不能很好地用于有效地在BNN中进行全精度值。为此,我们提出了一种称为Adabin的简单而有效的方法,可自适应获得最佳的二进制集$ \ {b_1,b_2 \} $($ b_1,b_1,b_2 \ in \ mathbb {r} $)的重量和激活而不是固定集(即$ \ { - 1,+1 \} $)。通过这种方式,提出的方法可以更好地拟合不同的分布,并提高二进制特征的表示能力。实际上,我们使用中心位置和1位值的距离来定义新的二进制量化函数。对于权重,我们提出了一种均衡方法,将对称分布的对称中心与实价分布相对,并最大程度地减少它们的kullback-leibler差异。同时,我们引入了一种基于梯度的优化方法,以获取这两个激活参数,这些参数以端到端的方式共同训练。基准模型和数据集的实验结果表明,拟议的Adabin能够实现最新性能。例如,我们使用RESNET-18体系结构在Imagenet上获得66.4 \%TOP-1的精度,并使用SSD300获得了Pascal VOC的69.4映射。
translated by 谷歌翻译
神经网络量化旨在将特定神经网络的高精度权重和激活转变为低精度的权重/激活,以减少存储器使用和计算,同时保留原始模型的性能。但是,紧凑设计的主链体系结构(例如Mobilenets)通常用于边缘设备部署的极端量化(1位重量/1位激活)会导致严重的性能变性。本文提出了一种新颖的量化感知训练(QAT)方法,即使通过重点关注各层之间的权重之间的重量间依赖性,也可以通过极端量化有效地减轻性能退化。为了最大程度地减少每个重量对其他重量的量化影响,我们通过训练一个依赖输入依赖性的相关矩阵和重要性向量来对每一层的权重进行正交转换,从而使每个权重都与其他权重分开。然后,我们根据权重量化的重要性来最大程度地减少原始权重/激活中信息丢失的重要性。我们进一步执行从底层到顶部的渐进层量化,因此每一层的量化都反映了先前层的权重和激活的量化分布。我们验证了我们的方法对各种基准数据集的有效性,可针对强神经量化基线,这表明它可以减轻ImageNet上的性能变性,并成功地保留了CIFAR-100上具有紧凑型骨干网络的完整精确模型性能。
translated by 谷歌翻译
用于压缩神经网络的非均匀量化策略通常实现的性能比其对应于对应物,即统一的策略,因为其优越的代表性能力。然而,许多非均匀量化方法在实现不均匀量化的权重/激活时忽略了复杂的投影过程,这在硬件部署中引起了不可忽略的时间和空间开销。在这项研究中,我们提出了非均匀致均匀的量化(N2UQ),一种方法,其能够保持非均匀方法的强表示能力,同时硬件友好且有效地作为模型推理的均匀量化。我们通过学习灵活的等距输入阈值来实现这一目标,以更好地拟合潜在的分布,同时将这些实值输入量化为等距输出电平。要使用可学习的输入阈值训练量化网络,我们将广义直通估计器(G-STE)介绍,用于难以应答的后向衍生计算W.r.t.阈值参数。此外,我们考虑熵保持正则化,以进一步降低重量量化的信息损失。即使在这种不利约束的施加均匀量化的重量和激活的情况下,我们的N2UQ也经历了最先进的非均匀量化方法,在想象中达到了0.7〜1.8%,展示了N2UQ设计的贡献。代码将公开可用。
translated by 谷歌翻译
Zero-shot quantization is a promising approach for developing lightweight deep neural networks when data is inaccessible owing to various reasons, including cost and issues related to privacy. By utilizing the learned parameters (statistics) of FP32-pre-trained models, zero-shot quantization schemes focus on generating synthetic data by minimizing the distance between the learned parameters ($\mu$ and $\sigma$) and distributions of intermediate activations. Subsequently, they distill knowledge from the pre-trained model (\textit{teacher}) to the quantized model (\textit{student}) such that the quantized model can be optimized with the synthetic dataset. In general, zero-shot quantization comprises two major elements: synthesizing datasets and quantizing models. However, thus far, zero-shot quantization has primarily been discussed in the context of quantization-aware training methods, which require task-specific losses and long-term optimization as much as retraining. We thus introduce a post-training quantization scheme for zero-shot quantization that produces high-quality quantized networks within a few hours on even half an hour. Furthermore, we propose a framework called \genie~that generates data suited for post-training quantization. With the data synthesized by \genie, we can produce high-quality quantized models without real datasets, which is comparable to few-shot quantization. We also propose a post-training quantization algorithm to enhance the performance of quantized models. By combining them, we can bridge the gap between zero-shot and few-shot quantization while significantly improving the quantization performance compared to that of existing approaches. In other words, we can obtain a unique state-of-the-art zero-shot quantization approach.
translated by 谷歌翻译
数据剪辑对于降低量化操作中的噪声和提高量化感知训练(QAT)的准确性至关重要。当前的实践依靠启发式方法来设置剪接阈值标量,不能证明是最佳的。我们提出了最佳的剪切张量和向量(octav),这是一种递归算法,以确定MSE最佳的剪切标量。 OCTAV源自Fast Newton-Raphson方法,在QAT例程的每一个迭代中,都可以随时发现最佳的剪切标量。因此,QAT算法在每个步骤中都具有可证明的最小量化噪声配制。此外,我们揭示了QAT中常见梯度估计技术的局限性,并提出了幅度感知的分化,以进一步提高准确性。在实验上,启用了八度的QAT在多个任务上实现了最先进的精度。其中包括在ImageNet上进行训练,并在ImageNet上进行重新注册和Mobilenets,以及使用BERT模型进行微调,其中启用八叶速度的QAT始终以低精度(4到6位)保持准确性。我们的结果不需要对基线训练配方进行任何修改,除了在适当的情况下插入量化操作。
translated by 谷歌翻译
模型量化已成为加速深度学习推理的不可或缺的技术。虽然研究人员继续推动量化算法的前沿,但是现有量化工作通常是不可否认的和不可推销的。这是因为研究人员不选择一致的训练管道并忽略硬件部署的要求。在这项工作中,我们提出了模型量化基准(MQBench),首次尝试评估,分析和基准模型量化算法的再现性和部署性。我们为实际部署选择多个不同的平台,包括CPU,GPU,ASIC,DSP,并在统一培训管道下评估广泛的最新量化算法。 MQBENCK就像一个连接算法和硬件的桥梁。我们进行全面的分析,并找到相当大的直观或反向直观的见解。通过对齐训练设置,我们发现现有的算法在传统的学术轨道上具有大致相同的性能。虽然用于硬件可部署量化,但有一个巨大的精度差距,仍然不稳定。令人惊讶的是,没有现有的算法在MQBench中赢得每一项挑战,我们希望这项工作能够激发未来的研究方向。
translated by 谷歌翻译
具有混合精度量化的大DNN可以实现超高压缩,同时保持高分类性能。但是,由于找到了可以引导优化过程的准确度量的挑战,与32位浮点(FP-32)基线相比,这些方法牺牲了显着性能,或者依赖于计算昂贵的迭代培训政策这需要预先训练的基线的可用性。要解决此问题,本文提出了BMPQ,一种使用位梯度来分析层敏感性的训练方法,并产生混合精度量化模型。 BMPQ需要单一的训练迭代,但不需要预先训练的基线。它使用整数线性程序(ILP)来动态调整培训期间层的精度,但经过固定的硬件预算。为了评估BMPQ的功效,我们对CiFar-10,CiFar-100和微小想象数据集的VGG16和Reset18进行了广泛的实验。与基线FP-32型号相比,BMPQ可以产生具有15.4倍的参数比特的模型,精度可忽略不计。与SOTA“在培训期间”相比,混合精确训练方案,我们的模型分别在CiFar-10,CiFar-100和微小想象中分别为2.1倍,2.2倍2.9倍,具有提高的精度高达14.54%。
translated by 谷歌翻译
The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be carried out using integer-only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware. We also co-design a training procedure to preserve end-to-end model accuracy post quantization. As a result, the proposed quantization scheme improves the tradeoff between accuracy and on-device latency. The improvements are significant even on MobileNets, a model family known for run-time efficiency, and are demonstrated in ImageNet classification and COCO detection on popular CPUs.
translated by 谷歌翻译
在资源受限的嵌入式系统上部署卷积神经网络的关键推动力是二进制神经网络(BNN)。 BNNS通过将功能和权重进行分配来保存内存并简化计算。不幸的是,二进制不可避免地伴随着准确性的严重降低。为了减少二进制和完整精确网络之间的准确性差距,最近提出了许多维修方法,我们已经将其分类并在本章中进行了单一概述。维修方法分为两个主要分支,培训技术和网络拓扑变化,可以进一步分为较小的类别。后一个类别为嵌入式系统引入了额外的成本(能源消耗或额外的面积),而前者则没有。从我们的概述中,我们可以观察到在减少准确性差距方面取得了进展,但是BNN论文并不对应使用哪种修复方法进行对齐,以获得高度准确的BNN。因此,本章包含一项经验综述,该综述评估了许多维修方法的好处,而不是Resnet-20 \&Cifar10和Resnet-18 \&Cifar100基准。我们发现三个维修类别最有益:功能二进制器,功能归一化和双重残留。基于这篇评论,我们讨论未来的方向和研究机会。我们勾勒出与BNN在嵌入式系统上相关的收益和成本,因为BNN是否能够缩小准确性差距,同时在资源受限的嵌入式系统上保持高能效率仍然有待观察。
translated by 谷歌翻译