具有混合精度量化的大DNN可以实现超高压缩,同时保持高分类性能。但是,由于找到了可以引导优化过程的准确度量的挑战,与32位浮点(FP-32)基线相比,这些方法牺牲了显着性能,或者依赖于计算昂贵的迭代培训政策这需要预先训练的基线的可用性。要解决此问题,本文提出了BMPQ,一种使用位梯度来分析层敏感性的训练方法,并产生混合精度量化模型。 BMPQ需要单一的训练迭代,但不需要预先训练的基线。它使用整数线性程序(ILP)来动态调整培训期间层的精度,但经过固定的硬件预算。为了评估BMPQ的功效,我们对CiFar-10,CiFar-100和微小想象数据集的VGG16和Reset18进行了广泛的实验。与基线FP-32型号相比,BMPQ可以产生具有15.4倍的参数比特的模型,精度可忽略不计。与SOTA“在培训期间”相比,混合精确训练方案,我们的模型分别在CiFar-10,CiFar-100和微小想象中分别为2.1倍,2.2倍2.9倍,具有提高的精度高达14.54%。
translated by 谷歌翻译
由于神经网络变得更加强大,因此在现实世界中部署它们的愿望是一个上升的愿望;然而,神经网络的功率和准确性主要是由于它们的深度和复杂性,使得它们难以部署,尤其是在资源受限的设备中。最近出现了神经网络量化,以满足这种需求通过降低网络的精度来降低神经网络的大小和复杂性。具有较小和更简单的网络,可以在目标硬件的约束中运行神经网络。本文调查了在过去十年中开发的许多神经网络量化技术。基于该调查和神经网络量化技术的比较,我们提出了该地区的未来研究方向。
translated by 谷歌翻译
混合精确的深神经网络达到了硬件部署所需的能源效率和吞吐量,尤其是在资源有限的情况下,而无需牺牲准确性。但是,不容易找到保留精度的最佳每层钻头精度,尤其是在创建巨大搜索空间的大量模型,数据集和量化技术中。为了解决这一困难,最近出现了一系列文献,并且已经提出了一些实现有希望的准确性结果的框架。在本文中,我们首先总结了文献中通常使用的量化技术。然后,我们对混合精液框架进行了彻底的调查,该调查是根据其优化技术进行分类的,例如增强学习和量化技术,例如确定性舍入。此外,讨论了每个框架的优势和缺点,我们在其中呈现并列。我们最终为未来的混合精液框架提供了指南。
translated by 谷歌翻译
Mixed-precision quantization has been widely applied on deep neural networks (DNNs) as it leads to significantly better efficiency-accuracy tradeoffs compared to uniform quantization. Meanwhile, determining the exact precision of each layer remains challenging. Previous attempts on bit-level regularization and pruning-based dynamic precision adjustment during training suffer from noisy gradients and unstable convergence. In this work, we propose Continuous Sparsification Quantization (CSQ), a bit-level training method to search for mixed-precision quantization schemes with improved stability. CSQ stabilizes the bit-level mixed-precision training process with a bi-level gradual continuous sparsification on both the bit values of the quantized weights and the bit selection in determining the quantization precision of each layer. The continuous sparsification scheme enables fully-differentiable training without gradient approximation while achieving an exact quantized model in the end.A budget-aware regularization of total model size enables the dynamic growth and pruning of each layer's precision towards a mixed-precision quantization scheme of the desired size. Extensive experiments show CSQ achieves better efficiency-accuracy tradeoff than previous methods on multiple models and datasets.
translated by 谷歌翻译
深神经网络(DNN)的庞大计算和记忆成本通常排除了它们在资源约束设备中的使用。将参数和操作量化为较低的位精确,为神经网络推断提供了可观的记忆和能量节省,从而促进了在边缘计算平台上使用DNN。量化DNN的最新努力采用了一系列技术,包括渐进式量化,步进尺寸的适应性和梯度缩放。本文提出了一种针对边缘计算的混合精度卷积神经网络(CNN)的新量化方法。我们的方法在模型准确性和内存足迹上建立了一个新的Pareto前沿,展示了一系列量化模型,可提供低于4.3 MB的权重(WGTS。)和激活(ACTS。)。我们的主要贡献是:(i)用张量学的学习精度,(ii)WGTS的靶向梯度修饰,(i)硬件感知的异质可区分量化。和行为。为了减轻量化错误,以及(iii)多相学习时间表,以解决从更新到学习的量化器和模型参数引起的学习不稳定性。我们证明了我们的技术在Imagenet数据集上的有效性,包括高效网络lite0(例如,WGTS。的4.14MB和ACTS。以67.66%的精度)和MobilenEtV2(例如3.51MB WGTS。 % 准确性)。
translated by 谷歌翻译
Although weight and activation quantization is an effective approach for Deep Neural Network (DNN) compression and has a lot of potentials to increase inference speed leveraging bit-operations, there is still a noticeable gap in terms of prediction accuracy between the quantized model and the full-precision model. To address this gap, we propose to jointly train a quantized, bit-operation-compatible DNN and its associated quantizers, as opposed to using fixed, handcrafted quantization schemes such as uniform or logarithmic quantization. Our method for learning the quantizers applies to both network weights and activations with arbitrary-bit precision, and our quantizers are easy to train. The comprehensive experiments on CIFAR-10 and ImageNet datasets show that our method works consistently well for various network structures such as AlexNet, VGG-Net, GoogLeNet, ResNet, and DenseNet, surpassing previous quantization methods in terms of accuracy by an appreciable margin. Code available at https://github.com/Microsoft/LQ-Nets
translated by 谷歌翻译
Although considerable progress has been obtained in neural network quantization for efficient inference, existing methods are not scalable to heterogeneous devices as one dedicated model needs to be trained, transmitted, and stored for one specific hardware setting, incurring considerable costs in model training and maintenance. In this paper, we study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. With this representation, we can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model. To this end, we propose a simple once quantization-aware training (QAT) scheme for obtaining high-performance vertical-layered models. Our design incorporates a cascade downsampling mechanism which allows us to obtain multiple quantized networks from one full precision source model by progressively mapping the higher precision weights to their adjacent lower precision counterparts. Then, with networks of different bit-widths from one source model, multi-objective optimization is employed to train the shared source model weights such that they can be updated simultaneously, considering the performance of all networks. By doing this, the shared weights will be optimized to balance the performance of different quantized models, thus making the weights transferable among different bit widths. Experiments show that the proposed vertical-layered representation and developed once QAT scheme are effective in embodying multiple quantized networks into a single one and allow one-time training, and it delivers comparable performance as that of quantized models tailored to any specific bit-width. Code will be available.
translated by 谷歌翻译
为了以计算有效的方式部署深层模型,经常使用模型量化方法。此外,由于新的硬件支持混合的位算术操作,最近对混合精度量化(MPQ)的研究开始通过搜索网络中不同层和模块的优化位低宽,从而完全利用表示的能力。但是,先前的研究主要是在使用强化学习,神经体系结构搜索等的昂贵方案中搜索MPQ策略,或者简单地利用部分先验知识来进行位于刻度分配,这可能是有偏见和优势的。在这项工作中,我们提出了一种新颖的随机量化量化(SDQ)方法,该方法可以在更灵活,更全球优化的空间中自动学习MPQ策略,并具有更平滑的梯度近似。特别是,可区分的位宽参数(DBP)被用作相邻位意选择之间随机量化的概率因素。在获取最佳MPQ策略之后,我们将进一步训练网络使用熵感知的bin正则化和知识蒸馏。我们广泛评估了不同硬件(GPU和FPGA)和数据集的多个网络的方法。 SDQ的表现优于所有最先进的混合或单个精度量化,甚至比较低的位置量化,甚至比各种重新网络和Mobilenet家族的全精度对应物更好,这表明了我们方法的有效性和优势。
translated by 谷歌翻译
Quantization has become a predominant approach for model compression, enabling deployment of large models trained on GPUs onto smaller form-factor devices for inference. Quantization-aware training (QAT) optimizes model parameters with respect to the end task while simulating quantization error, leading to better performance than post-training quantization. Approximation of gradients through the non-differentiable quantization operator is typically achieved using the straight-through estimator (STE) or additive noise. However, STE-based methods suffer from instability due to biased gradients, whereas existing noise-based methods cannot reduce the resulting variance. In this work, we incorporate exponentially decaying quantization-error-aware noise together with a learnable scale of task loss gradient to approximate the effect of a quantization operator. We show this method combines gradient scale and quantization noise in a better optimized way, providing finer-grained estimation of gradients at each weight and activation layer's quantizer bin size. Our controlled noise also contains an implicit curvature term that could encourage flatter minima, which we show is indeed the case in our experiments. Experiments training ResNet architectures on the CIFAR-10, CIFAR-100 and ImageNet benchmarks show that our method obtains state-of-the-art top-1 classification accuracy for uniform (non mixed-precision) quantization, out-performing previous methods by 0.5-1.2% absolute.
translated by 谷歌翻译
量化广泛用于云和边缘系统,以减少深层神经网络的记忆占用,潜伏期和能耗。特别是,混合精液量化,即,在网络的不同部分中使用不同的位宽度,已被证明可以提供出色的效率提高,尤其是通过自动化神经体系结构确定的优化的位宽度分配,尤其是通过自动化的位宽度分配(NAS)工具。最先进的混合精液在层面上,即,它对每个网络层的权重和激活张量使用不同的位宽度。在这项工作中,我们扩大了搜索空间,提出了一种新颖的NA,该NAS独立选择每个重量张量通道的位宽度。这为工具提供了额外的灵活性,即仅针对与最有用的功能相关的权重分配更高的精度。在MLPERF微小的基准套件上进行测试,我们获得了精确度大小与精度与能量空间的帕累托最佳模型的丰富集合。当部署在MPIC RISC-V边缘处理器上时,我们的网络将记忆和能量分别减少了63%和27%,而与层的方法相比,以相同的精度为单位。
translated by 谷歌翻译
模型量化已成为加速深度学习推理的不可或缺的技术。虽然研究人员继续推动量化算法的前沿,但是现有量化工作通常是不可否认的和不可推销的。这是因为研究人员不选择一致的训练管道并忽略硬件部署的要求。在这项工作中,我们提出了模型量化基准(MQBench),首次尝试评估,分析和基准模型量化算法的再现性和部署性。我们为实际部署选择多个不同的平台,包括CPU,GPU,ASIC,DSP,并在统一培训管道下评估广泛的最新量化算法。 MQBENCK就像一个连接算法和硬件的桥梁。我们进行全面的分析,并找到相当大的直观或反向直观的见解。通过对齐训练设置,我们发现现有的算法在传统的学术轨道上具有大致相同的性能。虽然用于硬件可部署量化,但有一个巨大的精度差距,仍然不稳定。令人惊讶的是,没有现有的算法在MQBench中赢得每一项挑战,我们希望这项工作能够激发未来的研究方向。
translated by 谷歌翻译
深度神经网络(DNN)的记录断裂性能具有沉重的参数化,导致外部动态随机存取存储器(DRAM)进行存储。 DRAM访问的禁用能量使得在资源受限的设备上部署DNN是不普遍的,呼叫最小化重量和数据移动以提高能量效率。我们呈现SmartDeal(SD),算法框架,以进行更高成本的存储器存储/访问的较低成本计算,以便在推理和培训中积极提高存储和能量效率。 SD的核心是一种具有结构约束的新型重量分解,精心制作以释放硬件效率潜力。具体地,我们将每个重量张量分解为小基矩阵的乘积以及大的结构稀疏系数矩阵,其非零被量化为-2的功率。由此产生的稀疏和量化的DNN致力于为数据移动和重量存储而大大降低的能量,因为由于稀疏的比特 - 操作和成本良好的计算,恢复原始权重的最小开销。除了推理之外,我们采取了另一次飞跃来拥抱节能培训,引入创新技术,以解决培训时出现的独特障碍,同时保留SD结构。我们还设计专用硬件加速器,充分利用SD结构来提高实际能源效率和延迟。我们在不同的设置中对多个任务,模型和数据集进行实验。结果表明:1)应用于推理,SD可实现高达2.44倍的能效,通过实际硬件实现评估; 2)应用于培训,储存能量降低10.56倍,减少了10.56倍和4.48倍,与最先进的训练基线相比,可忽略的准确性损失。我们的源代码在线提供。
translated by 谷歌翻译
当量化神经网络以进行有效推断时,低位整数是效率的首选格式。但是,低位浮点数具有额外的自由度,分配了一些以指数级的工作。本文深入研究了神经网络推断的浮点格式的这种好处。我们详细介绍了可以为FP8格式做出的选择,包括对Mantissa和Exponent的位数的重要选择,并通过分析显示这些选择可以提供更好的性能。然后,我们展示了这些发现如何转化为真实网络,为FP8模拟提供有效的实现,以及一种新算法,该算法能够学习比例参数和FP8格式中的指数位数。我们的主要结论是,在对各种网络进行培训后量化时,就准确性而言,FP8格式优于INT8,并且指数位数量的选择是由网络中异常值的严重性驱动的。我们还通过量化感知训练进行实验,在训练网络以降低离群值的效果时,格式的差异消失。
translated by 谷歌翻译
已经证明量化是提高深神经网络推理效率的重要方法(DNN)。然而,在将DNN权重或从高精度格式从高精度格式量化到它们量化的对应物的同时,在准确性和效率之间取得良好的平衡仍然具有挑战性。我们提出了一种称为弹性显着位量化(ESB)的新方法,可控制量化值的有效位数,以获得具有更少资源的更好的推理准确性。我们设计一个统一的数学公式,以限制ESB的量化值,具有灵活的有效位。我们还引入了分布差对准器(DDA),以定量对齐全精密重量或激活值和量化值之间的分布。因此,ESB适用于各种重量和DNN的激活的各种钟形分布,从而保持高推理精度。从较少的量化值中受益于较少的量化值,ESB可以降低乘法复杂性。我们将ESB实施为加速器,并定量评估其对FPGA的效率。广泛的实验结果表明,ESB量化始终如一地优于最先进的方法,并分别通过AlexNet,Resnet18和MobileNetv2的平均精度提高4.78%,1.92%和3.56%。此外,ESB作为加速器可以在Xilinx ZCU102 FPGA平台上实现1K LUT的10.95 GOPS峰值性能。与FPGA上的CPU,GPU和最先进的加速器相比,ESB加速器可以分别将能效分别提高到65倍,11x和26倍。
translated by 谷歌翻译
我们日常生活中的深度学习是普遍存在的,包括自驾车,虚拟助理,社交网络服务,医疗服务,面部识别等,但是深度神经网络在训练和推理期间需要大量计算资源。该机器学习界主要集中在模型级优化(如深度学习模型的架构压缩),而系统社区则专注于实施级别优化。在其间,在算术界中提出了各种算术级优化技术。本文在模型,算术和实施级技术方面提供了关于资源有效的深度学习技术的调查,并确定了三种不同级别技术的资源有效的深度学习技术的研究差距。我们的调查基于我们的资源效率度量定义,阐明了较低级别技术的影响,并探讨了资源有效的深度学习研究的未来趋势。
translated by 谷歌翻译
Adder Neural Network (AdderNet) provides a new way for developing energy-efficient neural networks by replacing the expensive multiplications in convolution with cheaper additions (i.e.l1-norm). To achieve higher hardware efficiency, it is necessary to further study the low-bit quantization of AdderNet. Due to the limitation that the commutative law in multiplication does not hold in l1-norm, the well-established quantization methods on convolutional networks cannot be applied on AdderNets. Thus, the existing AdderNet quantization techniques propose to use only one shared scale to quantize both the weights and activations simultaneously. Admittedly, such an approach can keep the commutative law in the l1-norm quantization process, while the accuracy drop after low-bit quantization cannot be ignored. To this end, we first thoroughly analyze the difference on distributions of weights and activations in AdderNet and then propose a new quantization algorithm by redistributing the weights and the activations. Specifically, the pre-trained full-precision weights in different kernels are clustered into different groups, then the intra-group sharing and inter-group independent scales can be adopted. To further compensate the accuracy drop caused by the distribution difference, we then develop a lossless range clamp scheme for weights and a simple yet effective outliers clamp strategy for activations. Thus, the functionality of full-precision weights and the representation ability of full-precision activations can be fully preserved. The effectiveness of the proposed quantization method for AdderNet is well verified on several benchmarks, e.g., our 4-bit post-training quantized adder ResNet-18 achieves an 66.5% top-1 accuracy on the ImageNet with comparable energy efficiency, which is about 8.5% higher than that of the previous AdderNet quantization methods.
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
二进制神经网络(BNNS)已经证明了它们能够以可比精度(DNNS)的准确性来解决复杂任务的能力,同时还降低了计算能力和存储要求并提高处理速度。这些属性使它们成为开发和部署基于DNN的应用程序(IOT)设备的吸引人的替代方法。尽管最近有所改善,但它们的压缩因素可能会导致一些资源非常有限的设备可能导致不足。在这项工作中,我们提出了稀疏的二进制神经网络(SBNNS),这是一种新颖的模型和训练方案,它引入了BNN中的稀疏性和一种新的量化函数,以使网络的权重进行二进制。提出的SBNN能够达到高压因子,并减少了推理时的操作和参数数量。我们还提供工具来协助SBNN设计,同时尊重硬件资源约束。我们通过三个数据集的线性和卷积网络上的一组实验来研究我们方法对不同压缩因子的概括属性。我们的实验证实,SBNNS可以达到高压缩率,而不会损害概括,同时进一步降低了BNNS的操作,使SBNNS成为以廉价,低成本,限量资源的IOT设备和传感器部署DNN的可行选择。
translated by 谷歌翻译
Model quantization enables the deployment of deep neural networks under resource-constrained devices. Vector quantization aims at reducing the model size by indexing model weights with full-precision embeddings, i.e., codewords, while the index needs to be restored to 32-bit during computation. Binary and other low-precision quantization methods can reduce the model size up to 32$\times$, however, at the cost of a considerable accuracy drop. In this paper, we propose an efficient framework for ternary quantization to produce smaller and more accurate compressed models. By integrating hyperspherical learning, pruning and reinitialization, our proposed Hyperspherical Quantization (HQ) method reduces the cosine distance between the full-precision and ternary weights, thus reducing the bias of the straight-through gradient estimator during ternary quantization. Compared with existing work at similar compression levels ($\sim$30$\times$, $\sim$40$\times$), our method significantly improves the test accuracy and reduces the model size.
translated by 谷歌翻译
基于变压器的模型用于实现各种深度学习任务的最新性能。由于基于变压器的模型具有大量参数,因此在下游任务上进行微调是计算密集型和饥饿的能量。此类型号的自动混合精液FP32/FP16微调以前已用于降低计算资源需求。但是,随着低位整数背面传播的最新进展,有可能进一步减少计算和记忆脚印。在这项工作中,我们探索了一种新颖的整数训练方法,该方法使用整数算术来进行正向传播和梯度计算,对基于变压器的模型中的线性,卷积,层和层和嵌入层的梯度计算。此外,我们研究了各种整数位宽度的效果,以找到基于变压器模型的整数微调所需的最小位宽度。我们使用整数层对流行的下游任务进行了微调和VIT模型。我们表明,16位整数模型与浮点基线性能匹配。将位宽度降低到10,我们观察到0.5平均得分下降。最后,将位宽度的进一步降低到8的平均得分下降为1.7分。
translated by 谷歌翻译