训练后量化(PTQ)由于其在部署量化的神经网络方面的便利性而引起了越来越多的关注。 Founding是量化误差的主要来源,仅针对模型权重进行了优化,而激活仍然使用圆形至最终操作。在这项工作中,我们首次证明了精心选择的激活圆形方案可以提高最终准确性。为了应对激活舍入方案动态性的挑战,我们通过简单的功能适应圆形边框,以在推理阶段生成圆形方案。边界函数涵盖了重量误差,激活错误和传播误差的影响,以消除元素误差的偏差,从而进一步受益于模型的准确性。我们还使边境意识到全局错误,以更好地拟合不同的到达激活。最后,我们建议使用Aquant框架来学习边界功能。广泛的实验表明,与最先进的作品相比,Aquant可以通过可忽略不计的开销来取得明显的改进,并将Resnet-18的精度提高到2位重量和激活后训练后量化下的精度最高60.3 \%。
translated by 谷歌翻译
我们为深神经网络引入了两个低位训练后训练量化(PTQ)方法,该方法满足硬件要求,并且不需要长期重新训练。两次量化的能力可以将通过量化和去除化引入的乘法转换为许多有效加速器采用的位移位。但是,两次量表因子的候选值较少,这会导致更多的舍入或剪辑错误。我们提出了一种新型的两个PTQ框架,称为RAPQ,该框架被动态调整了整个网络的两个尺度,而不是静态地确定它们一层。从理论上讲,它可以权衡整个网络的舍入错误和剪辑错误。同时,RAPQ中的重建方法基于每个单元的BN信息。对Imagenet的广泛实验证明了我们提出的方法的出色性能。没有铃铛和哨声,REPQ在RESNET-18和MOBILENETV2上的准确度可以达到65%和48%,分别具有INT2激活INT4的精度。我们是第一个为低位PTQ提出更受限制但对硬件友好型的两次量化方案的人,并证明它可以达到与SOTA PTQ方法几乎相同的准确性。该代码已发布。
translated by 谷歌翻译
模型二进制化是一种压缩神经网络并加速其推理过程的有效方法。但是,1位模型和32位模型之间仍然存在显着的性能差距。实证研究表明,二进制会导致前进和向后传播中的信息损失。我们提出了一个新颖的分布敏感信息保留网络(DIR-NET),该网络通过改善内部传播和引入外部表示,将信息保留在前后传播中。 DIR-NET主要取决于三个技术贡献:(1)最大化二进制(IMB)的信息:最小化信息损失和通过重量平衡和标准化同时同时使用权重/激活的二进制误差; (2)分布敏感的两阶段估计器(DTE):通过共同考虑更新能力和准确的梯度来通过分配敏感的软近似来保留梯度的信息; (3)代表性二进制 - 意识蒸馏(RBD):通过提炼完整精确和二元化网络之间的表示来保留表示信息。 DIR-NET从统一信息的角度研究了BNN的前进过程和后退过程,从而提供了对网络二进制机制的新见解。我们的DIR-NET中的三种技术具有多功能性和有效性,可以在各种结构中应用以改善BNN。关于图像分类和客观检测任务的综合实验表明,我们的DIR-NET始终优于主流和紧凑型体系结构(例如Resnet,vgg,vgg,EfficityNet,darts和mobilenet)下最新的二进制方法。此外,我们在现实世界中的资源有限设备上执行DIR-NET,该设备可实现11.1倍的存储空间和5.4倍的速度。
translated by 谷歌翻译
As a neural network compression technique, post-training quantization (PTQ) transforms a pre-trained model into a quantized model using a lower-precision data type. However, the prediction accuracy will decrease because of the quantization noise, especially in extremely low-bit settings. How to determine the appropriate quantization parameters (e.g., scaling factors and rounding of weights) is the main problem facing now. Many existing methods determine the quantization parameters by minimizing the distance between features before and after quantization. Using this distance as the metric to optimize the quantization parameters only considers local information. We analyze the problem of minimizing local metrics and indicate that it would not result in optimal quantization parameters. Furthermore, the quantized model suffers from overfitting due to the small number of calibration samples in PTQ. In this paper, we propose PD-Quant to solve the problems. PD-Quant uses the information of differences between network prediction before and after quantization to determine the quantization parameters. To mitigate the overfitting problem, PD-Quant adjusts the distribution of activations in PTQ. Experiments show that PD-Quant leads to better quantization parameters and improves the prediction accuracy of quantized models, especially in low-bit settings. For example, PD-Quant pushes the accuracy of ResNet-18 up to 53.08% and RegNetX-600MF up to 40.92% in weight 2-bit activation 2-bit. The code will be released at https://github.com/hustvl/PD-Quant.
translated by 谷歌翻译
模型量化已成为加速深度学习推理的不可或缺的技术。虽然研究人员继续推动量化算法的前沿,但是现有量化工作通常是不可否认的和不可推销的。这是因为研究人员不选择一致的训练管道并忽略硬件部署的要求。在这项工作中,我们提出了模型量化基准(MQBench),首次尝试评估,分析和基准模型量化算法的再现性和部署性。我们为实际部署选择多个不同的平台,包括CPU,GPU,ASIC,DSP,并在统一培训管道下评估广泛的最新量化算法。 MQBENCK就像一个连接算法和硬件的桥梁。我们进行全面的分析,并找到相当大的直观或反向直观的见解。通过对齐训练设置,我们发现现有的算法在传统的学术轨道上具有大致相同的性能。虽然用于硬件可部署量化,但有一个巨大的精度差距,仍然不稳定。令人惊讶的是,没有现有的算法在MQBench中赢得每一项挑战,我们希望这项工作能够激发未来的研究方向。
translated by 谷歌翻译
混合精确的深神经网络达到了硬件部署所需的能源效率和吞吐量,尤其是在资源有限的情况下,而无需牺牲准确性。但是,不容易找到保留精度的最佳每层钻头精度,尤其是在创建巨大搜索空间的大量模型,数据集和量化技术中。为了解决这一困难,最近出现了一系列文献,并且已经提出了一些实现有希望的准确性结果的框架。在本文中,我们首先总结了文献中通常使用的量化技术。然后,我们对混合精液框架进行了彻底的调查,该调查是根据其优化技术进行分类的,例如增强学习和量化技术,例如确定性舍入。此外,讨论了每个框架的优势和缺点,我们在其中呈现并列。我们最终为未来的混合精液框架提供了指南。
translated by 谷歌翻译
数据剪辑对于降低量化操作中的噪声和提高量化感知训练(QAT)的准确性至关重要。当前的实践依靠启发式方法来设置剪接阈值标量,不能证明是最佳的。我们提出了最佳的剪切张量和向量(octav),这是一种递归算法,以确定MSE最佳的剪切标量。 OCTAV源自Fast Newton-Raphson方法,在QAT例程的每一个迭代中,都可以随时发现最佳的剪切标量。因此,QAT算法在每个步骤中都具有可证明的最小量化噪声配制。此外,我们揭示了QAT中常见梯度估计技术的局限性,并提出了幅度感知的分化,以进一步提高准确性。在实验上,启用了八度的QAT在多个任务上实现了最先进的精度。其中包括在ImageNet上进行训练,并在ImageNet上进行重新注册和Mobilenets,以及使用BERT模型进行微调,其中启用八叶速度的QAT始终以低精度(4到6位)保持准确性。我们的结果不需要对基线训练配方进行任何修改,除了在适当的情况下插入量化操作。
translated by 谷歌翻译
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, for LLMs beyond 100 billion parameters, existing methods cannot maintain accuracy or do not run efficiently on hardware. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs that can be implemented efficiently. We observe that systematic outliers appear at fixed activation channels. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the GEMMs in LLMs, including OPT-175B, BLOOM-176B, and GLM-130B. SmoothQuant has better hardware efficiency than existing techniques using mixed-precision activation quantization or weight-only quantization. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy. Thanks to the hardware-friendly design, we integrate SmoothQuant into FasterTransformer, a state-of-the-art LLM serving framework, and achieve faster inference speed with half the number of GPUs compared to FP16. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs. Code is available at: https://github.com/mit-han-lab/smoothquant.
translated by 谷歌翻译
神经网络量化旨在将特定神经网络的高精度权重和激活转变为低精度的权重/激活,以减少存储器使用和计算,同时保留原始模型的性能。但是,紧凑设计的主链体系结构(例如Mobilenets)通常用于边缘设备部署的极端量化(1位重量/1位激活)会导致严重的性能变性。本文提出了一种新颖的量化感知训练(QAT)方法,即使通过重点关注各层之间的权重之间的重量间依赖性,也可以通过极端量化有效地减轻性能退化。为了最大程度地减少每个重量对其他重量的量化影响,我们通过训练一个依赖输入依赖性的相关矩阵和重要性向量来对每一层的权重进行正交转换,从而使每个权重都与其他权重分开。然后,我们根据权重量化的重要性来最大程度地减少原始权重/激活中信息丢失的重要性。我们进一步执行从底层到顶部的渐进层量化,因此每一层的量化都反映了先前层的权重和激活的量化分布。我们验证了我们的方法对各种基准数据集的有效性,可针对强神经量化基线,这表明它可以减轻ImageNet上的性能变性,并成功地保留了CIFAR-100上具有紧凑型骨干网络的完整精确模型性能。
translated by 谷歌翻译
Although considerable progress has been obtained in neural network quantization for efficient inference, existing methods are not scalable to heterogeneous devices as one dedicated model needs to be trained, transmitted, and stored for one specific hardware setting, incurring considerable costs in model training and maintenance. In this paper, we study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. With this representation, we can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model. To this end, we propose a simple once quantization-aware training (QAT) scheme for obtaining high-performance vertical-layered models. Our design incorporates a cascade downsampling mechanism which allows us to obtain multiple quantized networks from one full precision source model by progressively mapping the higher precision weights to their adjacent lower precision counterparts. Then, with networks of different bit-widths from one source model, multi-objective optimization is employed to train the shared source model weights such that they can be updated simultaneously, considering the performance of all networks. By doing this, the shared weights will be optimized to balance the performance of different quantized models, thus making the weights transferable among different bit widths. Experiments show that the proposed vertical-layered representation and developed once QAT scheme are effective in embodying multiple quantized networks into a single one and allow one-time training, and it delivers comparable performance as that of quantized models tailored to any specific bit-width. Code will be available.
translated by 谷歌翻译
Although weight and activation quantization is an effective approach for Deep Neural Network (DNN) compression and has a lot of potentials to increase inference speed leveraging bit-operations, there is still a noticeable gap in terms of prediction accuracy between the quantized model and the full-precision model. To address this gap, we propose to jointly train a quantized, bit-operation-compatible DNN and its associated quantizers, as opposed to using fixed, handcrafted quantization schemes such as uniform or logarithmic quantization. Our method for learning the quantizers applies to both network weights and activations with arbitrary-bit precision, and our quantizers are easy to train. The comprehensive experiments on CIFAR-10 and ImageNet datasets show that our method works consistently well for various network structures such as AlexNet, VGG-Net, GoogLeNet, ResNet, and DenseNet, surpassing previous quantization methods in terms of accuracy by an appreciable margin. Code available at https://github.com/Microsoft/LQ-Nets
translated by 谷歌翻译
最近,低精确的深度学习加速器(DLA)由于其在芯片区域和能源消耗方面的优势而变得流行,但是这些DLA上的低精确量化模型导致严重的准确性降解。达到高精度和高效推断的一种方法是在低精度DLA上部署高精度神经网络,这很少被研究。在本文中,我们提出了平行的低精确量化(PALQUANT)方法,该方法通过从头开始学习并行低精度表示来近似高精度计算。此外,我们提出了一个新型的循环洗牌模块,以增强平行低精度组之间的跨组信息通信。广泛的实验表明,PALQUANT的精度和推理速度既优于最先进的量化方法,例如,对于RESNET-18网络量化,PALQUANT可以获得0.52 \%的准确性和1.78 $ \ times $ speedup同时获得在最先进的2位加速器上的4位反片机上。代码可在\ url {https://github.com/huqinghao/palquant}中获得。
translated by 谷歌翻译
已经证明量化是提高深神经网络推理效率的重要方法(DNN)。然而,在将DNN权重或从高精度格式从高精度格式量化到它们量化的对应物的同时,在准确性和效率之间取得良好的平衡仍然具有挑战性。我们提出了一种称为弹性显着位量化(ESB)的新方法,可控制量化值的有效位数,以获得具有更少资源的更好的推理准确性。我们设计一个统一的数学公式,以限制ESB的量化值,具有灵活的有效位。我们还引入了分布差对准器(DDA),以定量对齐全精密重量或激活值和量化值之间的分布。因此,ESB适用于各种重量和DNN的激活的各种钟形分布,从而保持高推理精度。从较少的量化值中受益于较少的量化值,ESB可以降低乘法复杂性。我们将ESB实施为加速器,并定量评估其对FPGA的效率。广泛的实验结果表明,ESB量化始终如一地优于最先进的方法,并分别通过AlexNet,Resnet18和MobileNetv2的平均精度提高4.78%,1.92%和3.56%。此外,ESB作为加速器可以在Xilinx ZCU102 FPGA平台上实现1K LUT的10.95 GOPS峰值性能。与FPGA上的CPU,GPU和最先进的加速器相比,ESB加速器可以分别将能效分别提高到65倍,11x和26倍。
translated by 谷歌翻译
量化是一种降低DNN模型的计算和记忆成本的技术,DNN模型越来越大。现有的量化解决方案使用固定点整数或浮点类类型,这些量子的好处有限,因为两者都需要更多位以保持原始型号的准确性。另一方面,可变长度量化使用低位量化对正常值和高精度的分数对异常值的一部分。即使这项工作带来了算法的好处,但由于长度的编码和解码,它也引入了重要的硬件开销。在这项工作中,我们提出了一种称为ANT的固定长度自适应数值数据类型,以通过微小的硬件开销实现低位量化。我们的数据类型ANT利用了两项关键创新来利用DNN模型中的张贴内和调整的自适应机会。首先,我们提出了一种特定的数据类型Flint,该数据类型结合了Float和INT的优势,以适应张量中不同值的重要性。其次,我们提出了一个自适应框架,该框架根据其分布特性选择每个张量的最佳类型。我们为蚂蚁设计了统一的处理元件体系结构,并显示其与现有DNN加速器的易于集成。我们的设计导致2.8 $ \ times $速度和2.5 $ \ times $ $ $ $ $ \ times $ $ \ times $ $ \ times $ $ \ times $ $ \ times $ $ \ times $ $ \ times $ $ \ times $比最先进的量化加速器提高了能源效率。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
It has been witnessed that learned image compression has outperformed conventional image coding techniques and tends to be practical in industrial applications. One of the most critical issues that need to be considered is the non-deterministic calculation, which makes the probability prediction cross-platform inconsistent and frustrates successful decoding. We propose to solve this problem by introducing well-developed post-training quantization and making the model inference integer-arithmetic-only, which is much simpler than presently existing training and fine-tuning based approaches yet still keeps the superior rate-distortion performance of learned image compression. Based on that, we further improve the discretization of the entropy parameters and extend the deterministic inference to fit Gaussian mixture models. With our proposed methods, the current state-of-the-art image compression models can infer in a cross-platform consistent manner, which makes the further development and practice of learned image compression more promising.
translated by 谷歌翻译
量化图像超分辨率的深卷积神经网络大大降低了它们的计算成本。然而,现有的作品既不患有4个或低位宽度的超低精度的严重性能下降,或者需要沉重的微调过程以恢复性能。据我们所知,这种对低精度的漏洞依赖于特征映射值的两个统计观察。首先,特征贴图值的分布每个通道和每个输入图像都变化显着变化。其次,特征映射具有可以主导量化错误的异常值。基于这些观察,我们提出了一种新颖的分布感知量化方案(DAQ),其促进了超低精度的准确训练量化。 DAQ的简单功能确定了具有低计算负担的特征图和权重的动态范围。此外,我们的方法通过计算每个通道的相对灵敏度来实现混合精度量化,而无需涉及任何培训过程。尽管如此,量化感知培训也适用于辅助性能增益。我们的新方法优于最近的培训甚至基于培训的量化方法,以超低精度为最先进的图像超分辨率网络。
translated by 谷歌翻译
深度神经网络(DNN)的记录断裂性能具有沉重的参数化,导致外部动态随机存取存储器(DRAM)进行存储。 DRAM访问的禁用能量使得在资源受限的设备上部署DNN是不普遍的,呼叫最小化重量和数据移动以提高能量效率。我们呈现SmartDeal(SD),算法框架,以进行更高成本的存储器存储/访问的较低成本计算,以便在推理和培训中积极提高存储和能量效率。 SD的核心是一种具有结构约束的新型重量分解,精心制作以释放硬件效率潜力。具体地,我们将每个重量张量分解为小基矩阵的乘积以及大的结构稀疏系数矩阵,其非零被量化为-2的功率。由此产生的稀疏和量化的DNN致力于为数据移动和重量存储而大大降低的能量,因为由于稀疏的比特 - 操作和成本良好的计算,恢复原始权重的最小开销。除了推理之外,我们采取了另一次飞跃来拥抱节能培训,引入创新技术,以解决培训时出现的独特障碍,同时保留SD结构。我们还设计专用硬件加速器,充分利用SD结构来提高实际能源效率和延迟。我们在不同的设置中对多个任务,模型和数据集进行实验。结果表明:1)应用于推理,SD可实现高达2.44倍的能效,通过实际硬件实现评估; 2)应用于培训,储存能量降低10.56倍,减少了10.56倍和4.48倍,与最先进的训练基线相比,可忽略的准确性损失。我们的源代码在线提供。
translated by 谷歌翻译
深度学习在广泛的AI应用方面取得了有希望的结果。较大的数据集和模型一致地产生更好的性能。但是,我们一般花费更长的培训时间,以更多的计算和沟通。在本调查中,我们的目标是在模型精度和模型效率方面提供关于大规模深度学习优化的清晰草图。我们调查最常用于优化的算法,详细阐述了大批量培训中出现的泛化差距的可辩论主题,并审查了解决通信开销并减少内存足迹的SOTA策略。
translated by 谷歌翻译
当量化神经网络以进行有效推断时,低位整数是效率的首选格式。但是,低位浮点数具有额外的自由度,分配了一些以指数级的工作。本文深入研究了神经网络推断的浮点格式的这种好处。我们详细介绍了可以为FP8格式做出的选择,包括对Mantissa和Exponent的位数的重要选择,并通过分析显示这些选择可以提供更好的性能。然后,我们展示了这些发现如何转化为真实网络,为FP8模拟提供有效的实现,以及一种新算法,该算法能够学习比例参数和FP8格式中的指数位数。我们的主要结论是,在对各种网络进行培训后量化时,就准确性而言,FP8格式优于INT8,并且指数位数量的选择是由网络中异常值的严重性驱动的。我们还通过量化感知训练进行实验,在训练网络以降低离群值的效果时,格式的差异消失。
translated by 谷歌翻译