通过移除昂贵的乘法操作并将连续权重量化成低比特离散值来减少计算复杂性,与传统的神经网络相比,这是快速且节能的低比特离散值。然而,现有的换档网络对重量初始化敏感,并且还产生由消失梯度和重量率冻结问题引起的降级性能。为了解决这些问题,我们提出了一种低点重新参数化,这是一种用于训练低位换档网络的新技术。我们的方法以符号稀疏偏移3倍的方式分解离散参数。以这种方式,它有效地学习了一个低比特网络,其权重动力学类似于全精密网络并对重量初始化不敏感。我们所提出的培训方法推动移位神经网络的界限,并以在想象中的前1个精度方面显示出3位换档网络。
translated by 谷歌翻译
由于其不断增加的资源需求,在低资源边缘设备上部署深层神经网络是具有挑战性的。最近的研究提出了无倍数的神经网络,以减少计算和记忆消耗。 Shift神经网络是这些减少的最有效工具之一。但是,现有的低位换档网络不如其完整的精度对应物准确,并且由于其固有的设计缺陷,无法有效地转移到广泛的任务中。我们提出了利用以下新颖设计的光泽网络。首先,我们证明低位移位网络中的零重量值既不有用,也不简化模型推断。因此,我们建议使用零移动机制来简化推理,同时增加模型容量。其次,我们设计了一个新的指标,以测量训练低位移位网络中的重量冻结问题,并提出一个符号尺度分解以提高训练效率。第三,我们提出了低变化的随机初始化策略,以提高模型在转移学习方案中的性能。我们对各种计算机视觉和语音任务进行了广泛的实验。实验结果表明,光泽网络明显胜过现有的低位乘法网络,并可以实现全精度对应物的竞争性能。它还表现出强大的转移学习表现,没有准确性下降。
translated by 谷歌翻译
模型二进制化是一种压缩神经网络并加速其推理过程的有效方法。但是,1位模型和32位模型之间仍然存在显着的性能差距。实证研究表明,二进制会导致前进和向后传播中的信息损失。我们提出了一个新颖的分布敏感信息保留网络(DIR-NET),该网络通过改善内部传播和引入外部表示,将信息保留在前后传播中。 DIR-NET主要取决于三个技术贡献:(1)最大化二进制(IMB)的信息:最小化信息损失和通过重量平衡和标准化同时同时使用权重/激活的二进制误差; (2)分布敏感的两阶段估计器(DTE):通过共同考虑更新能力和准确的梯度来通过分配敏感的软近似来保留梯度的信息; (3)代表性二进制 - 意识蒸馏(RBD):通过提炼完整精确和二元化网络之间的表示来保留表示信息。 DIR-NET从统一信息的角度研究了BNN的前进过程和后退过程,从而提供了对网络二进制机制的新见解。我们的DIR-NET中的三种技术具有多功能性和有效性,可以在各种结构中应用以改善BNN。关于图像分类和客观检测任务的综合实验表明,我们的DIR-NET始终优于主流和紧凑型体系结构(例如Resnet,vgg,vgg,EfficityNet,darts和mobilenet)下最新的二进制方法。此外,我们在现实世界中的资源有限设备上执行DIR-NET,该设备可实现11.1倍的存储空间和5.4倍的速度。
translated by 谷歌翻译
深度神经网络(DNN)在多个领域取得了令人印象深刻的成功。多年来,随着更深层次,更复杂的体系结构的扩散,这些模型的准确性已经提高。因此,最新的解决方案通常在计算上很昂贵,这使得它们不适合在边缘计算平台上部署。为了减轻推断卷积神经网络(CNN)的高计算,内存和功率要求,我们提出了两次量化量化的使用,该量化量量化量的功率量化将连续参数量化为低点的两个值值。这通过删除昂贵的乘法操作和使用低位权重来降低计算复杂性。 Resnet被用作解决方案的基础,并根据口语理解(SLU)任务评估了建议的模型。实验结果表明,在测试集中,我们的低位量化实现了换档神经网络体系结构的性能,其低位量化达到了98.76 \%,这与其完整精确的对应物和最先进的解决方案相当。
translated by 谷歌翻译
二进制神经网络(BNNS)将原始的全精度权重和激活为1位,带有符号功能。由于传统符号函数的梯度几乎归零,因此不能用于反向传播,因此已经提出了几次尝试来通过使用近似梯度来缓解优化难度。然而,这些近似损坏了事实梯度的主要方向。为此,我们建议使用用于训练BNN的正弦函数的组合来估计傅立叶频域中的符号功能的梯度,即频域近似(FDA)。该提出的方法不会影响占据大部分整体能量的原始符号功能的低频信息,并且将忽略高频系数以避免巨大的计算开销。此外,我们将噪声适配模块嵌入到训练阶段以补偿近似误差。关于多个基准数据集和神经架构的实验说明了使用我们的方法学习的二进制网络实现了最先进的准确性。代码将在\ texit {https://gitee.com/mindspore/models/tree/master/research/cv/fda-bnn}上获得。
translated by 谷歌翻译
本文研究了重量和激活都将二进制神经网络(BNN)二进制为1位值,从而大大降低了记忆使用率和计算复杂性。由于现代深层神经网络具有复杂的设计,具有复杂的架构,其准确性,因此权重和激活分布的多样性非常高。因此,传统的符号函数不能很好地用于有效地在BNN中进行全精度值。为此,我们提出了一种称为Adabin的简单而有效的方法,可自适应获得最佳的二进制集$ \ {b_1,b_2 \} $($ b_1,b_1,b_2 \ in \ mathbb {r} $)的重量和激活而不是固定集(即$ \ { - 1,+1 \} $)。通过这种方式,提出的方法可以更好地拟合不同的分布,并提高二进制特征的表示能力。实际上,我们使用中心位置和1位值的距离来定义新的二进制量化函数。对于权重,我们提出了一种均衡方法,将对称分布的对称中心与实价分布相对,并最大程度地减少它们的kullback-leibler差异。同时,我们引入了一种基于梯度的优化方法,以获取这两个激活参数,这些参数以端到端的方式共同训练。基准模型和数据集的实验结果表明,拟议的Adabin能够实现最新性能。例如,我们使用RESNET-18体系结构在Imagenet上获得66.4 \%TOP-1的精度,并使用SSD300获得了Pascal VOC的69.4映射。
translated by 谷歌翻译
Although weight and activation quantization is an effective approach for Deep Neural Network (DNN) compression and has a lot of potentials to increase inference speed leveraging bit-operations, there is still a noticeable gap in terms of prediction accuracy between the quantized model and the full-precision model. To address this gap, we propose to jointly train a quantized, bit-operation-compatible DNN and its associated quantizers, as opposed to using fixed, handcrafted quantization schemes such as uniform or logarithmic quantization. Our method for learning the quantizers applies to both network weights and activations with arbitrary-bit precision, and our quantizers are easy to train. The comprehensive experiments on CIFAR-10 and ImageNet datasets show that our method works consistently well for various network structures such as AlexNet, VGG-Net, GoogLeNet, ResNet, and DenseNet, surpassing previous quantization methods in terms of accuracy by an appreciable margin. Code available at https://github.com/Microsoft/LQ-Nets
translated by 谷歌翻译
Although considerable progress has been obtained in neural network quantization for efficient inference, existing methods are not scalable to heterogeneous devices as one dedicated model needs to be trained, transmitted, and stored for one specific hardware setting, incurring considerable costs in model training and maintenance. In this paper, we study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. With this representation, we can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model. To this end, we propose a simple once quantization-aware training (QAT) scheme for obtaining high-performance vertical-layered models. Our design incorporates a cascade downsampling mechanism which allows us to obtain multiple quantized networks from one full precision source model by progressively mapping the higher precision weights to their adjacent lower precision counterparts. Then, with networks of different bit-widths from one source model, multi-objective optimization is employed to train the shared source model weights such that they can be updated simultaneously, considering the performance of all networks. By doing this, the shared weights will be optimized to balance the performance of different quantized models, thus making the weights transferable among different bit widths. Experiments show that the proposed vertical-layered representation and developed once QAT scheme are effective in embodying multiple quantized networks into a single one and allow one-time training, and it delivers comparable performance as that of quantized models tailored to any specific bit-width. Code will be available.
translated by 谷歌翻译
Model quantization enables the deployment of deep neural networks under resource-constrained devices. Vector quantization aims at reducing the model size by indexing model weights with full-precision embeddings, i.e., codewords, while the index needs to be restored to 32-bit during computation. Binary and other low-precision quantization methods can reduce the model size up to 32$\times$, however, at the cost of a considerable accuracy drop. In this paper, we propose an efficient framework for ternary quantization to produce smaller and more accurate compressed models. By integrating hyperspherical learning, pruning and reinitialization, our proposed Hyperspherical Quantization (HQ) method reduces the cosine distance between the full-precision and ternary weights, thus reducing the bias of the straight-through gradient estimator during ternary quantization. Compared with existing work at similar compression levels ($\sim$30$\times$, $\sim$40$\times$), our method significantly improves the test accuracy and reduces the model size.
translated by 谷歌翻译
当通过模拟量化训练神经网络时,我们观察到,量化的权重可以意外地在两个网格点之间振荡。这种效果的重要性及其对量化感知培训(QAT)的影响并未在文献中得到充分理解或研究。在本文中,我们更深入地研究了重量振荡现象,并表明由于推理过程中错误估计的批次纳入统计量和训练期间的噪声增加,它可能导致明显的准确性降解。这些效果在低位($ \ leq $ 4位)的高效网络中尤其明显,具有深度可分开的层,例如mobilenets和效率网络。在我们的分析中,我们研究了一些先前提出的QAT算法,并表明其中大多数无法克服振荡。最后,我们提出了两种新型的QAT算法来克服训练期间的振荡:振荡衰减和迭代重量冻结。我们证明,我们的算法对于低位(3&4位)的重量(3&4位)的最新精度以及有效体系结构的激活量化,例如MobilenetV2,MobilenetV3和Imagenet上的EfficentNet-Lite。我们的源代码可在{https://github.com/qualcomm-ai-research/oscillations-qat}上获得。
translated by 谷歌翻译
深度神经网络(DNN)的记录断裂性能具有沉重的参数化,导致外部动态随机存取存储器(DRAM)进行存储。 DRAM访问的禁用能量使得在资源受限的设备上部署DNN是不普遍的,呼叫最小化重量和数据移动以提高能量效率。我们呈现SmartDeal(SD),算法框架,以进行更高成本的存储器存储/访问的较低成本计算,以便在推理和培训中积极提高存储和能量效率。 SD的核心是一种具有结构约束的新型重量分解,精心制作以释放硬件效率潜力。具体地,我们将每个重量张量分解为小基矩阵的乘积以及大的结构稀疏系数矩阵,其非零被量化为-2的功率。由此产生的稀疏和量化的DNN致力于为数据移动和重量存储而大大降低的能量,因为由于稀疏的比特 - 操作和成本良好的计算,恢复原始权重的最小开销。除了推理之外,我们采取了另一次飞跃来拥抱节能培训,引入创新技术,以解决培训时出现的独特障碍,同时保留SD结构。我们还设计专用硬件加速器,充分利用SD结构来提高实际能源效率和延迟。我们在不同的设置中对多个任务,模型和数据集进行实验。结果表明:1)应用于推理,SD可实现高达2.44倍的能效,通过实际硬件实现评估; 2)应用于培训,储存能量降低10.56倍,减少了10.56倍和4.48倍,与最先进的训练基线相比,可忽略的准确性损失。我们的源代码在线提供。
translated by 谷歌翻译
模型量化已成为加速深度学习推理的不可或缺的技术。虽然研究人员继续推动量化算法的前沿,但是现有量化工作通常是不可否认的和不可推销的。这是因为研究人员不选择一致的训练管道并忽略硬件部署的要求。在这项工作中,我们提出了模型量化基准(MQBench),首次尝试评估,分析和基准模型量化算法的再现性和部署性。我们为实际部署选择多个不同的平台,包括CPU,GPU,ASIC,DSP,并在统一培训管道下评估广泛的最新量化算法。 MQBENCK就像一个连接算法和硬件的桥梁。我们进行全面的分析,并找到相当大的直观或反向直观的见解。通过对齐训练设置,我们发现现有的算法在传统的学术轨道上具有大致相同的性能。虽然用于硬件可部署量化,但有一个巨大的精度差距,仍然不稳定。令人惊讶的是,没有现有的算法在MQBench中赢得每一项挑战,我们希望这项工作能够激发未来的研究方向。
translated by 谷歌翻译
内存处理(PIM)是一种越来越多地研究的神经形态硬件,承诺能量和吞吐量改进以进行深度学习推断。 PIM利用大量平行,有效的模拟计算在内存内部,绕过传统数字硬件中数据移动的瓶颈。但是,需要额外的量化步骤(即PIM量化),通常由于硬件约束而导致的分辨率有限,才能将模拟计算结果转换为数字域。同时,由于不完善的类似物到数字界面,PIM量化中的非理想效应广泛存在,这进一步损害了推理的准确性。在本文中,我们提出了一种培训量化网络的方法,以合并PIM量化,这对所有PIM系统无处不在。具体而言,我们提出了PIM量化意识培训(PIM-QAT)算法,并通过分析训练动力学以促进训练收敛,从而在向后传播期间引入重新传播技术。我们还提出了两种技术,即批处理归一化(BN)校准和调整精度训练,以抑制实际PIM芯片中涉及的非理想线性和随机热噪声的不利影响。我们的方法在三个主流PIM分解方案上进行了验证,并在原型芯片上进行了物理上的验证。与直接在PIM系统上部署常规训练的量化模型相比,该模型没有考虑到此额外的量化步骤并因此失败,我们的方法提供了重大改进。它还可以在CIFAR10和CIFAR100数据集上使用各种网络深度来获得最受欢迎的网络拓扑结构,在CIFAR10和CIFAR100数据集上,在PIM系统上达到了可比的推理精度。
translated by 谷歌翻译
由于稀疏神经网络通常包含许多零权重,因此可以在不降低网络性能的情况下潜在地消除这些不必要的网络连接。因此,设计良好的稀疏神经网络具有显着降低拖鞋和计算资源的潜力。在这项工作中,我们提出了一种新的自动修剪方法 - 稀疏连接学习(SCL)。具体地,重量被重新参数化为可培训权重变量和二进制掩模的元素方向乘法。因此,由二进制掩模完全描述网络连接,其由单位步进函数调制。理论上,从理论上证明了使用直通估计器(STE)进行网络修剪的基本原理。这一原则是STE的代理梯度应该是积极的,确保掩模变量在其最小值处收敛。在找到泄漏的Relu后,SoftPlus和Identity Stes可以满足这个原理,我们建议采用SCL的身份STE以进行离散面膜松弛。我们发现不同特征的面具梯度非常不平衡,因此,我们建议将每个特征的掩模梯度标准化以优化掩码变量训练。为了自动训练稀疏掩码,我们将网络连接总数作为我们的客观函数中的正则化术语。由于SCL不需要由网络层设计人员定义的修剪标准或超级参数,因此在更大的假设空间中探讨了网络,以实现最佳性能的优化稀疏连接。 SCL克服了现有自动修剪方法的局限性。实验结果表明,SCL可以自动学习并选择各种基线网络结构的重要网络连接。 SCL培训的深度学习模型以稀疏性,精度和减少脚波特的SOTA人类设计和自动修剪方法训练。
translated by 谷歌翻译
Mixed-precision quantization has been widely applied on deep neural networks (DNNs) as it leads to significantly better efficiency-accuracy tradeoffs compared to uniform quantization. Meanwhile, determining the exact precision of each layer remains challenging. Previous attempts on bit-level regularization and pruning-based dynamic precision adjustment during training suffer from noisy gradients and unstable convergence. In this work, we propose Continuous Sparsification Quantization (CSQ), a bit-level training method to search for mixed-precision quantization schemes with improved stability. CSQ stabilizes the bit-level mixed-precision training process with a bi-level gradual continuous sparsification on both the bit values of the quantized weights and the bit selection in determining the quantization precision of each layer. The continuous sparsification scheme enables fully-differentiable training without gradient approximation while achieving an exact quantized model in the end.A budget-aware regularization of total model size enables the dynamic growth and pruning of each layer's precision towards a mixed-precision quantization scheme of the desired size. Extensive experiments show CSQ achieves better efficiency-accuracy tradeoff than previous methods on multiple models and datasets.
translated by 谷歌翻译
最近,低精确的深度学习加速器(DLA)由于其在芯片区域和能源消耗方面的优势而变得流行,但是这些DLA上的低精确量化模型导致严重的准确性降解。达到高精度和高效推断的一种方法是在低精度DLA上部署高精度神经网络,这很少被研究。在本文中,我们提出了平行的低精确量化(PALQUANT)方法,该方法通过从头开始学习并行低精度表示来近似高精度计算。此外,我们提出了一个新型的循环洗牌模块,以增强平行低精度组之间的跨组信息通信。广泛的实验表明,PALQUANT的精度和推理速度既优于最先进的量化方法,例如,对于RESNET-18网络量化,PALQUANT可以获得0.52 \%的准确性和1.78 $ \ times $ speedup同时获得在最先进的2位加速器上的4位反片机上。代码可在\ url {https://github.com/huqinghao/palquant}中获得。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
We introduce a method to train Quantized Neural Networks (QNNs) -neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At traintime the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.
translated by 谷歌翻译
用于压缩神经网络的非均匀量化策略通常实现的性能比其对应于对应物,即统一的策略,因为其优越的代表性能力。然而,许多非均匀量化方法在实现不均匀量化的权重/激活时忽略了复杂的投影过程,这在硬件部署中引起了不可忽略的时间和空间开销。在这项研究中,我们提出了非均匀致均匀的量化(N2UQ),一种方法,其能够保持非均匀方法的强表示能力,同时硬件友好且有效地作为模型推理的均匀量化。我们通过学习灵活的等距输入阈值来实现这一目标,以更好地拟合潜在的分布,同时将这些实值输入量化为等距输出电平。要使用可学习的输入阈值训练量化网络,我们将广义直通估计器(G-STE)介绍,用于难以应答的后向衍生计算W.r.t.阈值参数。此外,我们考虑熵保持正则化,以进一步降低重量量化的信息损失。即使在这种不利约束的施加均匀量化的重量和激活的情况下,我们的N2UQ也经历了最先进的非均匀量化方法,在想象中达到了0.7〜1.8%,展示了N2UQ设计的贡献。代码将公开可用。
translated by 谷歌翻译
With the advent of deep learning application on edge devices, researchers actively try to optimize their deployments on low-power and restricted memory devices. There are established compression method such as quantization, pruning, and architecture search that leverage commodity hardware. Apart from conventional compression algorithms, one may redesign the operations of deep learning models that lead to more efficient implementation. To this end, we propose EuclidNet, a compression method, designed to be implemented on hardware which replaces multiplication, $xw$, with Euclidean distance $(x-w)^2$. We show that EuclidNet is aligned with matrix multiplication and it can be used as a measure of similarity in case of convolutional layers. Furthermore, we show that under various transformations and noise scenarios, EuclidNet exhibits the same performance compared to the deep learning models designed with multiplication operations.
translated by 谷歌翻译