激活压缩训练〜(ACT)已被证明是减少训练深神经网络中记忆消耗的一种有希望的方法。但是,现有的ACT工作依赖于在深神经网络(DNN)训练期间寻找最佳的位宽度以减少量化噪声,从而使过程变得复杂且透明。为此,我们提出了一种简单有效的DNN培训方法。我们的方法是由观察结果激励的:\ emph {DNN向后传播主要取决于激活图的低频组分〜(LFC),而不是高频组件〜(HFC)}。它表明激活图的HFC在DNN训练过程中是高度冗余和可压缩的,这激发了我们提出的双重激活精度〜(分裂)。在培训期间,分裂估计激活图的LFC和HFC,并将HFC压缩到低精度副本中以消除冗余。这可以大大减少记忆消耗,而不会对DNN向后传播的精度产生负面影响。这样,部门可以实现可比的表现与正常培训。三个基准数据集的实验结果表明,在记忆消耗,模型准确性和跑步速度方面,分裂的表现优于最先进的基线方法。
translated by 谷歌翻译
培训广泛和深度神经网络(DNN)需要大量的存储资源,例如内存,因为在转发传播期间必须在存储器中保存中间激活数据,然后恢复以便向后传播。然而,由于硬件设计约束,诸如GPU之类的最先进的加速器(例如GPU)仅配备了非常有限的存储容量,这显着限制了在训练大规模DNN时的最大批量大小和性能加速。传统的记忆保存技术均受性能开销或受限互连带宽或特定互连技术的约束。在本文中,我们提出了一种新颖的记忆高效的CNN训练框架(称为Comet),利用错误界限的损耗压缩来显着降低训练的内存要求,以允许培训更大的模型或加速培训。不同于采用基于图像的有损压缩机(例如JPEG)的最先进的解决方案来压缩激活数据,我们的框架故意采用严格的错误控制机制来采用错误界限的损耗压缩。具体而言,我们对从改变的激活数据传播到梯度的压缩误差传播的理论分析,并经验探讨改变梯度对训练过程的影响。基于这些分析,我们优化了误报的损耗压缩,并提出了一种用于激活数据压缩的自适应误差控制方案。我们评估我们对最先进的解决方案的设计,其中包含五个广泛采用的CNN和Imagenet DataSet。实验表明,我们所提出的框架可以在基线训练中显着降低13.5倍,并分别在另一个最先进的基于压缩框架上的1.8倍,几乎没有准确性损失。
translated by 谷歌翻译
训练大型神经网络(NN)模型需要广泛的记忆资源,而激活压缩训练(ACT)是减少训练记忆足迹的一种有前途的方法。本文介绍了GACT,这是一个ACT框架,旨在支持具有有限域知识的通用NN体系结构的广泛机器学习任务。通过分析ACT近似梯度的线性化版本,我们证明了GACT的收敛性,而没有有关操作员类型或模型体系结构的先验知识。为了使训练保持稳定,我们提出了一种算法,该算法通过估计运行时对梯度的影响来决定每个张量的压缩比。我们将GACT实施为Pytorch库,很容易适用于任何NN体系结构。GACT将卷积NN,变压器和图形NNS的激活记忆降低到8.1倍,从而使4.2倍至24.7倍的训练能够较大,而精度损失可忽略不计。
translated by 谷歌翻译
二进制神经网络(BNNS)将原始的全精度权重和激活为1位,带有符号功能。由于传统符号函数的梯度几乎归零,因此不能用于反向传播,因此已经提出了几次尝试来通过使用近似梯度来缓解优化难度。然而,这些近似损坏了事实梯度的主要方向。为此,我们建议使用用于训练BNN的正弦函数的组合来估计傅立叶频域中的符号功能的梯度,即频域近似(FDA)。该提出的方法不会影响占据大部分整体能量的原始符号功能的低频信息,并且将忽略高频系数以避免巨大的计算开销。此外,我们将噪声适配模块嵌入到训练阶段以补偿近似误差。关于多个基准数据集和神经架构的实验说明了使用我们的方法学习的二进制网络实现了最先进的准确性。代码将在\ texit {https://gitee.com/mindspore/models/tree/master/research/cv/fda-bnn}上获得。
translated by 谷歌翻译
在设计高性能变压器方面有兴趣爆发。虽然变形金刚提供了显着的性能改进,但由于存储在背部经历期间梯度计算所需的所有中间激活,尤其是长序列,虽然变形金刚提供了显着的性能改进,但培训这种网络非常内存。为此,我们展示了MESA,一个用于变压器的节省记忆资源有效的训练框架。具体而言,MESA在转发过程中使用精确的激活,同时存储低精度版本的激活,以减少训练期间的内存消耗。然后在返回传播期间对低精度激活进行拆分以计算梯度。此外,为了解决多头自我注意层中的异构激活分布,我们提出了一种头脑激活量化策略,其基于每个头的统计量来量化激活,以最小化近似误差。为了进一步提高训练效率,我们通过运行估计来学习量化参数。更重要的是,通过在采用更大的批量大小或缩放模型尺寸时重新投资所保存的内存,我们可以进一步提高受约束的计算资源下的性能。关于Imagenet的广泛实验,CiFar-100和ADE20K表明,MESA可以在训练期间减少一半的内存足迹,同时实现可比或更好的性能。代码在https://github.com/zhuang-group/mesa获得
translated by 谷歌翻译
沟通压缩是现代分布式学习系统的至关重要技术,可以减轻其在较慢的网络上的交流瓶颈。尽管最近对数据并行式训练的梯度压缩进行了深入的研究,但压缩了通过管道并行性训练的模型的激活仍然是一个空旷的问题。在本文中,我们提出了AC-SGD,这是一种新型的激活压缩算法,用于在慢速网络上进行通信有效的管道并行性训练。 AC-SGD与以前的激活压缩方面的努力不同,而不是直接压缩激活值,而是压缩激活的变化。这使我们能够首次向我们的知识表明,仍然可以实现$ o(1/\ sqrt {t})$收敛速率,即激活压缩的非convex目标,而无需对梯度做出假设无偏见对于具有非线性激活功能的深度学习模型不符合。然后,我们证明AC-SGD可以有效地优化和实施,而无需额外的端到端运行时开销。我们将AC-SGD评估为微调语言具有高达15亿个参数的模型,将激活压缩至2-4位。AC-SGD在较慢的网络中可提供高达4.3倍的端到端速度,而无需牺牲模型质量。此外,我们还表明,AC-SGD可以与最先进的梯度压缩算法结合使用,以启用“端到端通信压缩:机器之间的所有通信,包括模型梯度,远期激活和后退梯度压缩为较低的精度。这提供了高达4.9倍的端到端加速,而无需牺牲模型质量。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
二进制神经网络(BNNS)已经证明了它们能够以可比精度(DNNS)的准确性来解决复杂任务的能力,同时还降低了计算能力和存储要求并提高处理速度。这些属性使它们成为开发和部署基于DNN的应用程序(IOT)设备的吸引人的替代方法。尽管最近有所改善,但它们的压缩因素可能会导致一些资源非常有限的设备可能导致不足。在这项工作中,我们提出了稀疏的二进制神经网络(SBNNS),这是一种新颖的模型和训练方案,它引入了BNN中的稀疏性和一种新的量化函数,以使网络的权重进行二进制。提出的SBNN能够达到高压因子,并减少了推理时的操作和参数数量。我们还提供工具来协助SBNN设计,同时尊重硬件资源约束。我们通过三个数据集的线性和卷积网络上的一组实验来研究我们方法对不同压缩因子的概括属性。我们的实验证实,SBNNS可以达到高压缩率,而不会损害概括,同时进一步降低了BNNS的操作,使SBNNS成为以廉价,低成本,限量资源的IOT设备和传感器部署DNN的可行选择。
translated by 谷歌翻译
深度神经网络(DNN)的记录断裂性能具有沉重的参数化,导致外部动态随机存取存储器(DRAM)进行存储。 DRAM访问的禁用能量使得在资源受限的设备上部署DNN是不普遍的,呼叫最小化重量和数据移动以提高能量效率。我们呈现SmartDeal(SD),算法框架,以进行更高成本的存储器存储/访问的较低成本计算,以便在推理和培训中积极提高存储和能量效率。 SD的核心是一种具有结构约束的新型重量分解,精心制作以释放硬件效率潜力。具体地,我们将每个重量张量分解为小基矩阵的乘积以及大的结构稀疏系数矩阵,其非零被量化为-2的功率。由此产生的稀疏和量化的DNN致力于为数据移动和重量存储而大大降低的能量,因为由于稀疏的比特 - 操作和成本良好的计算,恢复原始权重的最小开销。除了推理之外,我们采取了另一次飞跃来拥抱节能培训,引入创新技术,以解决培训时出现的独特障碍,同时保留SD结构。我们还设计专用硬件加速器,充分利用SD结构来提高实际能源效率和延迟。我们在不同的设置中对多个任务,模型和数据集进行实验。结果表明:1)应用于推理,SD可实现高达2.44倍的能效,通过实际硬件实现评估; 2)应用于培训,储存能量降低10.56倍,减少了10.56倍和4.48倍,与最先进的训练基线相比,可忽略的准确性损失。我们的源代码在线提供。
translated by 谷歌翻译
尽管Shapley值为DNN模型预测提供了有效的解释,但该计算依赖于所有可能的输入特征联盟的枚举,这导致了指数增长的复杂性。为了解决这个问题,我们提出了一种新颖的方法剪切,以显着加速DNN模型的Shapley解释,其中计算中只有几个输入特征的联盟。特征联盟的选择遵循我们提出的Shapley链规则,以最大程度地减少地面shapley值的绝对误差,从而使计算既有效又准确。为了证明有效性,我们全面评估了跨多个指标的剪切,包括地面真相shapley价值的绝对误差,解释的忠诚和跑步速度。实验结果表明,剪切始终优于不同评估指标的最先进的基线方法,这证明了其在计算资源受到限制的现实应用程序中的潜力。
translated by 谷歌翻译
We introduce a method to train Quantized Neural Networks (QNNs) -neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At traintime the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.
translated by 谷歌翻译
We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32× memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58× faster convolutional operations (in terms of number of the high precision operations) and 32× memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than 16% in top-1 accuracy. Our code is available at: http://allenai.org/plato/xnornet.
translated by 谷歌翻译
Model quantization enables the deployment of deep neural networks under resource-constrained devices. Vector quantization aims at reducing the model size by indexing model weights with full-precision embeddings, i.e., codewords, while the index needs to be restored to 32-bit during computation. Binary and other low-precision quantization methods can reduce the model size up to 32$\times$, however, at the cost of a considerable accuracy drop. In this paper, we propose an efficient framework for ternary quantization to produce smaller and more accurate compressed models. By integrating hyperspherical learning, pruning and reinitialization, our proposed Hyperspherical Quantization (HQ) method reduces the cosine distance between the full-precision and ternary weights, thus reducing the bias of the straight-through gradient estimator during ternary quantization. Compared with existing work at similar compression levels ($\sim$30$\times$, $\sim$40$\times$), our method significantly improves the test accuracy and reduces the model size.
translated by 谷歌翻译
Inference time, model size, and accuracy are three key factors in deep model compression. Most of the existing work addresses these three key factors separately as it is difficult to optimize them all at the same time. For example, low-bit quantization aims at obtaining a faster model; weight sharing quantization aims at improving compression ratio and accuracy; and mixed-precision quantization aims at balancing accuracy and inference time. To simultaneously optimize bit-width, model size, and accuracy, we propose pruning ternary quantization (PTQ): a simple, effective, symmetric ternary quantization method. We integrate L2 normalization, pruning, and the weight decay term to reduce the weight discrepancy in the gradient estimator during quantization, thus producing highly compressed ternary weights. Our method brings the highest test accuracy and the highest compression ratio. For example, it produces a 939kb (49$\times$) 2bit ternary ResNet-18 model with only 4\% accuracy drop on the ImageNet dataset. It compresses 170MB Mask R-CNN to 5MB (34$\times$) with only 2.8\% average precision drop. Our method is verified on image classification, object detection/segmentation tasks with different network structures such as ResNet-18, ResNet-50, and MobileNetV2.
translated by 谷歌翻译
混合精确的深神经网络达到了硬件部署所需的能源效率和吞吐量,尤其是在资源有限的情况下,而无需牺牲准确性。但是,不容易找到保留精度的最佳每层钻头精度,尤其是在创建巨大搜索空间的大量模型,数据集和量化技术中。为了解决这一困难,最近出现了一系列文献,并且已经提出了一些实现有希望的准确性结果的框架。在本文中,我们首先总结了文献中通常使用的量化技术。然后,我们对混合精液框架进行了彻底的调查,该调查是根据其优化技术进行分类的,例如增强学习和量化技术,例如确定性舍入。此外,讨论了每个框架的优势和缺点,我们在其中呈现并列。我们最终为未来的混合精液框架提供了指南。
translated by 谷歌翻译
深度学习在广泛的AI应用方面取得了有希望的结果。较大的数据集和模型一致地产生更好的性能。但是,我们一般花费更长的培训时间,以更多的计算和沟通。在本调查中,我们的目标是在模型精度和模型效率方面提供关于大规模深度学习优化的清晰草图。我们调查最常用于优化的算法,详细阐述了大批量培训中出现的泛化差距的可辩论主题,并审查了解决通信开销并减少内存足迹的SOTA策略。
translated by 谷歌翻译
我们日常生活中的深度学习是普遍存在的,包括自驾车,虚拟助理,社交网络服务,医疗服务,面部识别等,但是深度神经网络在训练和推理期间需要大量计算资源。该机器学习界主要集中在模型级优化(如深度学习模型的架构压缩),而系统社区则专注于实施级别优化。在其间,在算术界中提出了各种算术级优化技术。本文在模型,算术和实施级技术方面提供了关于资源有效的深度学习技术的调查,并确定了三种不同级别技术的资源有效的深度学习技术的研究差距。我们的调查基于我们的资源效率度量定义,阐明了较低级别技术的影响,并探讨了资源有效的深度学习研究的未来趋势。
translated by 谷歌翻译
模型二进制化是一种压缩神经网络并加速其推理过程的有效方法。但是,1位模型和32位模型之间仍然存在显着的性能差距。实证研究表明,二进制会导致前进和向后传播中的信息损失。我们提出了一个新颖的分布敏感信息保留网络(DIR-NET),该网络通过改善内部传播和引入外部表示,将信息保留在前后传播中。 DIR-NET主要取决于三个技术贡献:(1)最大化二进制(IMB)的信息:最小化信息损失和通过重量平衡和标准化同时同时使用权重/激活的二进制误差; (2)分布敏感的两阶段估计器(DTE):通过共同考虑更新能力和准确的梯度来通过分配敏感的软近似来保留梯度的信息; (3)代表性二进制 - 意识蒸馏(RBD):通过提炼完整精确和二元化网络之间的表示来保留表示信息。 DIR-NET从统一信息的角度研究了BNN的前进过程和后退过程,从而提供了对网络二进制机制的新见解。我们的DIR-NET中的三种技术具有多功能性和有效性,可以在各种结构中应用以改善BNN。关于图像分类和客观检测任务的综合实验表明,我们的DIR-NET始终优于主流和紧凑型体系结构(例如Resnet,vgg,vgg,EfficityNet,darts和mobilenet)下最新的二进制方法。此外,我们在现实世界中的资源有限设备上执行DIR-NET,该设备可实现11.1倍的存储空间和5.4倍的速度。
translated by 谷歌翻译
训练后量化(PTQ)由于其在部署量化的神经网络方面的便利性而引起了越来越多的关注。 Founding是量化误差的主要来源,仅针对模型权重进行了优化,而激活仍然使用圆形至最终操作。在这项工作中,我们首次证明了精心选择的激活圆形方案可以提高最终准确性。为了应对激活舍入方案动态性的挑战,我们通过简单的功能适应圆形边框,以在推理阶段生成圆形方案。边界函数涵盖了重量误差,激活错误和传播误差的影响,以消除元素误差的偏差,从而进一步受益于模型的准确性。我们还使边境意识到全局错误,以更好地拟合不同的到达激活。最后,我们建议使用Aquant框架来学习边界功能。广泛的实验表明,与最先进的作品相比,Aquant可以通过可忽略不计的开销来取得明显的改进,并将Resnet-18的精度提高到2位重量和激活后训练后量化下的精度最高60.3 \%。
translated by 谷歌翻译
Although considerable progress has been obtained in neural network quantization for efficient inference, existing methods are not scalable to heterogeneous devices as one dedicated model needs to be trained, transmitted, and stored for one specific hardware setting, incurring considerable costs in model training and maintenance. In this paper, we study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. With this representation, we can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model. To this end, we propose a simple once quantization-aware training (QAT) scheme for obtaining high-performance vertical-layered models. Our design incorporates a cascade downsampling mechanism which allows us to obtain multiple quantized networks from one full precision source model by progressively mapping the higher precision weights to their adjacent lower precision counterparts. Then, with networks of different bit-widths from one source model, multi-objective optimization is employed to train the shared source model weights such that they can be updated simultaneously, considering the performance of all networks. By doing this, the shared weights will be optimized to balance the performance of different quantized models, thus making the weights transferable among different bit widths. Experiments show that the proposed vertical-layered representation and developed once QAT scheme are effective in embodying multiple quantized networks into a single one and allow one-time training, and it delivers comparable performance as that of quantized models tailored to any specific bit-width. Code will be available.
translated by 谷歌翻译