在训练有素的深度神经网络中,非结构化修剪可以减少冗余重量,从而降低存储成本。但是,它需要定制硬件来加速实际推理。另一个趋势是通过采用粗粒度稀疏来修剪或规范连续权重以进行有效计算,从而加速了对通用硬件的sparsemodel推断。但这种方法通常会牺牲模型的准确性。在本文中,我们提出了新的细粒度稀疏方法,平衡稀疏性,以有效地实现商业硬件的高模式精度。我们的方法适应GPU的高并行性,在广泛部署深度学习服务中显示出令人难以置信的稀疏性潜力。实验结果表明,对于GPU的模型推理,平衡稀疏度实现了高达3.1倍的实际加速,同时保留了与细粒度稀疏度相同的高模型精度。
translated by 谷歌翻译
Sparsity helps reduce the computational complexity of deep neural networks by skipping zeros. Taking advantage of sparsity is listed as a high priority in the next generation DNN accelerators such as TPU[1]. The structure of sparsity, i.e., the granularity of pruning, affects the efficiency of hardware accelerator design as well as the prediction accuracy. Coarse-grained pruning brings more regular sparsity patterns, making it more amenable for hardware acceleration, but more challenging to maintain the same accuracy. In this paper we quantitatively measure the trade-off between sparsity regularity and the prediction accuracy, providing insights in how to maintain the accuracy while having more structured sparsity pattern. Our experimental results show that coarse-grained pruning can achieve similar sparsity ratio as unstructured pruning given no loss of accuracy. Moreover, due to the index saving effect, coarse-grained pruning is able to obtain better compression ratio than fine-grained sparsity at the same accuracy threshold. Based on the recent sparse convolutional neural network accelerator (SCNN), our experiments further demonstrate that coarse-grained sparsity saves ∼ 2× of the memory references compared with fine-grained sparsity. Since memory reference is more than two orders of magnitude more expensive than arithmetic operations, the regularity of sparse structure leads to more efficient hardware design.
translated by 谷歌翻译
深度神经网络(DNN)虽然在许多领域实现了人类级别的性能,但是具有非常大的模型尺寸,阻碍了它们在边缘计算设备上的广泛应用。已经对DNN模型压缩或修剪进行了广泛的研究工作。但是,之前的大多数工作采用了启发式方法。这项工作提出了一种基于ADMM(交替方向乘法器)的渐进式权重修剪方法,这种技术可以处理具有潜在组合约束的非凸优化问题。在动态规划的推动下,通过使用具有适度修剪率的部分修剪,所提出的方法达到极高的修剪率。因此,在追求极高的修剪比时,它解决了精度下降和长收敛时间问题。对于ImageNet数据集,修剪率高达34倍,MNIST数据集修剪率高达167倍,显着高于文献工作所达到的修剪率。在相同数量的时期下,所提出的方法还实现了更快的收敛和更高的压缩率。代码和prunedDNN模型在链接bit.ly/2zxdlss中发布
translated by 谷歌翻译
神经网络既是计算密集型又是内存密集型,使得它们难以部署在具有有限硬件资源的嵌入式系统上。为了解决这个限制,我们引入了“深度压缩”,一个三阶段流水线:修剪,训练量化和霍夫曼编码,它们共同工作,将神经网络的存储需求减少35倍到49倍,而不影响它们的准确性。我们的方法首先通过学习重要的连接来修剪网络。接下来,我们量化权重以实现权重共享,最后,我们应用霍夫曼编码。在第一个两步之后,我们重新训练网络以微调剩余的连接和量化的质心。修剪,将连接数减少9x到13x;量化然后将代表每个连接的位数从32减少到5.在ImageNet数据集上,我们的方法将AlexNet所需的存储空间减少了35x,从240MB减少到6.9MB,没有损失准确性。我们的方法将VGG-16的尺寸减少了49倍,从552MB减少到11.3MB,同样没有损失准确性。这允许将模型拟合到片上SRAM高速缓存而不是片外DRAM存储器中。我们的压缩方法还有助于在应用程序大小和下载带宽受限的移动应用程序中使用复杂的神经网络。基于CPU,GPU和移动GPU的基准测试,压缩网络具有3倍至4倍的分层加速和3倍至7倍的更高能效。
translated by 谷歌翻译
在许多现实世界应用中部署深度卷积神经网络(CNN)在很大程度上受到其高计算成本的阻碍。在本文中,我们提出了一种新的CNN学习方案,以同时1)减小模型大小; 2)减少运行时内存占用; 3)在不影响准确性的情况下,减少计算操作的次数。这是通过以简单有效的方式在网络中强制执行信道级稀疏性来实现的。与现有的许多方法不同,所提出的方法直接适用于现代CNN架构,为训练过程引入了最小的开销,并且对于所得到的模型不需要特殊的软件/硬件加速器。我们将我们的方法称为网络瘦身,它将广域网和大型网络作为输入模型,但在培训期间,无关紧要的通道会自动识别并在之后进行修剪,从而产生具有相当精度的薄且紧凑的模型。我们凭经验证明了我们的方法在几种最先进的CNN模型中的有效性,包括VGGNet,ResNet和DenseNet,以及各种图像分类数据集。对于VGGNet,网络瘦身的多次通过版本可使模型尺寸减少20倍,计算操作减少5倍。
translated by 谷歌翻译
DNN已经被快速和广泛地利用,以改善许多复杂的科学和工程应用中的数据分析质量。今天的DNN越来越广泛,因为对分析质量的要求越来越高,要解决的应用越来越复杂。然而,广泛而深入的DNN需要大量资源,严重限制了它们在资源受限系统上的使用。尽管已经提出了一些网络简化方法来解决这个问题,但是它们受到低压缩比或高压缩误差的影响,这可能引入针对目标精度的昂贵的再训练过程。在本文中,我们提出了DeepSZ:一种精度损失有界神经网络压缩框架,它涉及四个关键步骤:网络修剪,错误边界评估,误差限制配置优化和压缩模型生成,具有高压缩比和低编码时间。贡献是三倍的。 (1)我们开发了一种自适应方法来为每一层选择可行的误差界限。 (2)我们建立了一个模型来估计基于单个压缩层引起的精度下降的整体精度损失。 (3)我们开发了一种有效的优化算法来确定误差边界的最佳拟合配置,以便在用户设定精度约束下最大化压缩比。实验表明,DeepSZ可以分别以46X和116X的压缩比压缩ImageNet上的AlexNet和VGG-16,并分别以57X和56X的压缩比压缩MNIST上的LeNet-300-100和LeNet-5。精度损失0.3%。与其他最先进的方法相比,DeepSZ可以将压缩比提高1.43倍,DNN编码性能提升至4.0倍(​​使用4个Nvidia Tesla V100 GPU),解码性能提升至6.2倍。
translated by 谷歌翻译
为了促进深度神经网络(DNN)的有效嵌入和硬件实现,研究了两种重要的DNN模型压缩技术:权重修剪和权重量化。前者利用权重数量的冗余,而后者利用权重的比特表示中的冗余。然而,缺乏联合权重修剪和DNN量化的系统框架,从而限制了可用的模型压缩比。此外,除了简单的模型尺寸减小之外,还需要计算计算量减少,能量效率提高和硬件性能开销。为了解决这些局限性,我们提出了ADMM-NN,这是使用交替方向乘法器(ADMM)的DNN的第一个算法 - 硬件协同优化框架,这是一种处理可能具有组合约束的非凸优化问题的强大技术。 ADMM-NN的第一部分是使用ADMM进行DNN权重修剪和量化的系统联合框架。它可以被理解为智能正则化技术,其中正则化目标在每个ADMMiteration中动态更新,从而导致模型压缩中的性能高于先前的工作。第二部分是硬件感知DNN优化,以促进硬件级实现。如果没有准确性损失,我们可以分别在LeNet-5和AlexNet模型上实现85美元和24美元/次的修剪,显着高于之前的工作。当专注于计算减少时,改进变得更加重要。结合重量修剪和量化,当我们专注于数据存储时,我们在这两个基准测试中实现了1,910 $ \次$和231 $ $次整体模型尺寸的减少。在VGGNet和ResNet-50等其他代表性DNN上也观察到了非常有希望的结果。
translated by 谷歌翻译
Large-scale deep neural networks (DNNs) are both compute and memory intensive. As the size of DNNs continues to grow, it is critical to improve the energy efficiency and performance while maintaining accuracy. For DNNs, the model size is an important factor affecting performance, scalability and energy efficiency. Weight pruning achieves good compression ratios but suffers from three drawbacks: 1) the irregular network structure after pruning, which affects performance and throughput; 2) the increased training complexity ; and 3) the lack of rigorous guarantee of compression ratio and inference accuracy. To overcome these limitations, this paper proposes CirCNN, a principled approach to represent weights and process neural networks using block-circulant matrices. CirCNN utilizes the Fast Fourier Transform (FFT)-based fast multiplication, simultaneously reducing the computational complexity (both in inference and training) from O(n 2) to O(n log n) and the storage complexity from O(n 2) to O(n), with negligible accuracy loss. Compared to other approaches, CirCNN is distinct due to its mathematical rigor: the DNNs based on CirCNN can converge to the same "effectiveness" as DNNs without compression. We propose the CirCNN architecture , a universal DNN inference engine that can be implemented in various hardware/software platforms with configurable network architecture (e.g., layer type, size, scales, etc.). In CirCNN architecture: 1) Due to the recursive property, FFT can be used as the key computing kernel, which ensures universal and small-footprint implementations. 2) The compressed but regular network structure avoids the pitfalls of the network pruning and facilitates high performance and throughput with highly pipelined and parallel design. To demonstrate the performance and energy efficiency, we test Cir-CNN in FPGA, ASIC and embedded processors. Our results show that CirCNN architecture achieves very high energy efficiency and performance with a small hardware footprint. Based on the FPGA implementation and ASIC synthesis results, CirCNN achieves 6-102X energy efficiency improvements compared with the best state-of-the-art results.
translated by 谷歌翻译
Sparsity helps reducing the computation complexity of DNNs by skipping the multiplication with zeros. The gran-ularity of sparsity affects the efficiency of hardware architecture and the prediction accuracy. In this paper we quantitatively measure the accuracy-sparsity relationship with different granularity. Coarse-grained sparsity brings more regular sparsity pattern, making it easier for hardware acceleration , and our experimental results show that coarse-grained sparsity have very small impact on the sparsity ratio given no loss of accuracy. Moreover, due to the index saving effect, coarse-grained sparsity is able to obtain similar or even better compression rates than fine-grained spar-sity at the same accuracy threshold. Our analysis, which is based on the framework of a recent sparse convolutional neural network (SCNN) accelerator, further demonstrates that it saves 30% − 35% of memory references compared with fine-grained sparsity.
translated by 谷歌翻译
在嵌入的不正常神经网络(DNN)的许多层中可能出现各种形式的表示。其中,我们在哪里可以找到最紧凑的代表?我们建议使用修剪框架来回答这个问题:压缩每层的紧凑程度,而不会损失性能?大多数现有的DNN压缩方法不考虑各个层的相对可压缩性。它们统一地将单个目标稀疏度应用于所有层或使用启发式和附加训练来调整层稀疏性。我们提出了一种原理方法,可以自动确定从每个层的重要性派生的各个层的稀疏性。对于dothis,我们考虑一个度量来衡量每层的重要性,基于层的容量。鉴于训练有素的模型和总目标稀疏度,我们首先从模型中评估每个层的重要性。从评估的重要性,我们计算每层的逐层稀疏性。所提出的方法可以应用于任何DNN架构,并且可以与将总目标稀疏度作为参数的任何修剪方法组合。为了验证所提出的方法,我们在两个基准数据集上执行了两种类型的DNN体系结构的图像分类任务,并使用三种修剪方法进行压缩。对于在ImageNetdataset上进行重量修剪的VGG-16模型,在相同的总目标稀疏度下,我们实现了高达75%(平均17.5%)比基线更好的前5精度。此外,我们分析了网络中可能发生的最大压缩。这种分析可以帮助我们识别深层神经网络中最紧凑的表示。
translated by 谷歌翻译
Deep convolutional neural networks (CNNs) are indispensable to state-of-the-art computer vision algorithms. However, they are still rarely deployed on battery-powered mobile devices, such as smartphones and wearable gadgets , where vision algorithms can enable many revolutionary real-world applications. The key limiting factor is the high energy consumption of CNN processing due to its high computational complexity. While there are many previous efforts that try to reduce the CNN model size or the amount of computation, we find that they do not necessarily result in lower energy consumption. Therefore, these targets do not serve as a good metric for energy cost estimation. To close the gap between CNN design and energy consumption optimization, we propose an energy-aware pruning algorithm for CNNs that directly uses the energy consumption of a CNN to guide the pruning process. The energy estimation methodology uses parameters extrapolated from actual hardware measurements. The proposed layer-by-layer pruning algorithm also prunes more aggressively than previously proposed pruning methods by minimizing the error in the output feature maps instead of the filter weights. For each layer, the weights are first pruned and then locally fine-tuned with a closed-form least-square solution to quickly restore the accuracy. After all layers are pruned, the entire network is globally fine-tuned using back-propagation. With the proposed pruning method, the energy consumption of AlexNet and GoogLeNet is reduced by 3.7× and 1.6×, respectively, with less than 1% top-5 accuracy loss. We also show that reducing the number of target classes in AlexNet greatly decreases the number of weights, but has a limited impact on energy consumption.
translated by 谷歌翻译
神经网络既有计算密集又占用内存,因此很难在嵌入式系统上部署。此外,传统网络在培训开始之前修复了架构;因此,培训无法改善架构。为了解决这些限制,我们描述了一种方法,即通过仅学习重要的连接,将神经网络所需的存储和计算减少一个数量级而不影响其准确性。我们的方法使用三步法修剪冗余连接。首先,我们训练网络以了解哪些连接是重要的。接下来,我们修剪不重要的连接。最后,我们重新训练网络以微调剩余连接的权重。在ImageNetdataset上,我们的方法将AlexNet的参数数量减少了9倍,从6100万减少到670万,而不会导致精度损失。使用VGG-16进行的模拟实验发现,参数数量可以减少13倍,从1.38亿减少到1030万,同样不会降低精度。
translated by 谷歌翻译
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems. Also, conventional networks fix the architecture before training starts; as a result, training cannot improve the architecture. To address these limitations, we describe a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections. Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections. On the ImageNet dataset, our method reduced the number of parameters of AlexNet by a factor of 9×, from 61 million to 6.7 million, without incurring accuracy loss. Similar experiments with VGG-16 found that the total number of parameters can be reduced by 13×, from 138 million to 10.3 million, again with no loss of accuracy.
translated by 谷歌翻译
模型压缩对于在资源有限或需要实时响应的应用的设备上提供大型深度神经网络至关重要。作为acase研究,最先进的神经语言模型通常由夹在用于表示输入令牌的嵌入层和用于生成输出令牌的softmax层之间的一个或多个复现层组成。对于具有非常大的词汇量大小的问题,嵌入和softmaxmatrices可以占模型大小的一半以上。例如,thebigLSTM模型在具有大约800k词汇量的十亿字(OBW)数据集上实现了最先进的性能,其单词嵌入和softmaxmatrices使用超过6GB的空间,并且占据了90%以上的空间。模型参数。在本文中,我们提出了GroupReduce,一种用于神经语言模型的新型压缩方法,基于基于词汇分区(块)的低秩矩阵近似和令牌的固有频率分布(单词的幂律分布)。实验结果表明,我们的方法可以明显优于传统的压缩方法,如低等级近似和修剪。在OBW数据集上,我们的方法实现了嵌入和softmax矩阵的6.6倍压缩率,并且当与量化相结合时,我们的方法可以实现26倍的压缩率,这对于整个模型来说是压缩的12.8倍,并且在困惑度下非常小的降级。
translated by 谷歌翻译
Parameter pruning is a promising approach for CNN compression and acceleration by eliminating redundant model parameters with tolerable performance degrade. Despite its effectiveness, existing regularization-based parameter pruning methods usually drive weights towards zero with large and constant regularization factors, which neglects the fragility of the expressiveness of CNNs, and thus calls for a more gentle regular-ization scheme so that the networks can adapt during pruning. To achieve this, we propose a new and novel regularization-based pruning method, named IncReg, to incrementally assign different regularization factors to different weights based on their relative importance. Empirical analysis on CIFAR-10 dataset verifies the merits of IncReg. Further extensive experiments with popular CNNs on CIFAR-10 and ImageNet datasets show that IncReg achieves comparable to even better results compared with state-of-the-arts. Our source codes and trained models are available here: https://github.com/mingsun-tse/caffe_increg.
translated by 谷歌翻译
低秩张量近似对于深度神经网络的压缩是非常有前途的。我们提出了一种新的简单有效的迭代方法,它将低秩因子分解与智能秩选择和精细调整相互交替。我们证明了我们的方法比较tonon迭代的方法的效率。我们的方法提高了压缩率,同时保持了各种任务的准确性。
translated by 谷歌翻译
深度神经网络(DNN)在各种实际应用中取得了巨大成功。然而,由于模型尺寸大和密集计算,网络中的大量参数限制了神经网络的效率。为了解决这个问题,已经研究了各种压缩和加速技术,其中对低阶滤波器和稀疏滤波器进行了大量研究。在本文中,我们提出了一个统一的框架,通过结合这两种策略来对卷积神经网络进行压缩,同时考虑非线性激活。图层的文件管理器由稀疏组件和低级组件的总和近似,两者都支持模型压缩。特别是,我们将稀疏组件约束为稀疏结构,这有利于加速。使用乘法器的交替方向方法(ADMM),通过在激活每个层之后最小化特征图的重建误差来保持网络的性能。实验结果表明,我们提出的方法可以将VGG-16和AlexNet压缩4倍以上。此外,VGG-16和AlexNet分别实现了2.2倍和1.1倍的加速,但错误率却无法增加。
translated by 谷歌翻译
最先进的深度神经网络(DNN)具有数亿个连接,并且在计算和内存方面都很密集,这使得它们难以部署在具有有限硬件资源和功率预算的嵌入式系统上。虽然定制硬件有助于计算,但从DRAM获取权重比ALU操作昂贵两个数量级,并且支持所需的功率。之前提出的“深度压缩”使得将大型DNN(AlexNet和VGGNet)完全适用于片上SRAM成为可能。通过修剪冗余连接并使多个连接共享相同的权重来实现此压缩。我们提出了一种能量有效推理引擎(EIE),它可以对这种压缩网络模型进行推理,并通过权重共享加速产生的稀疏矩阵向量乘法。从DRAM到SRAM,EIE 120x节能;利用稀疏性可节省10倍;重量分配可节省8倍;从ReLU中删除零激活节省了另外3倍。在9个DNN基准测试中评估,与没有压缩的相同DNN的CPU和GPU实现相比,EIE快189倍和13倍。 EIE的处理能力为102GOPS / s,直接在压缩网络上工作,对应于未压缩网络上的3TOPS / s,并以1.88x10 ^ 4帧/秒的速度处理亚力士的FC层,功耗仅为600mW。它比CPU和GPU分别提高了24,000倍和3,400倍的能源效率。与DaDianNao相比,EIE的吞吐量,能效和面积效率分别提高了2.9倍,19倍和3倍。
translated by 谷歌翻译
深度学习革命为我们带来了广泛的神经网络体系结构,可在各种计算机视觉任务中实现最先进的性能,包括分类,检测和分割。与此同时,我们也一直在观察计算和内存需求方面的前所未有的需求,使神经网络在低功耗设备中的有效使用几乎无法实现。为此,我们提出了一个三阶段压缩和加速管道,对卷积神经网络的激活图进行稀疏化,量化和熵编码。散乱增加了激活图的表征能力,从而导致推理加速和更高的模型精度。 Inception-V3和MobileNet-V1可以加速1.6美元以上,而ImageNet和CIFAR-10数据集的不准确性分别增加0.38美元\%和0.54美元。稀疏激活图的量化和熵编码导致基线上的压缩更高,从而降低了网络执行的内存成本。量化为16美元位的Inception-V3和MobileNet-V1激活映射的压缩幅度高达$ 6 \ $ $,准确度分别为$ 0.36 \%$和$ 0.55 $ $ $。
translated by 谷歌翻译
How to develop slim and accurate deep neural networks has become crucial for real-world applications, especially for those employed in embedded systems. Though previous work along this research line has shown some promising results, most existing methods either fail to significantly compress a well-trained deep network or require a heavy retraining process for the pruned deep network to re-boost its prediction performance. In this paper, we propose a new layer-wise pruning method for deep neural networks. In our proposed method, parameters of each individual layer are pruned independently based on second order derivatives of a layer-wise error function with respect to the corresponding parameters. We prove that the final prediction performance drop after pruning is bounded by a linear combination of the reconstructed errors caused at each layer. By controlling layer-wise errors properly, one only needs to perform a light retraining process on the pruned network to resume its original prediction performance. We conduct extensive experiments on benchmark datasets to demonstrate the effectiveness of our pruning method compared with several state-of-the-art baseline methods. Codes of our work are released at: https://github.com/csyhhu/L-OBS.
translated by 谷歌翻译