通过强迫连续重量的最多n非零,最近的N:M网络稀疏性因其两个有吸引力的优势而受到越来越多的关注:1)高稀疏性的有希望的表现。 2)对NVIDIA A100 GPU的显着加速。最近的研究需要昂贵的训练阶段或重型梯度计算。在本文中,我们表明N:M学习可以自然地将其描述为一个组合问题,该问题可以在有限的集合中寻找最佳组合候选者。由这种特征激励,我们以有效的分裂方式解决了n:m的稀疏性。首先,我们将重量向量分为$ c _ {\ text {m}}}^{\ text {n}} $组合s子集的固定大小N。然后,我们通过分配每个组合来征服组合问题,一个可学习的分数是共同优化了其关联权重。我们证明,引入的评分机制可以很好地模拟组合子集之间的相对重要性。通过逐渐去除低得分的子集,可以在正常训练阶段有效地优化N:M细粒稀疏性。全面的实验表明,我们的学习最佳组合(LBC)的表现始终如一,始终如一地比现成的N:m稀疏方法更好。我们的代码在\ url {https://github.com/zyxxmu/lbc}上发布。
translated by 谷歌翻译
网络的稀疏性主要是由于其降低网络复杂性的能力而受欢迎。广泛的研究挖掘了梯度驱动的稀疏性。通常,这些方法是在体重独立性前提下构建的,但是与重量受到相互影响的事实相反。因此,他们的性能仍有待改进。在本文中,我们建议通过解决这种独立悖论来进一步优化梯度驱动的稀疏性(OPTG)。我们的动机来自最近对超级策略训练的进步,该进步表明,稀疏子网可以通过简单地更新掩码值而无需修改任何权重的情况下将其位于随机初始化的网络中。我们证明,超级手机训练是积累重量梯度,并可以部分解决独立悖论。因此,OPTG将Supermask训练集成到梯度驱动的稀疏度中,并且设计了专门的掩模优化器来解决独立悖论。实验表明,OPTG可以很好地超越许多现有的最先进的竞争对手。我们的代码可在\ url {https://github.com/zyxxmu/optg}上找到。
translated by 谷歌翻译
网络修剪是一种有效的方法,可以通过可接受的性能妥协降低网络复杂性。现有研究通过耗时的重量调谐或具有扩展宽度的网络的复杂搜索来实现神经网络的稀疏性,这极大地限制了网络修剪的应用。在本文中,我们表明,在没有权重调谐的情况下,高性能和稀疏的子网被称为“彩票奖线”,存在于具有膨胀宽度的预先训练的模型中。例如,我们获得了一个只有10%参数的彩票奖金,仍然达到了原始密度Vggnet-19的性能,而无需对CiFar-10的预先训练的重量进行任何修改。此外,我们观察到,来自许多现有修剪标准的稀疏面具与我们的彩票累积的搜索掩码具有高重叠,其中,基于幅度的修剪导致与我们的最相似的掩模。根据这种洞察力,我们使用基于幅度的修剪初始化我们的稀疏掩模,导致彩票累积搜索至少3倍降低,同时实现了可比或更好的性能。具体而言,我们的幅度基彩票奖学金在Reset-50中除去90%的重量,而在ImageNet上仅使用10个搜索时期可以轻松获得超过70%的前1个精度。我们的代码可在https://github.com/zyxxmu/lottery-jackpots获得。
translated by 谷歌翻译
The mainstream approach for filter pruning is usually either to force a hard-coded importance estimation upon a computation-heavy pretrained model to select "important" filters, or to impose a hyperparameter-sensitive sparse constraint on the loss objective to regularize the network training. In this paper, we present a novel filter pruning method, dubbed dynamic-coded filter fusion (DCFF), to derive compact CNNs in a computation-economical and regularization-free manner for efficient image classification. Each filter in our DCFF is firstly given an inter-similarity distribution with a temperature parameter as a filter proxy, on top of which, a fresh Kullback-Leibler divergence based dynamic-coded criterion is proposed to evaluate the filter importance. In contrast to simply keeping high-score filters in other methods, we propose the concept of filter fusion, i.e., the weighted averages using the assigned proxies, as our preserved filters. We obtain a one-hot inter-similarity distribution as the temperature parameter approaches infinity. Thus, the relative importance of each filter can vary along with the training of the compact CNN, leading to dynamically changeable fused filters without both the dependency on the pretrained model and the introduction of sparse constraints. Extensive experiments on classification benchmarks demonstrate the superiority of our DCFF over the compared counterparts. For example, our DCFF derives a compact VGGNet-16 with only 72.77M FLOPs and 1.06M parameters while reaching top-1 accuracy of 93.47% on CIFAR-10. A compact ResNet-50 is obtained with 63.8% FLOPs and 58.6% parameter reductions, retaining 75.60% top-1 accuracy on ILSVRC-2012. Our code, narrower models and training logs are available at https://github.com/lmbxmu/DCFF.
translated by 谷歌翻译
虽然网络稀疏作为克服神经网络大小的有希望的方向,但它仍然是保持模型准确性的开放问题,并在一般CPU上实现了显着的加速。在本文中,我们提出了一篇新颖的1美元\ Times N $块稀疏模式(块修剪)的概念来打破这种限制。特别是,具有相同输入通道索引的连续$ N $输出内核被分组为一个块,该块用作我们修剪模式的基本修剪粒度。我们的$ 1 \ times n $ sparsity模式prunes这些块被认为不重要。我们还提供过滤器重新排列的工作流程,首先重新排列输出通道尺寸中的权重矩阵,以获得更具影响力的块,以便精度改进,然后将类似的重新排列到输入通道维度中的下一层权重,以确保正确的卷积操作。此外,可以通过并行化块 - 方向的矢量化操作实现在我们的$ 1 \ Times N $块稀疏之后的输出计算,从而导致总基于CPU的平台上的显着加速。通过对ILSVRC-2012的实验证明了我们修剪模式的功效。例如,在50%的稀疏性和$ n = 4 $的情况下,我们的模式在MobileNet-V2的前1个精度的过滤器修剪中获得了大约3.0%的改进。同时,它在Cortex-A7 CPU上获得56.04ms推断,超过体重修剪。代码可在https://github.com/lmbxmu/1xn处获得。
translated by 谷歌翻译
稀疏性已成为压缩和加速深度神经网络(DNN)的有前途方法之一。在不同类别的稀疏性中,由于其对现代加速器的有效执行,结构化的稀疏性引起了人们的关注。特别是,n:m稀疏性很有吸引力,因为已经有一些硬件加速器架构可以利用某些形式的n:m结构化稀疏性来产生更高的计算效率。在这项工作中,我们专注于N:M的稀疏性,并广泛研究和评估N:M稀疏性的各种培训食谱,以模型准确性和计算成本(FLOPS)之间的权衡(FLOPS)。在这项研究的基础上,我们提出了两种新的基于衰减的修剪方法,即“修剪面膜衰减”和“稀疏结构衰减”。我们的评估表明,这些提出的方法始终提供最新的(SOTA)模型精度,可与非结构化的稀疏性相当,在基于变压器的模型上用于翻译任务。使用新培训配方的稀疏模型准确性的提高是以总训练计算(FLOP)边际增加的成本。
translated by 谷歌翻译
彩票票证假设(LTH)表明,密集的模型包含高度稀疏的子网(即获奖门票),可以隔离培训以完全准确。尽管做出了许多激动人心的努力,但仍有一个“常识”很少受到挑战:通过迭代级修剪(IMP)发现了一张获胜的票,因此由此产生的修剪子网仅具有非结构化的稀疏性。这一差距限制了在实践中赢得门票的吸引力,因为高度不规则的稀疏模式在硬件上加速的挑战是挑战性的。同时,直接将结构化修剪替换为非结构化的修剪,以更严重地损害绩效,并且通常无法找到获胜的票。在本文中,我们证明了第一个积极的结果是,总体上可以有效地找到结构上稀疏的获胜票。核心思想是在每一轮(非结构化)IMP之后附加“后处理技术”,以实施结构稀疏的形成。具体而言,我们首先在某些被认为很重要的通道中“重新填充”修剪元素,然后“重新组”非零元素以创建灵活的群体结构模式。我们确定的渠道和团体结构子网都赢得了彩票,并以现有硬件很容易支持的大量推理加速。广泛的实验,在多个网络骨架的不同数据集上进行,一致验证了我们的建议,表明LTH的硬件加速障碍现在已被删除。具体而言,结构上的获胜票最多可获得{64.93%,64.84%,60.23%}的运行时间节省,以{36%〜80%,74%,58%}的稀疏性在{Cifar,cifar,tiny-imageNet,imageNet}上保持可比较的精度。代码在https://github.com/vita-group/structure-lth上。
translated by 谷歌翻译
最近,稀疏培训已成为有希望的范式,可在边缘设备上有效地深入学习。当前的研究主要致力于通过进一步增加模型稀疏性来降低培训成本。但是,增加的稀疏性并不总是理想的,因为它不可避免地会在极高的稀疏度下引入严重的准确性降解。本文打算探索其他可能的方向,以有效,有效地降低稀疏培训成本,同时保持准确性。为此,我们研究了两种技术,即层冻结和数据筛分。首先,层冻结方法在密集的模型训练和微调方面取得了成功,但在稀疏训练域中从未采用过。然而,稀疏训练的独特特征可能会阻碍层冻结技术的结合。因此,我们分析了在稀疏培训中使用层冻结技术的可行性和潜力,并发现它有可能节省大量培训成本。其次,我们提出了一种用于数据集有效培训的数据筛分方法,该方法通过确保在整个培训过程中仅使用部分数据集来进一步降低培训成本。我们表明,这两种技术都可以很好地整合到稀疏训练算法中,以形成一个通用框架,我们将其配置为SPFDE。我们的广泛实验表明,SPFDE可以显着降低培训成本,同时从三个维度中保留准确性:重量稀疏性,层冻结和数据集筛分。
translated by 谷歌翻译
近年来,通过开发大型的深层模型,图像修复任务已经见证了绩效的巨大提高。尽管表现出色,但深层模型要求的重量计算限制了图像恢复的应用。为了提高限制,需要减少网络的大小,同时保持准确性。最近,N:M结构化修剪似乎是使模型具有准确性约束的有效且实用的修剪方法之一。但是,它无法解释图像恢复网络不同层的不同计算复杂性和性能要求。为了进一步优化效率和恢复精度之间的权衡,我们提出了一种新型的修剪方法,该方法确定了每一层N:M结构稀疏性的修剪比。关于超分辨率和脱张任务的广泛实验结果证明了我们方法的功效,该方法的表现胜过以前的修剪方法。拟议方法的Pytorch实施将在https://github.com/junghunoh/sls_cvpr2r2022上公开获得。
translated by 谷歌翻译
稀疏培训是一种自然的想法,可以加速深度神经网络的训练速度,并节省内存使用,特别是因为大型现代神经网络被显着过度参数化。然而,大多数现有方法在实践中无法实现这一目标,因为先前方法采用的基于链规则的梯度(W.R.T.结构参数)估计。至少在向后传播步骤中至少需要密集的计算。本文通过提出具有完全稀疏的前后通行证的有效稀疏训练方法来解决这个问题。我们首先在全球稀疏限制下将培训过程制定为连续最小化问题。然后,我们将优化过程分为两个步骤,对应于权重更新和结构参数更新。对于前一步,我们使用传统的链规则,这可以通过利用稀疏结构来稀疏。对于后一步,而不是使用基于链规则的梯度估计器,如现有方法中,我们提出了一个方差减少的策略梯度估计器,这只需要两个向前通过而不向后传播,从而实现完全稀疏的训练。我们证明了我们渐变估计器的差异是界定的。对现实世界数据集的广泛实验结果表明,与以前的方法相比,我们的算法在加速训练过程中更有效,速度快到速度更快。
translated by 谷歌翻译
Structured channel pruning has been shown to significantly accelerate inference time for convolution neural networks (CNNs) on modern hardware, with a relatively minor loss of network accuracy. Recent works permanently zero these channels during training, which we observe to significantly hamper final accuracy, particularly as the fraction of the network being pruned increases. We propose Soft Masking for cost-constrained Channel Pruning (SMCP) to allow pruned channels to adaptively return to the network while simultaneously pruning towards a target cost constraint. By adding a soft mask re-parameterization of the weights and channel pruning from the perspective of removing input channels, we allow gradient updates to previously pruned channels and the opportunity for the channels to later return to the network. We then formulate input channel pruning as a global resource allocation problem. Our method outperforms prior works on both the ImageNet classification and PASCAL VOC detection datasets.
translated by 谷歌翻译
边缘设备上有限且动态的资源激励我们部署优化的深神经网络,该网络可以调整其子网络以适应不同的资源约束。但是,现有作品通常通过在手工制作的采样空间中搜索不同的网络体系结构来构建子网络,这不仅可以导致低标准的性能,而且可能导致设备上的重新配置开销。在本文中,我们提出了一种新颖的培训算法,动态的实时稀疏子网(着装)。着装通过基于行的非结构化稀疏度从相同的骨干网络采样多个子网络,并与加权损失并联训练这些子网络。着装还利用包括参数重复使用和基于行的细粒抽样在内的策略,以进行有效的存储消耗和有效的机上适应。公共视觉数据集的广泛实验表明,与最先进的子网络相比,着装的准确性明显更高。
translated by 谷歌翻译
重量修剪是一种有效的模型压缩技术,可以解决在移动设备上实现实时深神经网络(DNN)推断的挑战。然而,由于精度劣化,难以利用硬件加速度,以及某些类型的DNN层的限制,难以降低的应用方案具有有限的应用方案。在本文中,我们提出了一般的细粒度的结构化修剪方案和相应的编译器优化,适用于任何类型的DNN层,同时实现高精度和硬件推理性能。随着使用我们的编译器优化所支持的不同层的灵活性,我们进一步探讨了确定最佳修剪方案的新问题,了解各种修剪方案的不同加速度和精度性能。两个修剪方案映射方法,一个是基于搜索,另一个是基于规则的,建议自动推导出任何给定DNN的每层的最佳修剪规则和块大小。实验结果表明,我们的修剪方案映射方法,以及一般细粒化结构修剪方案,优于最先进的DNN优化框架,最高可达2.48 $ \ times $和1.73 $ \ times $ DNN推理加速在CiFar-10和Imagenet DataSet上没有准确性损失。
translated by 谷歌翻译
由于稀疏神经网络通常包含许多零权重,因此可以在不降低网络性能的情况下潜在地消除这些不必要的网络连接。因此,设计良好的稀疏神经网络具有显着降低拖鞋和计算资源的潜力。在这项工作中,我们提出了一种新的自动修剪方法 - 稀疏连接学习(SCL)。具体地,重量被重新参数化为可培训权重变量和二进制掩模的元素方向乘法。因此,由二进制掩模完全描述网络连接,其由单位步进函数调制。理论上,从理论上证明了使用直通估计器(STE)进行网络修剪的基本原理。这一原则是STE的代理梯度应该是积极的,确保掩模变量在其最小值处收敛。在找到泄漏的Relu后,SoftPlus和Identity Stes可以满足这个原理,我们建议采用SCL的身份STE以进行离散面膜松弛。我们发现不同特征的面具梯度非常不平衡,因此,我们建议将每个特征的掩模梯度标准化以优化掩码变量训练。为了自动训练稀疏掩码,我们将网络连接总数作为我们的客观函数中的正则化术语。由于SCL不需要由网络层设计人员定义的修剪标准或超级参数,因此在更大的假设空间中探讨了网络,以实现最佳性能的优化稀疏连接。 SCL克服了现有自动修剪方法的局限性。实验结果表明,SCL可以自动学习并选择各种基线网络结构的重要网络连接。 SCL培训的深度学习模型以稀疏性,精度和减少脚波特的SOTA人类设计和自动修剪方法训练。
translated by 谷歌翻译
过滤器修剪方法通过去除选定的过滤器来引入结构稀疏性,因此对于降低复杂性特别有效。先前的作品从验证较小规范的过滤器的角度从经验修剪网络中造成了较小的最终结果贡献。但是,此类标准已被证明对过滤器的分布敏感,并且由于修剪后的容量差距是固定的,因此准确性可能很难恢复。在本文中,我们提出了一种称为渐近软簇修剪(ASCP)的新型过滤器修剪方法,以根据过滤器的相似性来识别网络的冗余。首先通过聚类来区分来自参数过度的网络的每个过滤器,然后重建以手动将冗余引入其中。提出了一些聚类指南,以更好地保留特征提取能力。重建后,允许更新过滤器,以消除错误选择的效果。此外,还采用了各种修剪率的衰减策略来稳定修剪过程并改善最终性能。通过逐渐在每个群集中生成更相同的过滤器,ASCP可以通过通道添加操作将其删除,几乎没有准确性下降。 CIFAR-10和Imagenet数据集的广泛实验表明,与许多最新算法相比,我们的方法可以取得竞争性结果。
translated by 谷歌翻译
Low-rankness plays an important role in traditional machine learning, but is not so popular in deep learning. Most previous low-rank network compression methods compress the networks by approximating pre-trained models and re-training. However, the optimal solution in the Euclidean space may be quite different from the one in the low-rank manifold. A well-pre-trained model is not a good initialization for the model with low-rank constraints. Thus, the performance of a low-rank compressed network degrades significantly. Compared to other network compression methods such as pruning, low-rank methods attracts less attention in recent years. In this paper, we devise a new training method, low-rank projection with energy transfer (LRPET), that trains low-rank compressed networks from scratch and achieves competitive performance. First, we propose to alternately perform stochastic gradient descent training and projection onto the low-rank manifold. Compared to re-training on the compact model, this enables full utilization of model capacity since solution space is relaxed back to Euclidean space after projection. Second, the matrix energy (the sum of squares of singular values) reduction caused by projection is compensated by energy transfer. We uniformly transfer the energy of the pruned singular values to the remaining ones. We theoretically show that energy transfer eases the trend of gradient vanishing caused by projection. Third, we propose batch normalization (BN) rectification to cut off its effect on the optimal low-rank approximation of the weight matrix, which further improves the performance. Comprehensive experiments on CIFAR-10 and ImageNet have justified that our method is superior to other low-rank compression methods and also outperforms recent state-of-the-art pruning methods. Our code is available at https://github.com/BZQLin/LRPET.
translated by 谷歌翻译
深度神经网络(DNN)在解决许多真实问题方面都有效。较大的DNN模型通常表现出更好的质量(例如,精度,精度),但它们的过度计算会导致长期推理时间。模型稀疏可以降低计算和内存成本,同时保持模型质量。大多数现有的稀疏算法是单向移除的重量,而其他人则随机或贪婪地探索每层进行修剪的小权重子集。这些算法的局限性降低了可实现的稀疏性水平。此外,许多算法仍然需要预先训练的密集模型,因此遭受大的内存占地面积。在本文中,我们提出了一种新颖的预定生长和修剪(间隙)方法,而无需预先培训密集模型。它通过反复生长一个层次的层来解决以前的作品的缺点,然后在一些训练后修剪回到稀疏。实验表明,使用所提出的方法修剪模型匹配或击败高度优化的密集模型的质量,在各种任务中以80%的稀疏度,例如图像分类,客观检测,3D对象分段和翻译。它们还优于模型稀疏的其他最先进的(SOTA)方法。作为一个例子,通过间隙获得的90%不均匀的稀疏resnet-50模型在想象中实现了77.9%的前1个精度,提高了先前的SOTA结果1.5%。所有代码将公开发布。
translated by 谷歌翻译
This paper proposed a Soft Filter Pruning (SFP) method to accelerate the inference procedure of deep Convolutional Neural Networks (CNNs). Specifically, the proposed SFP enables the pruned filters to be updated when training the model after pruning. SFP has two advantages over previous works: (1) Larger model capacity. Updating previously pruned filters provides our approach with larger optimization space than fixing the filters to zero. Therefore, the network trained by our method has a larger model capacity to learn from the training data. (2) Less dependence on the pretrained model. Large capacity enables SFP to train from scratch and prune the model simultaneously. In contrast, previous filter pruning methods should be conducted on the basis of the pre-trained model to guarantee their performance. Empirically, SFP from scratch outperforms the previous filter pruning methods. Moreover, our approach has been demonstrated effective for many advanced CNN architectures. Notably, on ILSCRC-2012, SFP reduces more than 42% FLOPs on ResNet-101 with even 0.2% top-5 accuracy improvement, which has advanced the state-of-the-art. Code is publicly available on GitHub: https://github.com/he-y/softfilter-pruning
translated by 谷歌翻译
Neural network pruning offers a promising prospect to facilitate deploying deep neural networks on resourcelimited devices. However, existing methods are still challenged by the training inefficiency and labor cost in pruning designs, due to missing theoretical guidance of non-salient network components. In this paper, we propose a novel filter pruning method by exploring the High Rank of feature maps (HRank). Our HRank is inspired by the discovery that the average rank of multiple feature maps generated by a single filter is always the same, regardless of the number of image batches CNNs receive. Based on HRank, we develop a method that is mathematically formulated to prune filters with low-rank feature maps. The principle behind our pruning is that low-rank feature maps contain less information, and thus pruned results can be easily reproduced. Besides, we experimentally show that weights with high-rank feature maps contain more important information, such that even when a portion is not updated, very little damage would be done to the model performance. Without introducing any additional constraints, HRank leads to significant improvements over the state-of-the-arts in terms of FLOPs and parameters reduction, with similar accuracies. For example, with ResNet-110, we achieve a 58.2%-FLOPs reduction by removing 59.2% of the parameters, with only a small loss of 0.14% in top-1 accuracy on CIFAR-10. With Res-50, we achieve a 43.8%-FLOPs reduction by removing 36.7% of the parameters, with only a loss of 1.17% in the top-1 accuracy on ImageNet. The codes can be available at https://github.com/lmbxmu/HRank.
translated by 谷歌翻译
Network pruning is widely used for reducing the heavy inference cost of deep models in low-resource settings. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning. During pruning, according to a certain criterion, redundant weights are pruned and important weights are kept to best preserve the accuracy. In this work, we make several surprising observations which contradict common beliefs. For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined target network architecture, one can get rid of the full pipeline and directly train the target network from scratch. Our observations are consistent for multiple network architectures, datasets, and tasks, which imply that: 1) training a large, over-parameterized model is often not necessary to obtain an efficient final model, 2) learned "important" weights of the large model are typically not useful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited "important" weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm. Our results suggest the need for more careful baseline evaluations in future research on structured pruning methods. We also compare with the "Lottery Ticket Hypothesis" (Frankle & Carbin, 2019), and find that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization. * Equal contribution. † Work done while visiting UC Berkeley.
translated by 谷歌翻译