我们为神经网络提出了一种新颖,结构化修剪算法 - 迭代,稀疏结构修剪算法,称为I-Spasp。从稀疏信号恢复的思想启发,I-Spasp通过迭代地识别网络内的较大的重要参数组(例如,滤波器或神经元),这些参数组大多数对修剪和密集网络输出之间的残差贡献,然后基于这些组阈值以较小的预定定义修剪比率。对于具有Relu激活的双层和多层网络架构,我们展示了通过多项式修剪修剪诱导的错误,该衰减是基于密集网络隐藏表示的稀疏性任意大的。在我们的实验中,I-Spasp在各种数据集(即MNIST和ImageNet)和架构(即馈送前向网络,Resnet34和MobileNetv2)中进行评估,其中显示用于发现高性能的子网和改进经过几种数量级的可提供基线方法的修剪效率。简而言之,I-Spasp很容易通过自动分化实现,实现强大的经验结果,具有理论收敛保证,并且是高效的,因此将自己区分开作为少数几个计算有效,实用,实用,实用,实用,实用,实用,实用,实用和可提供的修剪算法之一。
translated by 谷歌翻译
神经网络修剪对于在预训练的密集网络架构中发现有效,高性能的子网有用。然而,更常见的是,它涉及三步过程 - 预先训练,修剪和重新训练 - 这是计算昂贵的,因为必须完全预先训练的密集模型。幸运的是,已经经过了多种作品,证明可以通过修剪发现高性能的子网,而无需完全预先训练密集网络。旨在理论上分析修剪网络表现良好的密集网络预培训量,我们发现在两层全连接网络上的SGD预训练迭代数量中发现了一个理论界限,超出了由此进行修剪贪婪的前瞻性选择产生了一个达到良好训练错误的子网。该阈值显示在对数上依赖于数据集的大小,这意味着具有较大数据集的实验需要更好地训练通过修剪以执行良好执行的子网。我们经验展示了我们在各种架构和数据集中的理论结果的有效性,包括在Mnist上培训的全连接网络以及在CIFAR10和ImageNet上培训的几个深度卷积神经网络(CNN)架构。
translated by 谷歌翻译
Structural pruning of neural network parameters reduces computation, energy, and memory transfer costs during inference. We propose a novel method that estimates the contribution of a neuron (filter) to the final loss and iteratively removes those with smaller scores. We describe two variations of our method using the first and secondorder Taylor expansions to approximate a filter's contribution. Both methods scale consistently across any network layer without requiring per-layer sensitivity analysis and can be applied to any kind of layer, including skip connections. For modern networks trained on ImageNet, we measured experimentally a high (>93%) correlation between the contribution computed by our methods and a reliable estimate of the true importance. Pruning with the proposed methods leads to an improvement over state-ofthe-art in terms of accuracy, FLOPs, and parameter reduction. On ResNet-101, we achieve a 40% FLOPS reduction by removing 30% of the parameters, with a loss of 0.02% in the top-1 accuracy on ImageNet. Code is available at https://github.com/NVlabs/Taylor_pruning.
translated by 谷歌翻译
结构化修剪是一种常用的技术,用于将深神经网络(DNN)部署到资源受限的设备上。但是,现有的修剪方法通常是启发式,任务指定的,并且需要额外的微调过程。为了克服这些限制,我们提出了一个框架,将DNN压缩成纤薄的架构,具有竞争性表现,并且仅通过列车 - 一次(OTO)减少重大拖车。 OTO包含两个键:(i)我们将DNN的参数分区为零不变组,使我们能够修剪零组而不影响输出; (ii)促进零群,我们制定了结构性稀疏优化问题,提出了一种新颖的优化算法,半空间随机投影梯度(HSPG),以解决它,这优于组稀疏性探索的标准近端方法和保持可比的收敛性。为了展示OTO的有效性,我们从划痕上同时培训和压缩全模型,而无需微调推理加速和参数减少,并且在CIFAR10的VGG16实现最先进的结果,为CIFAR10和Squad的BERT为BERT竞争结果在resnet50上为想象成。源代码可在https://github.com/tianyic/only_train_once上获得。
translated by 谷歌翻译
有效地近似损失函数的局部曲率信息是用于深神经网络的优化和压缩的关键工具。然而,大多数现有方法近似二阶信息具有高计算或存储成本,这可以限制其实用性。在这项工作中,我们调查矩阵,用于估计逆象征的矢量产品(IHVPS)的矩阵线性时间方法,因为当Hessian可以近似为乘语 - 一个矩阵的总和时,如Hessian的经典近似由经验丰富的Fisher矩阵。我们提出了两个新的算法作为称为M-FAC的框架的一部分:第一个算法朝着网络压缩量身定制,如果Hessian给出了M $等级的总和,则可以计算Dimension $ D $的IHVP。 ,使用$ O(DM ^ 2)$预压制,$ O(DM)$代价计算IHVP,并查询逆Hessian的任何单个元素的费用$ O(m)$。第二算法针对优化设置,我们希望在反向Hessian之间计算产品,估计在优化步骤的滑动窗口和给定梯度方向上,根据预先说明的SGD所需的梯度方向。我们为计算IHVP和OHVP和O(DM + M ^ 3)$ of $ o(dm + m ^ 2)$提供算法,以便从滑动窗口添加或删除任何渐变。这两种算法产生最先进的结果,用于网络修剪和相对于现有二阶方法的计算开销的优化。在[9]和[17]可用实现。
translated by 谷歌翻译
深度神经网络(DNN)的计算要求增加导致获得稀疏,且准确的DNN模型的兴趣。最近的工作已经调查了稀疏训练的更加困难的情况,其中DNN重量尽可能稀少,以减少训练期间的计算成本。现有的稀疏训练方法通常是经验的,并且可以具有相对于致密基线的准确性较低。在本文中,我们介绍了一种称为交替压缩/解压缩(AC / DC)训练DNN的一般方法,证明了算法变体的收敛,并表明AC / DC在类似的计算预算中准确地表现出现有的稀疏训练方法;在高稀疏水平下,AC / DC甚至优于现有的现有方法,依赖于准确的预训练密集模型。 AC / DC的一个重要属性是它允许联合培训密集和稀疏的型号,在训练过程结束时产生精确的稀疏密集模型对。这在实践中是有用的,其中压缩变体可能是为了在资源受限的设置中进行部署而不重新执行整个训练流,并且还为我们提供了深入和压缩模型之间的精度差距的见解。代码可在:https://github.com/ist-daslab/acdc。
translated by 谷歌翻译
We propose a new formulation for pruning convolutional kernels in neural networks to enable efficient inference. We interleave greedy criteria-based pruning with finetuning by backpropagation-a computationally efficient procedure that maintains good generalization in the pruned network. We propose a new criterion based on Taylor expansion that approximates the change in the cost function induced by pruning network parameters. We focus on transfer learning, where large pretrained networks are adapted to specialized tasks. The proposed criterion demonstrates superior performance compared to other criteria, e.g. the norm of kernel weights or feature map activation, for pruning large CNNs after adaptation to fine-grained classification tasks (Birds-200 and Flowers-102) relaying only on the first order gradient information. We also show that pruning can lead to more than 10× theoretical reduction in adapted 3D-convolutional filters with a small drop in accuracy in a recurrent gesture classifier. Finally, we show results for the largescale ImageNet dataset to emphasize the flexibility of our approach.
translated by 谷歌翻译
We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network during the early stages of training. Thus, the computational cost of subsequent training iterations, besides that of inference, is considerably reduced. Our method, based on variational inference principles using Gaussian scale mixture priors on neural network weights, learns the variational posterior distribution of Bernoulli random variables multiplying the units/filters similarly to adaptive dropout. Our algorithm, ensures that the Bernoulli parameters practically converge to either 0 or 1, establishing a deterministic final network. We analytically derive a novel hyper-prior distribution over the prior parameters that is crucial for their optimal selection and leads to consistent pruning levels and prediction accuracy regardless of weight initialization or the size of the starting network. We prove the convergence properties of our algorithm establishing theoretical and practical pruning conditions. We evaluate the proposed algorithm on the MNIST and CIFAR-10 data sets and the commonly used fully connected and convolutional LeNet and VGG16 architectures. The simulations show that our method achieves pruning levels on par with state-of the-art methods for structured pruning, while maintaining better test-accuracy and more importantly in a manner robust with respect to network initialization and initial size.
translated by 谷歌翻译
The ability to dynamically adapt neural networks to newly-available data without performance deterioration would revolutionize deep learning applications. Streaming learning (i.e., learning from one data example at a time) has the potential to enable such real-time adaptation, but current approaches i) freeze a majority of network parameters during streaming and ii) are dependent upon offline, base initialization procedures over large subsets of data, which damages performance and limits applicability. To mitigate these shortcomings, we propose Cold Start Streaming Learning (CSSL), a simple, end-to-end approach for streaming learning with deep networks that uses a combination of replay and data augmentation to avoid catastrophic forgetting. Because CSSL updates all model parameters during streaming, the algorithm is capable of beginning streaming from a random initialization, making base initialization optional. Going further, the algorithm's simplicity allows theoretical convergence guarantees to be derived using analysis of the Neural Tangent Random Feature (NTRF). In experiments, we find that CSSL outperforms existing baselines for streaming learning in experiments on CIFAR100, ImageNet, and Core50 datasets. Additionally, we propose a novel multi-task streaming learning setting and show that CSSL performs favorably in this domain. Put simply, CSSL performs well and demonstrates that the complicated, multi-step training pipelines adopted by most streaming methodologies can be replaced with a simple, end-to-end learning approach without sacrificing performance.
translated by 谷歌翻译
深度神经网络(DNN)在解决许多真实问题方面都有效。较大的DNN模型通常表现出更好的质量(例如,精度,精度),但它们的过度计算会导致长期推理时间。模型稀疏可以降低计算和内存成本,同时保持模型质量。大多数现有的稀疏算法是单向移除的重量,而其他人则随机或贪婪地探索每层进行修剪的小权重子集。这些算法的局限性降低了可实现的稀疏性水平。此外,许多算法仍然需要预先训练的密集模型,因此遭受大的内存占地面积。在本文中,我们提出了一种新颖的预定生长和修剪(间隙)方法,而无需预先培训密集模型。它通过反复生长一个层次的层来解决以前的作品的缺点,然后在一些训练后修剪回到稀疏。实验表明,使用所提出的方法修剪模型匹配或击败高度优化的密集模型的质量,在各种任务中以80%的稀疏度,例如图像分类,客观检测,3D对象分段和翻译。它们还优于模型稀疏的其他最先进的(SOTA)方法。作为一个例子,通过间隙获得的90%不均匀的稀疏resnet-50模型在想象中实现了77.9%的前1个精度,提高了先前的SOTA结果1.5%。所有代码将公开发布。
translated by 谷歌翻译
Network pruning is widely used for reducing the heavy inference cost of deep models in low-resource settings. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning. During pruning, according to a certain criterion, redundant weights are pruned and important weights are kept to best preserve the accuracy. In this work, we make several surprising observations which contradict common beliefs. For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined target network architecture, one can get rid of the full pipeline and directly train the target network from scratch. Our observations are consistent for multiple network architectures, datasets, and tasks, which imply that: 1) training a large, over-parameterized model is often not necessary to obtain an efficient final model, 2) learned "important" weights of the large model are typically not useful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited "important" weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm. Our results suggest the need for more careful baseline evaluations in future research on structured pruning methods. We also compare with the "Lottery Ticket Hypothesis" (Frankle & Carbin, 2019), and find that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization. * Equal contribution. † Work done while visiting UC Berkeley.
translated by 谷歌翻译
训练有素的神经网络的性能至关重要。加上深度学习模型的不断增长的规模,这种观察激发了对学习稀疏模型的广泛研究。在这项工作中,我们专注于控制稀疏学习时的稀疏水平的任务。基于稀疏性惩罚的现有方法涉及对罚款因素的昂贵反复试验调整,因此缺乏直接控制所得模型的稀疏性。作为响应,我们采用了一个约束的公式:使用Louizos等人提出的栅极机制。 (2018年),我们制定了一个受约束的优化问题,其中稀疏以训练目标和所需的稀疏目标以端到端的方式指导。使用WIDERESNET和RESNET {18,50}模型进行了CIFAR-10/100,Tinyimagenet和ImageNet的实验验证了我们的提案的有效性,并证明我们可以可靠地实现预定的稀疏目标,而不会损害预测性能。
translated by 谷歌翻译
过滤器修剪的目标是搜索不重要的过滤器以删除以便使卷积神经网络(CNNS)有效而不牺牲过程中的性能。挑战在于找到可以帮助确定每个过滤器关于神经网络的最终输出的重要或相关的信息的信息。在这项工作中,我们分享了我们的观察说,预先训练的CNN的批量标准化(BN)参数可用于估计激活输出的特征分布,而无需处理训练数据。在观察时,我们通过基于预先训练的CNN的BN参数评估每个滤波器的重要性来提出简单而有效的滤波修剪方法。 CiFar-10和Imagenet的实验结果表明,该方法可以在准确性下降和计算复杂性的计算复杂性和降低的折衷方面具有和不进行微调的卓越性能。
translated by 谷歌翻译
To reduce the significant redundancy in deep Convolutional Neural Networks (CNNs), most existing methods prune neurons by only considering statistics of an individual layer or two consecutive layers (e.g., prune one layer to minimize the reconstruction error of the next layer), ignoring the effect of error propagation in deep networks. In contrast, we argue that it is essential to prune neurons in the entire neuron network jointly based on a unified goal: minimizing the reconstruction error of important responses in the "final response layer" (FRL), which is the secondto-last layer before classification, for a pruned network to retrain its predictive power. Specifically, we apply feature ranking techniques to measure the importance of each neuron in the FRL, and formulate network pruning as a binary integer optimization problem and derive a closed-form solution to it for pruning neurons in earlier layers. Based on our theoretical analysis, we propose the Neuron Importance Score Propagation (NISP) algorithm to propagate the importance scores of final responses to every neuron in the network. The CNN is pruned by removing neurons with least importance, and then fine-tuned to retain its predictive power. NISP is evaluated on several datasets with multiple CNN models and demonstrated to achieve significant acceleration and compression with negligible accuracy loss.
translated by 谷歌翻译
事实证明,稀疏的深度神经网络在大规模研究中对于预测模型构建有效。尽管几项作品研究了稀疏神经体系结构的理论和数值特性,但它们主要集中在边缘选择上。通过优势选择的稀疏性可能具有直觉上的吸引力;但是,它不一定会降低网络的结构复杂性。相反,修剪过多的节点会导致一个结构稀疏的网络,并在推理过程中具有显着的计算加速。为此,我们建议使用Spike and-Slab Gaussian先验者提出贝叶斯稀疏溶液,以允许在训练过程中选择自动节点。使用Spike and-Slab先验减轻了对修剪的临时阈值规则的需求。此外,我们采用了一种差异贝叶斯方法来规避传统马尔可夫链蒙特卡洛(MCMC)实施的计算挑战。在节点选择的背景下,我们建立了变异后一致性的基本结果,以及先前参数的表征。与以前的作品相反,我们的理论发展放宽了所有网络权重的节点和均匀界限的假设,从而适应具有层依赖性节点结构或系数边界的稀疏网络。通过对先前纳入概率的层表表征,我们讨论了后部变异的最佳收缩率。我们从经验上证明,我们所提出的方法的表现优于计算复杂性的边缘选择方法,具有相似或更好的预测性能。我们的实验证据进一步证明了我们的理论工作有助于层面上的最佳节点恢复。
translated by 谷歌翻译
修剪深度神经网络的现有方法专注于去除训练有素的网络的不必要参数,然后微调模型,找到恢复训练模型的初始性能的良好解决方案。与其他作品不同,我们的方法特别注意通过修剪神经元的压缩模型和推理计算时间的解决方案的质量。通过探索Hessian的光谱半径,所提出的算法通过探索Hessian的光谱半径来指示压缩模型的参数,这导致了更好地推广了未经看涨的数据。此外,该方法不适用于预先训练的网络,并同时执行训练和修剪。我们的结果表明,它改善了神经元压缩的最先进的结果。该方法能够在不同神经网络模型上实现具有小精度下降的非常小的网络。
translated by 谷歌翻译
Pruning large neural networks while maintaining their performance is often desirable due to the reduced space and time complexity. In existing methods, pruning is done within an iterative optimization procedure with either heuristically designed pruning schedules or additional hyperparameters, undermining their utility. In this work, we present a new approach that prunes a given network once at initialization prior to training. To achieve this, we introduce a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task. This eliminates the need for both pretraining and the complex pruning schedule while making it robust to architecture variations. After pruning, the sparse network is trained in the standard way. Our method obtains extremely sparse networks with virtually the same accuracy as the reference network on the MNIST, CIFAR-10, and Tiny-ImageNet classification tasks and is broadly applicable to various architectures including convolutional, residual and recurrent networks. Unlike existing methods, our approach enables us to demonstrate that the retained connections are indeed relevant to the given task.
translated by 谷歌翻译
The importance of learning rate (LR) schedules on network pruning has been observed in a few recent works. As an example, Frankle and Carbin (2019) highlighted that winning tickets (i.e., accuracy preserving subnetworks) can not be found without applying a LR warmup schedule and Renda, Frankle and Carbin (2020) demonstrated that rewinding the LR to its initial state at the end of each pruning cycle improves performance. In this paper, we go one step further by first providing a theoretical justification for the surprising effect of LR schedules. Next, we propose a LR schedule for network pruning called SILO, which stands for S-shaped Improved Learning rate Optimization. The advantages of SILO over existing state-of-the-art (SOTA) LR schedules are two-fold: (i) SILO has a strong theoretical motivation and dynamically adjusts the LR during pruning to improve generalization. Specifically, SILO increases the LR upper bound (max_lr) in an S-shape. This leads to an improvement of 2% - 4% in extensive experiments with various types of networks (e.g., Vision Transformers, ResNet) on popular datasets such as ImageNet, CIFAR-10/100. (ii) In addition to the strong theoretical motivation, SILO is empirically optimal in the sense of matching an Oracle, which exhaustively searches for the optimal value of max_lr via grid search. We find that SILO is able to precisely adjust the value of max_lr to be within the Oracle optimized interval, resulting in performance competitive with the Oracle with significantly lower complexity.
translated by 谷歌翻译
Many applications require sparse neural networks due to space or inference time restrictions. There is a large body of work on training dense networks to yield sparse networks for inference, but this limits the size of the largest trainable sparse model to that of the largest trainable dense model. In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-tosparse training methods. Our method updates the topology of the sparse network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. We demonstrate state-of-the-art sparse training results on a variety of networks and datasets, including ResNet-50, MobileNets on Imagenet-2012, and RNNs on WikiText-103. Finally, we provide some insights into why allowing the topology to change during the optimization can overcome local minima encountered when the topology remains static * .
translated by 谷歌翻译
彩票假设引发了通过识别大型随机初始化神经网络的稀疏子网来实现结构学习的修剪算法的快速发展。这些“胜利门票”的存在理论上已被证明,但在次优稀疏水平。当代修剪算法还在努力确定复杂的学习任务的稀疏彩票票。这个次优稀疏仅仅是存在证明和算法的文物还是修剪方法的一般限制?并且,如果存在非常稀疏的罚单,则当前算法是能够找到它们的当前算法,或者是实现有效网络压缩所需的进一步改进吗?为了系统地回答这些问题,我们推导了一个框架来植物并隐藏大型随机初始化的神经网络中的目标架构。对于机器学习中的三个共同挑战,我们手工制作极其稀疏的网络拓扑,将它们植入大型神经网络,并评估最先进的彩票修剪方法。我们发现,修剪算法的当前局限性识别极其稀疏的票证是算法的,而不是基本的性质,并且预期我们的种植框架将促进有效修剪算法的未来发展,因为我们已经解决了所提出的领域缺失基线的问题Frankle等人。
translated by 谷歌翻译