Many applications require sparse neural networks due to space or inference time restrictions. There is a large body of work on training dense networks to yield sparse networks for inference, but this limits the size of the largest trainable sparse model to that of the largest trainable dense model. In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-tosparse training methods. Our method updates the topology of the sparse network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. We demonstrate state-of-the-art sparse training results on a variety of networks and datasets, including ResNet-50, MobileNets on Imagenet-2012, and RNNs on WikiText-103. Finally, we provide some insights into why allowing the topology to change during the optimization can overcome local minima encountered when the topology remains static * .
translated by 谷歌翻译
关于稀疏神经网络训练(稀疏训练)的最新研究表明,通过从头开始训练本质上稀疏的神经网络可以实现绩效和效率之间的令人信服的权衡。现有的稀疏训练方法通常努力在一次跑步中找到最佳的稀疏子网,而无需涉及任何昂贵的密集或预训练步骤。例如,作为最突出的方向之一,动态稀疏训练(DST)能够通过在训练过程中迭代发展稀疏拓扑来实现竞争性训练的竞争性能。在本文中,我们认为最好分配有限的资源来创建多个低损失的稀疏子网并将其超级置于更强的基因,而不是完全分配所有资源以找到单个子网络。为了实现这一目标,需要两个Desiderata:(1)在一个培训过程中有效生产许多低损失的子网,即所谓的廉价门票,仅限于用于密集培训的标准培训时间; (2)将这些廉价的门票有效地超级为一个更强的子网,而无需超越约束参数预算。为了证实我们的猜想,我们提出了一种新颖的稀疏训练方法,称为\ textbf {sup-tickets},可以在单个稀疏到较小的训练过程中同时满足上述两个desiderata。在CIFAR-10/100和Imagenet上的各种现代体系结构中,我们表明,SUP-Tickets与现有的稀疏训练方法无缝集成,并显示出一致的性能提高。
translated by 谷歌翻译
最近对稀疏神经网络的作品已经证明了独立从头开始训练稀疏子网,以匹配其相应密集网络的性能。然而,识别这种稀疏的子网(获奖票)涉及昂贵的迭代火车 - 培训 - 培训过程(例如,彩票票证假设)或过度扩展的训练时间(例如,动态稀疏训练)。在这项工作中,我们在稀疏神经网络训练和深度合并技术之间汲取了独特的联系,产生了一个名为FreeTickets的新型集合学习框架。 FreeTickets而不是从密集的网络开始,随机初始化稀疏的子网,然后在动态调整其稀疏掩码的同时列举子网,从而在整个训练过程中产生许多不同的稀疏子网。 FreeTickets被定义为这些稀疏子网的集合,在这种单次通过,稀疏稀疏训练中自由获得,其仅使用Vanilla密集培训所需的计算资源的一小部分。此外,尽管是模型的集合,但与单一密集模型相比,FreeTickets的参数和训练拖鞋更少:这种看似反向直观的结果是由于每个子网的高稀疏性。与标准致密基线相比,观察到惯性基因术,以预测准确性,不确定度估计,鲁棒性和效率相比表现出显着的全面改进。 FreeTickets在ImageNet上只使用后者所需的四分之一的培训拖鞋,可以轻松地表达Naive Deep EndleBe。我们的结果提供了对稀疏神经网络的强度的见解,并表明稀疏性的好处超出了通常预期的推理效率。
translated by 谷歌翻译
近年来,稀疏神经网络的使用迅速增长,尤其是在计算机视觉中。它们的吸引力在很大程度上源于培训和存储所需的参数数量以及学习效率的提高。有些令人惊讶的是,很少有努力探索他们在深度强化学习中的使用(DRL)。在这项工作中,我们进行了系统的调查,以在各种DRL代理和环境上应用许多现有的稀疏培训技术。我们的结果证实了计算机视觉域中稀疏训练的发现 - 稀疏网络在DRL域中对相同的参数计数的稀疏网络表现更好。我们提供了有关DRL中各种组件如何受到稀疏网络的影响的详细分析,并通过建议有希望的途径提高稀疏训练方法的有效性以及推进其在DRL中的使用来结论。
translated by 谷歌翻译
由于稀疏神经网络通常包含许多零权重,因此可以在不降低网络性能的情况下潜在地消除这些不必要的网络连接。因此,设计良好的稀疏神经网络具有显着降低拖鞋和计算资源的潜力。在这项工作中,我们提出了一种新的自动修剪方法 - 稀疏连接学习(SCL)。具体地,重量被重新参数化为可培训权重变量和二进制掩模的元素方向乘法。因此,由二进制掩模完全描述网络连接,其由单位步进函数调制。理论上,从理论上证明了使用直通估计器(STE)进行网络修剪的基本原理。这一原则是STE的代理梯度应该是积极的,确保掩模变量在其最小值处收敛。在找到泄漏的Relu后,SoftPlus和Identity Stes可以满足这个原理,我们建议采用SCL的身份STE以进行离散面膜松弛。我们发现不同特征的面具梯度非常不平衡,因此,我们建议将每个特征的掩模梯度标准化以优化掩码变量训练。为了自动训练稀疏掩码,我们将网络连接总数作为我们的客观函数中的正则化术语。由于SCL不需要由网络层设计人员定义的修剪标准或超级参数,因此在更大的假设空间中探讨了网络,以实现最佳性能的优化稀疏连接。 SCL克服了现有自动修剪方法的局限性。实验结果表明,SCL可以自动学习并选择各种基线网络结构的重要网络连接。 SCL培训的深度学习模型以稀疏性,精度和减少脚波特的SOTA人类设计和自动修剪方法训练。
translated by 谷歌翻译
网络修剪是一种广泛使用的技术,用于有效地压缩深神经网络,几乎没有在推理期间在性能下降低。迭代幅度修剪(IMP)是由几种迭代训练和修剪步骤组成的网络修剪的最熟悉的方法之一,其中在修剪后丢失了大量网络的性能,然后在随后的再培训阶段中恢复。虽然常用为基准参考,但经常认为a)通过不将稀疏纳入训练阶段来达到次优状态,b)其全球选择标准未能正确地确定最佳层面修剪速率和c)其迭代性质使它变得缓慢和不竞争。根据最近提出的再培训技术,我们通过严格和一致的实验来调查这些索赔,我们将Impr到培训期间的训练算法进行比较,评估其选择标准的建议修改,并研究实际需要的迭代次数和总培训时间。我们发现IMP与SLR进行再培训,可以优于最先进的修剪期间,没有或仅具有很少的计算开销,即全局幅度选择标准在很大程度上具有更复杂的方法,并且只有几个刷新时期在实践中需要达到大部分稀疏性与IMP的诽谤 - 与性能权衡。我们的目标既可以证明基本的进攻已经可以提供最先进的修剪结果,甚至优于更加复杂或大量参数化方法,也可以为未来的研究建立更加现实但易于可实现的基线。
translated by 谷歌翻译
Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance.We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the lottery ticket hypothesis: dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that-when trained in isolationreach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.
translated by 谷歌翻译
深度神经网络(DNN)在解决许多真实问题方面都有效。较大的DNN模型通常表现出更好的质量(例如,精度,精度),但它们的过度计算会导致长期推理时间。模型稀疏可以降低计算和内存成本,同时保持模型质量。大多数现有的稀疏算法是单向移除的重量,而其他人则随机或贪婪地探索每层进行修剪的小权重子集。这些算法的局限性降低了可实现的稀疏性水平。此外,许多算法仍然需要预先训练的密集模型,因此遭受大的内存占地面积。在本文中,我们提出了一种新颖的预定生长和修剪(间隙)方法,而无需预先培训密集模型。它通过反复生长一个层次的层来解决以前的作品的缺点,然后在一些训练后修剪回到稀疏。实验表明,使用所提出的方法修剪模型匹配或击败高度优化的密集模型的质量,在各种任务中以80%的稀疏度,例如图像分类,客观检测,3D对象分段和翻译。它们还优于模型稀疏的其他最先进的(SOTA)方法。作为一个例子,通过间隙获得的90%不均匀的稀疏resnet-50模型在想象中实现了77.9%的前1个精度,提高了先前的SOTA结果1.5%。所有代码将公开发布。
translated by 谷歌翻译
Network pruning is widely used for reducing the heavy inference cost of deep models in low-resource settings. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning. During pruning, according to a certain criterion, redundant weights are pruned and important weights are kept to best preserve the accuracy. In this work, we make several surprising observations which contradict common beliefs. For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined target network architecture, one can get rid of the full pipeline and directly train the target network from scratch. Our observations are consistent for multiple network architectures, datasets, and tasks, which imply that: 1) training a large, over-parameterized model is often not necessary to obtain an efficient final model, 2) learned "important" weights of the large model are typically not useful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited "important" weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm. Our results suggest the need for more careful baseline evaluations in future research on structured pruning methods. We also compare with the "Lottery Ticket Hypothesis" (Frankle & Carbin, 2019), and find that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization. * Equal contribution. † Work done while visiting UC Berkeley.
translated by 谷歌翻译
深度神经网络(DNN)的计算要求增加导致获得稀疏,且准确的DNN模型的兴趣。最近的工作已经调查了稀疏训练的更加困难的情况,其中DNN重量尽可能稀少,以减少训练期间的计算成本。现有的稀疏训练方法通常是经验的,并且可以具有相对于致密基线的准确性较低。在本文中,我们介绍了一种称为交替压缩/解压缩(AC / DC)训练DNN的一般方法,证明了算法变体的收敛,并表明AC / DC在类似的计算预算中准确地表现出现有的稀疏训练方法;在高稀疏水平下,AC / DC甚至优于现有的现有方法,依赖于准确的预训练密集模型。 AC / DC的一个重要属性是它允许联合培训密集和稀疏的型号,在训练过程结束时产生精确的稀疏密集模型对。这在实践中是有用的,其中压缩变体可能是为了在资源受限的设置中进行部署而不重新执行整个训练流,并且还为我们提供了深入和压缩模型之间的精度差距的见解。代码可在:https://github.com/ist-daslab/acdc。
translated by 谷歌翻译
Sparse neural networks attract increasing interest as they exhibit comparable performance to their dense counterparts while being computationally efficient. Pruning the dense neural networks is among the most widely used methods to obtain a sparse neural network. Driven by the high training cost of such methods that can be unaffordable for a low-resource device, training sparse neural networks sparsely from scratch has recently gained attention. However, existing sparse training algorithms suffer from various issues, including poor performance in high sparsity scenarios, computing dense gradient information during training, or pure random topology search. In this paper, inspired by the evolution of the biological brain and the Hebbian learning theory, we present a new sparse training approach that evolves sparse neural networks according to the behavior of neurons in the network. Concretely, by exploiting the cosine similarity metric to measure the importance of the connections, our proposed method, Cosine similarity-based and Random Topology Exploration (CTRE), evolves the topology of sparse neural networks by adding the most important connections to the network without calculating dense gradient in the backward. We carried out different experiments on eight datasets, including tabular, image, and text datasets, and demonstrate that our proposed method outperforms several state-of-the-art sparse training algorithms in extremely sparse neural networks by a large gap. The implementation code is available on https://github.com/zahraatashgahi/CTRE
translated by 谷歌翻译
Pruning large neural networks while maintaining their performance is often desirable due to the reduced space and time complexity. In existing methods, pruning is done within an iterative optimization procedure with either heuristically designed pruning schedules or additional hyperparameters, undermining their utility. In this work, we present a new approach that prunes a given network once at initialization prior to training. To achieve this, we introduce a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task. This eliminates the need for both pretraining and the complex pruning schedule while making it robust to architecture variations. After pruning, the sparse network is trained in the standard way. Our method obtains extremely sparse networks with virtually the same accuracy as the reference network on the MNIST, CIFAR-10, and Tiny-ImageNet classification tasks and is broadly applicable to various architectures including convolutional, residual and recurrent networks. Unlike existing methods, our approach enables us to demonstrate that the retained connections are indeed relevant to the given task.
translated by 谷歌翻译
有效地近似损失函数的局部曲率信息是用于深神经网络的优化和压缩的关键工具。然而,大多数现有方法近似二阶信息具有高计算或存储成本,这可以限制其实用性。在这项工作中,我们调查矩阵,用于估计逆象征的矢量产品(IHVPS)的矩阵线性时间方法,因为当Hessian可以近似为乘语 - 一个矩阵的总和时,如Hessian的经典近似由经验丰富的Fisher矩阵。我们提出了两个新的算法作为称为M-FAC的框架的一部分:第一个算法朝着网络压缩量身定制,如果Hessian给出了M $等级的总和,则可以计算Dimension $ D $的IHVP。 ,使用$ O(DM ^ 2)$预压制,$ O(DM)$代价计算IHVP,并查询逆Hessian的任何单个元素的费用$ O(m)$。第二算法针对优化设置,我们希望在反向Hessian之间计算产品,估计在优化步骤的滑动窗口和给定梯度方向上,根据预先说明的SGD所需的梯度方向。我们为计算IHVP和OHVP和O(DM + M ^ 3)$ of $ o(dm + m ^ 2)$提供算法,以便从滑动窗口添加或删除任何渐变。这两种算法产生最先进的结果,用于网络修剪和相对于现有二阶方法的计算开销的优化。在[9]和[17]可用实现。
translated by 谷歌翻译
神经网络的架构和参数通常独立优化,这需要每当修改体系结构时对参数的昂贵再次再次再次进行验证。在这项工作中,我们专注于在不需要昂贵的再培训的情况下越来越多。我们提出了一种在训练期间添加新神经元的方法,而不会影响已经学到的内容,同时改善了培训动态。我们通过最大化新重量的梯度来实现后者,并通过奇异值分解(SVD)有效地找到最佳初始化。我们称这种技术渐变最大化增长(Gradmax),并展示其各种视觉任务和架构的效力。
translated by 谷歌翻译
最近,稀疏的培训方法已开始作为事实上的人工神经网络的培训和推理效率的方法。然而,这种效率只是理论上。在实践中,每个人都使用二进制掩码来模拟稀疏性,因为典型的深度学习软件和硬件已针对密集的矩阵操作进行了优化。在本文中,我们采用正交方法,我们表明我们可以训练真正稀疏的神经网络以收获其全部潜力。为了实现这一目标,我们介绍了三个新颖的贡献,这些贡献是专门为稀疏神经网络设计的:(1)平行训练算法及其相应的稀疏实现,(2)具有不可训练的参数的激活功能,以支持梯度流动,以支持梯度流量, (3)隐藏的神经元对消除冗余的重要性指标。总而言之,我们能够打破记录并训练有史以来最大的神经网络在代表力方面训练 - 达到蝙蝠大脑的大小。结果表明,我们的方法具有最先进的表现,同时为环保人工智能时代开辟了道路。
translated by 谷歌翻译
彩票票证假设(LTH)表明,密集的模型包含高度稀疏的子网(即获奖门票),可以隔离培训以完全准确。尽管做出了许多激动人心的努力,但仍有一个“常识”很少受到挑战:通过迭代级修剪(IMP)发现了一张获胜的票,因此由此产生的修剪子网仅具有非结构化的稀疏性。这一差距限制了在实践中赢得门票的吸引力,因为高度不规则的稀疏模式在硬件上加速的挑战是挑战性的。同时,直接将结构化修剪替换为非结构化的修剪,以更严重地损害绩效,并且通常无法找到获胜的票。在本文中,我们证明了第一个积极的结果是,总体上可以有效地找到结构上稀疏的获胜票。核心思想是在每一轮(非结构化)IMP之后附加“后处理技术”,以实施结构稀疏的形成。具体而言,我们首先在某些被认为很重要的通道中“重新填充”修剪元素,然后“重新组”非零元素以创建灵活的群体结构模式。我们确定的渠道和团体结构子网都赢得了彩票,并以现有硬件很容易支持的大量推理加速。广泛的实验,在多个网络骨架的不同数据集上进行,一致验证了我们的建议,表明LTH的硬件加速障碍现在已被删除。具体而言,结构上的获胜票最多可获得{64.93%,64.84%,60.23%}的运行时间节省,以{36%〜80%,74%,58%}的稀疏性在{Cifar,cifar,tiny-imageNet,imageNet}上保持可比较的精度。代码在https://github.com/vita-group/structure-lth上。
translated by 谷歌翻译
人们通常认为,修剪网络不仅会降低深网的计算成本,而且还可以通过降低模型容量来防止过度拟合。但是,我们的工作令人惊讶地发现,网络修剪有时甚至会加剧过度拟合。我们报告了出乎意料的稀疏双后裔现象,随着我们通过网络修剪增加模型稀疏性,首先测试性能变得更糟(由于过度拟合),然后变得更好(由于过度舒适),并且终于变得更糟(由于忘记了有用的有用信息)。尽管最近的研究集中在模型过度参数化方面,但他们未能意识到稀疏性也可能导致双重下降。在本文中,我们有三个主要贡献。首先,我们通过广泛的实验报告了新型的稀疏双重下降现象。其次,对于这种现象,我们提出了一种新颖的学习距离解释,即$ \ ell_ {2} $稀疏模型的学习距离(从初始化参数到最终参数)可能与稀疏的双重下降曲线良好相关,并更好地反映概括比最小平坦。第三,在稀疏的双重下降的背景下,彩票票假设中的获胜票令人惊讶地并不总是赢。
translated by 谷歌翻译
We propose a new formulation for pruning convolutional kernels in neural networks to enable efficient inference. We interleave greedy criteria-based pruning with finetuning by backpropagation-a computationally efficient procedure that maintains good generalization in the pruned network. We propose a new criterion based on Taylor expansion that approximates the change in the cost function induced by pruning network parameters. We focus on transfer learning, where large pretrained networks are adapted to specialized tasks. The proposed criterion demonstrates superior performance compared to other criteria, e.g. the norm of kernel weights or feature map activation, for pruning large CNNs after adaptation to fine-grained classification tasks (Birds-200 and Flowers-102) relaying only on the first order gradient information. We also show that pruning can lead to more than 10× theoretical reduction in adapted 3D-convolutional filters with a small drop in accuracy in a recurrent gesture classifier. Finally, we show results for the largescale ImageNet dataset to emphasize the flexibility of our approach.
translated by 谷歌翻译
Neural network pruning-the task of reducing the size of a network by removing parameters-has been the subject of a great deal of work in recent years. We provide a meta-analysis of the literature, including an overview of approaches to pruning and consistent findings in the literature. After aggregating results across 81 papers and pruning hundreds of models in controlled conditions, our clearest finding is that the community suffers from a lack of standardized benchmarks and metrics. This deficiency is substantial enough that it is hard to compare pruning techniques to one another or determine how much progress the field has made over the past three decades. To address this situation, we identify issues with current practices, suggest concrete remedies, and introduce ShrinkBench, an open-source framework to facilitate standardized evaluations of pruning methods. We use ShrinkBench to compare various pruning techniques and show that its comprehensive evaluation can prevent common pitfalls when comparing pruning methods.
translated by 谷歌翻译
The deployment of deep convolutional neural networks (CNNs) in many real world applications is largely hindered by their high computational cost. In this paper, we propose a novel learning scheme for CNNs to simultaneously 1) reduce the model size; 2) decrease the run-time memory footprint; and 3) lower the number of computing operations, without compromising accuracy. This is achieved by enforcing channel-level sparsity in the network in a simple but effective way. Different from many existing approaches, the proposed method directly applies to modern CNN architectures, introduces minimum overhead to the training process, and requires no special software/hardware accelerators for the resulting models. We call our approach network slimming, which takes wide and large networks as input models, but during training insignificant channels are automatically identified and pruned afterwards, yielding thin and compact models with comparable accuracy. We empirically demonstrate the effectiveness of our approach with several state-of-the-art CNN models, including VGGNet, ResNet and DenseNet, on various image classification datasets. For VGGNet, a multi-pass version of network slimming gives a 20× reduction in model size and a 5× reduction in computing operations.
translated by 谷歌翻译