深度学习优化的最新进展表明,仅仅一部分参数才能成功训练模型。潜在地,这种发现从理论到应用都有广泛的影响。但是,众所周知,找到这些可训练的子网络通常是一个昂贵的过程。这抑制了实际应用:在培训时可以找到深度学习模型中学习的子图形结构吗?在这项工作中,我们探讨了这种可能性,观察和激励为什么普通方法通常在感兴趣的极端情况下失败,并提出一种方法,该方法有可能通过减少的计算工作来培训。关于具有挑战性的体系结构和数据集的实验表明,在这种计算增益上具有算法可访问性,尤其是实现的准确性和部署的培训复杂性之间的权衡。
translated by 谷歌翻译
深度学习优化的最新进展表明,借助有关训练有素的模型的一些A-posteriori信息,可以通过简单地训练其参数的一部分来匹配相同的性能。这种发现从理论到应用都有广泛的影响,将研究推向方法,以识别无需查看信息开发而训练的最小参数子集。但是,提出的方法与最新性能不符,并依赖于非结构化的稀疏连接模型。在这项工作中,我们将重点从单个参数转移到整个神经元的行为,从而利用了神经元平衡的概念(NEQ)。当神经元处于平衡状态(意味着它已经学会了特定的输入关系)时,我们可以停止其更新;相反,当神经元处于非平衡状态时,我们使其状态朝着平衡状态进化,从而更新其参数。提出的方法已在不同的最新学习策略和任务上进行了测试,验证了NEQ并观察到神经元平衡取决于特定的学习设置。
translated by 谷歌翻译
Pruning refers to the elimination of trivial weights from neural networks. The sub-networks within an overparameterized model produced after pruning are often called Lottery tickets. This research aims to generate winning lottery tickets from a set of lottery tickets that can achieve similar accuracy to the original unpruned network. We introduce a novel winning ticket called Cyclic Overlapping Lottery Ticket (COLT) by data splitting and cyclic retraining of the pruned network from scratch. We apply a cyclic pruning algorithm that keeps only the overlapping weights of different pruned models trained on different data segments. Our results demonstrate that COLT can achieve similar accuracies (obtained by the unpruned model) while maintaining high sparsities. We show that the accuracy of COLT is on par with the winning tickets of Lottery Ticket Hypothesis (LTH) and, at times, is better. Moreover, COLTs can be generated using fewer iterations than tickets generated by the popular Iterative Magnitude Pruning (IMP) method. In addition, we also notice COLTs generated on large datasets can be transferred to small ones without compromising performance, demonstrating its generalizing capability. We conduct all our experiments on Cifar-10, Cifar-100 & TinyImageNet datasets and report superior performance than the state-of-the-art methods.
translated by 谷歌翻译
We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. We use this technique to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained in isolation to full accuracy. We find that these subnetworks only reach full accuracy when they are stable to SGD noise, which either occurs at initialization for small-scale settings (MNIST) or early in training for large-scale settings (ResNet-50 and Inception-v3 on ImageNet).
translated by 谷歌翻译
Many applications require sparse neural networks due to space or inference time restrictions. There is a large body of work on training dense networks to yield sparse networks for inference, but this limits the size of the largest trainable sparse model to that of the largest trainable dense model. In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-tosparse training methods. Our method updates the topology of the sparse network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. We demonstrate state-of-the-art sparse training results on a variety of networks and datasets, including ResNet-50, MobileNets on Imagenet-2012, and RNNs on WikiText-103. Finally, we provide some insights into why allowing the topology to change during the optimization can overcome local minima encountered when the topology remains static * .
translated by 谷歌翻译
修剪深度神经网络的现有方法专注于去除训练有素的网络的不必要参数,然后微调模型,找到恢复训练模型的初始性能的良好解决方案。与其他作品不同,我们的方法特别注意通过修剪神经元的压缩模型和推理计算时间的解决方案的质量。通过探索Hessian的光谱半径,所提出的算法通过探索Hessian的光谱半径来指示压缩模型的参数,这导致了更好地推广了未经看涨的数据。此外,该方法不适用于预先训练的网络,并同时执行训练和修剪。我们的结果表明,它改善了神经元压缩的最先进的结果。该方法能够在不同神经网络模型上实现具有小精度下降的非常小的网络。
translated by 谷歌翻译
深度神经网络已用于多种成功的应用中。但是,由于包含数百万个参数,它们的高度复杂性质导致在延迟需求低的管道中部署期间有问题。结果,更希望获得在推理期间具有相同性能的轻型神经网络。在这项工作中,我们提出了一种基于重量的修剪方法,其中权重根据以前的迭代势头逐渐修剪。神经网络的每个层都根据其相对稀疏性分配了一个重要性值,然后在先前迭代中的重量幅度分配。我们在Alexnet,VGG16和Resnet50等网络上评估了我们的方法,其中包括图像分类数据集,例如CIFAR-10和CIFAR-100。我们发现,在准确性和压缩比方面,结果优于先前的方法。我们的方法能够在两个数据集上获得同一降解的相同降解的15%压缩。
translated by 谷歌翻译
Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance.We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the lottery ticket hypothesis: dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that-when trained in isolationreach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.
translated by 谷歌翻译
在神经网络中引入稀疏性是一种有效的方法,可以降低其复杂性,同时保持其性能几乎完好无损。在大多数情况下,使用三阶段管道引入稀疏性:1)训练模型以收敛,2)根据某些标准修剪模型,3)微调修剪模型以恢复性能。最后两个步骤通常是迭代执行的,从而导致合理的结果,但也取得了耗时且复杂的过程。在我们的工作中,我们建议摆脱管道的第一步,并在单个修剪训练周期中结合其他两个步骤,从而使模型在修剪时共同学习最佳权重。我们通过介绍一个名为One Cycle Pruning的小说修剪时间表来做到这一点,该时间表从培训开始就开始修剪,直到最后。采用这样的时间表不仅可以更好地执行修剪模型,而且还大大降低了修剪模型所需的培训预算。实验是在多种架构(VGG-16和RESNET-18)和数据集(CIFAR-10,CIFAR-100和CALTECH-101)上进行的,以及相对较高的稀疏值(80%,90%,95%的权重,删除)。我们的结果表明,按固定的培训预算,一环修剪始终优于通常使用的修剪时间表,例如单发修剪,迭代修剪和自动化逐渐修剪。
translated by 谷歌翻译
网络修剪是一种广泛使用的技术,用于有效地压缩深神经网络,几乎没有在推理期间在性能下降低。迭代幅度修剪(IMP)是由几种迭代训练和修剪步骤组成的网络修剪的最熟悉的方法之一,其中在修剪后丢失了大量网络的性能,然后在随后的再培训阶段中恢复。虽然常用为基准参考,但经常认为a)通过不将稀疏纳入训练阶段来达到次优状态,b)其全球选择标准未能正确地确定最佳层面修剪速率和c)其迭代性质使它变得缓慢和不竞争。根据最近提出的再培训技术,我们通过严格和一致的实验来调查这些索赔,我们将Impr到培训期间的训练算法进行比较,评估其选择标准的建议修改,并研究实际需要的迭代次数和总培训时间。我们发现IMP与SLR进行再培训,可以优于最先进的修剪期间,没有或仅具有很少的计算开销,即全局幅度选择标准在很大程度上具有更复杂的方法,并且只有几个刷新时期在实践中需要达到大部分稀疏性与IMP的诽谤 - 与性能权衡。我们的目标既可以证明基本的进攻已经可以提供最先进的修剪结果,甚至优于更加复杂或大量参数化方法,也可以为未来的研究建立更加现实但易于可实现的基线。
translated by 谷歌翻译
人们通常认为,修剪网络不仅会降低深网的计算成本,而且还可以通过降低模型容量来防止过度拟合。但是,我们的工作令人惊讶地发现,网络修剪有时甚至会加剧过度拟合。我们报告了出乎意料的稀疏双后裔现象,随着我们通过网络修剪增加模型稀疏性,首先测试性能变得更糟(由于过度拟合),然后变得更好(由于过度舒适),并且终于变得更糟(由于忘记了有用的有用信息)。尽管最近的研究集中在模型过度参数化方面,但他们未能意识到稀疏性也可能导致双重下降。在本文中,我们有三个主要贡献。首先,我们通过广泛的实验报告了新型的稀疏双重下降现象。其次,对于这种现象,我们提出了一种新颖的学习距离解释,即$ \ ell_ {2} $稀疏模型的学习距离(从初始化参数到最终参数)可能与稀疏的双重下降曲线良好相关,并更好地反映概括比最小平坦。第三,在稀疏的双重下降的背景下,彩票票假设中的获胜票令人惊讶地并不总是赢。
translated by 谷歌翻译
有效地近似损失函数的局部曲率信息是用于深神经网络的优化和压缩的关键工具。然而,大多数现有方法近似二阶信息具有高计算或存储成本,这可以限制其实用性。在这项工作中,我们调查矩阵,用于估计逆象征的矢量产品(IHVPS)的矩阵线性时间方法,因为当Hessian可以近似为乘语 - 一个矩阵的总和时,如Hessian的经典近似由经验丰富的Fisher矩阵。我们提出了两个新的算法作为称为M-FAC的框架的一部分:第一个算法朝着网络压缩量身定制,如果Hessian给出了M $等级的总和,则可以计算Dimension $ D $的IHVP。 ,使用$ O(DM ^ 2)$预压制,$ O(DM)$代价计算IHVP,并查询逆Hessian的任何单个元素的费用$ O(m)$。第二算法针对优化设置,我们希望在反向Hessian之间计算产品,估计在优化步骤的滑动窗口和给定梯度方向上,根据预先说明的SGD所需的梯度方向。我们为计算IHVP和OHVP和O(DM + M ^ 3)$ of $ o(dm + m ^ 2)$提供算法,以便从滑动窗口添加或删除任何渐变。这两种算法产生最先进的结果,用于网络修剪和相对于现有二阶方法的计算开销的优化。在[9]和[17]可用实现。
translated by 谷歌翻译
Network pruning is widely used for reducing the heavy inference cost of deep models in low-resource settings. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning. During pruning, according to a certain criterion, redundant weights are pruned and important weights are kept to best preserve the accuracy. In this work, we make several surprising observations which contradict common beliefs. For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined target network architecture, one can get rid of the full pipeline and directly train the target network from scratch. Our observations are consistent for multiple network architectures, datasets, and tasks, which imply that: 1) training a large, over-parameterized model is often not necessary to obtain an efficient final model, 2) learned "important" weights of the large model are typically not useful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited "important" weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm. Our results suggest the need for more careful baseline evaluations in future research on structured pruning methods. We also compare with the "Lottery Ticket Hypothesis" (Frankle & Carbin, 2019), and find that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization. * Equal contribution. † Work done while visiting UC Berkeley.
translated by 谷歌翻译
新兴的边缘情报应用程序要求服务器重新训练和更新部署在远程边缘节点上的深神经网络,以利用新收集的数据示例。不幸的是,由于高度严格的通信资源,在实践中可能不可能连续向这些边缘节点发送全面更新的权重。在本文中,我们提出了重量的深层部分更新范式,该范式巧妙地选择了一小部分权重以在每个服务器到边缘通信中进行更新,同时与完整更新相比实现了相似的性能。我们的方法是通过分析上限的部分更新和完整更新之间的损失差异来建立的,并且只能更新权重,从而对上限产生最大的贡献。广泛的实验结果证明了我们部分更新方法的功效,该方法在更新少量的权重的同时,可以达到高推理精度。
translated by 谷歌翻译
当前的深神经网络(DNN)被过度参数化,并在推断每个任务期间使用其大多数神经元连接。然而,人的大脑开发了针对不同任务的专门区域,并通过其神经元连接的一小部分进行推断。我们提出了一种迭代修剪策略,引入了一个简单的重要性评分度量度量,该指标可以停用不重要的连接,解决DNN中的过度参数化并调节射击模式。目的是找到仍然能够以可比精度解决给定任务的最小连接,即更简单的子网。我们在MNIST上实现了LENET体系结构的可比性能,并且与CIFAR-10/100和Tiny-ImageNet上的VGG和Resnet架构的最先进算法相比,参数压缩的性能明显更高。我们的方法对于考虑到ADAM和SGD的两个不同优化器也表现良好。该算法并非旨在在考虑当前的硬件和软件实现时最小化失败,尽管与最新技术相比,该算法的性能合理。
translated by 谷歌翻译
Pruning large neural networks to create highquality, independently trainable sparse masks, which can maintain similar performance to their dense counterparts, is very desirable due to the reduced space and time complexity. As research effort is focused on increasingly sophisticated pruning methods that leads to sparse subnetworks trainable from the scratch, we argue for an orthogonal, under-explored theme: improving training techniques for pruned sub-networks, i.e. sparse training. Apart from the popular belief that only the quality of sparse masks matters for sparse training, in this paper we demonstrate an alternative opportunity: one can carefully customize the sparse training techniques to deviate from the default dense network training protocols, consisting of introducing "ghost" neurons and skip connections at the early stage of training, and strategically modifying the initialization as well as labels. Our new sparse training recipe is generally applicable to improving training from scratch with various sparse masks. By adopting our newly curated techniques, we demonstrate significant performance gains across various popular datasets (CIFAR-10, CIFAR-100, TinyIma-geNet), architectures (ResNet-18/32/104, Vgg16, MobileNet), and sparse mask options (lottery ticket, SNIP/GRASP, SynFlow, or even randomly pruning), compared to the default training protocols, especially at high sparsity levels. Code is at https://github.com/VITA-Group/ToST.
translated by 谷歌翻译
网络修剪是一种有效的方法,可以通过可接受的性能妥协降低网络复杂性。现有研究通过耗时的重量调谐或具有扩展宽度的网络的复杂搜索来实现神经网络的稀疏性,这极大地限制了网络修剪的应用。在本文中,我们表明,在没有权重调谐的情况下,高性能和稀疏的子网被称为“彩票奖线”,存在于具有膨胀宽度的预先训练的模型中。例如,我们获得了一个只有10%参数的彩票奖金,仍然达到了原始密度Vggnet-19的性能,而无需对CiFar-10的预先训练的重量进行任何修改。此外,我们观察到,来自许多现有修剪标准的稀疏面具与我们的彩票累积的搜索掩码具有高重叠,其中,基于幅度的修剪导致与我们的最相似的掩模。根据这种洞察力,我们使用基于幅度的修剪初始化我们的稀疏掩模,导致彩票累积搜索至少3倍降低,同时实现了可比或更好的性能。具体而言,我们的幅度基彩票奖学金在Reset-50中除去90%的重量,而在ImageNet上仅使用10个搜索时期可以轻松获得超过70%的前1个精度。我们的代码可在https://github.com/zyxxmu/lottery-jackpots获得。
translated by 谷歌翻译
Pruning large neural networks while maintaining their performance is often desirable due to the reduced space and time complexity. In existing methods, pruning is done within an iterative optimization procedure with either heuristically designed pruning schedules or additional hyperparameters, undermining their utility. In this work, we present a new approach that prunes a given network once at initialization prior to training. To achieve this, we introduce a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task. This eliminates the need for both pretraining and the complex pruning schedule while making it robust to architecture variations. After pruning, the sparse network is trained in the standard way. Our method obtains extremely sparse networks with virtually the same accuracy as the reference network on the MNIST, CIFAR-10, and Tiny-ImageNet classification tasks and is broadly applicable to various architectures including convolutional, residual and recurrent networks. Unlike existing methods, our approach enables us to demonstrate that the retained connections are indeed relevant to the given task.
translated by 谷歌翻译
深度神经网络(DNN)在解决许多真实问题方面都有效。较大的DNN模型通常表现出更好的质量(例如,精度,精度),但它们的过度计算会导致长期推理时间。模型稀疏可以降低计算和内存成本,同时保持模型质量。大多数现有的稀疏算法是单向移除的重量,而其他人则随机或贪婪地探索每层进行修剪的小权重子集。这些算法的局限性降低了可实现的稀疏性水平。此外,许多算法仍然需要预先训练的密集模型,因此遭受大的内存占地面积。在本文中,我们提出了一种新颖的预定生长和修剪(间隙)方法,而无需预先培训密集模型。它通过反复生长一个层次的层来解决以前的作品的缺点,然后在一些训练后修剪回到稀疏。实验表明,使用所提出的方法修剪模型匹配或击败高度优化的密集模型的质量,在各种任务中以80%的稀疏度,例如图像分类,客观检测,3D对象分段和翻译。它们还优于模型稀疏的其他最先进的(SOTA)方法。作为一个例子,通过间隙获得的90%不均匀的稀疏resnet-50模型在想象中实现了77.9%的前1个精度,提高了先前的SOTA结果1.5%。所有代码将公开发布。
translated by 谷歌翻译
网络的稀疏性主要是由于其降低网络复杂性的能力而受欢迎。广泛的研究挖掘了梯度驱动的稀疏性。通常,这些方法是在体重独立性前提下构建的,但是与重量受到相互影响的事实相反。因此,他们的性能仍有待改进。在本文中,我们建议通过解决这种独立悖论来进一步优化梯度驱动的稀疏性(OPTG)。我们的动机来自最近对超级策略训练的进步,该进步表明,稀疏子网可以通过简单地更新掩码值而无需修改任何权重的情况下将其位于随机初始化的网络中。我们证明,超级手机训练是积累重量梯度,并可以部分解决独立悖论。因此,OPTG将Supermask训练集成到梯度驱动的稀疏度中,并且设计了专门的掩模优化器来解决独立悖论。实验表明,OPTG可以很好地超越许多现有的最先进的竞争对手。我们的代码可在\ url {https://github.com/zyxxmu/optg}上找到。
translated by 谷歌翻译