Over-parameterization of deep neural networks (DNNs) has shown high prediction accuracy for many applications. Although effective, the large number of parameters hinders its popularity on resource-limited devices and has an outsize environmental impact. Sparse training (using a fixed number of nonzero weights in each iteration) could significantly mitigate the training costs by reducing the model size. However, existing sparse training methods mainly use either random-based or greedy-based drop-and-grow strategies, resulting in local minimal and low accuracy. In this work, to assist explainable sparse training, we propose important weights Exploitation and coverage Exploration to characterize Dynamic Sparse Training (DST-EE), and provide quantitative analysis of these two metrics. We further design an acquisition function and provide the theoretical guarantees for the proposed method and clarify its convergence property. Experimental results show that sparse models (up to 98\% sparsity) obtained by our proposed method outperform the SOTA sparse training methods on a wide variety of deep learning tasks. On VGG-19 / CIFAR-100, ResNet-50 / CIFAR-10, ResNet-50 / CIFAR-100, our method has even higher accuracy than dense models. On ResNet-50 / ImageNet, the proposed method has up to 8.2\% accuracy improvement compared to SOTA sparse training methods.
translated by 谷歌翻译
深度神经网络(DNN)在解决许多真实问题方面都有效。较大的DNN模型通常表现出更好的质量(例如,精度,精度),但它们的过度计算会导致长期推理时间。模型稀疏可以降低计算和内存成本,同时保持模型质量。大多数现有的稀疏算法是单向移除的重量,而其他人则随机或贪婪地探索每层进行修剪的小权重子集。这些算法的局限性降低了可实现的稀疏性水平。此外,许多算法仍然需要预先训练的密集模型,因此遭受大的内存占地面积。在本文中,我们提出了一种新颖的预定生长和修剪(间隙)方法,而无需预先培训密集模型。它通过反复生长一个层次的层来解决以前的作品的缺点,然后在一些训练后修剪回到稀疏。实验表明,使用所提出的方法修剪模型匹配或击败高度优化的密集模型的质量,在各种任务中以80%的稀疏度,例如图像分类,客观检测,3D对象分段和翻译。它们还优于模型稀疏的其他最先进的(SOTA)方法。作为一个例子,通过间隙获得的90%不均匀的稀疏resnet-50模型在想象中实现了77.9%的前1个精度,提高了先前的SOTA结果1.5%。所有代码将公开发布。
translated by 谷歌翻译
随着实际图表的扩大,将部署具有数十亿个参数的较大GNN模型。此类模型中的高参数计数使图表的训练和推断昂贵且具有挑战性。为了降低GNN的计算和记忆成本,通常采用了输入图中的冗余节点和边缘等优化方法。但是,直接针对模型层稀疏的模型压缩,主要限于用于图像分类和对象检测等任务的传统深神网络(DNN)。在本文中,我们利用两种最先进的模型压缩方法(1)训练和修剪以及(2)稀疏训练GNN中的重量层。我们评估并比较了两种方法的效率,从精确性,训练稀疏性和现实世界图上的训练拖失lop方面。我们的实验结果表明,在IA-Email,Wiki-Talk和Stackoverflow数据集上,用于链接预测,稀疏训练和较低的训练拖失板可以使用火车和修剪方法达到可比的精度。在用于节点分类的大脑数据集上,稀疏训练使用较低的数字插槽(小于1/7的火车和修剪方法),并在极端模型的稀疏性下保留了更好的精度性能。
translated by 谷歌翻译
最近对稀疏神经网络的作品已经证明了独立从头开始训练稀疏子网,以匹配其相应密集网络的性能。然而,识别这种稀疏的子网(获奖票)涉及昂贵的迭代火车 - 培训 - 培训过程(例如,彩票票证假设)或过度扩展的训练时间(例如,动态稀疏训练)。在这项工作中,我们在稀疏神经网络训练和深度合并技术之间汲取了独特的联系,产生了一个名为FreeTickets的新型集合学习框架。 FreeTickets而不是从密集的网络开始,随机初始化稀疏的子网,然后在动态调整其稀疏掩码的同时列举子网,从而在整个训练过程中产生许多不同的稀疏子网。 FreeTickets被定义为这些稀疏子网的集合,在这种单次通过,稀疏稀疏训练中自由获得,其仅使用Vanilla密集培训所需的计算资源的一小部分。此外,尽管是模型的集合,但与单一密集模型相比,FreeTickets的参数和训练拖鞋更少:这种看似反向直观的结果是由于每个子网的高稀疏性。与标准致密基线相比,观察到惯性基因术,以预测准确性,不确定度估计,鲁棒性和效率相比表现出显着的全面改进。 FreeTickets在ImageNet上只使用后者所需的四分之一的培训拖鞋,可以轻松地表达Naive Deep EndleBe。我们的结果提供了对稀疏神经网络的强度的见解,并表明稀疏性的好处超出了通常预期的推理效率。
translated by 谷歌翻译
关于稀疏神经网络训练(稀疏训练)的最新研究表明,通过从头开始训练本质上稀疏的神经网络可以实现绩效和效率之间的令人信服的权衡。现有的稀疏训练方法通常努力在一次跑步中找到最佳的稀疏子网,而无需涉及任何昂贵的密集或预训练步骤。例如,作为最突出的方向之一,动态稀疏训练(DST)能够通过在训练过程中迭代发展稀疏拓扑来实现竞争性训练的竞争性能。在本文中,我们认为最好分配有限的资源来创建多个低损失的稀疏子网并将其超级置于更强的基因,而不是完全分配所有资源以找到单个子网络。为了实现这一目标,需要两个Desiderata:(1)在一个培训过程中有效生产许多低损失的子网,即所谓的廉价门票,仅限于用于密集培训的标准培训时间; (2)将这些廉价的门票有效地超级为一个更强的子网,而无需超越约束参数预算。为了证实我们的猜想,我们提出了一种新颖的稀疏训练方法,称为\ textbf {sup-tickets},可以在单个稀疏到较小的训练过程中同时满足上述两个desiderata。在CIFAR-10/100和Imagenet上的各种现代体系结构中,我们表明,SUP-Tickets与现有的稀疏训练方法无缝集成,并显示出一致的性能提高。
translated by 谷歌翻译
Many applications require sparse neural networks due to space or inference time restrictions. There is a large body of work on training dense networks to yield sparse networks for inference, but this limits the size of the largest trainable sparse model to that of the largest trainable dense model. In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-tosparse training methods. Our method updates the topology of the sparse network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. We demonstrate state-of-the-art sparse training results on a variety of networks and datasets, including ResNet-50, MobileNets on Imagenet-2012, and RNNs on WikiText-103. Finally, we provide some insights into why allowing the topology to change during the optimization can overcome local minima encountered when the topology remains static * .
translated by 谷歌翻译
由于稀疏神经网络通常包含许多零权重,因此可以在不降低网络性能的情况下潜在地消除这些不必要的网络连接。因此,设计良好的稀疏神经网络具有显着降低拖鞋和计算资源的潜力。在这项工作中,我们提出了一种新的自动修剪方法 - 稀疏连接学习(SCL)。具体地,重量被重新参数化为可培训权重变量和二进制掩模的元素方向乘法。因此,由二进制掩模完全描述网络连接,其由单位步进函数调制。理论上,从理论上证明了使用直通估计器(STE)进行网络修剪的基本原理。这一原则是STE的代理梯度应该是积极的,确保掩模变量在其最小值处收敛。在找到泄漏的Relu后,SoftPlus和Identity Stes可以满足这个原理,我们建议采用SCL的身份STE以进行离散面膜松弛。我们发现不同特征的面具梯度非常不平衡,因此,我们建议将每个特征的掩模梯度标准化以优化掩码变量训练。为了自动训练稀疏掩码,我们将网络连接总数作为我们的客观函数中的正则化术语。由于SCL不需要由网络层设计人员定义的修剪标准或超级参数,因此在更大的假设空间中探讨了网络,以实现最佳性能的优化稀疏连接。 SCL克服了现有自动修剪方法的局限性。实验结果表明,SCL可以自动学习并选择各种基线网络结构的重要网络连接。 SCL培训的深度学习模型以稀疏性,精度和减少脚波特的SOTA人类设计和自动修剪方法训练。
translated by 谷歌翻译
网络修剪是一种有效的方法,可以通过可接受的性能妥协降低网络复杂性。现有研究通过耗时的重量调谐或具有扩展宽度的网络的复杂搜索来实现神经网络的稀疏性,这极大地限制了网络修剪的应用。在本文中,我们表明,在没有权重调谐的情况下,高性能和稀疏的子网被称为“彩票奖线”,存在于具有膨胀宽度的预先训练的模型中。例如,我们获得了一个只有10%参数的彩票奖金,仍然达到了原始密度Vggnet-19的性能,而无需对CiFar-10的预先训练的重量进行任何修改。此外,我们观察到,来自许多现有修剪标准的稀疏面具与我们的彩票累积的搜索掩码具有高重叠,其中,基于幅度的修剪导致与我们的最相似的掩模。根据这种洞察力,我们使用基于幅度的修剪初始化我们的稀疏掩模,导致彩票累积搜索至少3倍降低,同时实现了可比或更好的性能。具体而言,我们的幅度基彩票奖学金在Reset-50中除去90%的重量,而在ImageNet上仅使用10个搜索时期可以轻松获得超过70%的前1个精度。我们的代码可在https://github.com/zyxxmu/lottery-jackpots获得。
translated by 谷歌翻译
Pruning large neural networks to create highquality, independently trainable sparse masks, which can maintain similar performance to their dense counterparts, is very desirable due to the reduced space and time complexity. As research effort is focused on increasingly sophisticated pruning methods that leads to sparse subnetworks trainable from the scratch, we argue for an orthogonal, under-explored theme: improving training techniques for pruned sub-networks, i.e. sparse training. Apart from the popular belief that only the quality of sparse masks matters for sparse training, in this paper we demonstrate an alternative opportunity: one can carefully customize the sparse training techniques to deviate from the default dense network training protocols, consisting of introducing "ghost" neurons and skip connections at the early stage of training, and strategically modifying the initialization as well as labels. Our new sparse training recipe is generally applicable to improving training from scratch with various sparse masks. By adopting our newly curated techniques, we demonstrate significant performance gains across various popular datasets (CIFAR-10, CIFAR-100, TinyIma-geNet), architectures (ResNet-18/32/104, Vgg16, MobileNet), and sparse mask options (lottery ticket, SNIP/GRASP, SynFlow, or even randomly pruning), compared to the default training protocols, especially at high sparsity levels. Code is at https://github.com/VITA-Group/ToST.
translated by 谷歌翻译
彩票(LTS)能够发现准确而稀疏的子网,可以隔离训练以匹配密集网络的性能。合奏并行,是机器学习中最古老的预期技巧之一,可以通过结合多个独立模型的输出来提高性能。但是,在LTS背景下,合奏的好处将被稀释,因为合奏并没有直接导致更稀疏的子网,而是利用其预测来做出更好的决定。在这项工作中,我们首先观察到,直接平均相邻学习的子网的权重显着提高了LT的性能。在这一观察结果的鼓励下,我们进一步提出了另一种方法,通过简单的插值策略通过迭代幅度修剪来识别的子网执行“合奏”。我们称我们的方法彩票池。与幼稚的合奏相比,每一个子网都不会带来性能,彩票池比原始LTS产生的稀疏子网稀疏得多,而无需任何额外的培训或推理成本。在CIFAR-10/100和Imagenet上的各种现代体系结构中,我们表明我们的方法在分布和分发场景方面都取得了显着的性能。令人印象深刻的是,用VGG-16和RESNET-18进行评估,生产的子网稀疏的子网在CIFAR-100上优于原始LTS,在CIFAR-100-C上高达1.88%,而CIFAR-100-C则高于2.36%。最终的致密网络超过了CIFAR-100的预训练密集模型,在CIFAR-100-C上超过2.22%。
translated by 谷歌翻译
彩票票证假设(LTH)表明,密集的模型包含高度稀疏的子网(即获奖门票),可以隔离培训以完全准确。尽管做出了许多激动人心的努力,但仍有一个“常识”很少受到挑战:通过迭代级修剪(IMP)发现了一张获胜的票,因此由此产生的修剪子网仅具有非结构化的稀疏性。这一差距限制了在实践中赢得门票的吸引力,因为高度不规则的稀疏模式在硬件上加速的挑战是挑战性的。同时,直接将结构化修剪替换为非结构化的修剪,以更严重地损害绩效,并且通常无法找到获胜的票。在本文中,我们证明了第一个积极的结果是,总体上可以有效地找到结构上稀疏的获胜票。核心思想是在每一轮(非结构化)IMP之后附加“后处理技术”,以实施结构稀疏的形成。具体而言,我们首先在某些被认为很重要的通道中“重新填充”修剪元素,然后“重新组”非零元素以创建灵活的群体结构模式。我们确定的渠道和团体结构子网都赢得了彩票,并以现有硬件很容易支持的大量推理加速。广泛的实验,在多个网络骨架的不同数据集上进行,一致验证了我们的建议,表明LTH的硬件加速障碍现在已被删除。具体而言,结构上的获胜票最多可获得{64.93%,64.84%,60.23%}的运行时间节省,以{36%〜80%,74%,58%}的稀疏性在{Cifar,cifar,tiny-imageNet,imageNet}上保持可比较的精度。代码在https://github.com/vita-group/structure-lth上。
translated by 谷歌翻译
最近,稀疏培训已成为有希望的范式,可在边缘设备上有效地深入学习。当前的研究主要致力于通过进一步增加模型稀疏性来降低培训成本。但是,增加的稀疏性并不总是理想的,因为它不可避免地会在极高的稀疏度下引入严重的准确性降解。本文打算探索其他可能的方向,以有效,有效地降低稀疏培训成本,同时保持准确性。为此,我们研究了两种技术,即层冻结和数据筛分。首先,层冻结方法在密集的模型训练和微调方面取得了成功,但在稀疏训练域中从未采用过。然而,稀疏训练的独特特征可能会阻碍层冻结技术的结合。因此,我们分析了在稀疏培训中使用层冻结技术的可行性和潜力,并发现它有可能节省大量培训成本。其次,我们提出了一种用于数据集有效培训的数据筛分方法,该方法通过确保在整个培训过程中仅使用部分数据集来进一步降低培训成本。我们表明,这两种技术都可以很好地整合到稀疏训练算法中,以形成一个通用框架,我们将其配置为SPFDE。我们的广泛实验表明,SPFDE可以显着降低培训成本,同时从三个维度中保留准确性:重量稀疏性,层冻结和数据集筛分。
translated by 谷歌翻译
Neural network pruning has been a well-established compression technique to enable deep learning models on resource-constrained devices. The pruned model is usually specialized to meet specific hardware platforms and training tasks (defined as deployment scenarios). However, existing pruning approaches rely heavily on training data to trade off model size, efficiency, and accuracy, which becomes ineffective for federated learning (FL) over distributed and confidential datasets. Moreover, the memory- and compute-intensive pruning process of most existing approaches cannot be handled by most FL devices with resource limitations. In this paper, we develop FedTiny, a novel distributed pruning framework for FL, to obtain specialized tiny models for memory- and computing-constrained participating devices with confidential local data. To alleviate biased pruning due to unseen heterogeneous data over devices, FedTiny introduces an adaptive batch normalization (BN) selection module to adaptively obtain an initially pruned model to fit deployment scenarios. Besides, to further improve the initial pruning, FedTiny develops a lightweight progressive pruning module for local finer pruning under tight memory and computational budgets, where the pruning policy for each layer is gradually determined rather than evaluating the overall deep model structure. Extensive experimental results demonstrate the effectiveness of FedTiny, which outperforms state-of-the-art baseline approaches, especially when compressing deep models to extremely sparse tiny models.
translated by 谷歌翻译
Pruning refers to the elimination of trivial weights from neural networks. The sub-networks within an overparameterized model produced after pruning are often called Lottery tickets. This research aims to generate winning lottery tickets from a set of lottery tickets that can achieve similar accuracy to the original unpruned network. We introduce a novel winning ticket called Cyclic Overlapping Lottery Ticket (COLT) by data splitting and cyclic retraining of the pruned network from scratch. We apply a cyclic pruning algorithm that keeps only the overlapping weights of different pruned models trained on different data segments. Our results demonstrate that COLT can achieve similar accuracies (obtained by the unpruned model) while maintaining high sparsities. We show that the accuracy of COLT is on par with the winning tickets of Lottery Ticket Hypothesis (LTH) and, at times, is better. Moreover, COLTs can be generated using fewer iterations than tickets generated by the popular Iterative Magnitude Pruning (IMP) method. In addition, we also notice COLTs generated on large datasets can be transferred to small ones without compromising performance, demonstrating its generalizing capability. We conduct all our experiments on Cifar-10, Cifar-100 & TinyImageNet datasets and report superior performance than the state-of-the-art methods.
translated by 谷歌翻译
人们通常认为,修剪网络不仅会降低深网的计算成本,而且还可以通过降低模型容量来防止过度拟合。但是,我们的工作令人惊讶地发现,网络修剪有时甚至会加剧过度拟合。我们报告了出乎意料的稀疏双后裔现象,随着我们通过网络修剪增加模型稀疏性,首先测试性能变得更糟(由于过度拟合),然后变得更好(由于过度舒适),并且终于变得更糟(由于忘记了有用的有用信息)。尽管最近的研究集中在模型过度参数化方面,但他们未能意识到稀疏性也可能导致双重下降。在本文中,我们有三个主要贡献。首先,我们通过广泛的实验报告了新型的稀疏双重下降现象。其次,对于这种现象,我们提出了一种新颖的学习距离解释,即$ \ ell_ {2} $稀疏模型的学习距离(从初始化参数到最终参数)可能与稀疏的双重下降曲线良好相关,并更好地反映概括比最小平坦。第三,在稀疏的双重下降的背景下,彩票票假设中的获胜票令人惊讶地并不总是赢。
translated by 谷歌翻译
网络修剪是一种广泛使用的技术,用于有效地压缩深神经网络,几乎没有在推理期间在性能下降低。迭代幅度修剪(IMP)是由几种迭代训练和修剪步骤组成的网络修剪的最熟悉的方法之一,其中在修剪后丢失了大量网络的性能,然后在随后的再培训阶段中恢复。虽然常用为基准参考,但经常认为a)通过不将稀疏纳入训练阶段来达到次优状态,b)其全球选择标准未能正确地确定最佳层面修剪速率和c)其迭代性质使它变得缓慢和不竞争。根据最近提出的再培训技术,我们通过严格和一致的实验来调查这些索赔,我们将Impr到培训期间的训练算法进行比较,评估其选择标准的建议修改,并研究实际需要的迭代次数和总培训时间。我们发现IMP与SLR进行再培训,可以优于最先进的修剪期间,没有或仅具有很少的计算开销,即全局幅度选择标准在很大程度上具有更复杂的方法,并且只有几个刷新时期在实践中需要达到大部分稀疏性与IMP的诽谤 - 与性能权衡。我们的目标既可以证明基本的进攻已经可以提供最先进的修剪结果,甚至优于更加复杂或大量参数化方法,也可以为未来的研究建立更加现实但易于可实现的基线。
translated by 谷歌翻译
网络的稀疏性主要是由于其降低网络复杂性的能力而受欢迎。广泛的研究挖掘了梯度驱动的稀疏性。通常,这些方法是在体重独立性前提下构建的,但是与重量受到相互影响的事实相反。因此,他们的性能仍有待改进。在本文中,我们建议通过解决这种独立悖论来进一步优化梯度驱动的稀疏性(OPTG)。我们的动机来自最近对超级策略训练的进步,该进步表明,稀疏子网可以通过简单地更新掩码值而无需修改任何权重的情况下将其位于随机初始化的网络中。我们证明,超级手机训练是积累重量梯度,并可以部分解决独立悖论。因此,OPTG将Supermask训练集成到梯度驱动的稀疏度中,并且设计了专门的掩模优化器来解决独立悖论。实验表明,OPTG可以很好地超越许多现有的最先进的竞争对手。我们的代码可在\ url {https://github.com/zyxxmu/optg}上找到。
translated by 谷歌翻译
Low-rankness plays an important role in traditional machine learning, but is not so popular in deep learning. Most previous low-rank network compression methods compress the networks by approximating pre-trained models and re-training. However, the optimal solution in the Euclidean space may be quite different from the one in the low-rank manifold. A well-pre-trained model is not a good initialization for the model with low-rank constraints. Thus, the performance of a low-rank compressed network degrades significantly. Compared to other network compression methods such as pruning, low-rank methods attracts less attention in recent years. In this paper, we devise a new training method, low-rank projection with energy transfer (LRPET), that trains low-rank compressed networks from scratch and achieves competitive performance. First, we propose to alternately perform stochastic gradient descent training and projection onto the low-rank manifold. Compared to re-training on the compact model, this enables full utilization of model capacity since solution space is relaxed back to Euclidean space after projection. Second, the matrix energy (the sum of squares of singular values) reduction caused by projection is compensated by energy transfer. We uniformly transfer the energy of the pruned singular values to the remaining ones. We theoretically show that energy transfer eases the trend of gradient vanishing caused by projection. Third, we propose batch normalization (BN) rectification to cut off its effect on the optimal low-rank approximation of the weight matrix, which further improves the performance. Comprehensive experiments on CIFAR-10 and ImageNet have justified that our method is superior to other low-rank compression methods and also outperforms recent state-of-the-art pruning methods. Our code is available at https://github.com/BZQLin/LRPET.
translated by 谷歌翻译
Neural network pruning-the task of reducing the size of a network by removing parameters-has been the subject of a great deal of work in recent years. We provide a meta-analysis of the literature, including an overview of approaches to pruning and consistent findings in the literature. After aggregating results across 81 papers and pruning hundreds of models in controlled conditions, our clearest finding is that the community suffers from a lack of standardized benchmarks and metrics. This deficiency is substantial enough that it is hard to compare pruning techniques to one another or determine how much progress the field has made over the past three decades. To address this situation, we identify issues with current practices, suggest concrete remedies, and introduce ShrinkBench, an open-source framework to facilitate standardized evaluations of pruning methods. We use ShrinkBench to compare various pruning techniques and show that its comprehensive evaluation can prevent common pitfalls when comparing pruning methods.
translated by 谷歌翻译
修剪是稀疏深神经网络的任务,最近受到了越来越多的关注。尽管最先进的修剪方法提取了高度稀疏的模型,但它们忽略了两个主要挑战:(1)寻找这些稀疏模型的过程通常非常昂贵; (2)非结构化的修剪在GPU记忆,训练时间或碳排放方面没有提供好处。我们提出了通过梯度流量保存(早期CROP)提出的早期压缩,该压缩在训练挑战(1)的培训(1)中有效提取最先进的稀疏模型,并且可以以结构化的方式应用来应对挑战(2)。这使我们能够在商品GPU上训练稀疏的网络,该商品GPU的密集版本太大,从而节省了成本并减少了硬件要求。我们从经验上表明,早期杂交的表现优于许多任务(包括分类,回归)和域(包括计算机视觉,自然语言处理和增强学习)的丰富基线。早期杂交导致准确性与密集训练相当,同时超过修剪基线。
translated by 谷歌翻译