大型和性能的神经网络通常过度参数化,并且由于修剪而可以大大降低大小和复杂性。修剪是一组方法,它试图消除网络中的冗余或不必要的权重或权重。这些技术允许创建轻型网络,这对于嵌入式或移动应用程序特别重要。在本文中,我们设计了一种替代修剪方法,允许从较大未训练的方法中提取有效的子网。我们的方法是随机的,并通过探索使用Gumbel SoftMax采样的不同拓扑来提取子网。后者还用于训练概率分布,以衡量样品中权重的相关性。使用高效的重新恢复机制进一步增强了最终的子网,从而减少训练时间并提高性能。在CIFAR上进行的广泛实验表明,针对相关工作,我们的子网络提取方法的表现要优于表现。
translated by 谷歌翻译
In this paper, we design lightweight graph convolutional networks (GCNs) using a particular class of regularizers, dubbed as phase-field models (PFMs). PFMs exhibit a bi-phase behavior using a particular ultra-local term that allows training both the topology and the weight parameters of GCNs as a part of a single "end-to-end" optimization problem. Our proposed solution also relies on a reparametrization that pushes the mask of the topology towards binary values leading to effective topology selection and high generalization while implementing any targeted pruning rate. Both masks and weights share the same set of latent variables and this further enhances the generalization power of the resulting lightweight GCNs. Extensive experiments conducted on the challenging task of skeleton-based recognition show the outperformance of PFMs against other staple regularizers as well as related lightweight design methods.
translated by 谷歌翻译
现代深度神经网络往往太大而无法在许多实际情况下使用。神经网络修剪是降低这种模型的大小的重要技术和加速推断。Gibbs修剪是一种表达和设计神经网络修剪方法的新框架。结合统计物理和随机正则化方法的方法,它可以同时培训和修剪网络,使得学习的权重和修剪面膜彼此很好地适应。它可用于结构化或非结构化修剪,我们为每个提出了许多特定方法。我们将拟议的方法与许多当代神经网络修剪方法进行比较,发现Gibbs修剪优于它们。特别是,我们通过CIFAR-10数据集来实现修剪Reset-56的新型最先进的结果。
translated by 谷歌翻译
由于稀疏神经网络通常包含许多零权重,因此可以在不降低网络性能的情况下潜在地消除这些不必要的网络连接。因此,设计良好的稀疏神经网络具有显着降低拖鞋和计算资源的潜力。在这项工作中,我们提出了一种新的自动修剪方法 - 稀疏连接学习(SCL)。具体地,重量被重新参数化为可培训权重变量和二进制掩模的元素方向乘法。因此,由二进制掩模完全描述网络连接,其由单位步进函数调制。理论上,从理论上证明了使用直通估计器(STE)进行网络修剪的基本原理。这一原则是STE的代理梯度应该是积极的,确保掩模变量在其最小值处收敛。在找到泄漏的Relu后,SoftPlus和Identity Stes可以满足这个原理,我们建议采用SCL的身份STE以进行离散面膜松弛。我们发现不同特征的面具梯度非常不平衡,因此,我们建议将每个特征的掩模梯度标准化以优化掩码变量训练。为了自动训练稀疏掩码,我们将网络连接总数作为我们的客观函数中的正则化术语。由于SCL不需要由网络层设计人员定义的修剪标准或超级参数,因此在更大的假设空间中探讨了网络,以实现最佳性能的优化稀疏连接。 SCL克服了现有自动修剪方法的局限性。实验结果表明,SCL可以自动学习并选择各种基线网络结构的重要网络连接。 SCL培训的深度学习模型以稀疏性,精度和减少脚波特的SOTA人类设计和自动修剪方法训练。
translated by 谷歌翻译
图表卷积网络(GCNS)旨在扩展深度学习,以任意不规则域,即图表。它们的成功高度依赖于如何定义输入图的拓扑结构,并且大多数现有的GCN架构依赖于预定义或手工制作的图形结构。在本文中,我们介绍了一种新的方法,该方法将输入图的拓扑(或连接)作为GCN设计的一部分。我们方法的主要贡献驻留在建立正交的连接基础上,以便在实现卷积之前通过其邻居优化节点。我们的方法还考虑了一个时剧性标准,它作为符合规范器,使学习基础和潜在的GCNS轻质,同时仍然非常有效。对基于骨架的手势识别的挑战性任务进行了实验,展示了学习GCNS W.R.T的高效率。相关工作。
translated by 谷歌翻译
Turning the weights to zero when training a neural network helps in reducing the computational complexity at inference. To progressively increase the sparsity ratio in the network without causing sharp weight discontinuities during training, our work combines soft-thresholding and straight-through gradient estimation to update the raw, i.e. non-thresholded, version of zeroed weights. Our method, named ST-3 for straight-through/soft-thresholding/sparse-training, obtains SoA results, both in terms of accuracy/sparsity and accuracy/FLOPS trade-offs, when progressively increasing the sparsity ratio in a single training cycle. In particular, despite its simplicity, ST-3 favorably compares to the most recent methods, adopting differentiable formulations or bio-inspired neuroregeneration principles. This suggests that the key ingredients for effective sparsification primarily lie in the ability to give the weights the freedom to evolve smoothly across the zero state while progressively increasing the sparsity ratio. Source code and weights available at https://github.com/vanderschuea/stthree
translated by 谷歌翻译
We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network during the early stages of training. Thus, the computational cost of subsequent training iterations, besides that of inference, is considerably reduced. Our method, based on variational inference principles using Gaussian scale mixture priors on neural network weights, learns the variational posterior distribution of Bernoulli random variables multiplying the units/filters similarly to adaptive dropout. Our algorithm, ensures that the Bernoulli parameters practically converge to either 0 or 1, establishing a deterministic final network. We analytically derive a novel hyper-prior distribution over the prior parameters that is crucial for their optimal selection and leads to consistent pruning levels and prediction accuracy regardless of weight initialization or the size of the starting network. We prove the convergence properties of our algorithm establishing theoretical and practical pruning conditions. We evaluate the proposed algorithm on the MNIST and CIFAR-10 data sets and the commonly used fully connected and convolutional LeNet and VGG16 architectures. The simulations show that our method achieves pruning levels on par with state-of the-art methods for structured pruning, while maintaining better test-accuracy and more importantly in a manner robust with respect to network initialization and initial size.
translated by 谷歌翻译
Many applications require sparse neural networks due to space or inference time restrictions. There is a large body of work on training dense networks to yield sparse networks for inference, but this limits the size of the largest trainable sparse model to that of the largest trainable dense model. In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-tosparse training methods. Our method updates the topology of the sparse network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. We demonstrate state-of-the-art sparse training results on a variety of networks and datasets, including ResNet-50, MobileNets on Imagenet-2012, and RNNs on WikiText-103. Finally, we provide some insights into why allowing the topology to change during the optimization can overcome local minima encountered when the topology remains static * .
translated by 谷歌翻译
神经网络修剪对于在预训练的密集网络架构中发现有效,高性能的子网有用。然而,更常见的是,它涉及三步过程 - 预先训练,修剪和重新训练 - 这是计算昂贵的,因为必须完全预先训练的密集模型。幸运的是,已经经过了多种作品,证明可以通过修剪发现高性能的子网,而无需完全预先训练密集网络。旨在理论上分析修剪网络表现良好的密集网络预培训量,我们发现在两层全连接网络上的SGD预训练迭代数量中发现了一个理论界限,超出了由此进行修剪贪婪的前瞻性选择产生了一个达到良好训练错误的子网。该阈值显示在对数上依赖于数据集的大小,这意味着具有较大数据集的实验需要更好地训练通过修剪以执行良好执行的子网。我们经验展示了我们在各种架构和数据集中的理论结果的有效性,包括在Mnist上培训的全连接网络以及在CIFAR10和ImageNet上培训的几个深度卷积神经网络(CNN)架构。
translated by 谷歌翻译
网络修剪是一种有效的方法,可以通过可接受的性能妥协降低网络复杂性。现有研究通过耗时的重量调谐或具有扩展宽度的网络的复杂搜索来实现神经网络的稀疏性,这极大地限制了网络修剪的应用。在本文中,我们表明,在没有权重调谐的情况下,高性能和稀疏的子网被称为“彩票奖线”,存在于具有膨胀宽度的预先训练的模型中。例如,我们获得了一个只有10%参数的彩票奖金,仍然达到了原始密度Vggnet-19的性能,而无需对CiFar-10的预先训练的重量进行任何修改。此外,我们观察到,来自许多现有修剪标准的稀疏面具与我们的彩票累积的搜索掩码具有高重叠,其中,基于幅度的修剪导致与我们的最相似的掩模。根据这种洞察力,我们使用基于幅度的修剪初始化我们的稀疏掩模,导致彩票累积搜索至少3倍降低,同时实现了可比或更好的性能。具体而言,我们的幅度基彩票奖学金在Reset-50中除去90%的重量,而在ImageNet上仅使用10个搜索时期可以轻松获得超过70%的前1个精度。我们的代码可在https://github.com/zyxxmu/lottery-jackpots获得。
translated by 谷歌翻译
学习图形卷积网络(GCNS)是一种新兴领域,其旨在将卷积操作概括为任意非常规域。特别地,与光谱域相比,在空间域操作的GCNS显示出优异的性能,但它们的成功高度依赖于如何定义输入图的拓扑。在本文中,我们向图表卷积网络介绍了一个新颖的框架,了解图形的拓扑属性。我们的方法的设计原理基于约束目标函数的优化,该函数不仅在GCNS中的常用卷积参数中学习,而且是传达这些图中最相关的拓扑关系的转换基础。基于骨架的动作识别的具有挑战性任务进行的实验表明,与手工图形设计以及相关工作相比,所提出的方法的优越性。
translated by 谷歌翻译
Helmholtz机器(HMS)是由两个Sigmoid信念网络(SBN)组成的一类生成模型,分别用作编码器和解码器。这些模型通常是使用称为唤醒 - 睡眠(WS)的两步优化算法对这些模型进行的,并且最近通过改进版本(例如重新恢复的尾流(RWS)和双向Helmholtz Machines(BIHM))进行了改进版本。 SBN中连接的局部性在与概率模型相关的Fisher信息矩阵中诱导稀疏性,并以细粒粒度的块状结构的形式引起。在本文中,我们利用自然梯度利用该特性来有效地训练SBN和HMS。我们提出了一种新颖的算法,称为“自然重新唤醒”(NRWS),该算法与其标准版本的几何适应相对应。以类似的方式,我们还引入了天然双向Helmholtz机器(NBIHM)。与以前的工作不同,我们将展示如何有效地计算自然梯度,而无需引入Fisher信息矩阵结构的任何近似值。在文献中进行的标准数据集进行的实验表明,NRW和NBIHM不仅在其非几何基准方面,而且在HMS的最先进培训算法方面都具有一致的改善。在训练后,汇聚速度以及对数可能达到的对数似然的值量化了改进。
translated by 谷歌翻译
Pruning large neural networks to create highquality, independently trainable sparse masks, which can maintain similar performance to their dense counterparts, is very desirable due to the reduced space and time complexity. As research effort is focused on increasingly sophisticated pruning methods that leads to sparse subnetworks trainable from the scratch, we argue for an orthogonal, under-explored theme: improving training techniques for pruned sub-networks, i.e. sparse training. Apart from the popular belief that only the quality of sparse masks matters for sparse training, in this paper we demonstrate an alternative opportunity: one can carefully customize the sparse training techniques to deviate from the default dense network training protocols, consisting of introducing "ghost" neurons and skip connections at the early stage of training, and strategically modifying the initialization as well as labels. Our new sparse training recipe is generally applicable to improving training from scratch with various sparse masks. By adopting our newly curated techniques, we demonstrate significant performance gains across various popular datasets (CIFAR-10, CIFAR-100, TinyIma-geNet), architectures (ResNet-18/32/104, Vgg16, MobileNet), and sparse mask options (lottery ticket, SNIP/GRASP, SynFlow, or even randomly pruning), compared to the default training protocols, especially at high sparsity levels. Code is at https://github.com/VITA-Group/ToST.
translated by 谷歌翻译
训练有素的神经网络的性能至关重要。加上深度学习模型的不断增长的规模,这种观察激发了对学习稀疏模型的广泛研究。在这项工作中,我们专注于控制稀疏学习时的稀疏水平的任务。基于稀疏性惩罚的现有方法涉及对罚款因素的昂贵反复试验调整,因此缺乏直接控制所得模型的稀疏性。作为响应,我们采用了一个约束的公式:使用Louizos等人提出的栅极机制。 (2018年),我们制定了一个受约束的优化问题,其中稀疏以训练目标和所需的稀疏目标以端到端的方式指导。使用WIDERESNET和RESNET {18,50}模型进行了CIFAR-10/100,Tinyimagenet和ImageNet的实验验证了我们的提案的有效性,并证明我们可以可靠地实现预定的稀疏目标,而不会损害预测性能。
translated by 谷歌翻译
Sparse neural networks attract increasing interest as they exhibit comparable performance to their dense counterparts while being computationally efficient. Pruning the dense neural networks is among the most widely used methods to obtain a sparse neural network. Driven by the high training cost of such methods that can be unaffordable for a low-resource device, training sparse neural networks sparsely from scratch has recently gained attention. However, existing sparse training algorithms suffer from various issues, including poor performance in high sparsity scenarios, computing dense gradient information during training, or pure random topology search. In this paper, inspired by the evolution of the biological brain and the Hebbian learning theory, we present a new sparse training approach that evolves sparse neural networks according to the behavior of neurons in the network. Concretely, by exploiting the cosine similarity metric to measure the importance of the connections, our proposed method, Cosine similarity-based and Random Topology Exploration (CTRE), evolves the topology of sparse neural networks by adding the most important connections to the network without calculating dense gradient in the backward. We carried out different experiments on eight datasets, including tabular, image, and text datasets, and demonstrate that our proposed method outperforms several state-of-the-art sparse training algorithms in extremely sparse neural networks by a large gap. The implementation code is available on https://github.com/zahraatashgahi/CTRE
translated by 谷歌翻译
最近证明利用稀疏网络连接深神经网络中的连续层,可为大型最新模型提供好处。但是,网络连接性在浅网络的学习曲线中也起着重要作用,例如经典限制的玻尔兹曼机器(RBM)。一个基本问题是有效地找到了改善学习曲线的连接模式。最近的原则方法明确将网络连接作为参数,这些参数必须在模型中进行优化,但通常依靠连续功能来表示连接和明确的惩罚。这项工作提出了一种基于网络梯度的想法来找到RBM的最佳连接模式的方法:计算每个可能连接的梯度,给定特定的连接模式,并使用梯度驱动连续连接强度参数又使用确定连接模式。因此,学习RBM参数和学习网络连接是真正共同执行的,尽管学习率不同,并且没有改变目标函数。该方法应用于MNIST数据集,以显示针对样本生成和输入分类的基准任务找到更好的RBM模型。
translated by 谷歌翻译
修剪神经网络可降低推理时间和记忆成本。在标准硬件上,如果修剪诸如特征地图之类的粗粒结构(例如特征地图),这些好处将特别突出。我们为二阶结构修剪(SOSP)设计了两种新型的基于显着性的方法,其中包括所有结构和层之间的相关性。我们的主要方法SOSP-H采用了创新的二阶近似,可以通过快速的Hessian-vector产品进行显着评估。 SOSP-H因此,尽管考虑到了完整的Hessian,但仍像一阶方法一样缩放。我们通过将SOSP-H与使用公认的Hessian近似值以及许多最先进方法进行比较来验证SOSP-H。尽管SOSP-H在准确性方面的表现或更好,但在可伸缩性和效率方面具有明显的优势。这使我们能够将SOSP-H扩展到大规模视觉任务,即使它捕获了网络所有层的相关性。为了强调我们修剪方法的全球性质,我们不仅通过删除预验证网络的结构,而且还通过检测建筑瓶颈来评估它们的性能。我们表明,我们的算法允许系统地揭示建筑瓶颈,然后将其删除以进一步提高网络的准确性。
translated by 谷歌翻译
嵌套辍学是辍学操作的变体,能够根据训练期间的预定义重要性订购网络参数或功能。它已被探索:I。构造嵌套网络:嵌套网是神经网络,可以在测试时间(例如基于计算约束)中立即调整架构的架构。嵌套的辍学者隐含地对网络参数进行排名,生成一组子网络,从而使任何较小的子网络构成较大的子网络的基础。 ii。学习排序表示:应用于生成模型的潜在表示(例如自动编码器)对特征进行排名,从而在尺寸上执行密集表示的明确顺序。但是,在整个训练过程中,辍学率是固定为高参数的。对于嵌套网,当删除网络参数时,性能衰减在人类指定的轨迹中而不是从数据中学到的轨迹中。对于生成模型,特征的重要性被指定为恒定向量,从而限制了表示学习的灵活性。为了解决该问题,我们专注于嵌套辍学的概率对应物。我们提出了一个嵌套掉落(VND)操作,该操作以低成本绘制多维有序掩码的样品,为嵌套掉落的参数提供了有用的梯度。基于这种方法,我们设计了一个贝叶斯嵌套的神经网络,以了解参数分布的顺序知识。我们在不同的生成模型下进一步利用VND来学习有序的潜在分布。在实验中,我们表明所提出的方法在分类任务中的准确性,校准和室外检测方面优于嵌套网络。它还在数据生成任务上胜过相关的生成模型。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
Catastrophic forgetting occurs when a neural network loses the information learned in a previous task after training on subsequent tasks. This problem remains a hurdle for artificial intelligence systems with sequential learning capabilities. In this paper, we propose a task-based hard attention mechanism that preserves previous tasks' information without affecting the current task's learning. A hard attention mask is learned concurrently to every task, through stochastic gradient descent, and previous masks are exploited to condition such learning. We show that the proposed mechanism is effective for reducing catastrophic forgetting, cutting current rates by 45 to 80%. We also show that it is robust to different hyperparameter choices, and that it offers a number of monitoring capabilities. The approach features the possibility to control both the stability and compactness of the learned knowledge, which we believe makes it also attractive for online learning or network compression applications.
translated by 谷歌翻译