We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network during the early stages of training. Thus, the computational cost of subsequent training iterations, besides that of inference, is considerably reduced. Our method, based on variational inference principles using Gaussian scale mixture priors on neural network weights, learns the variational posterior distribution of Bernoulli random variables multiplying the units/filters similarly to adaptive dropout. Our algorithm, ensures that the Bernoulli parameters practically converge to either 0 or 1, establishing a deterministic final network. We analytically derive a novel hyper-prior distribution over the prior parameters that is crucial for their optimal selection and leads to consistent pruning levels and prediction accuracy regardless of weight initialization or the size of the starting network. We prove the convergence properties of our algorithm establishing theoretical and practical pruning conditions. We evaluate the proposed algorithm on the MNIST and CIFAR-10 data sets and the commonly used fully connected and convolutional LeNet and VGG16 architectures. The simulations show that our method achieves pruning levels on par with state-of the-art methods for structured pruning, while maintaining better test-accuracy and more importantly in a manner robust with respect to network initialization and initial size.
translated by 谷歌翻译
事实证明,稀疏的深度神经网络在大规模研究中对于预测模型构建有效。尽管几项作品研究了稀疏神经体系结构的理论和数值特性,但它们主要集中在边缘选择上。通过优势选择的稀疏性可能具有直觉上的吸引力;但是,它不一定会降低网络的结构复杂性。相反,修剪过多的节点会导致一个结构稀疏的网络,并在推理过程中具有显着的计算加速。为此,我们建议使用Spike and-Slab Gaussian先验者提出贝叶斯稀疏溶液,以允许在训练过程中选择自动节点。使用Spike and-Slab先验减轻了对修剪的临时阈值规则的需求。此外,我们采用了一种差异贝叶斯方法来规避传统马尔可夫链蒙特卡洛(MCMC)实施的计算挑战。在节点选择的背景下,我们建立了变异后一致性的基本结果,以及先前参数的表征。与以前的作品相反,我们的理论发展放宽了所有网络权重的节点和均匀界限的假设,从而适应具有层依赖性节点结构或系数边界的稀疏网络。通过对先前纳入概率的层表表征,我们讨论了后部变异的最佳收缩率。我们从经验上证明,我们所提出的方法的表现优于计算复杂性的边缘选择方法,具有相似或更好的预测性能。我们的实验证据进一步证明了我们的理论工作有助于层面上的最佳节点恢复。
translated by 谷歌翻译
这项调查的目的是介绍对深神经网络的近似特性的解释性回顾。具体而言,我们旨在了解深神经网络如何以及为什么要优于其他经典线性和非线性近似方法。这项调查包括三章。在第1章中,我们回顾了深层网络及其组成非线性结构的关键思想和概念。我们通过在解决回归和分类问题时将其作为优化问题来形式化神经网络问题。我们简要讨论用于解决优化问题的随机梯度下降算法以及用于解决优化问题的后传播公式,并解决了与神经网络性能相关的一些问题,包括选择激活功能,成本功能,过度适应问题和正则化。在第2章中,我们将重点转移到神经网络的近似理论上。我们首先介绍多项式近似中的密度概念,尤其是研究实现连续函数的Stone-WeierStrass定理。然后,在线性近似的框架内,我们回顾了馈电网络的密度和收敛速率的一些经典结果,然后在近似Sobolev函数中进行有关深网络复杂性的最新发展。在第3章中,利用非线性近似理论,我们进一步详细介绍了深度和近似网络与其他经典非线性近似方法相比的近似优势。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
这是一门专门针对STEM学生开发的介绍性机器学习课程。我们的目标是为有兴趣的读者提供基础知识,以在自己的项目中使用机器学习,并将自己熟悉术语作为进一步阅读相关文献的基础。在这些讲义中,我们讨论受监督,无监督和强化学习。注释从没有神经网络的机器学习方法的说明开始,例如原理分析,T-SNE,聚类以及线性回归和线性分类器。我们继续介绍基本和先进的神经网络结构,例如密集的进料和常规神经网络,经常性的神经网络,受限的玻尔兹曼机器,(变性)自动编码器,生成的对抗性网络。讨论了潜在空间表示的解释性问题,并使用梦和对抗性攻击的例子。最后一部分致力于加强学习,我们在其中介绍了价值功能和政策学习的基本概念。
translated by 谷歌翻译
由于稀疏神经网络通常包含许多零权重,因此可以在不降低网络性能的情况下潜在地消除这些不必要的网络连接。因此,设计良好的稀疏神经网络具有显着降低拖鞋和计算资源的潜力。在这项工作中,我们提出了一种新的自动修剪方法 - 稀疏连接学习(SCL)。具体地,重量被重新参数化为可培训权重变量和二进制掩模的元素方向乘法。因此,由二进制掩模完全描述网络连接,其由单位步进函数调制。理论上,从理论上证明了使用直通估计器(STE)进行网络修剪的基本原理。这一原则是STE的代理梯度应该是积极的,确保掩模变量在其最小值处收敛。在找到泄漏的Relu后,SoftPlus和Identity Stes可以满足这个原理,我们建议采用SCL的身份STE以进行离散面膜松弛。我们发现不同特征的面具梯度非常不平衡,因此,我们建议将每个特征的掩模梯度标准化以优化掩码变量训练。为了自动训练稀疏掩码,我们将网络连接总数作为我们的客观函数中的正则化术语。由于SCL不需要由网络层设计人员定义的修剪标准或超级参数,因此在更大的假设空间中探讨了网络,以实现最佳性能的优化稀疏连接。 SCL克服了现有自动修剪方法的局限性。实验结果表明,SCL可以自动学习并选择各种基线网络结构的重要网络连接。 SCL培训的深度学习模型以稀疏性,精度和减少脚波特的SOTA人类设计和自动修剪方法训练。
translated by 谷歌翻译
We investigate a local reparameterizaton technique for greatly reducing the variance of stochastic gradients for variational Bayesian inference (SGVB) of a posterior over model parameters, while retaining parallelizability. This local reparameterization translates uncertainty about global parameters into local noise that is independent across datapoints in the minibatch. Such parameterizations can be trivially parallelized and have variance that is inversely proportional to the minibatch size, generally leading to much faster convergence. Additionally, we explore a connection with dropout: Gaussian dropout objectives correspond to SGVB with local reparameterization, a scale-invariant prior and proportionally fixed posterior variance. Our method allows inference of more flexibly parameterized posteriors; specifically, we propose variational dropout, a generalization of Gaussian dropout where the dropout rates are learned, often leading to better models. The method is demonstrated through several experiments.
translated by 谷歌翻译
These notes were compiled as lecture notes for a course developed and taught at the University of the Southern California. They should be accessible to a typical engineering graduate student with a strong background in Applied Mathematics. The main objective of these notes is to introduce a student who is familiar with concepts in linear algebra and partial differential equations to select topics in deep learning. These lecture notes exploit the strong connections between deep learning algorithms and the more conventional techniques of computational physics to achieve two goals. First, they use concepts from computational physics to develop an understanding of deep learning algorithms. Not surprisingly, many concepts in deep learning can be connected to similar concepts in computational physics, and one can utilize this connection to better understand these algorithms. Second, several novel deep learning algorithms can be used to solve challenging problems in computational physics. Thus, they offer someone who is interested in modeling a physical phenomena with a complementary set of tools.
translated by 谷歌翻译
We develop an optimization algorithm suitable for Bayesian learning in complex models. Our approach relies on natural gradient updates within a general black-box framework for efficient training with limited model-specific derivations. It applies within the class of exponential-family variational posterior distributions, for which we extensively discuss the Gaussian case for which the updates have a rather simple form. Our Quasi Black-box Variational Inference (QBVI) framework is readily applicable to a wide class of Bayesian inference problems and is of simple implementation as the updates of the variational posterior do not involve gradients with respect to the model parameters, nor the prescription of the Fisher information matrix. We develop QBVI under different hypotheses for the posterior covariance matrix, discuss details about its robust and feasible implementation, and provide a number of real-world applications to demonstrate its effectiveness.
translated by 谷歌翻译
We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.
translated by 谷歌翻译
我们研究了回归中神经网络(NNS)的模型不确定性的方法。为了隔离模型不确定性的效果,我们专注于稀缺训练数据的无噪声环境。我们介绍了关于任何方法都应满足的模型不确定性的五个重要的逃亡者。但是,我们发现,建立的基准通常无法可靠地捕获其中一些逃避者,即使是贝叶斯理论要求的基准。为了解决这个问题,我们介绍了一种新方法来捕获NNS的模型不确定性,我们称之为基于神经优化的模型不确定性(NOMU)。 NOMU的主要思想是设计一个由两个连接的子NN组成的网络体系结构,一个用于模型预测,一个用于模型不确定性,并使用精心设计的损耗函数进行训练。重要的是,我们的设计执行NOMU满足我们的五个Desiderata。由于其模块化体系结构,NOMU可以为任何给定(先前训练)NN提供模型不确定性,如果访问其培训数据。我们在各种回归任务和无嘈杂的贝叶斯优化(BO)中评估NOMU,并具有昂贵的评估。在回归中,NOMU至少和最先进的方法。在BO中,Nomu甚至胜过所有考虑的基准。
translated by 谷歌翻译
深度神经网络(DNN)的计算要求增加导致获得稀疏,且准确的DNN模型的兴趣。最近的工作已经调查了稀疏训练的更加困难的情况,其中DNN重量尽可能稀少,以减少训练期间的计算成本。现有的稀疏训练方法通常是经验的,并且可以具有相对于致密基线的准确性较低。在本文中,我们介绍了一种称为交替压缩/解压缩(AC / DC)训练DNN的一般方法,证明了算法变体的收敛,并表明AC / DC在类似的计算预算中准确地表现出现有的稀疏训练方法;在高稀疏水平下,AC / DC甚至优于现有的现有方法,依赖于准确的预训练密集模型。 AC / DC的一个重要属性是它允许联合培训密集和稀疏的型号,在训练过程结束时产生精确的稀疏密集模型对。这在实践中是有用的,其中压缩变体可能是为了在资源受限的设置中进行部署而不重新执行整个训练流,并且还为我们提供了深入和压缩模型之间的精度差距的见解。代码可在:https://github.com/ist-daslab/acdc。
translated by 谷歌翻译
深入学习有权成为人工智能(AI)的最近成功。然而,作为深度学习的基本模型,深度神经网络遭受了当地陷阱和错误稳定等问题。在本文中,我们为稀疏深度学习提供了一个新的框架,这具有以一种连贯的方式解决了上述问题。特别是,我们阐述了稀疏深度学习的理论基础,并提出了用于学习稀疏神经网络的先前退火算法。前者已成功地将稀疏的深神经网络命名为统计建模的框架,使得能够正确量化预测不确定性。后者可以渐近地保证收敛到全局最优,从而实现了下游统计推断的有效性。数值结果表明,与现有的方法相比,所提出的方法的优越性。
translated by 谷歌翻译
Helmholtz机器(HMS)是由两个Sigmoid信念网络(SBN)组成的一类生成模型,分别用作编码器和解码器。这些模型通常是使用称为唤醒 - 睡眠(WS)的两步优化算法对这些模型进行的,并且最近通过改进版本(例如重新恢复的尾流(RWS)和双向Helmholtz Machines(BIHM))进行了改进版本。 SBN中连接的局部性在与概率模型相关的Fisher信息矩阵中诱导稀疏性,并以细粒粒度的块状结构的形式引起。在本文中,我们利用自然梯度利用该特性来有效地训练SBN和HMS。我们提出了一种新颖的算法,称为“自然重新唤醒”(NRWS),该算法与其标准版本的几何适应相对应。以类似的方式,我们还引入了天然双向Helmholtz机器(NBIHM)。与以前的工作不同,我们将展示如何有效地计算自然梯度,而无需引入Fisher信息矩阵结构的任何近似值。在文献中进行的标准数据集进行的实验表明,NRW和NBIHM不仅在其非几何基准方面,而且在HMS的最先进培训算法方面都具有一致的改善。在训练后,汇聚速度以及对数可能达到的对数似然的值量化了改进。
translated by 谷歌翻译
鉴于Vanilla SGD的直接简单,本文在迷你批处理箱中提供了精细调整其阶梯尺寸。为了这样做,基于局部二次模型并仅使用嘈杂的梯度近似来估计曲率。一个人获得一种新的随机第一阶方法(步骤调谐的SGD),由二阶信息增强,这可以被视为古典Barzilai-Borwein方法的随机版本。我们的理论结果确保了几乎肯定的趋同集,我们提供了收敛速率。深度剩余网络培训的实验说明了我们方法的有利性质。对于我们在培训期间观察到的网络,突然下降的损失和中等阶段的测试精度的提高,产生比SGD,RMSPROP或ADAM更好的结果。
translated by 谷歌翻译
近似贝叶斯深度学习方法对于解决在智能系统中部署深度学习组件时,包括在智能系统中部署深度学习组件的几个问题,包括减轻过度自信的错误并提供增强的鲁棒性,从而超出分发示例。但是,现有近似贝叶斯推理方法的计算要求可以使它们不适合部署包括低功耗边缘设备的智能IOT系统。在本文中,我们为监督深度学习提供了一系列近似贝叶斯推理方法,并在应用这些方法对当前边缘硬件上的挑战和机遇。我们突出了几种潜在的解决方案来降低模型存储要求,提高计算可扩展性,包括模型修剪和蒸馏方法。
translated by 谷歌翻译
近期在应用于培训深度神经网络和数据分析中的其他优化问题中的非凸优化的优化算法的兴趣增加,我们概述了最近对非凸优化优化算法的全球性能保证的理论结果。我们从古典参数开始,显示一般非凸面问题无法在合理的时间内有效地解决。然后,我们提供了一个问题列表,可以通过利用问题的结构来有效地找到全球最小化器,因为可能的问题。处理非凸性的另一种方法是放宽目标,从找到全局最小,以找到静止点或局部最小值。对于该设置,我们首先为确定性一阶方法的收敛速率提出了已知结果,然后是最佳随机和随机梯度方案的一般理论分析,以及随机第一阶方法的概述。之后,我们讨论了非常一般的非凸面问题,例如最小化$ \ alpha $ -weakly-are-convex功能和满足Polyak-lojasiewicz条件的功能,这仍然允许获得一阶的理论融合保证方法。然后,我们考虑更高阶和零序/衍生物的方法及其收敛速率,以获得非凸优化问题。
translated by 谷歌翻译
这本数字本书包含在物理模拟的背景下与深度学习相关的一切实际和全面的一切。尽可能多,所有主题都带有Jupyter笔记本的形式的动手代码示例,以便快速入门。除了标准的受监督学习的数据中,我们将看看物理丢失约束,更紧密耦合的学习算法,具有可微分的模拟,以及加强学习和不确定性建模。我们生活在令人兴奋的时期:这些方法具有从根本上改变计算机模拟可以实现的巨大潜力。
translated by 谷歌翻译
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.
translated by 谷歌翻译
Normalizing flows provide a general mechanism for defining expressive probability distributions, only requiring the specification of a (usually simple) base distribution and a series of bijective transformations. There has been much recent work on normalizing flows, ranging from improving their expressive power to expanding their application. We believe the field has now matured and is in need of a unified perspective. In this review, we attempt to provide such a perspective by describing flows through the lens of probabilistic modeling and inference. We place special emphasis on the fundamental principles of flow design, and discuss foundational topics such as expressive power and computational trade-offs. We also broaden the conceptual framing of flows by relating them to more general probability transformations. Lastly, we summarize the use of flows for tasks such as generative modeling, approximate inference, and supervised learning.
translated by 谷歌翻译