Denoising autoencoders have been previously shown to be competitive alternatives to Restricted Boltzmann Machines for unsupervised pre-training of each layer of a deep architecture. We show that a simple denoising autoencoder training criterion is equivalent to matching the score (with respect to the data) of a specific energy based model to that of a non-parametric Parzen density estimator of the data. This yields several useful insights. It defines a proper probabilistic model for the denoising autoencoder technique which makes it in principle possible to sample from them or to rank examples by their energy. It suggests a different way to apply score matching that is related to learning to denoise and does not require computing second derivatives. It justifies the use of tied weights between the encoder and decoder, and suggests ways to extend the success of denoising autoencoders to a larger family of energy-based models.
translated by 谷歌翻译
We explore an original strategy for building deep networks, based on stacking layers of denoising autoencoders which are trained locally to denoise corrupted versions of their inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classification problems to yield significantly lower classification error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpassing it. Higher level representations learnt in this purely unsupervised fashion also help boost the performance of subsequent SVM classifiers. Qualitative experiments show that, contrary to ordinary autoencoders, denoising autoencoders are able to learn Gabor-like edge detectors from natural image patches and larger stroke detectors from digit images. This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations.
translated by 谷歌翻译
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.
translated by 谷歌翻译
Hopfield networks and Boltzmann machines (BMs) are fundamental energy-based neural network models. Recent studies on modern Hopfield networks have broaden the class of energy functions and led to a unified perspective on general Hopfield networks including an attention module. In this letter, we consider the BM counterparts of modern Hopfield networks using the associated energy functions, and study their salient properties from a trainability perspective. In particular, the energy function corresponding to the attention module naturally introduces a novel BM, which we refer to as attentional BM (AttnBM). We verify that AttnBM has a tractable likelihood function and gradient for a special case and is easy to train. Moreover, we reveal the hidden connections between AttnBM and some single-layer models, namely the Gaussian--Bernoulli restricted BM and denoising autoencoder with softmax units. We also investigate BMs introduced by other energy functions, and in particular, observe that the energy function of dense associative memory models gives BMs belonging to Exponential Family Harmoniums.
translated by 谷歌翻译
One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant. Typically, one then has to resort to Markov Chain Monte Carlo methods, or approximations of the normalization constant. Here, we propose that such models can be estimated by minimizing the expected squared distance between the gradient of the log-density given by the model and the gradient of the log-density of the observed data. While the estimation of the gradient of log-density function is, in principle, a very difficult non-parametric problem, we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simplifies to a sample average of a sum of some derivatives of the log-density given by the model. The validity of the method is demonstrated on multivariate Gaussian and independent component analysis models, and by estimating an overcomplete filter set for natural image data.
translated by 谷歌翻译
How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.
translated by 谷歌翻译
这项正在进行的工作旨在为统计学习提供统一的介绍,从诸如GMM和HMM等经典模型到现代神经网络(如VAE和扩散模型)缓慢地构建。如今,有许多互联网资源可以孤立地解释这一点或新的机器学习算法,但是它们并没有(也不能在如此简短的空间中)将这些算法彼此连接起来,或者与统计模型的经典文献相连现代算法出现了。同样明显缺乏的是一个单一的符号系统,尽管对那些已经熟悉材料的人(如这些帖子的作者)不满意,但对新手的入境造成了重大障碍。同样,我的目的是将各种模型(尽可能)吸收到一个用于推理和学习的框架上,表明(以及为什么)如何以最小的变化将一个模型更改为另一个模型(其中一些是新颖的,另一些是文献中的)。某些背景当然是必要的。我以为读者熟悉基本的多变量计算,概率和统计以及线性代数。这本书的目标当然不是​​完整性,而是从基本知识到过去十年中极强大的新模型的直线路径或多或少。然后,目标是补充而不是替换,诸如Bishop的\ emph {模式识别和机器学习}之类的综合文本,该文本现在已经15岁了。
translated by 谷歌翻译
这是一门专门针对STEM学生开发的介绍性机器学习课程。我们的目标是为有兴趣的读者提供基础知识,以在自己的项目中使用机器学习,并将自己熟悉术语作为进一步阅读相关文献的基础。在这些讲义中,我们讨论受监督,无监督和强化学习。注释从没有神经网络的机器学习方法的说明开始,例如原理分析,T-SNE,聚类以及线性回归和线性分类器。我们继续介绍基本和先进的神经网络结构,例如密集的进料和常规神经网络,经常性的神经网络,受限的玻尔兹曼机器,(变性)自动编码器,生成的对抗性网络。讨论了潜在空间表示的解释性问题,并使用梦和对抗性攻击的例子。最后一部分致力于加强学习,我们在其中介绍了价值功能和政策学习的基本概念。
translated by 谷歌翻译
概率分布允许从业者发现数据中的隐藏结构,并构建模型,以使用有限的数据解决监督的学习问题。该报告的重点是变异自动编码器,这是一种学习大型复杂数据集概率分布的方法。该报告提供了对变异自动编码器的理论理解,并巩固了该领域的当前研究。该报告分为多个章节,第一章介绍了问题,描述了变异自动编码器并标识了该领域的关键研究方向。第2、3、4和5章深入研究了每个关键研究领域的细节。第6章总结了报告,并提出了未来工作的指示。具有机器学习基本思想但想了解机器学习研究中的一般主题的读者可以从报告中受益。该报告解释了有关学习概率分布的中心思想,人们为使这种危险做些什么,并介绍了有关当前如何应用深度学习的细节。该报告还为希望为这个子场做出贡献的人提供了温和的介绍。
translated by 谷歌翻译
我们研究是否使用两个条件型号$ p(x | z)$和$ q(z | x)$,以使用循环的两个条件型号,我们如何建模联合分配$ p(x,z)$。这是通过观察到深入生成模型的动机,除了可能的型号$ p(x | z)$,通常也使用推理型号$ q(z | x)$来提取表示,但它们通常依赖不表征的先前分配$ P(z)$来定义联合分布,这可能会使后塌和歧管不匹配等问题。为了探讨仅使用$ p(x | z)$和$ q(z | x)$模拟联合分布的可能性,我们研究其兼容性和确定性,对应于其条件分布一致的联合分布的存在和唯一性跟他们。我们为可操作的等价标准开发了一般理论,以实现兼容性,以及足够的确定条件。基于该理论,我们提出了一种新颖的生成建模框架来源,仅使用两个循环条件模型。我们开发方法以实现兼容性和确定性,并使用条件模型适合和生成数据。通过预先删除的约束,Cygen更好地适合数据并捕获由合成和现实世界实验支持的更多代表性特征。
translated by 谷歌翻译
已经引入了生成流量网络(GFlowNETS)作为在主动学习背景下采样多样化候选的方法,具有培训目标,其使它们与给定奖励功能成比例地进行比例。在本文中,我们显示了许多额外的GFLOWN的理论特性。它们可用于估计联合概率分布和一些变量未指定的相应边际分布,并且特别感兴趣地,可以代表像集合和图形的复合对象的分布。 Gflownets摊销了通常通过计算昂贵的MCMC方法在单个但训练有素的生成通行证中进行的工作。它们还可用于估计分区功能和自由能量,给定子集(子图)的超标(超图)的条件概率,以及给定集合(图)的所有超标仪(超图)的边际分布。我们引入了熵和相互信息估计的变体,从帕累托前沿采样,与奖励最大化策略的连接,以及随机环境的扩展,连续动作和模块化能量功能。
translated by 谷歌翻译
Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.
translated by 谷歌翻译
We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.
translated by 谷歌翻译
基于能量的模型(EBMS)提供灵活的分布参数化。然而,由于难以应变的分隔功能,它们通常通过对比发散培训,以获得最大似然估计。在本文中,我们提出了伪球形对比偏差(PS-CD)来概括eBM的最大似然学习。 PS-CD源自严格适当的同质评分规则系列的最大化,这避免了难以处理分区功能的计算,并提供了包括对比分歧的广义学习目标作为特殊情况。此外,PS-CD允许我们灵活地选择各种学习目标,以便在没有额外的计算成本或变分性最低限度优化的情况下培训EBM。关于合成数据和常用图像数据集的提出方法和广泛实验的理论分析证明了PS-CD的有效性和建模灵活性,以及​​其对数据污染的鲁棒性,从而显示出其最大可能性和$ F $的优势 - ebms。
translated by 谷歌翻译
We present a new estimation principle for parameterized statistical models. The idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity. We show that this leads to a consistent (convergent) estimator of the parameters, and analyze the asymptotic variance. In particular, the method is shown to directly work for unnormalized models, i.e. models where the density function does not integrate to one. The normalization constant can be estimated just like any other parameter. For a tractable ICA model, we compare the method with other estimation methods that can be used to learn unnormalized models, including score matching, contrastive divergence, and maximum-likelihood where the normalization constant is estimated with importance sampling. Simulations show that noise-contrastive estimation offers the best trade-off between computational and statistical efficiency. The method is then applied to the modeling of natural images: We show that the method can successfully estimate a large-scale two-layer model and a Markov random field.
translated by 谷歌翻译
我们正式地用密度$ p_x $中的未知分发问题映射了从$ \ mathbb {r} ^ d $上学习和采样$ p_ \ mathbf {y} $ in $ \ mathbb {r} ^ {使用固定因子内核将$ P_X $获得的MD} $获取:$ p_ \ mathbf {y} $被称为m密度和因子内核作为多索静音噪声模型(MNM)。 m-litess比$ p_x $更顺畅,更容易学习和示例,但对于大量的$ m $来说,由于估计$ x $来估计$ \ mathbf {y} = \ mathbf {y $使用贝叶斯估算器$ \ widehat {x}(\ mathbf {y})= \ mathbb {e} [x \ vert \ mathbf {y} = \ mathbf {y}。为了制定问题,我们从无通知$ P_ \ MATHBF {Y} $以封闭式表达以封闭式表示的泊松和高斯MNMS获得$ \ widehat {x}(\ mathbf {y})$。这导致了用于学习参数能量和得分功能的简单最小二乘目标。我们展示了各种兴趣的参数化方案,包括研究高斯M密度直接导致多营养的自动化器 - 这是在文献中的去噪自动化器和经验贝叶斯之间进行的第一个理论连接。来自$ P_X $的示例由步行跳转采样(Saremi&Hyvarinen,2019)通过欠款Langevin MCMC(Walk)从$ P_ \ Mathbf {Y} $和Multimeasurement Bayes估算$ x $(跳转)。我们研究Mnist,CiFar-10和FFHQ-256数据集上的置换不变高斯M密度,并证明了该框架的有效性,以实现高尺寸的快速混合稳定的马尔可夫链。
translated by 谷歌翻译
For distributions $\mathbb{P}$ and $\mathbb{Q}$ with different supports or undefined densities, the divergence $\textrm{D}(\mathbb{P}||\mathbb{Q})$ may not exist. We define a Spread Divergence $\tilde{\textrm{D}}(\mathbb{P}||\mathbb{Q})$ on modified $\mathbb{P}$ and $\mathbb{Q}$ and describe sufficient conditions for the existence of such a divergence. We demonstrate how to maximize the discriminatory power of a given divergence by parameterizing and learning the spread. We also give examples of using a Spread Divergence to train implicit generative models, including linear models (Independent Components Analysis) and non-linear models (Deep Generative Networks).
translated by 谷歌翻译
指数族在机器学习中广泛使用,包括连续和离散域中的许多分布(例如,通过SoftMax变换,Gaussian,Dirichlet,Poisson和分类分布)。这些家庭中的每个家庭的分布都有固定的支持。相比之下,对于有限域而言,最近在SoftMax稀疏替代方案(例如Sparsemax,$ \ alpha $ -entmax和Fusedmax)的稀疏替代方案中导致了带有不同支持的分布。本文基于几种技术贡献,开发了连续分布的稀疏替代方案:首先,我们定义了$ \ omega $ regultion的预测图和任意域的Fenchel-young损失(可能是无限或连续的)。对于线性参数化的家族,我们表明,Fenchel-Young损失的最小化等效于统计的矩匹配,从而概括了指数家族的基本特性。当$ \ omega $是带有参数$ \ alpha $的Tsallis negentropy时,我们将获得````trabormed rompential指数)'',其中包括$ \ alpha $ -entmax和sparsemax和sparsemax($ \ alpha = 2 $)。对于二次能量函数,产生的密度为$ \ beta $ -Gaussians,椭圆形分布的实例,其中包含特殊情况,即高斯,双重量级,三人级和epanechnikov密度,我们为差异而得出了差异的封闭式表达式, Tsallis熵和Fenchel-Young损失。当$ \ Omega $是总变化或Sobolev正常化程序时,我们将获得Fusedmax的连续版本。最后,我们引入了连续的注意机制,从\ {1、4/3、3/3、3/2、2 \} $中得出有效的梯度反向传播算法。使用这些算法,我们证明了我们的稀疏连续分布,用于基于注意力的音频分类和视觉问题回答,表明它们允许参加时间间隔和紧凑区域。
translated by 谷歌翻译
变异推理(VI)的核心原理是将计算复杂后概率密度计算的统计推断问题转换为可拖动的优化问题。该属性使VI比几种基于采样的技术更快。但是,传统的VI算法无法扩展到大型数据集,并且无法轻易推断出越野数据点,而无需重新运行优化过程。该领域的最新发展,例如随机,黑框和摊销VI,已帮助解决了这些问题。如今,生成的建模任务广泛利用摊销VI来实现其效率和可扩展性,因为它利用参数化函数来学习近似的后验密度参数。在本文中,我们回顾了各种VI技术的数学基础,以构成理解摊销VI的基础。此外,我们还概述了最近解决摊销VI问题的趋势,例如摊销差距,泛化问题,不一致的表示学习和后验崩溃。最后,我们分析了改善VI优化的替代差异度量。
translated by 谷歌翻译
量子哈密顿学习和量子吉布斯采样的双重任务与物理和化学中的许多重要问题有关。在低温方案中,这些任务的算法通常会遭受施状能力,例如因样本或时间复杂性差而遭受。为了解决此类韧性,我们将量子自然梯度下降的概括引入了参数化的混合状态,并提供了稳健的一阶近似算法,即量子 - 固定镜下降。我们使用信息几何学和量子计量学的工具证明了双重任务的数据样本效率,因此首次将经典Fisher效率的开创性结果推广到变异量子算法。我们的方法扩展了以前样品有效的技术,以允许模型选择的灵活性,包括基于量子汉密尔顿的量子模型,包括基于量子的模型,这些模型可能会规避棘手的时间复杂性。我们的一阶算法是使用经典镜下降二元性的新型量子概括得出的。两种结果都需要特殊的度量选择,即Bogoliubov-Kubo-Mori度量。为了从数值上测试我们提出的算法,我们将它们的性能与现有基准进行了关于横向场ISING模型的量子Gibbs采样任务的现有基准。最后,我们提出了一种初始化策略,利用几何局部性来建模状态的序列(例如量子 - 故事过程)的序列。我们从经验上证明了它在实际和想象的时间演化的经验上,同时定义了更广泛的潜在应用。
translated by 谷歌翻译