Autoencoders are a popular model in many branches of machine learning and lossy data compression. However, their fundamental limits, the performance of gradient methods and the features learnt during optimization remain poorly understood, even in the two-layer setting. In fact, earlier work has considered either linear autoencoders or specific training regimes (leading to vanishing or diverging compression rates). Our paper addresses this gap by focusing on non-linear two-layer autoencoders trained in the challenging proportional regime in which the input dimension scales linearly with the size of the representation. Our results characterize the minimizers of the population risk, and show that such minimizers are achieved by gradient methods; their structure is also unveiled, thus leading to a concise description of the features obtained via training. For the special case of a sign activation function, our analysis establishes the fundamental limits for the lossy compression of Gaussian sources via (shallow) autoencoders. Finally, while the results are proved for Gaussian data, numerical simulations on standard datasets display the universality of the theoretical predictions.
translated by 谷歌翻译
We consider the problem of reconstructing the signal and the hidden variables from observations coming from a multi-layer network with rotationally invariant weight matrices. The multi-layer structure models inference from deep generative priors, and the rotational invariance imposed on the weights generalizes the i.i.d.\ Gaussian assumption by allowing for a complex correlation structure, which is typical in applications. In this work, we present a new class of approximate message passing (AMP) algorithms and give a state evolution recursion which precisely characterizes their performance in the large system limit. In contrast with the existing multi-layer VAMP (ML-VAMP) approach, our proposed AMP -- dubbed multi-layer rotationally invariant generalized AMP (ML-RI-GAMP) -- provides a natural generalization beyond Gaussian designs, in the sense that it recovers the existing Gaussian AMP as a special case. Furthermore, ML-RI-GAMP exhibits a significantly lower complexity than ML-VAMP, as the computationally intensive singular value decomposition is replaced by an estimation of the moments of the design matrices. Finally, our numerical results show that this complexity gain comes at little to no cost in the performance of the algorithm.
translated by 谷歌翻译
In a mixed generalized linear model, the objective is to learn multiple signals from unlabeled observations: each sample comes from exactly one signal, but it is not known which one. We consider the prototypical problem of estimating two statistically independent signals in a mixed generalized linear model with Gaussian covariates. Spectral methods are a popular class of estimators which output the top two eigenvectors of a suitable data-dependent matrix. However, despite the wide applicability, their design is still obtained via heuristic considerations, and the number of samples $n$ needed to guarantee recovery is super-linear in the signal dimension $d$. In this paper, we develop exact asymptotics on spectral methods in the challenging proportional regime in which $n, d$ grow large and their ratio converges to a finite constant. By doing so, we are able to optimize the design of the spectral method, and combine it with a simple linear estimator, in order to minimize the estimation error. Our characterization exploits a mix of tools from random matrices, free probability and the theory of approximate message passing algorithms. Numerical simulations for mixed linear regression and phase retrieval display the advantage enabled by our analysis over existing designs of spectral methods.
translated by 谷歌翻译
Artificial neural networks are functions depending on a finite number of parameters typically encoded as weights and biases. The identification of the parameters of the network from finite samples of input-output pairs is often referred to as the \emph{teacher-student model}, and this model has represented a popular framework for understanding training and generalization. Even if the problem is NP-complete in the worst case, a rapidly growing literature -- after adding suitable distributional assumptions -- has established finite sample identification of two-layer networks with a number of neurons $m=\mathcal O(D)$, $D$ being the input dimension. For the range $D<m<D^2$ the problem becomes harder, and truly little is known for networks parametrized by biases as well. This paper fills the gap by providing constructive methods and theoretical guarantees of finite sample identification for such wider shallow networks with biases. Our approach is based on a two-step pipeline: first, we recover the direction of the weights, by exploiting second order information; next, we identify the signs by suitable algebraic evaluations, and we recover the biases by empirical risk minimization via gradient descent. Numerical results demonstrate the effectiveness of our approach.
translated by 谷歌翻译
神经切线内核(NTK)已成为提供记忆,优化和泛化的强大工具,可保证深度神经网络。一项工作已经研究了NTK频谱的两层和深网,其中至少具有$ \ omega(n)$神经元的层,$ n $是培训样本的数量。此外,有越来越多的证据表明,只要参数数量超过样品数量,具有亚线性层宽度的深网是强大的记忆和优化器。因此,一个自然的开放问题是NTK是否在如此充满挑战的子线性设置中适应得很好。在本文中,我们以肯定的方式回答了这个问题。我们的主要技术贡献是对最小的深网的最小NTK特征值的下限,最小可能的过度参数化:参数的数量大约为$ \ omega(n)$,因此,神经元的数量仅为$ $ $ \ omega(\ sqrt {n})$。为了展示我们的NTK界限的适用性,我们为梯度下降训练提供了两个有关记忆能力和优化保证的结果。
translated by 谷歌翻译
在本文中,我们研究了具有N节点的目标两层神经网络的压缩到具有M <n节点的压缩网络中。更确切地说,我们考虑目标网络权重为I.I.D的设置。在高斯输入的假设下,次高斯次级高斯,我们最大程度地减少了目标和压缩网络的输出之间的L_2损失。通过使用高维概率的工具,我们表明,当目标网络充分过度参数化时,可以简化此非凸问题,并提供此近似值作为输入维度和N的函数。平均场限制,简化的目标以及压缩网络的最佳权重不取决于目标网络的实现,而仅取决于预期的缩放因素。此外,对于具有relu激活的网络,我们猜想通过在等缘紧密框架(ETF)上取重量来实现简化优化问题的最佳,而权重的缩放和ETF的方向取决于ETF的方向目标网络。提供数值证据以支持此猜想。
translated by 谷歌翻译
我们考虑通过旋转不变设计矩阵定义的广义线性模型中信号估计的问题。由于这些矩阵可以具有任意光谱分布,因此该模型非常适合于捕获在应用中经常出现的复杂相关结构。我们提出了一种新颖的近似消息,用于通过(AMP)算法用于信号估计,并且经由状态演进递归严格地表征其在高维极限中的性能。假设设计矩阵频谱的知识,我们的旋转不变放大器具有与高斯矩阵的现有放大器相同的顺序的复杂性;它还恢复现有的放大器作为一个特例。数值结果展示了靠近向量放大器的性能(在某些设置中猜测贝叶斯 - 最佳),但随着所提出的算法不需要计算昂贵的奇异值分解,可以获得更低的复杂性。
translated by 谷歌翻译
了解通过随机梯度下降(SGD)训练的神经网络的特性是深度学习理论的核心。在这项工作中,我们采取了平均场景,并考虑通过SGD培训的双层Relu网络,以实现一个非变量正则化回归问题。我们的主要结果是SGD偏向于简单的解决方案:在收敛时,Relu网络实现输入的分段线性图,以及“结”点的数量 - 即,Relu网络估计器的切线变化的点数 - 在两个连续的训练输入之间最多三个。特别地,随着网络的神经元的数量,通过梯度流的解决方案捕获SGD动力学,并且在收敛时,重量的分布方法接近相关的自由能量的独特最小化器,其具有GIBBS形式。我们的主要技术贡献在于分析了这一最小化器产生的估计器:我们表明其第二阶段在各地消失,除了代表“结”要点的一些特定地点。我们还提供了经验证据,即我们的理论预测的不同可能发生与数据点不同的位置的结。
translated by 谷歌翻译
最近的一项工作已经通过神经切线核(NTK)分析了深神经网络的理论特性。特别是,NTK的最小特征值与记忆能力,梯度下降算法的全球收敛性和深网的概括有关。但是,现有结果要么在两层设置中提供边界,要么假设对于多层网络,将NTK矩阵的频谱从0界限为界限。在本文中,我们在无限宽度和有限宽度的限制情况下,在最小的ntk矩阵的最小特征值上提供了紧密的界限。在有限宽度的设置中,我们认为的网络体系结构相当笼统:我们需要大致订购$ n $神经元的宽层,$ n $是数据示例的数量;剩余层宽度的缩放是任意的(取决于对数因素)。为了获得我们的结果,我们分析了各种量的独立兴趣:我们对隐藏特征矩阵的最小奇异值以及输入输出特征图的Lipschitz常数上的上限给出了下限。
translated by 谷歌翻译
Computational units in artificial neural networks follow a simplified model of biological neurons. In the biological model, the output signal of a neuron runs down the axon, splits following the many branches at its end, and passes identically to all the downward neurons of the network. Each of the downward neurons will use their copy of this signal as one of many inputs dendrites, integrate them all and fire an output, if above some threshold. In the artificial neural network, this translates to the fact that the nonlinear filtering of the signal is performed in the upward neuron, meaning that in practice the same activation is shared between all the downward neurons that use that signal as their input. Dendrites thus play a passive role. We propose a slightly more complex model for the biological neuron, where dendrites play an active role: the activation in the output of the upward neuron becomes optional, and instead the signals going through each dendrite undergo independent nonlinear filterings, before the linear combination. We implement this new model into a ReLU computational unit and discuss its biological plausibility. We compare this new computational unit with the standard one and describe it from a geometrical point of view. We provide a Keras implementation of this unit into fully connected and convolutional layers and estimate their FLOPs and weights change. We then use these layers in ResNet architectures on CIFAR-10, CIFAR-100, Imagenette, and Imagewoof, obtaining performance improvements over standard ResNets up to 1.73%. Finally, we prove a universal representation theorem for continuous functions on compact sets and show that this new unit has more representational power than its standard counterpart.
translated by 谷歌翻译