智能论文笔记

Rank-1 Matrix Completion with Gradient Descent and Small Random Initialization

Daesung Kim , Hye Won Chung

分类： (统计)机器学习 | 机器学习

2022-12-19

The nonconvex formulation of matrix completion problem has received significant attention in recent years due to its affordable complexity compared to the convex formulation. Gradient descent (GD) is the simplest yet efficient baseline algorithm for solving nonconvex optimization problems. The success of GD has been witnessed in many different problems in both theory and practice when it is combined with random initialization. However, previous works on matrix completion require either careful initialization or regularizers to prove the convergence of GD. In this work, we study the rank-1 symmetric matrix completion and prove that GD converges to the ground truth when small random initialization is used. We show that in logarithmic amount of iterations, the trajectory enters the region where local convergence occurs. We provide an upper bound on the initialization size that is sufficient to guarantee the convergence and show that a larger initialization can be used as more samples are available. We observe that implicit regularization effect of GD plays a critical role in the analysis, and for the entire trajectory, it prevents each entry from becoming much larger than the others.

translated by 谷歌翻译

Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction

Dominik Stöger , Mahdi Soltanolkotabi

分类：机器学习 | (统计)机器学习

2021-06-28

最近以来，在理解与overparameterized模型非凸损失基于梯度的方法收敛性和泛化显著的理论进展。尽管如此，优化和推广，尤其是小的随机初始化的关键作用的许多方面都没有完全理解。在本文中，我们迈出玄机通过证明小的随机初始化这个角色的步骤，然后通过梯度下降的行为类似于流行谱方法的几个迭代。我们还表明，从小型随机初始化，这可证明是用于overparameterized车型更加突出这种隐含的光谱偏差，也使梯度下降迭代在一个特定的轨迹走向，不仅是全局最优的，但也很好期广义的解决方案。具体而言，我们专注于通过天然非凸制剂重构从几个测量值的低秩矩阵的问题。在该设置中，我们表明，从小的随机初始化的梯度下降迭代的轨迹可以近似分解为三个阶段：（Ⅰ）的光谱或对准阶段，其中，我们表明，该迭代具有一个隐含的光谱偏置类似于频谱初始化允许我们表明，在该阶段中进行迭代，并且下面的低秩矩阵的列空间被充分对准的端部，（II）一鞍回避/细化阶段，我们表明，该梯度的轨迹从迭代移动离开某些简并鞍点，和（III）的本地细化阶段，其中，我们表明，避免了鞍座后的迭代快速收敛到底层低秩矩阵。底层我们的分析是，可能有超出低等级的重建计算问题影响overparameterized非凸优化方案的分析见解。

translated by 谷歌翻译

Algorithmic Regularization in Model-free Overparametrized Asymmetric Matrix Factorization

Liwei Jiang , Yudong Chen , Lijun Ding

分类：机器学习 | (统计)机器学习

2022-03-06

我们研究了自然非凸形公式下的不对称矩阵分解问题，并具有任意的过多参数化。考虑了无模型设置，对观察到的矩阵的秩或单数值的假设最小，在该矩阵的秩或奇异值中，全局最优值证明过度拟合。我们表明，带有小随机初始化的香草梯度下降顺序恢复了观察到的矩阵的主要成分。因此，当配备适当的早期停止时，梯度下降会产生观察到的矩阵的最佳低级别近似，而无需显式正则化。我们提供了近似误差，迭代复杂性，初始化大小和步骤大小之间关系的尖锐表征。我们的复杂性界限几乎不含尺寸，并取决于对数近似误差，与先前的工作相比，对步骤和初始化的宽大要求明显更大。我们的理论结果为行为梯度下降提供了准确的预测，显示了与数值实验的良好一致性。

translated by 谷歌翻译

Big-Step-Little-Step: Efficient Gradient Methods for Objectives with Multiple Scales

Jonathan Kelner , Annie Marsden , Vatsal Sharan , Aaron Sidford , Gregory Valiant , Honglin Yuan

分类：机器学习 | (统计)机器学习

2021-11-04

我们提供了新的基于梯度的方法，以便有效解决广泛的病态化优化问题。我们考虑最小化函数$ f：\ mathbb {r} ^ d \ lightarrow \ mathbb {r} $的问题，它是隐含的可分解的，作为$ m $未知的非交互方式的总和，强烈的凸起功能并提供方法这解决了这个问题，这些问题是缩放（最快的对数因子）作为组件的条件数量的平方根的乘积。这种复杂性绑定（我们证明几乎是最佳的）可以几乎指出的是加速梯度方法的几乎是指数的，这将作为$ F $的条件数量的平方根。此外，我们提供了求解该多尺度优化问题的随机异标变体的有效方法。而不是学习$ F $的分解（这将是过度昂贵的），而是我们的方法应用一个清洁递归“大步小步”交错标准方法。由此产生的算法使用$ \ tilde {\ mathcal {o}}（d m）$空间，在数字上稳定，并打开门以更细粒度的了解凸优化超出条件号的复杂性。

translated by 谷歌翻译

Scaling and Scalability: Provable Nonconvex Low-Rank Tensor Estimation from Incomplete Measurements

Tian Tong , Cong Ma , Ashley Prater-Bennette , Erin Tripp , Yuejie Chi

分类：机器学习 | (统计)机器学习

2021-04-29

提供了一种强大而灵活的模型，可用于代表多属数据和多种方式相互作用，在科学和工程中的各个领域中发挥着现代数据科学中的不可或缺的作用。基本任务是忠实地以统计和计算的有效方式从高度不完整的测量中恢复张量。利用Tucker分解中的张量的低级别结构，本文开发了一个缩放的梯度下降（Scaledgd）算法，可以直接恢复具有定制频谱初始化的张量因子，并表明它以与条件号无关的线性速率收敛对于两个规范问题的地面真理张量 - 张量完成和张量回归 - 一旦样本大小高于$ n ^ {3/2} $忽略其他参数依赖项，$ n $是维度张量。这导致与现有技术相比的低秩张力估计的极其可扩展的方法，这些方法具有以下至少一个缺点：对记忆和计算方面的对不良，偏移成本高的极度敏感性，或差样本复杂性保证。据我们所知，Scaledgd是第一算法，它可以同时实现近最佳统计和计算复杂性，以便与Tucker分解进行低级张力完成。我们的算法突出了加速非耦合统计估计在加速非耦合统计估计中的适当预处理的功率，其中迭代改复的预处理器促进轨迹的所需的不变性属性相对于低级张量分解中的底层对称性。

translated by 谷歌翻译

Preconditioned Gradient Descent for Overparameterized Nonconvex Burer--Monteiro Factorization with Global Optimality Certification

Gavin Zhang , Salar Fattahi , Richard Y. Zhang

分类：机器学习 | (统计)机器学习

2022-06-07

我们考虑使用梯度下降来最大程度地减少$ f（x）= \ phi（xx^{t}）$在$ n \ times r $因件矩阵$ x $上，其中$ \ phi是一种基础平稳凸成本函数定义了$ n \ times n $矩阵。虽然只能在合理的时间内发现只有二阶固定点$ x $，但如果$ x $的排名不足，则其排名不足证明其是全球最佳的。这种认证全球最优性的方式必然需要当前迭代$ x $的搜索等级$ r $，以相对于级别$ r^{\ star} $过度参数化。不幸的是，过度参数显着减慢了梯度下降的收敛性，从$ r = r = r = r^{\ star} $的线性速率到$ r> r> r> r> r^{\ star} $，即使$ \ phi $是$ \ phi $强烈凸。在本文中，我们提出了一项廉价的预处理，该预处理恢复了过度参数化的情况下梯度下降回到线性的收敛速率，同时也使在全局最小化器$ x^{\ star} $中可能不良条件变得不可知。

translated by 谷歌翻译

Alternating minimization for generalized rank one matrix sensing: Sharp predictions from a random initialization

Kabir Aladin Chandrasekher , Mengqi Lou , Ashwin Pananjady

分类： (统计)机器学习

2022-07-20

我们考虑估计与I.I.D的排名$ 1 $矩阵因素的问题。高斯，排名$ 1 $的测量值，这些测量值非线性转化和损坏。考虑到非线性的两种典型选择，我们研究了从随机初始化开始的此非convex优化问题的天然交流更新规则的收敛性能。我们通过得出确定性递归，即使在高维问题中也是准确的，我们显示出算法的样本分割版本的敏锐收敛保证。值得注意的是，虽然无限样本的种群更新是非信息性的，并提示单个步骤中的精确恢复，但算法 - 我们的确定性预测 - 从随机初始化中迅速地收敛。我们尖锐的非反应分析也暴露了此问题的其他几种细粒度，包括非线性和噪声水平如何影响收敛行为。从技术层面上讲，我们的结果可以通过证明我们的确定性递归可以通过我们的确定性顺序来预测我们的确定性序列，而当每次迭代都以$ n $观测来运行时，我们的确定性顺序可以通过$ n^{ - 1/2} $的波动。我们的技术利用了源自有关高维$ m $估计文献的遗留工具，并为通过随机数据的其他高维优化问题的随机初始化而彻底地分析了高阶迭代算法的途径。

translated by 谷歌翻译

Understanding Implicit Regularization in Over-Parameterized Single Index Model

Jianqing Fan , Zhuoran Yang , Mengxin Yu

分类： (统计)机器学习 | 机器学习

2020-07-16

在本文中，我们利用过度参数化来设计高维单索索引模型的无规矩算法，并为诱导的隐式正则化现象提供理论保证。具体而言，我们研究了链路功能是非线性且未知的矢量和矩阵单索引模型，信号参数是稀疏向量或低秩对称矩阵，并且响应变量可以是重尾的。为了更好地理解隐含正规化的角色而没有过度的技术性，我们假设协变量的分布是先验的。对于载体和矩阵设置，我们通过采用分数函数变换和专为重尾数据的强大截断步骤来构造过度参数化最小二乘损耗功能。我们建议通过将无规则化的梯度下降应用于损耗函数来估计真实参数。当初始化接近原点并且步骤中足够小时，我们证明了所获得的解决方案在载体和矩阵案件中实现了最小的收敛统计速率。此外，我们的实验结果支持我们的理论调查结果，并表明我们的方法在$ \ ell_2 $ -staticatisticated率和变量选择一致性方面具有明确的正则化的经验卓越。

translated by 谷歌翻译

Robust Matrix Completion with Heavy-tailed Noise

Bingyan Wang , Jianqing Fan

分类：机器学习 | (统计)机器学习

2022-06-09

本文研究了在存在重尾且可能是不对称噪声的情况下，低级矩阵的完成，我们旨在估计一组高度不完整的噪声条目，以估算一个基础的低级矩阵。尽管在过去的十年中，矩阵的完成问题吸引了很多关注，但是当观察结果被重尾噪音污染时，仍然缺乏理论上的理解。先前的理论缺乏解释经验结果，无法捕获估计误差对噪声水平的最佳依赖性。在本文中，我们采用自适应的Huber损失来容纳重尾噪声，当损失函数中的参数经过精心设计以平衡异常值的大偏差和稳健性时，这是对大型且可能不对称的误差的鲁棒性。然后，我们通过平衡的低级数burer-monteiro矩阵分解和梯度不错，并具有稳健的光谱初始化，提出了有效的非凸算法。我们证明，在仅在误差分布上的第二刻条件下，而不是次高斯的假设下，由提议的算法生成的迭代元素的欧几里得误差会快速减少几何，直到达到最小值 - 最佳统计估计误差，这具有相同的相同在次级案件中订购。这一重大进步背后的关键技术是一个强大的一对一分析框架。我们的模拟研究证实了理论结果。

translated by 谷歌翻译

Lower Bounds for the Convergence of Tensor Power Iteration on Random Overcomplete Models

Yuchen Wu , Kangjie Zhou

分类：机器学习 | (统计)机器学习

2022-11-07

Tensor decomposition serves as a powerful primitive in statistics and machine learning. In this paper, we focus on using power iteration to decompose an overcomplete random tensor. Past work studying the properties of tensor power iteration either requires a non-trivial data-independent initialization, or is restricted to the undercomplete regime. Moreover, several papers implicitly suggest that logarithmically many iterations (in terms of the input dimension) are sufficient for the power method to recover one of the tensor components. In this paper, we analyze the dynamics of tensor power iteration from random initialization in the overcomplete regime. Surprisingly, we show that polynomially many steps are necessary for convergence of tensor power iteration to any of the true component, which refutes the previous conjecture. On the other hand, our numerical experiments suggest that tensor power iteration successfully recovers tensor components for a broad range of parameters, despite that it takes at least polynomially many steps to converge. To further complement our empirical evidence, we prove that a popular objective function for tensor decomposition is strictly increasing along the power iteration path. Our proof is based on the Gaussian conditioning technique, which has been applied to analyze the approximate message passing (AMP) algorithm. The major ingredient of our argument is a conditioning lemma that allows us to generalize AMP-type analysis to non-proportional limit and polynomially many iterations of the power method.

translated by 谷歌翻译

Implicit Regularization and Convergence for Weight Normalization

Xiaoxia Wu , Edgar Dobriban , Tongzheng Ren , Shanshan Wu , Zhiyuan Li , Suriya Gunasekar , Rachel Ward , Qiang Liu

分类：机器学习 | 计算机视觉 | (统计)机器学习

2019-11-18

批准方法，例如批处理[Ioffe和Szegedy，2015]，体重[Salimansand Kingma，2016]，实例[Ulyanov等，2016]和层归一化[Baet al。，2016]已广泛用于现代机器学习中。在这里，我们研究了体重归一化方法（WN）方法[Salimans和Kingma，2016年]，以及一种称为重扎式投影梯度下降（RPGD）的变体，用于过多散热性最小二乘回归。 WN和RPGD用比例G和一个单位向量W重新绘制权重，因此目标函数变为非convex。我们表明，与原始目标的梯度下降相比，这种非凸式配方具有有益的正则化作用。这些方法适应性地使重量正规化并收敛于最小L2规范解决方案，即使初始化远非零。对于G和W的某些步骤，我们表明它们可以收敛于最小规范解决方案。这与梯度下降的行为不同，梯度下降的行为仅在特征矩阵范围内的一个点开始时才收敛到最小规范解，因此对初始化更敏感。

translated by 谷歌翻译

Low-rank matrix recovery with non-quadratic loss: projected gradient method and regularity projection oracle

Lijun Ding , Yuqian Zhang , Yudong Chen

分类： (统计)机器学习 | 机器学习

2020-08-31

低秩矩阵恢复的现有结果在很大程度上专注于二次损失，这享有有利的性质，例如限制强的强凸/平滑度（RSC / RSM）以及在所有低等级矩阵上的良好调节。然而，许多有趣的问题涉及更一般，非二次损失，这不满足这些属性。对于这些问题，标准的非耦合方法，例如秩约为秩约为预定的梯度下降（A.K.A.迭代硬阈值）和毛刺蒙特罗分解可能具有差的经验性能，并且没有令人满意的理论保证了这些算法的全球和快速收敛。在本文中，我们表明，具有非二次损失的可证实低级恢复中的关键组成部分是规律性投影oracle。该Oracle限制在适当的界限集中迭代到低级矩阵，损耗功能在其上表现良好并且满足一组近似RSC / RSM条件。因此，我们分析配备有这样的甲骨文的（平均）投影的梯度方法，并证明它在全球和线性地收敛。我们的结果适用于广泛的非二次低级估计问题，包括一个比特矩阵感测/完成，个性化排名聚集，以及具有等级约束的更广泛的广义线性模型。

translated by 谷歌翻译

How to Escape Saddle Points Efficiently

Chi Jin , Rong Ge , Praneeth Netrapalli , Sham M. Kakade , Michael I. Jordan

分类：

2017-03-02

This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost "dimension-free"). The convergence rate of this procedure matches the wellknown convergence rate of gradient descent to first-order stationary points, up to log factors. When all saddle points are non-degenerate, all second-order stationary points are local minima, and our result thus shows that perturbed gradient descent can escape saddle points almost for free.Our results can be directly applied to many machine learning applications, including deep learning. As a particular concrete example of such an application, we show that our results can be used directly to establish sharp global convergence rates for matrix factorization. Our results rely on a novel characterization of the geometry around saddle points, which may be of independent interest to the non-convex optimization community.

translated by 谷歌翻译

Perturbation Analysis of Randomized SVD and its Applications to High-dimensional Statistics

Yichi Zhang , Minh Tang

分类： (统计)机器学习

2022-03-19

随机奇异值分解（RSVD）是用于计算大型数据矩阵截断的SVD的一类计算算法。给定A $ n \ times n $对称矩阵$ \ mathbf {m} $，原型RSVD算法输出通过计算$ \ mathbf {m mathbf {m} $的$ k $引导singular vectors的近似m}^{g} \ mathbf {g} $;这里$ g \ geq 1 $是一个整数，$ \ mathbf {g} \ in \ mathbb {r}^{n \ times k} $是一个随机的高斯素描矩阵。在本文中，我们研究了一般的“信号加上噪声”框架下的RSVD的统计特性，即，观察到的矩阵$ \ hat {\ mathbf {m}} $被认为是某种真实但未知的加法扰动信号矩阵$ \ mathbf {m} $。我们首先得出$ \ ell_2 $（频谱规范）和$ \ ell_ {2 \ to \ infty} $（最大行行列$ \ ell_2 $ norm）$ \ hat {\ hat {\ Mathbf {M}} $和信号矩阵$ \ Mathbf {M} $的真实单数向量。这些上限取决于信噪比（SNR）和功率迭代$ g $的数量。观察到一个相变现象，其中较小的SNR需要较大的$ g $值以保证$ \ ell_2 $和$ \ ell_ {2 \ to \ fo \ infty} $ distances的收敛。我们还表明，每当噪声矩阵满足一定的痕量生长条件时，这些相变发生的$ g $的阈值都会很清晰。最后，我们得出了近似奇异向量的行波和近似矩阵的进入波动的正常近似。我们通过将RSVD的几乎最佳性能保证在应用于三个统计推断问题的情况下，即社区检测，矩阵完成和主要的组件分析，并使用缺失的数据来说明我们的理论结果。

translated by 谷歌翻译

Exact Matrix Completion via Convex Optimization

Emmanuel J. Candes , Benjamin Recht

分类：

2008-05-29

We consider a problem of considerable practical interest: the recovery of a data matrix from a sampling of its entries. Suppose that we observe m entries selected uniformly at random from a matrix M . Can we complete the matrix and recover the entries that we have not seen?We show that one can perfectly recover most low-rank matrices from what appears to be an incomplete set of entries. We prove that if the number m of sampled entries obeys m ≥ C n 1.2 r log n for some positive numerical constant C, then with very high probability, most n × n matrices of rank r can be perfectly recovered by solving a simple convex optimization program. This program finds the matrix with minimum nuclear norm that fits the data. The condition above assumes that the rank is not too large. However, if one replaces the 1.2 exponent with 1.25, then the result holds for all values of the rank. Similar results hold for arbitrary rectangular matrices as well. Our results are connected with the recent literature on compressed sensing, and show that objects other than signals and images can be perfectly reconstructed from very limited information.

translated by 谷歌翻译

How and When Random Feedback Works: A Case Study of Low-Rank Matrix Factorization

Shivam Garg , Santosh S. Vempala

分类：神经与进化计算 | 机器学习 | (统计)机器学习

2021-11-17

ML的梯度下降的成功尤其是学习神经网络是显着的和稳健的。在大脑如何学习的背景下，似乎在生物学上难以实现（如果不是难以判断）的梯度下降的一个方面是，其更新依赖于通过相同的连接到更早层的反馈。这种双向链路在脑网络中相对较少，即使存在互易连接时，它们也可能不等级。随机反馈对准（LillicRap等，2016），后向后重量是随机的和固定的，已经提出作为生物合理的替代品，并发现凭经验有效。我们调查如何以及当反馈对齐（FA）工作的方式，重点关注分层结构的最基本问题之一 - 低秩矩阵分解。在这个问题中，给定矩阵$ y_ {n \ times m} $，目标是找到低秩分解$ z_ {n \ times r} w_ {r \ times m} $，从而最小化错误$ \ | zw - 我\ | _f $。梯度血压最佳地解决了这个问题。我们显示FA收敛于当$ r \ ge \ mbox {rank}（y）$时收敛到最佳解决方案。我们还阐明了Fa工作的方式。经验上观察到前进权重矩阵和（随机）反馈矩阵在FA更新期间更接近。我们的分析严格地源地源于这种现象，并展示了如何促进FA的收敛。我们还表明，当$ r <\ mbox {rank}（y）$时，FA可能远非最佳。这是梯度下降和FA之间的第一个可提供的分离结果。此外，即使当它们的错误$ \ | zw-y \ | _f $大致相等时，梯度下降和fa发现的表示也可能是几乎正交的。

translated by 谷歌翻译

Finite Sample Identification of Wide Shallow Neural Networks with Biases

Massimo Fornasier , Timo Klock , Marco Mondelli , Michael Rauchensteiner

分类：机器学习 | (统计)机器学习

2022-11-08

Artificial neural networks are functions depending on a finite number of parameters typically encoded as weights and biases. The identification of the parameters of the network from finite samples of input-output pairs is often referred to as the \emph{teacher-student model}, and this model has represented a popular framework for understanding training and generalization. Even if the problem is NP-complete in the worst case, a rapidly growing literature -- after adding suitable distributional assumptions -- has established finite sample identification of two-layer networks with a number of neurons $m=\mathcal O(D)$, $D$ being the input dimension. For the range $D<m<D^2$ the problem becomes harder, and truly little is known for networks parametrized by biases as well. This paper fills the gap by providing constructive methods and theoretical guarantees of finite sample identification for such wider shallow networks with biases. Our approach is based on a two-step pipeline: first, we recover the direction of the weights, by exploiting second order information; next, we identify the signs by suitable algebraic evaluations, and we recover the biases by empirical risk minimization via gradient descent. Numerical results demonstrate the effectiveness of our approach.

translated by 谷歌翻译

A Non-Asymptotic Framework for Approximate Message Passing in Spiked Models

Gen Li , Yuting Wei

分类：机器学习 | (统计)机器学习

2022-08-05

近似消息传递（AMP）是解决高维统计问题的有效迭代范式。但是，当迭代次数超过$ o \ big（\ frac {\ log n} {\ log log \ log \ log n} \时big）$（带有$ n $问题维度）。为了解决这一不足，本文开发了一个非吸附框架，用于理解峰值矩阵估计中的AMP。基于AMP更新的新分解和可控的残差项，我们布置了一个分析配方，以表征在存在独立初始化的情况下AMP的有限样本行为，该过程被进一步概括以进行光谱初始化。作为提出的分析配方的两个具体后果：（i）求解$ \ mathbb {z} _2 $同步时，我们预测了频谱初始化AMP的行为，最高为$ o \ big（\ frac {n} {\ mathrm {\ mathrm { poly} \ log n} \ big）$迭代，表明该算法成功而无需随后的细化阶段（如最近由\ citet {celentano2021local}推测）; （ii）我们表征了稀疏PCA中AMP的非反应性行为（在尖刺的Wigner模型中），以广泛的信噪比。

translated by 谷歌翻译

The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training

Andrea Montanari , Yiqiao Zhong

分类： (统计)机器学习 | 机器学习

2020-07-25

现代神经网络通常以强烈的过度构造状态运行：它们包含许多参数，即使实际标签被纯粹随机的标签代替，它们也可以插入训练集。尽管如此，他们在看不见的数据上达到了良好的预测错误：插值训练集并不会导致巨大的概括错误。此外，过度散色化似乎是有益的，因为它简化了优化景观。在这里，我们在神经切线（NT）制度中的两层神经网络的背景下研究这些现象。我们考虑了一个简单的数据模型，以及各向同性协变量的矢量，$ d $尺寸和$ n $隐藏的神经元。我们假设样本量$ n $和尺寸$ d $都很大，并且它们在多项式上相关。我们的第一个主要结果是对过份术的经验NT内核的特征结构的特征。这种表征意味着必然的表明，经验NT内核的最低特征值在$ ND \ gg n $后立即从零界限，因此网络可以在同一制度中精确插值任意标签。我们的第二个主要结果是对NT Ridge回归的概括误差的表征，包括特殊情况，最小值-ULL_2 $ NORD插值。我们证明，一旦$ nd \ gg n $，测试误差就会被内核岭回归之一相对于无限宽度内核而近似。多项式脊回归的误差依次近似后者，从而通过与激活函数的高度组件相关的“自我诱导的”项增加了正则化参数。多项式程度取决于样本量和尺寸（尤其是$ \ log n/\ log d $）。

translated by 谷歌翻译

How Does Adaptive Optimization Impact Local Neural Network Geometry?

Kaiqi Jiang , Dhruv Malik , Yuanzhi Li

分类：机器学习 | (统计)机器学习

2022-11-04

Adaptive optimization methods are well known to achieve superior convergence relative to vanilla gradient methods. The traditional viewpoint in optimization, particularly in convex optimization, explains this improved performance by arguing that, unlike vanilla gradient schemes, adaptive algorithms mimic the behavior of a second-order method by adapting to the global geometry of the loss function. We argue that in the context of neural network optimization, this traditional viewpoint is insufficient. Instead, we advocate for a local trajectory analysis. For iterate trajectories produced by running a generic optimization algorithm OPT, we introduce $R^{\text{OPT}}_{\text{med}}$, a statistic that is analogous to the condition number of the loss Hessian evaluated at the iterates. Through extensive experiments, we show that adaptive methods such as Adam bias the trajectories towards regions where $R^{\text{Adam}}_{\text{med}}$ is small, where one might expect faster convergence. By contrast, vanilla gradient methods like SGD bias the trajectories towards regions where $R^{\text{SGD}}_{\text{med}}$ is comparatively large. We complement these empirical observations with a theoretical result that provably demonstrates this phenomenon in the simplified setting of a two-layer linear network. We view our findings as evidence for the need of a new explanation of the success of adaptive methods, one that is different than the conventional wisdom.

translated by 谷歌翻译