本文研究了拟牛顿方法求解强凸强凹鞍点问题(SPP)。我们提出了SPP一般贪婪Broyden族更新,其中有$明确的局部超线性收敛速度的变体{\mathcalØ}\大(\大(1\压裂{1}{N\卡帕^2}\大)^ {K(K-1)/ 2}\大)$,其中$N $是问题的尺寸,$ \卡帕$是条件数和$$ķ是迭代次数。设计和算法的分析是基于估计不定Hessian矩阵的平方,这是从在凸优化古典准牛顿方法的不同。我们还提出两个具体Broyden族算法与BFGS型和SR1型更新,其享受的$更快的局部收敛速度\mathcalØ\大(\大(1\压裂{1} {N}\大)^{K(K-1)/ 2}\大)$。
translated by 谷歌翻译
在本文中,我们研究并证明了拟牛顿算法的Broyden阶级的非渐近超线性收敛速率,包括Davidon - Fletcher - Powell(DFP)方法和泡沫 - 弗莱彻 - 夏诺(BFGS)方法。这些准牛顿方法的渐近超线性收敛率在文献中已经广泛研究,但它们明确的有限时间局部会聚率未得到充分调查。在本文中,我们为Broyden Quasi-Newton算法提供了有限时间(非渐近的)收敛分析,在目标函数强烈凸起的假设下,其梯度是Lipschitz连续的,并且其Hessian在最佳解决方案中连续连续。我们表明,在最佳解决方案的本地附近,DFP和BFGS生成的迭代以$(1 / k)^ {k / 2} $的超连线率收敛到最佳解决方案,其中$ k $是迭代次数。我们还证明了类似的本地超连线收敛结果,因为目标函数是自我协调的情况。几个数据集的数值实验证实了我们显式的收敛速度界限。我们的理论保证是第一个为准牛顿方法提供非渐近超线性收敛速率的效果之一。
translated by 谷歌翻译
We study the smooth minimax optimization problem $\min_{\bf x}\max_{\bf y} f({\bf x},{\bf y})$, where $f$ is $\ell$-smooth, strongly-concave in ${\bf y}$ but possibly nonconvex in ${\bf x}$. Most of existing works focus on finding the first-order stationary points of the function $f({\bf x},{\bf y})$ or its primal function $P({\bf x})\triangleq \max_{\bf y} f({\bf x},{\bf y})$, but few of them focus on achieving second-order stationary points. In this paper, we propose a novel approach for minimax optimization, called Minimax Cubic Newton (MCN), which could find an $\big(\varepsilon,\kappa^{1.5}\sqrt{\rho\varepsilon}\,\big)$-second-order stationary point of $P({\bf x})$ with calling ${\mathcal O}\big(\kappa^{1.5}\sqrt{\rho}\varepsilon^{-1.5}\big)$ times of second-order oracles and $\tilde{\mathcal O}\big(\kappa^{2}\sqrt{\rho}\varepsilon^{-1.5}\big)$ times of first-order oracles, where $\kappa$ is the condition number and $\rho$ is the Lipschitz continuous constant for the Hessian of $f({\bf x},{\bf y})$. In addition, we propose an inexact variant of MCN for high-dimensional problems to avoid calling expensive second-order oracles. Instead, our method solves the cubic sub-problem inexactly via gradient descent and matrix Chebyshev expansion. This strategy still obtains the desired approximate second-order stationary point with high probability but only requires $\tilde{\mathcal O}\big(\kappa^{1.5}\ell\varepsilon^{-2}\big)$ Hessian-vector oracle calls and $\tilde{\mathcal O}\big(\kappa^{2}\sqrt{\rho}\varepsilon^{-1.5}\big)$ first-order oracle calls. To the best of our knowledge, this is the first work that considers the non-asymptotic convergence behavior of finding second-order stationary points for minimax problems without the convex-concave assumptions.
translated by 谷歌翻译
我们介绍了一种牛顿型方法,可以从任何初始化和带有Lipschitz Hessians的任意凸面目标收敛。通过将立方规范化与某种自适应levenberg - Marquardt罚款合并来实现这一目标。特别地,我们表明由$ x ^ {k + 1} = x ^ k - \ bigl(\ nabla ^ 2 f(x ^ k)+ \ sqrt {h \ | \ nabla f(x ^ k)给出的迭代)\ |} \ mathbf {i} \ bigr)^ { - 1} \ nabla f(x ^ k)$,其中$ h> 0 $是一个常数,用$ \ mathcal {o}全球收敛(\ frac{1} {k ^ 2})$率。我们的方法是牛顿方法的第一个变体,具有廉价迭代和可怕的全球融合。此外,我们证明当目的强烈凸起时,本地我们的方法会收敛超连续。为了提高方法的性能,我们提供了一种不需要超参数的线路搜索程序,并且可提供高效。
translated by 谷歌翻译
我们分析了牛顿方法的变体的性能,并通过二次正则化来解决复合凸最小化问题。在我们方法的每个步骤中,我们选择正规化参数与当前点的梯度标准的某些功率成正比。我们介绍了一个以h \ h \“第二或第三个衍生物的较旧连续性为特征的问题类别。然后,我们使用简单的自适应搜索步骤介绍该方法,允许自动调整问题类,并以最佳的全球复杂性界限,而无需知道问题的特定参数。特别是,对于Lipschitz连续第三个导数的函数类别,我们获得了全局$ o(1/k^3)$ rate,以前归因于三阶张量方法。功能是均匀凸的,我们证明我们方案的自动加速度是合理的,导致全局速率和局部超线性收敛。不同的速率(sublinear,linear和superlinear)之间的切换是自动的。同样,没有先验的先验需要了解参数。
translated by 谷歌翻译
标准梯度下降(GDA) - 型算法只能在非凸极小优化中找到固定点,这比局部minimax点比局部最佳。在这项工作中,我们开发了GDA型算法,这些算法在非convex-rong-concave minimax优化中全球收敛到局部minimax点。我们首先观察到局部最小点等效于某个包膜函数的二阶固定点。然后,受到经典立方正则化算法的启发,我们提出了Cubic-GDA(一种用于查找局部最小值点的立方体规范化的GDA算法),并通过利用其内在潜在功能来提供全面的收敛分析。具体而言,我们以sublinear收敛速率建立了立方GDA与局部最小点的全球收敛。我们进一步分析了在局部梯度显性型非凸几何形状的整个频谱中立方GDA的渐近收敛速率,比标准GDA更快地建立秩序的渐近收敛速率。此外,我们提出了用于大规模最小优化的立方GDA的随机变体,并在随机子采样下表征其样品复杂性。
translated by 谷歌翻译
我们考虑使用梯度下降来最大程度地减少$ f(x)= \ phi(xx^{t})$在$ n \ times r $因件矩阵$ x $上,其中$ \ phi是一种基础平稳凸成本函数定义了$ n \ times n $矩阵。虽然只能在合理的时间内发现只有二阶固定点$ x $,但如果$ x $的排名不足,则其排名不足证明其是全球最佳的。这种认证全球最优性的方式必然需要当前迭代$ x $的搜索等级$ r $,以相对于级别$ r^{\ star} $过度参数化。不幸的是,过度参数显着减慢了梯度下降的收敛性,从$ r = r = r = r^{\ star} $的线性速率到$ r> r> r> r> r^{\ star} $,即使$ \ phi $是$ \ phi $强烈凸。在本文中,我们提出了一项廉价的预处理,该预处理恢复了过度参数化的情况下梯度下降回到线性的收敛速率,同时也使在全局最小化器$ x^{\ star} $中可能不良条件变得不可知。
translated by 谷歌翻译
We consider minimizing a smooth and strongly convex objective function using a stochastic Newton method. At each iteration, the algorithm is given an oracle access to a stochastic estimate of the Hessian matrix. The oracle model includes popular algorithms such as Subsampled Newton and Newton Sketch. Despite using second-order information, these existing methods do not exhibit superlinear convergence, unless the stochastic noise is gradually reduced to zero during the iteration, which would lead to a computational blow-up in the per-iteration cost. We propose to address this limitation with Hessian averaging: instead of using the most recent Hessian estimate, our algorithm maintains an average of all the past estimates. This reduces the stochastic noise while avoiding the computational blow-up. We show that this scheme exhibits local $Q$-superlinear convergence with a non-asymptotic rate of $(\Upsilon\sqrt{\log (t)/t}\,)^{t}$, where $\Upsilon$ is proportional to the level of stochastic noise in the Hessian oracle. A potential drawback of this (uniform averaging) approach is that the averaged estimates contain Hessian information from the global phase of the method, i.e., before the iterates converge to a local neighborhood. This leads to a distortion that may substantially delay the superlinear convergence until long after the local neighborhood is reached. To address this drawback, we study a number of weighted averaging schemes that assign larger weights to recent Hessians, so that the superlinear convergence arises sooner, albeit with a slightly slower rate. Remarkably, we show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage, and still exhibits a superlinear convergence rate nearly (up to a logarithmic factor) matching that of uniform Hessian averaging.
translated by 谷歌翻译
We analyze Newton's method with lazy Hessian updates for solving general possibly non-convex optimization problems. We propose to reuse a previously seen Hessian for several iterations while computing new gradients at each step of the method. This significantly reduces the overall arithmetical complexity of second-order optimization schemes. By using the cubic regularization technique, we establish fast global convergence of our method to a second-order stationary point, while the Hessian does not need to be updated each iteration. For convex problems, we justify global and local superlinear rates for lazy Newton steps with quadratic regularization, which is easier to compute. The optimal frequency for updating the Hessian is once every $d$ iterations, where $d$ is the dimension of the problem. This provably improves the total arithmetical complexity of second-order algorithms by a factor $\sqrt{d}$.
translated by 谷歌翻译
我们提出了一种基于优化的基于优化的框架,用于计算差异私有M估算器以及构建差分私立置信区的新方法。首先,我们表明稳健的统计数据可以与嘈杂的梯度下降或嘈杂的牛顿方法结合使用,以便分别获得具有全局线性或二次收敛的最佳私人估算。我们在局部强大的凸起和自我协调下建立当地和全球融合保障,表明我们的私人估算变为对非私人M估计的几乎最佳附近的高概率。其次,我们通过构建我们私有M估计的渐近方差的差异私有估算来解决参数化推断的问题。这自然导致近​​似枢轴统计,用于构建置信区并进行假设检测。我们展示了偏置校正的有效性,以提高模拟中的小样本实证性能。我们说明了我们在若干数值例子中的方法的好处。
translated by 谷歌翻译
在本文中,我们通过推断在歧管上的迭代来提出一种简单的加速度方案,用于利曼梯度方法。我们显示何时从Riemannian梯度下降法生成迭代元素,加速方案是渐近地达到最佳收敛速率,并且比最近提出的Riemannian Nesterov加速梯度方法在计算上更有利。我们的实验验证了新型加速策略的实际好处。
translated by 谷歌翻译
本文提出了一个新的算法系列,用于在线优化复合目标。该算法可以解释为凸起梯度和$ p $ - 纳米算法的组合。结合适应性和乐观的算法思想,所提出的算法获得了序列依赖的遗憾上限,与稀疏目标决策变量的最著名界限相匹配。此外,该算法具有对流行的复合目标和约束的有效实现,并且可以通过最佳加速速率转换为随机优化算法,以实现流畅的目标。
translated by 谷歌翻译
我们应用随机顺序二次编程(STOSQP)算法来求解受约束的非线性优化问题,在该问题是随机的,并且约束是确定性的。我们研究了一个完全随机的设置,其中每次迭代中只有一个样本可用于估计物镜的梯度和黑森州。我们允许stosqp选择一个随机架子$ \ bar {\ alpha} _t $适应性,使得$ \ beta_t \ leq \ leq \ bar {\ alpha} _t \ leq \ leq \ beta_t+beta_t+\ chi_t+\ chi_t $,wither = o(\ beta_t)$是预定的确定性序列。我们还允许STOSQP通过随机迭代求解器(例如,使用草图和项目方法)求解牛顿系统。而且我们不需要不精确的牛顿方向的近似误差即可消失。对于这个一般的STOSQP框架,我们建立了其最后一次迭代的渐近收敛速率,最差的案例迭代复杂性是副产品。我们执行统计推断。特别是,有了适当的衰减$ \ beta_t,\ chi_t $,我们表明:(i)STOSQP方案最多可以采用$ o(1/\ epsilon^4)$ iterations $ iterations $ iTerations以实现$ \ epsilon $ -Stationarity; (ii)几乎毫无疑问,$ \ |(x_t -x^\ star,\ lambda_t- \ lambda^\ star)\ | | = o(\ sqrt {\ beta_t \ log(1/\ beta_t)})+o(\ chi_t/\ beta_t)$,其中$(x_t,\ lambda_t)$是primal-dimal-dimal-dialal-dialal-dialal-dual stosqp itselmate; (iii)序列$ 1/\ sqrt {\ beta_t} \ cdot(x_t -x^\ star,\ lambda_t- \ lambda_t- \ lambda^\ star)$收敛到平均零高斯分布,具有非琐事的共价矩阵。此外,我们建立了$(x_t,\ lambda_t)$的Berry-Esseen,以定量地测量其分布功能的收敛性。我们还为协方差矩阵提供了实用的估计器,可以使用iTerates $ \ {(x_t,\ lambda_t)\} _ t $构建$(x^\ star,\ lambda^\ star)$的置信区间(x^\ star,\ lambda^\ star)$。我们的定理使用最可爱的测试集中的非线性问题验证。
translated by 谷歌翻译
通过在线规范相关性分析的问题,我们提出了\ emph {随机缩放梯度下降}(SSGD)算法,以最小化通用riemannian歧管上的随机功能的期望。 SSGD概括了投影随机梯度下降的思想,允许使用缩放的随机梯度而不是随机梯度。在特殊情况下,球形约束的特殊情况,在广义特征向量问题中产生的,我们建立了$ \ sqrt {1 / t} $的令人反感的有限样本,并表明该速率最佳最佳,直至具有积极的积极因素相关参数。在渐近方面,一种新的轨迹平均争论使我们能够实现局部渐近常态,其速率与鲁普特 - Polyak-Quaditsky平均的速率匹配。我们将这些想法携带在一个在线规范相关分析,从事文献中的第一次获得了最佳的一次性尺度算法,其具有局部渐近融合到正常性的最佳一次性尺度算法。还提供了用于合成数据的规范相关分析的数值研究。
translated by 谷歌翻译
在本文中,我们提出了SC-REG(自助正规化)来学习过共同的前馈神经网络来学习\ EMPH {牛顿递减}框架的二阶信息进行凸起问题。我们提出了具有自助正规化(得分-GGN)算法的广义高斯 - 牛顿,其每次接收到新输入批处理时都会更新网络参数。所提出的算法利用Hessian矩阵中的二阶信息的结构,从而减少训练计算开销。虽然我们的目前的分析仅考虑凸面的情况,但数值实验表明了我们在凸和非凸面设置下的方法和快速收敛的效率,这对基线一阶方法和准牛顿方法进行了比较。
translated by 谷歌翻译
Influence diagnostics such as influence functions and approximate maximum influence perturbations are popular in machine learning and in AI domain applications. Influence diagnostics are powerful statistical tools to identify influential datapoints or subsets of datapoints. We establish finite-sample statistical bounds, as well as computational complexity bounds, for influence functions and approximate maximum influence perturbations using efficient inverse-Hessian-vector product implementations. We illustrate our results with generalized linear models and large attention based models on synthetic and real data.
translated by 谷歌翻译
This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost "dimension-free"). The convergence rate of this procedure matches the wellknown convergence rate of gradient descent to first-order stationary points, up to log factors. When all saddle points are non-degenerate, all second-order stationary points are local minima, and our result thus shows that perturbed gradient descent can escape saddle points almost for free.Our results can be directly applied to many machine learning applications, including deep learning. As a particular concrete example of such an application, we show that our results can be used directly to establish sharp global convergence rates for matrix factorization. Our results rely on a novel characterization of the geometry around saddle points, which may be of independent interest to the non-convex optimization community.
translated by 谷歌翻译
The implicit stochastic gradient descent (ISGD), a proximal version of SGD, is gaining interest in the literature due to its stability over (explicit) SGD. In this paper, we conduct an in-depth analysis of the two modes of ISGD for smooth convex functions, namely proximal Robbins-Monro (proxRM) and proximal Poylak-Ruppert (proxPR) procedures, for their use in statistical inference on model parameters. Specifically, we derive nonasymptotic point estimation error bounds of both proxRM and proxPR iterates and their limiting distributions, and propose on-line estimators of their asymptotic covariance matrices that require only a single run of ISGD. The latter estimators are used to construct valid confidence intervals for the model parameters. Our analysis is free of the generalized linear model assumption that has limited the preceding analyses, and employs feasible procedures. Our on-line covariance matrix estimators appear to be the first of this kind in the ISGD literature.* Equal contribution 1 Kakao Entertainment Corp.
translated by 谷歌翻译
我们研究了具有有限和结构的平滑非凸化优化问题的随机重新洗脱(RR)方法。虽然该方法在诸如神经网络的训练之类的实践中广泛利用,但其会聚行为仅在几个有限的环境中被理解。在本文中,在众所周知的Kurdyka-LojasiewiCz(KL)不等式下,我们建立了具有适当递减步长尺寸的RR的强极限点收敛结果,即,RR产生的整个迭代序列是会聚并会聚到单个静止点几乎肯定的感觉。 In addition, we derive the corresponding rate of convergence, depending on the KL exponent and the suitably selected diminishing step sizes.当KL指数在$ [0,\ FRAC12] $以$ [0,\ FRAC12] $时,收敛率以$ \ mathcal {o}(t ^ { - 1})$的速率计算,以$ t $ counting迭代号。当KL指数属于$(\ FRAC12,1)$时,我们的派生收敛速率是FORM $ \ MATHCAL {O}(T ^ { - Q})$,$ Q \ IN(0,1)$取决于在KL指数上。基于标准的KL不等式的收敛分析框架仅适用于具有某种阶段性的算法。我们对基于KL不等式的步长尺寸减少的非下降RR方法进行了新的收敛性分析,这概括了标准KL框架。我们总结了我们在非正式分析框架中的主要步骤和核心思想,这些框架是独立的兴趣。作为本框架的直接应用,我们还建立了类似的强极限点收敛结果,为重组的近端点法。
translated by 谷歌翻译
梯度下降(GDA)方法是生成对抗网络(GAN)中最小值优化的主流算法。 GDA的收敛特性引起了最近文献的重大兴趣。具体而言,对于$ \ min _ {\ mathbf {x}} \ max _ {\ mathbf {y}} f(\ mathbf {x}; \ m m缩y} $以及$ \ mathbf {x} $,(lin等,2020)中的nonConvex证明了GDA的收敛性,带有sptepize的比率$ \ eta _ {\ mathbf {y}}}}/\ eta _ { }} = \ theta(\ kappa^2)$ with $ \ eta _ {\ mathbf {x}} $和$ \ eta _ {\ eta _ {\ mathbf {y}} $是$ \ mathbf {x}} $和$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \\ Mathbf {y} $和$ \ kappa $是$ \ mathbf {y} $的条件号。尽管该步骤大比表明对最小玩家进行缓慢的训练,但实用的GAN算法通常对两个变量采用类似的步骤,表明理论和经验结果之间存在较大差距。在本文中,我们的目标是通过分析常规\ emph {nonconvex-nonconcave} minimax问题的\ emph {local contergence}来弥合这一差距。我们证明,$ \ theta(\ kappa)$的得分比是必要且足够的,足以使GDA局部收敛到Stackelberg equilibrium,其中$ \ kappa $是$ \ mathbf {y} $的本地条件号。我们证明了与匹配的下限几乎紧密的收敛速率。我们进一步将收敛保证扩展到随机GDA和额外梯度方法(例如)。最后,我们进行了几项数值实验来支持我们的理论发现。
translated by 谷歌翻译