We introduce a framework based on bilevel programming that unifies gradient-based hyperparameter optimization and meta-learning. We show that an approximate version of the bilevel problem can be solved by taking into explicit account the optimization dynamics for the inner objective. Depending on the specific setting, the outer variables take either the meaning of hyperparameters in a supervised learning problem or parameters of a meta-learner. We provide sufficient conditions under which solutions of the approximate problem converge to those of the exact problem. We instantiate our approach for meta-learning in the case of deep learning where representation layers are treated as hyperparameters shared across a set of training episodes. In experiments, we confirm our theoretical findings, present encouraging results for few-shot learning and contrast the bilevel approach against classical approaches for learning-to-learn.
translated by 谷歌翻译
Many meta-learning approaches for few-shot learning rely on simple base learners such as nearest-neighbor classifiers. However, even in the few-shot regime, discriminatively trained linear predictors can offer better generalization. We propose to use these predictors as base learners to learn representations for few-shot learning and show they offer better tradeoffs between feature size and performance across a range of few-shot recognition benchmarks. Our objective is to learn feature embeddings that generalize well under a linear classification rule for novel categories. To efficiently solve the objective, we exploit two properties of linear classifiers: implicit differentiation of the optimality conditions of the convex problem and the dual formulation of the optimization problem. This allows us to use highdimensional embeddings with improved generalization at a modest increase in computational overhead. Our approach, named MetaOptNet, achieves state-of-the-art performance on miniImageNet, tieredImageNet, CIFAR-FS, and FC100 few-shot learning benchmarks. Our code is available online 1 .
translated by 谷歌翻译
近年来,已经开发出各种基于梯度的方法来解决机器学习和计算机视觉地区的双层优化(BLO)问题。然而,这些现有方法的理论正确性和实际有效性总是依赖于某些限制性条件(例如,下层单身,LLS),这在现实世界中可能很难满足。此外,以前的文献仅证明了基于其特定的迭代策略的理论结果,因此缺乏一般的配方,以统一分析不同梯度的BLO的收敛行为。在这项工作中,我们从乐观的双级视点制定BLOS,并建立一个名为Bi-Level血液血统聚合(BDA)的新梯度的算法框架,以部分地解决上述问题。具体而言,BDA提供模块化结构,以分级地聚合上层和下层子问题以生成我们的双级迭代动态。从理论上讲,我们建立了一般会聚分析模板,并导出了一种新的证据方法,以研究基于梯度的BLO方法的基本理论特性。此外,这项工作系统地探讨了BDA在不同优化场景中的收敛行为,即,考虑从解决近似子问题返回的各种解决方案质量(即,全局/本地/静止解决方案)。广泛的实验证明了我们的理论结果,并展示了所提出的超参数优化和元学习任务算法的优越性。源代码可在https://github.com/vis-opt-group/bda中获得。
translated by 谷歌翻译
A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-specific models are learned in the inner-loop, by using only a small amount of data from the current task. A key challenge in scaling these approaches is the need to differentiate through the inner loop learning process, which can impose considerable computational and memory burdens. By drawing upon implicit differentiation, we develop the implicit MAML algorithm, which depends only on the solution to the inner level optimization and not the path taken by the inner loop optimizer. This effectively decouples the meta-gradient computation from the choice of inner loop optimizer. As a result, our approach is agnostic to the choice of inner loop optimizer and can gracefully handle many gradient steps without vanishing gradients or memory constraints. Theoretically, we prove that implicit MAML can compute accurate meta-gradients with a memory footprint no more than that which is required to compute a single inner loop gradient and at no overall increase in the total computational cost. Experimentally, we show that these benefits of implicit MAML translate into empirical gains on few-shot image recognition benchmarks.
translated by 谷歌翻译
模型 - 不可知的元学习(MAML),一种流行的基于梯度的元学习框架,假设每个任务或实例对元学习​​者的贡献相等。因此,在几次拍摄学习中,它无法解决基本和新颖类之间的域转移。在这项工作中,我们提出了一种新颖的鲁棒元学习算法,巢式MAML,它学会为训练任务或实例分配权重。我们将权重用为超参数,并使用嵌套双级优化方法中设置的一小组验证任务迭代优化它们(与MAML中的标准双级优化相比)。然后,我们在元培训阶段应用NestedMaml,涉及(1)从不同于元测试任务分发的分布中采样的多个任务,或(2)具有嘈杂标签的某些数据样本。对综合和现实世界数据集的广泛实验表明,巢式米姆有效地减轻了“不需要的”任务或情况的影响,从而实现了最先进的强大的元学习方法的显着改善。
translated by 谷歌翻译
找到模型的最佳超参数可以作为双重优化问题,通常使用零级技术解决。在这项工作中,当内部优化问题是凸但不平滑时,我们研究一阶方法。我们表明,近端梯度下降和近端坐标下降序列序列的前向模式分化,雅各比人会收敛到精确的雅各布式。使用隐式差异化,我们表明可以利用内部问题的非平滑度来加快计算。最后,当内部优化问题大约解决时,我们对高度降低的误差提供了限制。关于回归和分类问题的结果揭示了高参数优化的计算益处,尤其是在需要多个超参数时。
translated by 谷歌翻译
基于梯度的高参数调整的优化方法可确保理论收敛到固定解决方案时,对于固定的上层变量值,双光线程序的下层级别强烈凸(LLSC)和平滑(LLS)。对于在许多机器学习算法中调整超参数引起的双重程序,不满足这种情况。在这项工作中,我们开发了一种基于不精确度(VF-IDCA)的基于依次收敛函数函数算法。我们表明,该算法从一系列的超级参数调整应用程序中实现了无LLSC和LLS假设的固定解决方案。我们的广泛实验证实了我们的理论发现,并表明,当应用于调子超参数时,提出的VF-IDCA会产生较高的性能。
translated by 谷歌翻译
我们分析了一类养生问题,其中高级问题在于平滑的目标函数的最小化和下层问题是找到平滑收缩图的固定点。这种类型的问题包括元学习,平衡模型,超参数优化和数据中毒对抗性攻击的实例。最近的几项作品提出了算法,这些算法温暖了较低级别的问题,即他们使用先前的下级近似解决方案作为低级求解器的凝视点。这种温暖的启动程序使人们可以在随机和确定性设置中提高样品复杂性,在某些情况下可以实现订单的最佳样品复杂性。但是,存在一些情况,例如元学习和平衡模型,其中温暖的启动程序不适合或无效。在这项工作中,我们表明没有温暖的启动,仍然可以实现订单的最佳或近乎最佳的样品复杂性。特别是,我们提出了一种简单的方法,该方法在下层下使用随机固定点迭代,并在上层处预测不精确的梯度下降,该梯度下降到达$ \ epsilon $ -Stationary Point,使用$ O(\ Epsilon^{-2) })$和$ \ tilde {o}(\ epsilon^{ - 1})$样本分别用于随机和确定性设置。最后,与使用温暖启动的方法相比,我们的方法产生了更简单的分析,不需要研究上层和下层迭代之间的耦合相互作用
translated by 谷歌翻译
Bilevel优化是在机器学习的许多领域中最小化涉及另一个功能的价值函数的问题。在大规模的经验风险最小化设置中,样品数量很大,开发随机方法至关重要,而随机方法只能一次使用一些样品进行进展。但是,计算值函数的梯度涉及求解线性系统,这使得很难得出无偏的随机估计。为了克服这个问题,我们引入了一个新颖的框架,其中内部问题的解决方案,线性系统的解和主要变量同时发展。这些方向是作为总和写成的,使其直接得出无偏估计。我们方法的简单性使我们能够开发全球差异算法,其中所有变量的动力学都会降低差异。我们证明,萨巴(Saba)是我们框架中著名的传奇算法的改编,具有$ o(\ frac1t)$收敛速度,并且在polyak-lojasciewicz的假设下实现了线性收敛。这是验证这些属性之一的双光线优化的第一种随机算法。数值实验验证了我们方法的实用性。
translated by 谷歌翻译
二重优化发现在现代机器学习问题中发现了广泛的应用,例如超参数优化,神经体系结构搜索,元学习等。而具有独特的内部最小点(例如,内部功能是强烈凸的,都具有唯一的内在最小点)的理解,这是充分理解的,多个内部最小点的问题仍然是具有挑战性和开放的。为此问题设计的现有算法适用于限制情况,并且不能完全保证融合。在本文中,我们采用了双重优化的重新制定来限制优化,并通过原始的双二线优化(PDBO)算法解决了问题。 PDBO不仅解决了多个内部最小挑战,而且还具有完全一阶效率的情况,而无需涉及二阶Hessian和Jacobian计算,而不是大多数现有的基于梯度的二杆算法。我们进一步表征了PDBO的收敛速率,它是与多个内部最小值的双光线优化的第一个已知的非质合收敛保证。我们的实验证明了所提出的方法的预期性能。
translated by 谷歌翻译
在本文中,我们研究了一类二聚体优化问题,也称为简单的双重优化,在其中,我们将光滑的目标函数最小化,而不是另一个凸的约束优化问题的最佳解决方案集。已经开发了几种解决此类问题的迭代方法。 las,它们的收敛保证并不令人满意,因为它们要么渐近,要么渐近,要么是收敛速度缓慢且最佳的。为了解决这个问题,在本文中,我们介绍了Frank-Wolfe(FW)方法的概括,以解决考虑的问题。我们方法的主要思想是通过切割平面在局部近似低级问题的解决方案集,然后运行FW型更新以减少上层目标。当上层目标是凸面时,我们表明我们的方法需要$ {\ mathcal {o}}(\ max \ {1/\ epsilon_f,1/\ epsilon_g \})$迭代才能找到$ \ \ \ \ \ \ epsilon_f $ - 最佳目标目标和$ \ epsilon_g $ - 最佳目标目标。此外,当高级目标是非convex时,我们的方法需要$ {\ MATHCAL {o}}(\ max \ {1/\ epsilon_f^2,1/(\ epsilon_f \ epsilon_g})查找$(\ epsilon_f,\ epsilon_g)$ - 最佳解决方案。我们进一步证明了在“较低级别问题的老年人错误约束假设”下的更强的融合保证。据我们所知,我们的方法实现了所考虑的二聚体问题的最著名的迭代复杂性。我们还向数值实验提出了数值实验。与最先进的方法相比,展示了我们方法的出色性能。
translated by 谷歌翻译
在本文中,我们考虑了多任务表示(MTR)的框架学习的目标是使用源任务来学习降低求解目标任务的样本复杂性的表示形式。我们首先回顾MTR理论的最新进展,并表明它们可以在此框架内进行分析时为流行的元学习算法提供新颖的见解。特别是,我们重点介绍了实践中基于梯度和基于度量的算法之间的根本差异,并提出了理论分析来解释它。最后,我们使用派生的见解来通过新的基于光谱的正则化项来提高元学习方法的性能,并通过对少量分类基准的实验研究确认其效率。据我们所知,这是将MTR理论的最新学习范围付诸实践的第一项贡献,以实现几乎没有射击分类的任务。
translated by 谷歌翻译
Two-level stochastic optimization formulations have become instrumental in a number of machine learning contexts such as continual learning, neural architecture search, adversarial learning, and hyperparameter tuning. Practical stochastic bilevel optimization problems become challenging in optimization or learning scenarios where the number of variables is high or there are constraints. In this paper, we introduce a bilevel stochastic gradient method for bilevel problems with lower-level constraints. We also present a comprehensive convergence theory that covers all inexact calculations of the adjoint gradient (also called hypergradient) and addresses both the lower-level unconstrained and constrained cases. To promote the use of bilevel optimization in large-scale learning, we introduce a practical bilevel stochastic gradient method (BSG-1) that does not require second-order derivatives and, in the lower-level unconstrained case, dismisses any system solves and matrix-vector products.
translated by 谷歌翻译
二重优化(BO)可用于解决各种重要的机器学习问题,包括但不限于超参数优化,元学习,持续学习和增强学习。常规的BO方法需要通过与隐式分化的低级优化过程进行区分,这需要与Hessian矩阵相关的昂贵计算。最近,人们一直在寻求BO的一阶方法,但是迄今为止提出的方法对于大规模的深度学习应用程序往往是复杂且不切实际的。在这项工作中,我们提出了一种简单的一阶BO算法,仅取决于一阶梯度信息,不需要隐含的区别,并且对于大规模的非凸函数而言是实用和有效的。我们为提出的方法提供了非注重方法分析非凸目标的固定点,并提出了表明其出色实践绩效的经验结果。
translated by 谷歌翻译
我们介绍了SubGD,这是一种新颖的几声学习方法,基于最近的发现,即随机梯度下降更新往往生活在低维参数子空间中。在实验和理论分析中,我们表明模型局限于合适的预定义子空间,可以很好地推广用于几次学习。合适的子空间符合给定任务的三个标准:IT(a)允许通过梯度流量减少训练误差,(b)导致模型良好的模型,并且(c)可以通过随机梯度下降来识别。 SUBGD从不同任务的更新说明的自动相关矩阵的特征组合中标识了这些子空间。明确的是,我们可以识别出低维合适的子空间,用于对动态系统的几次学习,而动态系统具有不同的属性,这些属性由分析系统描述的一个或几个参数描述。这种系统在科学和工程领域的现实应用程序中无处不在。我们在实验中证实了SubGD在三个不同的动态系统问题设置上的优势,在样本效率和性能方面,均超过了流行的几次学习方法。
translated by 谷歌翻译
We consider minimizing the average of a very large number of smooth and possibly non-convex functions. This optimization problem has deserved much attention in the past years due to the many applications in different fields, the most challenging being training Machine Learning models. Widely used approaches for solving this problem are mini-batch gradient methods which, at each iteration, update the decision vector moving along the gradient of a mini-batch of the component functions. We consider the Incremental Gradient (IG) and the Random reshuffling (RR) methods which proceed in cycles, picking batches in a fixed order or by reshuffling the order after each epoch. Convergence properties of these schemes have been proved under different assumptions, usually quite strong. We aim to define ease-controlled modifications of the IG/RR schemes, which require a light additional computational effort and can be proved to converge under very weak and standard assumptions. In particular, we define two algorithmic schemes, monotone or non-monotone, in which the IG/RR iteration is controlled by using a watchdog rule and a derivative-free line search that activates only sporadically to guarantee convergence. The two schemes also allow controlling the updating of the stepsize used in the main IG/RR iteration, avoiding the use of preset rules. We prove convergence under the lonely assumption of Lipschitz continuity of the gradients of the component functions and perform extensive computational analysis using Deep Neural Architectures and a benchmark of datasets. We compare our implementation with both full batch gradient methods and online standard implementation of IG/RR methods, proving that the computational effort is comparable with the corresponding online methods and that the control on the learning rate may allow faster decrease.
translated by 谷歌翻译
许多实际优化问题涉及不确定的参数,这些参数具有概率分布,可以使用上下文特征信息来估算。与首先估计不确定参数的分布然后基于估计优化目标的标准方法相反,我们提出了一个\ textIt {集成条件估计 - 优化}(ICEO)框架,该框架估计了随机参数的潜在条件分布同时考虑优化问题的结构。我们将随机参数的条件分布与上下文特征之间的关系直接建模,然后以与下游优化问题对齐的目标估算概率模型。我们表明,我们的ICEO方法在适度的规律性条件下渐近一致,并以概括范围的形式提供有限的性能保证。在计算上,使用ICEO方法执行估计是一种非凸面且通常是非差异的优化问题。我们提出了一种通用方法,用于近似从估计的条件分布到通过可区分函数的最佳决策的潜在非差异映射,这极大地改善了应用于非凸问题的基于梯度的算法的性能。我们还提供了半代理案例中的多项式优化解决方案方法。还进行了数值实验,以显示我们在不同情况下的方法的经验成功,包括数据样本和模型不匹配。
translated by 谷歌翻译
A central capability of intelligent systems is the ability to continuously build upon previous experiences to speed up and enhance learning of new tasks. Two distinct research paradigms have studied this question. Meta-learning views this problem as learning a prior over model parameters that is amenable for fast adaptation on a new task, but typically assumes the tasks are available together as a batch. In contrast, online (regret based) learning considers a setting where tasks are revealed one after the other, but conventionally trains a single model without task-specific adaptation. This work introduces an online meta-learning setting, which merges ideas from both paradigms to better capture the spirit and practice of continual lifelong learning. We propose the follow the meta leader (FTML) algorithm which extends the MAML algorithm to this setting. Theoretically, this work provides an O(log T ) regret guarantee with one additional higher order smoothness assumption (in comparison to the standard online setting). Our experimental evaluation on three different largescale problems suggest that the proposed algorithm significantly outperforms alternatives based on traditional online learning approaches.
translated by 谷歌翻译
我们研究了一类算法,用于在内部级别物镜强烈凸起时求解随机和确定性设置中的彼此优化问题。具体地,我们考虑基于不精确的隐含区分的算法,并且我们利用热门开始策略来摊销精确梯度的估计。然后,我们介绍了一个统一的理论框架,受到奇异的扰动系统(Habets,1974)的研究来分析这种摊销算法。通过使用此框架,我们的分析显示了匹配可以访问梯度无偏见估计的Oracle方法的计算复杂度的算法,从而优于彼此优化的许多现有结果。我们在合成实验中说明了这些发现,并展示了这些算法对涉及几千个变量的超参数优化实验的效率。
translated by 谷歌翻译
We describe an algorithm that learns two-layer residual units using rectified linear unit (ReLU) activation: suppose the input $\mathbf{x}$ is from a distribution with support space $\mathbb{R}^d$ and the ground-truth generative model is a residual unit of this type, given by $\mathbf{y} = \boldsymbol{B}^\ast\left[\left(\boldsymbol{A}^\ast\mathbf{x}\right)^+ + \mathbf{x}\right]$, where ground-truth network parameters $\boldsymbol{A}^\ast \in \mathbb{R}^{d\times d}$ represent a full-rank matrix with nonnegative entries and $\boldsymbol{B}^\ast \in \mathbb{R}^{m\times d}$ is full-rank with $m \geq d$ and for $\boldsymbol{c} \in \mathbb{R}^d$, $[\boldsymbol{c}^{+}]_i = \max\{0, c_i\}$. We design layer-wise objectives as functionals whose analytic minimizers express the exact ground-truth network in terms of its parameters and nonlinearities. Following this objective landscape, learning residual units from finite samples can be formulated using convex optimization of a nonparametric function: for each layer, we first formulate the corresponding empirical risk minimization (ERM) as a positive semi-definite quadratic program (QP), then we show the solution space of the QP can be equivalently determined by a set of linear inequalities, which can then be efficiently solved by linear programming (LP). We further prove the strong statistical consistency of our algorithm, and demonstrate its robustness and sample efficiency through experimental results on synthetic data and a set of benchmark regression datasets.
translated by 谷歌翻译