A central capability of intelligent systems is the ability to continuously build upon previous experiences to speed up and enhance learning of new tasks. Two distinct research paradigms have studied this question. Meta-learning views this problem as learning a prior over model parameters that is amenable for fast adaptation on a new task, but typically assumes the tasks are available together as a batch. In contrast, online (regret based) learning considers a setting where tasks are revealed one after the other, but conventionally trains a single model without task-specific adaptation. This work introduces an online meta-learning setting, which merges ideas from both paradigms to better capture the spirit and practice of continual lifelong learning. We propose the follow the meta leader (FTML) algorithm which extends the MAML algorithm to this setting. Theoretically, this work provides an O(log T ) regret guarantee with one additional higher order smoothness assumption (in comparison to the standard online setting). Our experimental evaluation on three different largescale problems suggest that the proposed algorithm significantly outperforms alternatives based on traditional online learning approaches.
translated by 谷歌翻译
在线优化是一个完善的优化范式,旨在鉴于对以前的决策任务的正确答案,旨在做出一系列正确的决策。二重编程涉及一个分层优化问题,其中所谓的外部问题的可行区域受内部问题的解决方案集映射的限制。本文将这两个想法汇总在一起,并研究了在线双层优化设置,其中一系列随时间变化的二聚体问题又一个接一个地揭示了一个。我们将已知的单层在线算法的已知遗憾界限扩展到双重设置。具体而言,我们引入了新的杂种遗憾概念,开发了一种在线交替的时间平均梯度方法,该方法能够利用光滑度,并根据内部和外部极型序列的长度提供遗憾的界限。
translated by 谷歌翻译
Lack of performance when it comes to continual learning over non-stationary distributions of data remains a major challenge in scaling neural network learning to more human realistic settings. In this work we propose a new conceptualization of the continual learning problem in terms of a temporally symmetric trade-off between transfer and interference that can be optimized by enforcing gradient alignment across examples. We then propose a new algorithm, Meta-Experience Replay (MER), that directly exploits this view by combining experience replay with optimization based meta-learning. This method learns parameters that make interference based on future gradients less likely and transfer based on future gradients more likely. 1 We conduct experiments across continual lifelong supervised learning benchmarks and non-stationary reinforcement learning environments demonstrating that our approach consistently outperforms recently proposed baselines for continual learning. Our experiments show that the gap between the performance of MER and baseline algorithms grows both as the environment gets more non-stationary and as the fraction of the total experiences stored gets smaller.
translated by 谷歌翻译
We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task. In effect, our method trains the model to be easy to fine-tune. We demonstrate that this approach leads to state-of-the-art performance on two fewshot image classification benchmarks, produces good results on few-shot regression, and accelerates fine-tuning for policy gradient reinforcement learning with neural network policies.
translated by 谷歌翻译
A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-specific models are learned in the inner-loop, by using only a small amount of data from the current task. A key challenge in scaling these approaches is the need to differentiate through the inner loop learning process, which can impose considerable computational and memory burdens. By drawing upon implicit differentiation, we develop the implicit MAML algorithm, which depends only on the solution to the inner level optimization and not the path taken by the inner loop optimizer. This effectively decouples the meta-gradient computation from the choice of inner loop optimizer. As a result, our approach is agnostic to the choice of inner loop optimizer and can gracefully handle many gradient steps without vanishing gradients or memory constraints. Theoretically, we prove that implicit MAML can compute accurate meta-gradients with a memory footprint no more than that which is required to compute a single inner loop gradient and at no overall increase in the total computational cost. Experimentally, we show that these benefits of implicit MAML translate into empirical gains on few-shot image recognition benchmarks.
translated by 谷歌翻译
我们开发了一种新的持续元学习方法,以解决连续多任务学习中的挑战。在此设置中,代理商的目标是快速通过任何任务序列实现高奖励。先前的Meta-Creenifiltive学习算法已经表现出有希望加速收购新任务的结果。但是,他们需要在培训期间访问所有任务。除了简单地将过去的经验转移到新任务,我们的目标是设计学习学习的持续加强学习算法,使用他们以前任务的经验更快地学习新任务。我们介绍了一种新的方法,连续的元策略搜索(Comps),通过以增量方式,在序列中的每个任务上,通过序列的每个任务来消除此限制,而无需重新访问先前的任务。 Comps持续重复两个子程序:使用RL学习新任务,并使用RL的经验完全离线Meta学习,为后续任务学习做好准备。我们发现,在若干挑战性连续控制任务的旧序列上,Comps优于持续的持续学习和非政策元增强方法。
translated by 谷歌翻译
学习在线推荐模型的关键挑战之一是时间域移动,这会导致培训与测试数据分布之间的不匹配以及域的概括错误。为了克服,我们建议学习一个未来的梯度生成器,该生成器可以预测培训未来数据分配的梯度信息,以便可以对建议模型进行培训,就像我们能够展望其部署的未来一样。与批处理更新相比,我们的理论表明,所提出的算法达到了较小的时间域概括误差,该误差通过梯度变异项在局部遗憾中衡量。我们通过与各种代表性基线进行比较来证明经验优势。
translated by 谷歌翻译
在本文中,我们考虑了找到一种元学习在线控制算法的问题,该算法可以在面对$ n $(类似)控制任务的序列时可以在整个任务中学习。每个任务都涉及控制$ t $时间步骤的有限视野的线性动力系统。在采取控制动作之前,每个时间步骤的成本函数和系统噪声是对抗性的,并且控制器未知。元学习是一种广泛的方法,其目标是为任何新的未见任务开出在线政策,从其他任务中利用信息以及任务之间的相似性。我们为控制设置提出了一种元学习的在线控制算法,并通过\ textit {meta-regret}表征其性能,这是整个任务的平均累积后悔。我们表明,当任务数量足够大时,我们提出的方法实现了与独立学习的在线控制算法相比,$ d/d/d^{*} $较小的元regret,该算法不会在整个网上控制算法上进行学习任务,其中$ d $是一个问题常数,$ d^{*} $是标量,随着任务之间的相似性的增加而降低。因此,当任务的顺序相似时,提议的元学习在线控制的遗憾显着低于没有元学习的幼稚方法。我们还提出了实验结果,以证明我们的元学习算法获得的出色性能。
translated by 谷歌翻译
模型 - 不可知的元学习(MAML),一种流行的基于梯度的元学习框架,假设每个任务或实例对元学习​​者的贡献相等。因此,在几次拍摄学习中,它无法解决基本和新颖类之间的域转移。在这项工作中,我们提出了一种新颖的鲁棒元学习算法,巢式MAML,它学会为训练任务或实例分配权重。我们将权重用为超参数,并使用嵌套双级优化方法中设置的一小组验证任务迭代优化它们(与MAML中的标准双级优化相比)。然后,我们在元培训阶段应用NestedMaml,涉及(1)从不同于元测试任务分发的分布中采样的多个任务,或(2)具有嘈杂标签的某些数据样本。对综合和现实世界数据集的广泛实验表明,巢式米姆有效地减轻了“不需要的”任务或情况的影响,从而实现了最先进的强大的元学习方法的显着改善。
translated by 谷歌翻译
在本文中,我们研究了模型 - 不可知的元学习(MAML)算法的泛化特性,用于监督学习问题。我们专注于我们培训MAML模型超过$ M $任务的设置,每个都有$ n $数据点,并从两个视角表征其泛化错误:首先,我们假设测试时间的新任务是其中之一培训任务,我们表明,对于强烈凸的客观函数,预期的多余人口损失是由$ {\ mathcal {o}}(1 / mn)$的界限。其次,我们考虑MAML算法的概念任务的泛化,并表明产生的泛化误差取决于新任务的底层分布与培训过程中观察到的任务之间的总变化距离。我们的校对技术依赖于算法稳定性与算法的泛化界之间的连接。特别是,我们为元学习算法提出了一种新的稳定性定义,这使我们能够捕获每项任务的任务数量的任务数量的角色$ N $对MAML的泛化误差。
translated by 谷歌翻译
大多数机器学习算法的基本假设是培训和测试数据是从相同的底层分布中汲取的。然而,在几乎所有实际应用中违反了这种假设:由于不断变化的时间相关,非典型最终用户或其他因素,机器学习系统经常测试。在这项工作中,我们考虑域泛化的问题设置,其中训练数据被构造成域,并且可能有多个测试时间偏移,对应于新域或域分布。大多数事先方法旨在学习在所有域上执行良好的单一强大模型或不变的功能空间。相比之下,我们的目标是使用未标记的测试点学习适应域转移到域移的模型。我们的主要贡献是介绍自适应风险最小化(ARM)的框架,其中模型被直接优化,以便通过学习来转移以适应培训域来改编。与稳健性,不变性和适应性的先前方法相比,ARM方法提供了在表现域移位的多个图像分类问题上的性能增益为1-4%的测试精度。
translated by 谷歌翻译
We introduce a framework based on bilevel programming that unifies gradient-based hyperparameter optimization and meta-learning. We show that an approximate version of the bilevel problem can be solved by taking into explicit account the optimization dynamics for the inner objective. Depending on the specific setting, the outer variables take either the meaning of hyperparameters in a supervised learning problem or parameters of a meta-learner. We provide sufficient conditions under which solutions of the approximate problem converge to those of the exact problem. We instantiate our approach for meta-learning in the case of deep learning where representation layers are treated as hyperparameters shared across a set of training episodes. In experiments, we confirm our theoretical findings, present encouraging results for few-shot learning and contrast the bilevel approach against classical approaches for learning-to-learn.
translated by 谷歌翻译
Many real-world learning scenarios face the challenge of slow concept drift, where data distributions change gradually over time. In this setting, we pose the problem of learning temporally sensitive importance weights for training data, in order to optimize predictive accuracy. We propose a class of temporal reweighting functions that can capture multiple timescales of change in the data, as well as instance-specific characteristics. We formulate a bi-level optimization criterion, and an associated meta-learning algorithm, by which these weights can be learned. In particular, our formulation trains an auxiliary network to output weights as a function of training instances, thereby compactly representing the instance weights. We validate our temporal reweighting scheme on a large real-world dataset of 39M images spread over a 9 year period. Our extensive experiments demonstrate the necessity of instance-based temporal reweighting in the dataset, and achieve significant improvements to classical batch-learning approaches. Further, our proposal easily generalizes to a streaming setting and shows significant gains compared to recent continual learning methods.
translated by 谷歌翻译
We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms.
translated by 谷歌翻译
最近,模型 - 不可知的元学习(MAML)已经获得了巨大的关注。然而,MAML的随机优化仍然不成熟。 MAML的现有算法利用“剧集”思想,通过对每个迭代的每个采样任务进行采样和一些数据点来更新元模型。但是,它们不一定能够以恒定的小批量大小保证收敛,或者需要在每次迭代时处理大量任务,这对于持续学习或跨设备联合学习不可行,其中仅提供少量任务每次迭代或每轮。本文通过(i)提出了与消失收敛误差的有效的基于内存的随机算法提出了基于存储的基于存储器的随机算法,这只需要采样恒定数量的任务和恒定数量的每次迭代数据样本; (ii)提出基于通信的分布式内存基于存储器的MAML算法,用于跨设备(带客户端采样)和跨筒仓(无客户采样)设置中的个性化联合学习。理论结果显着改善了MAML的优化理论,实证结果也证实了理论。
translated by 谷歌翻译
在许多实际应用中,机器学习数据随着时间的流逝依次到达大块。然后,从业者必须决定如何分配其计算预算,以便在任何时间点获得最佳性能。凸优化的在线学习理论表明,最佳策略是在到达时立即使用数据。但是,这可能不是使用深度非线性网络时的最佳策略,尤其是当这些网络对每个数据进行多个数据进行多次通过时,呈现整体分布而非i.i.d ..在本文中,我们在最简单的情况下将此学习环境正式化。每个数据块都是从相同的基础分布中得出的,并首次尝试从经验回答以下问题:学习者在培训新来的块之前应该等待多长时间?学习者应该采用哪些架构?随着观察到更多的数据,学习者是否应该随着时间的推移增加能力吗?我们使用经典计算机视觉基准测试的卷积神经网络以及在大规模语言建模任务中训练的大型变压器模型进行探讨。代码可在\ url {www.github.com/facebookresearch/alma}中获得。
translated by 谷歌翻译
我们考虑在线模仿学习(OIL),其中的任务是找到一项通过与环境的积极互动来模仿专家的行为的政策。我们旨在通过分析最流行的石油算法之一匕首来弥合石油政策优化算法之间的差距。具体而言,如果一类政策足以包含专家政策,我们证明匕首会持续遗憾。与以前需要损失的界限不同,我们的结果只需要较弱的假设,即损失相对于策略的足够统计数据(而不是其参数化)。为了确保对更广泛的政策和损失类别的收敛,我们以额外的正则化项增强了匕首。特别是,我们提出了一个遵循定制领导者(FTRL)的变体及其用于石油的自适应变体,并开发了与FTL的内存需求相匹配的记忆效率实现。假设损失的功能是平稳的,并且相对于政策参数凸出,我们还证明,FTRL对任何足够表达的政策类别都持续遗憾,同时保留了$ O(\ sqrt {t})$,在最坏的情况下遗憾案子。我们通过实验对合成和高维控制任务的实验证明了这些算法的有效性。
translated by 谷歌翻译
模型不足的元学习(MAML)已越来越流行,对于可以通过一个或几个随机梯度下降步骤迅速适应新任务的训练模型。但是,与标准的非自适应学习(NAL)相比,MAML目标更难优化,并且几乎没有理解MAML在各种情况下的溶液的快速适应性方面的改善。我们通过线性回归设置进行分析解决此问题,该设置由简单而艰难的任务组成,其中硬度与梯度下降在任务上收敛的速率有关。具体而言,我们证明,为了使MAML比NAL获得可观的收益,(i)任务之间的硬度必须有一定的差异,并且(ii)艰苦任务的最佳解决方案必须与中心远离远离中心。简单任务最佳解决方案的中心。我们还提供数值和分析结果,表明这些见解适用于两层神经网络。最后,我们提供了很少的图像分类实验,可以支持我们何时使用MAML的见解,并强调培训MAML对实践中的艰巨任务的重要性。
translated by 谷歌翻译
共享初始化参数的元学习已显示在解决少量学习任务方面非常有效。然而,将框架扩展到许多射击场景,这可能进一步提高其实用性,这一切相对忽略了由于内梯度步长的长链中的元学习的技术困难。在本文中,我们首先表明允许元学习者采取更多的内梯度步骤更好地捕获异构和大规模任务分布的结构,从而导致获得更好的初始化点。此外,为了增加元更新的频率,即使是过度长的内部优化轨迹,我们建议估计关于初始化参数的改变的任务特定参数的所需移位。通过这样做,我们可以随意增加元更新的频率,从而大大提高了元级收敛以及学习初始化的质量。我们验证了我们在异构的大规模任务集中验证了方法,并表明该算法在泛型性能和收敛方面以及多任务学习和微调基线方面主要优于先前的一阶元学习方法。 。
translated by 谷歌翻译
什么是学习? 20美元^ {st} Centure的学习理论形式化 - 这是人工智能中沉淀的革命 - 主要是在$ \ mathit {in-diversion} $学习,即在假设训练数据被取样的假设下学习与评估分布相同的分配。这种假设使这些理论不足以表征21美元^ $ {st} MENTURE的现实世界数据问题,其通常是通过与培训数据分布(称为公共学习)不同的评估分布来表征。因此,我们通过放松这种假设来对现有可读性的正式定义进行小小的变化。然后,我们介绍$ \ MATHBF {学习\效率} $(LE)来量化学习者能够利用给定问题的数据的金额,无论它是一个或分发的问题如何。然后,我们定义并证明了可读性的广义概念之间的关系,并展示了该框架是如何足够一般的,以表征传输,多任务,元,持续和终身学习。我们希望本统一有助于弥合现实世界问题的实证实践与理论指导之间的差距。最后,因为生物学学习继续胜过机器学习算法的某些挑战,我们讨论了这一框架VI的局限性 - \'A-is-is-is-is-is-is-is-vis,它的形式化生物学学习能力,旨在为未来研究的多个途径。
translated by 谷歌翻译