Many machine learning problems encode their data as a matrix with a possibly very large number of rows and columns. In several applications like neuroscience, image compression or deep reinforcement learning, the principal subspace of such a matrix provides a useful, low-dimensional representation of individual data. Here, we are interested in determining the $d$-dimensional principal subspace of a given matrix from sample entries, i.e. from small random submatrices. Although a number of sample-based methods exist for this problem (e.g. Oja's rule \citep{oja1982simplified}), these assume access to full columns of the matrix or particular matrix structure such as symmetry and cannot be combined as-is with neural networks \citep{baldi1989neural}. In this paper, we derive an algorithm that learns a principal subspace from sample entries, can be applied when the approximate subspace is represented by a neural network, and hence can be scaled to datasets with an effectively infinite number of rows and columns. Our method consists in defining a loss function whose minimizer is the desired principal subspace, and constructing a gradient estimate of this loss whose bias can be controlled. We complement our theoretical analysis with a series of experiments on synthetic matrices, the MNIST dataset \citep{lecun2010mnist} and the reinforcement learning domain PuddleWorld \citep{sutton1995generalization} demonstrating the usefulness of our approach.
translated by 谷歌翻译
We study the learning dynamics of self-predictive learning for reinforcement learning, a family of algorithms that learn representations by minimizing the prediction error of their own future latent representations. Despite its recent empirical success, such algorithms have an apparent defect: trivial representations (such as constants) minimize the prediction error, yet it is obviously undesirable to converge to such solutions. Our central insight is that careful designs of the optimization dynamics are critical to learning meaningful representations. We identify that a faster paced optimization of the predictor and semi-gradient updates on the representation, are crucial to preventing the representation collapse. Then in an idealized setup, we show self-predictive learning dynamics carries out spectral decomposition on the state transition matrix, effectively capturing information of the transition dynamics. Building on the theoretical insights, we propose bidirectional self-predictive learning, a novel self-predictive algorithm that learns two representations simultaneously. We examine the robustness of our theoretical insights with a number of small-scale experiments and showcase the promise of the novel representation learning algorithm with large-scale experiments.
translated by 谷歌翻译
我们考虑在以$ s $状态的地平线$ h $和$ a $ ACTIVE的偶发性,有限的,依赖于阶段的马尔可夫决策过程的环境中进行强化学习。代理商的性能是在与环境互动以$ t $插件互动后的遗憾来衡量的。我们提出了一种乐观的后验抽样算法(OPSRL),这是一种简单的后验抽样变体,仅需要许多后样品对数,$ h $,$ s $,$ a $和$ t $ a $ h $ s $ s $ a $ a $和$ t $一对。对于OPSRL,我们保证最多可容纳订单的高概率遗憾,$ \ wideTilde {\ mathcal {o}}}(\ sqrt {h^3sat})$忽略$ \ text {poly} \ log(hsat)$项。新型的新型技术成分是线性形式的新型抗浓缩不等式,可能具有独立感兴趣。具体而言,我们将Alfers and Dinges [1984]的Beta分布的基于正常近似的下限扩展到Dirichlet分布。我们的界限匹配订单$ \ omega(\ sqrt {h^3sat})$的下限,从而回答了Agrawal和Jia [2017b]在情节环境中提出的空旷问题。
translated by 谷歌翻译
当今许多大型系统的设计,从交通路由环境到智能电网,都依赖游戏理论平衡概念。但是,随着$ n $玩家游戏的大小通常会随着$ n $而成倍增长,标准游戏理论分析实际上是不可行的。最近的方法通过考虑平均场游戏,匿名$ n $玩家游戏的近似值,在这种限制中,玩家的数量是无限的,而人口的状态分布,而不是每个单独的球员的状态,是兴趣。然而,迄今为止研究最多的平均场平衡的平均场nash平衡的实际可计算性通常取决于有益的非一般结构特性,例如单调性或收缩性能,这是已知的算法收敛所必需的。在这项工作中,我们通过开发均值相关和与粗相关的平衡的概念来研究平均场比赛的替代途径。我们证明,可以使用三种经典算法在\ emph {ash All Games}中有效地学习它们,而无需对游戏结构进行任何其他假设。此外,我们在文献中已经建立了对应关系,从而获得了平均场 - $ n $玩家过渡的最佳范围,并经验证明了这些算法在简单游戏中的收敛性。
translated by 谷歌翻译
我们研究了分销RL的多步非政策学习方法。尽管基于价值的RL和分布RL之间的相似性明显相似,但我们的研究揭示了多步环境中两种情况之间的有趣和根本差异。我们确定了依赖路径依赖性分布TD误差的新颖概念,这对于原则上的多步分布RL是必不可少的。基于价值的情况的区别对诸如后视算法等概念的重要含义具有重要意义。我们的工作提供了多步非政策分布RL算法的第一个理论保证,包括适用于多步分配RL现有方法的结果。此外,我们得出了一种新颖的算法,即分位数回归 - 逆转录,该算法导致了深度RL QR QR-DQN-RETRACE,显示出对Atari-57基准上QR-DQN的经验改进。总的来说,我们阐明了多步分布RL中如何在理论和实践中解决多个独特的挑战。
translated by 谷歌翻译
我们介绍了DeepNash,这是一种能够学习从头开始播放不完美的信息游戏策略的自主代理,直到人类的专家级别。 Stratego是人工智能(AI)尚未掌握的少数标志性棋盘游戏之一。这个受欢迎的游戏具有$ 10^{535} $节点的巨大游戏树,即,$ 10^{175} $倍的$倍于GO。它具有在不完美的信息下需要决策的其他复杂性,类似于德克萨斯州Hold'em扑克,该扑克的游戏树较小(以$ 10^{164} $节点为单位)。 Stratego中的决策是在许多离散的动作上做出的,而动作与结果之间没有明显的联系。情节很长,在球员获胜之前经常有数百次动作,而Stratego中的情况则不能像扑克中那样轻松地分解成管理大小的子问题。由于这些原因,Stratego几十年来一直是AI领域的巨大挑战,现有的AI方法几乎没有达到业余比赛水平。 Deepnash使用游戏理论,无模型的深钢筋学习方法,而无需搜索,该方法学会通过自我播放来掌握Stratego。 DeepNash的关键组成部分的正则化NASH Dynamics(R-NAD)算法通过直接修改基础多项式学习动力学来收敛到近似NASH平衡,而不是围绕它“循环”。 Deepnash在Stratego中击败了现有的最先进的AI方法,并在Gravon Games平台上获得了年度(2022年)和历史前3名,并与人类专家竞争。
translated by 谷歌翻译
我们介绍了一种改进政策改进的方法,该方法在基于价值的强化学习(RL)的贪婪方法与基于模型的RL的典型计划方法之间进行了插值。新方法建立在几何视野模型(GHM,也称为伽马模型)的概念上,该模型对给定策略的折现状态验证分布进行了建模。我们表明,我们可以通过仔细的基本策略GHM的仔细组成,而无需任何其他学习,可以评估任何非马尔科夫策略,以固定的概率在一组基本马尔可夫策略之间切换。然后,我们可以将广义政策改进(GPI)应用于此类非马尔科夫政策的收集,以获得新的马尔可夫政策,通常将其表现优于其先驱。我们对这种方法提供了彻底的理论分析,开发了转移和标准RL的应用,并在经验上证明了其对标准GPI的有效性,对充满挑战的深度RL连续控制任务。我们还提供了GHM培训方法的分析,证明了关于先前提出的方法的新型收敛结果,并显示了如何在深度RL设置中稳定训练这些模型。
translated by 谷歌翻译
模型 - 不可知的元增强学习需要估算价值函数的黑森斯矩阵。这是从实施角度挑战,反复区分政策梯度估计可能导致偏见的Hessian估计。在这项工作中,我们提供了一个统一的框架,用于估算价值函数的高阶导数,基于禁止策略评估。我们的框架将许多现有方法解释为特殊情况,并阐明了Hessian估计的偏差和方差权衡。该框架还打开了一个新的估计系列的大门,这可以通过自动差异化库轻松实现,并在实践中导致性能提升。
translated by 谷歌翻译
我们在马尔可夫决策过程的状态空间上提出了一种新的行为距离,并展示使用该距离作为塑造深度加强学习代理的学习言论的有效手段。虽然由于高计算成本和基于样本的算法缺乏缺乏样本的距离,但是,虽然现有的国家相似性通常难以在规模上学习,但我们的新距离解决了这两个问题。除了提供详细的理论分析外,我们还提供了学习该距离的经验证据,与价值函数产生的结构化和信息化表示,包括对街机学习环境基准的强劲结果。
translated by 谷歌翻译
Whilst deep neural networks have shown great empirical success, there is still much work to be done to understand their theoretical properties. In this paper, we study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes with a recursive kernel definition. We show that, under broad conditions, as we make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks. To evaluate convergence rates empirically, we use maximum mean discrepancy. We then compare finite Bayesian deep networks from the literature to Gaussian processes in terms of the key predictive quantities of interest, finding that in some cases the agreement can be very close. We discuss the desirability of Gaussian process behaviour and review non-Gaussian alternative models from the literature. 1
translated by 谷歌翻译