在本文中,我们提出了可利用性下降,一种新的算法,用于通过针对最坏情况的直接策略优化来计算具有不完全信息的双玩家零和广义形式游戏中的近似均衡。我们证明,当遵循这种优化时,玩家策略的可利用性渐近地收敛为零,因此当两个玩家都采用这种优化时,联合策略会收敛到纳西均衡。与虚构游戏(XFP)和反事实后悔化(CFR)不同,我们的融合结果与被优化的政策而不是平均政策有关。我们的实验表明,在四个基准游戏中,收敛率与XFP和CFR相当。使用函数逼近,我们发现我们的算法在两个游戏中执行表格版本,据我们所知,这是在这类算法中不完全信息游戏中的第一个这样的结果。
translated by 谷歌翻译
Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-specific state space reductions.
translated by 谷歌翻译
从交互样本中学习不完美信息游戏的策略是一个具有挑战性的问题。这种设置的常用方法MonteCarlo Counterfactual Regret Minimization(MCCFR)由于高方差而具有较慢的长期收敛率。在本文中,我们引入了适用于MCCFR的任何采样变体的方差减少技术(VR-MCCFR)。使用此技术,每次迭代估计值和更新被重新构建为采样值和状态 - 动作基线的函数,类似于它们用于政策梯度强化学习。新的制定允许估计从同一集中的其他估计中引导,沿着采样的轨迹传播基线的好处;即使从其他估计引导,估计仍然是无偏见的。最后,我们证明给定一个完美的基线,值估计的方差可以减少到零。实验评估表明,VR-MCCFR带来了一个数量级的加速,而经验方差降低了三个数量级。减小的方差允许第一次CFR +与采样一起使用,将加速增加到两个数量级。
translated by 谷歌翻译
Deep reinforcement learning (RL) has achieved several high profile successesin difficult decision-making problems. However, these algorithms typicallyrequire a huge amount of data before they reach reasonable performance. Infact, their performance during learning can be extremely poor. This may beacceptable for a simulator, but it severely limits the applicability of deep RLto many real-world tasks, where the agent must learn in the real environment.In this paper we study a setting where the agent may access data from previouscontrol of the system. We present an algorithm, Deep Q-learning fromDemonstrations (DQfD), that leverages small sets of demonstration data tomassively accelerate the learning process even from relatively small amounts ofdemonstration data and is able to automatically assess the necessary ratio ofdemonstration data while learning thanks to a prioritized replay mechanism.DQfD works by combining temporal difference updates with supervisedclassification of the demonstrator's actions. We show that DQfD has betterinitial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN)as it starts with better scores on the first million steps on 41 of 42 gamesand on average it takes PDD DQN 83 million steps to catch up to DQfD'sperformance. DQfD learns to out-perform the best demonstration given in 14 of42 games. In addition, DQfD leverages human demonstrations to achievestate-of-the-art results for 11 games. Finally, we show that DQfD performsbetter than three related algorithms for incorporating demonstration data intoDQN.
translated by 谷歌翻译
In this work we introduce a differentiable version of the CompositionalPattern Producing Network, called the DPPN. Unlike a standard CPPN, thetopology of a DPPN is evolved but the weights are learned. A Lamarckianalgorithm, that combines evolution and learning, produces DPPNs to reconstructan image. Our main result is that DPPNs can be evolved/trained to compress theweights of a denoising autoencoder from 157684 to roughly 200 parameters, whileachieving a reconstruction accuracy comparable to a fully connected networkwith more than two orders of magnitude more parameters. The regularizationability of the DPPN allows it to rediscover (approximate) convolutional networkarchitectures embedded within a fully connected architecture. Suchconvolutional architectures are the current state of the art for many computervision applications, so it is satisfying that DPPNs are capable of discoveringthis structure rather than having to build it in by design. DPPNs exhibitbetter generalization when tested on the Omniglot dataset after being trainedon MNIST, than directly encoded fully connected autoencoders. DPPNs aretherefore a new framework for integrating learning and evolution.
translated by 谷歌翻译
近年来,在强化学习中使用深度表示已经取得了很多成功。尽管如此,这些应用程序中的许多仍然使用常规架构,例如卷积网络,LSTM或自动编码器。在本文中,我们提出了一种新的神经网络架构,用于无模型增强学习。我们的决斗网络代表两个独立的估算器:一个用于状态值函数,一个用于状态依赖的动作优势函数。这种因子分解的主要好处是可以在不对基础强化学习算法进行任何改变的情况下概括整个行动。我们的结果表明,这种架构可以在存在许多类似值的行为的情况下进行更好的策略评估。此外,决斗架构使我们的RL代理能够超越Atari 2600域的最新技术。
translated by 谷歌翻译
策略梯度方法是强大的强化学习算法,并且已被证明可以解决许多复杂的任务。然而,这些方法也是数据无效的,受到高方差梯度估计的影响,并且经常陷入局部最优。这项工作通过将最近改进的非政策数据的重用和参数空间的探索与确定性行为政策相结合来解决这些弱点。由此产生的目标适用于标准的神经网络优化策略,如随机梯度下降或随机梯度哈密顿蒙特卡罗。通过重要性抽样对以前的推出进行大量提高数据效率,而随机优化方案有助于逃避局部最优。我们评估了一系列连续控制基准测试任务的建议方法。结果表明,该算法能够使用比标准策略梯度方法更少的系统交互成功可靠地学习解决方案。
translated by 谷歌翻译
深度高斯过程(DGP)可以模拟复杂的边缘密度以及复杂的映射。非高斯边缘对于模拟真实世界数据是必不可少的,并且可以通过将相关变量结合到模型来从DGP生成。先前关于DGP模型的工作已经引入了加性和使用变分推理,其中使用稀疏高斯过程和平均场高斯的组合用于近似后验。加性噪声衰减信号,并且高斯形式的变分布可能导致后验不准确。我们将噪声变量作为潜在协变量,并提出一种新颖的重要性加权目标,它利用分析结果并提供一种权衡计算的机制以提高准确性。我们的研究结果表明,重要加权目标在实践中运作良好,并且始终优于经典变分推理,尤其是对于更深层次的模型。
translated by 谷歌翻译
差异隐私关注预测质量,同时测量对信息包含在数据中的个人的隐私影响。我们考虑与引起结构化稀疏性的规则制定者的差异私人风险最小化问题。已知这些正则化器是凸的,但它们通常是不可微分的。我们分析了标准的不同私有算法,例如输出扰动,Frank-Wolfe和目标扰动。输出扰动是一种差异私有算法,众所周知,它可以很好地降低强凸的风险。以前的工作已经导出了与维度无关的超额风险界限。在本文中,我们假设一类特定的凸但非光滑正则化器,它们导致广义线性模型的结构化稀疏性和损失函数。我们还考虑差异私有Frank-Wolfeal算法来优化风险最小化问题的双重性。我们得出这两种算法的过度风险界限。两个边界都取决于双范数的单位球的高斯宽度。我们还表明,风险最小化问题的客观扰动等同于双优化问题的输出扰动。这是在差异隐私的背景下分析风险最小化问题的双重优化问题的第一部作品。
translated by 谷歌翻译
在这项工作中,我们解决了在困难的成像条件下找到可靠的像素级对应的问题。我们提出了一种方法,其中单个卷积神经网络起双重作用:它同时是一个密集的特征描述符和一个特征检测器。通过将检测推迟到后期阶段,基于早期检测低水平结构,获得的关键点比传统的关键点更稳定。我们表明,可以使用从现成的大规模SfM重建中提取的像素对应来训练该模型,而无需任何进一步的注释。所提出的方法在困难的亚琛日夜定位数据集和InLocindoor定位基准测试中获得最先进的性能,以及用于图像匹配和3D重建的其他基准标记的竞争性能。
translated by 谷歌翻译