In this work we aim to solve a large collection of tasks using a single reinforcement learning agent with a single set of parameters. A key challenge is to handle the increased amount of data and extended training time. We have developed a new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) that not only uses resources more efficiently in singlemachine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation. We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace. We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the DeepMind Lab environment (Beattie et al., 2016)) and Atari-57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). Our results show that IMPALA is able to achieve better performance than previous agents with less data, and crucially exhibits positive transfer between tasks as a result of its multi-task approach. The source code is publicly available at github.com/deepmind/scalable agent.
translated by 谷歌翻译
基于模拟的推理的神经后验估计方法可能不适合通过在多个观测值上进行条件来处理后验分布,因为它们可能需要大量的模拟器调用以产生准确的近似值。神经可能性估计方法可以自然处理多个观察结果,但需要单独的推论步骤,这可能会影响其效率和性能。我们引入了一种基于模拟的推理的新方法,该方法享有两种方法的好处。我们建议对单个观察值引起的后验分布进行建模,并引入采样算法,该算法将学习分数结合在一起以有效地从目标中进行样本。
translated by 谷歌翻译
高效地培训专家模型的大规模混合,现代硬件需要将数据点分配给不同的专家,每个专家都具有有限的容量。最近提出的任务程序缺乏概率解释和使用偏见估算进行培训。作为替代方案,我们提出了基于原则的随机分配程序的两个无偏的估计,其中跳过超过专家容量的DataPoints,以及使用Gumbel匹配分布的延伸来示范完全平衡的作业[29]。两个估算器都是无偏见的,因为它们纠正了使用的采样程序。在玩具实验中,我们发现“Skip'-Expliesator比平衡采样更有效,并且在解决任务方面比偏置替代方案更加强大。
translated by 谷歌翻译
找到同一问题的不同解决方案是与创造力和对新颖情况的适应相关的智能的关键方面。在钢筋学习中,一套各种各样的政策对于勘探,转移,层次结构和鲁棒性有用。我们提出了各种各样的连续政策,一种发现在继承人功能空间中多样化的政策的方法,同时确保它们接近最佳。我们将问题形式形式化为受限制的马尔可夫决策过程(CMDP),目标是找到最大化多样性的政策,其特征在于内在的多样性奖励,同时对MDP的外在奖励保持近乎最佳。我们还分析了最近提出的稳健性和歧视奖励的绩效,并发现它们对程序的初始化敏感,并且可以收敛到次优溶液。为了缓解这一点,我们提出了新的明确多样性奖励,该奖励旨在最大限度地减少集合中策略的继承人特征之间的相关性。我们比较深度控制套件中的不同多样性机制,发现我们提出的明确多样性的类型对于发现不同的行为是重要的,例如不同的运动模式。
translated by 谷歌翻译
We define and address the problem of unsupervised learning of disentangled representations on data generated from independent factors of variation. We propose FactorVAE, a method that disentangles by encouraging the distribution of representations to be factorial and hence independent across the dimensions. We show that it improves upon β-VAE by providing a better trade-off between disentanglement and reconstruction quality. Moreover, we highlight the problems of a commonly used disentanglement metric and introduce a new metric that does not suffer from them.
translated by 谷歌翻译
The reparameterization trick enables optimizing large scale stochastic computation graphs via gradient descent. The essence of the trick is to refactor each stochastic node into a differentiable function of its parameters and a random variable with fixed distribution. After refactoring, the gradients of the loss propagated by the chain rule through the graph are low variance unbiased estimators of the gradients of the expected loss. While many continuous random variables have such reparameterizations, discrete random variables lack useful reparameterizations due to the discontinuous nature of discrete states. In this work we introduce CONCRETE random variables-CONtinuous relaxations of disCRETE random variables. The Concrete distribution is a new family of distributions with closed form densities and a simple reparameterization. Whenever a discrete stochastic node of a computation graph can be refactored into a one-hot bit representation that is treated continuously, Concrete stochastic nodes can be used with automatic differentiation to produce low-variance biased gradients of objectives (including objectives that depend on the log-probability of latent stochastic nodes) on the corresponding discrete graph. We demonstrate the effectiveness of Concrete relaxations on density estimation and structured prediction tasks using neural networks.
translated by 谷歌翻译
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
translated by 谷歌翻译
Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.
translated by 谷歌翻译
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
translated by 谷歌翻译