我们考虑学习控制问题的最佳阈值策略的问题。阈值策略通过评估系统状态的元素是否超过一定阈值来做出控制决策,其值由系统状态的其他元素决定。通过利用阈值策略的单调特性,我们证明他们的政策梯度具有令人惊讶的简单表达方式。我们使用这种简单的表达方式来构建一种范围的演员批评算法,以学习最佳阈值策略。仿真结果表明,由于其能够利用单调属性的能力,我们的政策大大优于其他强化学习算法。此外,我们表明,Whittle Index是一种用于躁动的多臂匪徒问题的强大工具,相当于替代问题的最佳阈值策略。该观察结果导致了一种简单的算法,该算法通过学习替代问题中的最佳阈值策略来找到Whittle索引。仿真结果表明,我们的算法比最近通过间接手段学习小索引的一些研究快得多。
translated by 谷歌翻译
translated by 谷歌翻译
我们在\ textit {躁动不安的多臂土匪}(rmabs)中引入了鲁棒性,这是一个流行的模型,用于在独立随机过程(臂)之间进行约束资源分配。几乎所有RMAB技术都假设随机动力学是精确的。但是,在许多实际设置中,动态是用显着的\ emph {不确定性}估算的,例如,通过历史数据,如果被忽略,这可能会导致不良结果。为了解决这个问题,我们开发了一种算法来计算Minimax遗憾 - RMAB的强大政策。我们的方法使用双oracle框架(\ textit {agent}和\ textit {nature}),通常用于单过程强大的计划,但需要大量的新技术来适应RMAB的组合性质。具体而言,我们设计了深入的强化学习(RL)算法DDLPO,该算法通过学习辅助机构“ $ \ lambda $ -network”来应对组合挑战,并与每手臂的策略网络串联,大大降低了样本复杂性,并确保了融合。普遍关注的DDLPO实现了我们的奖励最大化代理Oracle。然后,我们通过将其作为策略优化器和对抗性性质之间的多代理RL问题提出来解决具有挑战性的遗憾最大化自然甲骨文,这是一个非平稳的RL挑战。这种表述具有普遍的兴趣 - 我们通过与共同的评论家创建DDLPO的多代理扩展来解决RMAB。我们显示我们的方法在三个实验域中效果很好。
translated by 谷歌翻译
translated by 谷歌翻译
脱机强化学习 - 从一批数据中学习策略 - 是难以努力的:如果没有制造强烈的假设,它很容易构建实体算法失败的校长。在这项工作中,我们考虑了某些现实世界问题的财产,其中离线强化学习应该有效:行动仅对一部分产生有限的行动。我们正规化并介绍此动作影响规律(AIR)财产。我们进一步提出了一种算法,该算法假定和利用AIR属性,并在MDP满足空气时绑定输出策略的子优相。最后,我们展示了我们的算法在定期保留的两个模拟环境中跨越不同的数据收集策略占据了现有的离线强度学习算法。
translated by 谷歌翻译
translated by 谷歌翻译
Restless multi-armed bandits (RMABs) extend multi-armed bandits to allow for stateful arms, where the state of each arm evolves restlessly with different transitions depending on whether that arm is pulled. Solving RMABs requires information on transition dynamics, which are often unknown upfront. To plan in RMAB settings with unknown transitions, we propose the first online learning algorithm based on the Whittle index policy, using an upper confidence bound (UCB) approach to learn transition dynamics. Specifically, we estimate confidence bounds of the transition probabilities and formulate a bilinear program to compute optimistic Whittle indices using these estimates. Our algorithm, UCWhittle, achieves sublinear $O(H \sqrt{T \log T})$ frequentist regret to solve RMABs with unknown transitions in $T$ episodes with a constant horizon $H$. Empirically, we demonstrate that UCWhittle leverages the structure of RMABs and the Whittle index policy solution to achieve better performance than existing online learning baselines across three domains, including one constructed via sampling from a real-world maternal and childcare dataset.
translated by 谷歌翻译
This work provides a Deep Reinforcement Learning approach to solving a periodic review inventory control system with stochastic vendor lead times, lost sales, correlated demand, and price matching. While this dynamic program has historically been considered intractable, our results show that several policy learning approaches are competitive with or outperform classical methods. In order to train these algorithms, we develop novel techniques to convert historical data into a simulator. On the theoretical side, we present learnability results on a subclass of inventory control problems, where we provide a provable reduction of the reinforcement learning problem to that of supervised learning. On the algorithmic side, we present a model-based reinforcement learning procedure (Direct Backprop) to solve the periodic review inventory control problem by constructing a differentiable simulator. Under a variety of metrics Direct Backprop outperforms model-free RL and newsvendor baselines, in both simulations and real-world deployments.
translated by 谷歌翻译
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies "end-to-end": directly from raw pixel inputs.
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
从现有数据中学习最佳行为是加强学习(RL)中最重要的问题之一。这被称为RL中的“非政策控制”,其中代理的目标是根据从给定策略(称为行为策略)获得的数据计算最佳策略。由于最佳策略可能与行为策略有很大不同,因此与“政体”设置相比,学习最佳行为非常困难,在学习中将利用来自策略更新的新数据。这项工作提出了一种非政策的天然参与者批评算法,该算法利用州行动分布校正来处理外部行为和样本效率的自然政策梯度。具有收敛保证的现有基于天然梯度的参与者批评算法需要固定功能,以近似策略和价值功能。这通常会导致许多RL应用中的次级学习。另一方面,我们提出的算法利用兼容功能,使人们能够使用任意神经网络近似策略和价值功能,并保证收敛到本地最佳策略。我们通过将其与基准RL任务上的香草梯度参与者 - 批评算法进行比较,说明了提出的非政策自然梯度算法的好处。
translated by 谷歌翻译
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an offpolicy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.
translated by 谷歌翻译
我们为处理顺序决策和外在不确定性的应用程序开发了增强学习(RL)框架,例如资源分配和库存管理。在这些应用中,不确定性仅由于未来需求等外源变量所致。一种流行的方法是使用历史数据预测外源变量,然后对预测进行计划。但是,这种间接方法需要对外源过程进行高保真模型,以确保良好的下游决策,当外源性过程复杂时,这可能是不切实际的。在这项工作中,我们提出了一种基于事后观察学习的替代方法,该方法避开了对外源过程进行建模的建模。我们的主要见解是,与Sim2real RL不同,我们可以在历史数据中重新审视过去的决定,并在这些应用程序中对其他动作产生反事实后果。我们的框架将事后最佳的行动用作政策培训信号,并在决策绩效方面具有强大的理论保证。我们使用框架开发了一种算法,以分配计算资源,以用于现实世界中的Microsoft Azure工作负载。结果表明,我们的方法比域特异性的启发式方法和SIM2REAL RL基准学习更好的政策。
translated by 谷歌翻译
经验重放机制允许代理多次使用经验。在以前的作品中,过渡的抽样概率根据其重要性进行调整。重新分配采样概率在每次迭代后的重传缓冲器的每个过渡是非常低效的。因此,经验重播优先算法重新计算时,相应的过渡进行采样,以获得计算效率转变的意义。然而,过渡的重要性水平动态变化的政策和代理人的价值函数被更新。此外,经验回放存储转换由可显著从代理的最新货币政策偏离剂的以前的政策产生。从代理引线的最新货币政策更关闭策略更新,这是有害的代理高偏差。在本文中,我们开发了一种新的算法,通过KL散度批次优先化体验重播(KLPER),其优先批次转换的,而不是直接优先每个过渡。此外,为了减少更新的截止policyness,我们的算法选择一个批次中的某一批次的数量和力量的通过很有可能是代理的最新货币政策所产生的一批学习代理。我们结合与深确定性政策渐变和Twin算法延迟深确定性政策渐变,并评估它在不同的连续控制任务。 KLPER提供培训期间的抽样效率,最终表现和政策的稳定性方面有前途的深确定性的连续控制算法的改进。
translated by 谷歌翻译
We present temporally layered architecture (TLA), a biologically inspired system for temporally adaptive distributed control. TLA layers a fast and a slow controller together to achieve temporal abstraction that allows each layer to focus on a different time-scale. Our design is biologically inspired and draws on the architecture of the human brain which executes actions at different timescales depending on the environment's demands. Such distributed control design is widespread across biological systems because it increases survivability and accuracy in certain and uncertain environments. We demonstrate that TLA can provide many advantages over existing approaches, including persistent exploration, adaptive control, explainable temporal behavior, compute efficiency and distributed control. We present two different algorithms for training TLA: (a) Closed-loop control, where the fast controller is trained over a pre-trained slow controller, allowing better exploration for the fast controller and closed-loop control where the fast controller decides whether to "act-or-not" at each timestep; and (b) Partially open loop control, where the slow controller is trained over a pre-trained fast controller, allowing for open loop-control where the slow controller picks a temporally extended action or defers the next n-actions to the fast controller. We evaluated our method on a suite of continuous control tasks and demonstrate the advantages of TLA over several strong baselines.
translated by 谷歌翻译
translated by 谷歌翻译
In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译