We develop a simple framework to learn bio-inspired foraging policies using human data. We conduct an experiment where humans are virtually immersed in an open field foraging environment and are trained to collect the highest amount of rewards. A Markov Decision Process (MDP) framework is introduced to model the human decision dynamics. Then, Imitation Learning (IL) based on maximum likelihood estimation is used to train Neural Networks (NN) that map human decisions to observed states. The results show that passive imitation substantially underperforms humans. We further refine the human-inspired policies via Reinforcement Learning (RL) using the on-policy Proximal Policy Optimization (PPO) algorithm which shows better stability than other algorithms and can steadily improve the policies pretrained with IL. We show that the combination of IL and RL can match human results and that good performance strongly depends on combining the allocentric information with an egocentric representation of the environment.
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
在各种控制任务域中,现有控制器提供了基线的性能水平,虽然可能是次优的 - 应维护。依赖于国家和行动空间的广泛探索的强化学习(RL)算法可用于优化控制策略。但是,完全探索性的RL算法可能会在训练过程中降低低于基线水平的性能。在本文中,我们解决了控制政策的在线优化问题,同时最大程度地减少了遗憾的W.R.T基线政策绩效。我们提出了一个共同的仿制学习框架,表示乔尔。 JIRL中的学习过程假设了基线策略的可用性,并设计了两个目标\ textbf {(a)}利用基线的在线演示,以最大程度地减少培训期间的遗憾W.R.T的基线策略,\ textbf {(b) }最终超过了基线性能。 JIRL通过最初学习模仿基线策略并逐渐将控制从基线转移到RL代理来解决这些目标。实验结果表明,JIRR有效地实现了几个连续的动作空间域中的上述目标。结果表明,JIRL在最终性能中与最先进的算法相当,同时在所有提出的域中训练期间都会降低基线后悔。此外,结果表明,对于最先进的基线遗憾最小化方法,其基线后悔的减少因素最高为21美元。
translated by 谷歌翻译
我们专注于一个典型的物流部门的卸载问题,该问题被建模为顺序的选择任务。在这种类型的任务中,现代的机器学习技术已经显示出比经典系统更好的工作,因为它们更适合随机性,并且能够更好地应对大型不确定性。更具体地说,在这方面,有监督和模仿学习取得了出色的成果,因为需要某种形式的监督,这对于所有设置并不总是可获得的。另一方面,加固学习(RL)需要许多更温和的监督形式,但由于其效率低下仍然不切实际。在本文中,我们提出并理论上激励了一种新颖的无监督奖励构成算法,从专家的观察结果中塑造了算法,该算法放宽了代理商所需的监督水平,并致力于改善我们任务中的RL绩效。
translated by 谷歌翻译
强化学习算法在解决稀疏和延迟奖励的复杂分层任务时需要许多样本。对于此类复杂的任务,最近提出的方向舵使用奖励再分配来利用与完成子任务相关的Q功能中的步骤。但是,由于当前的探索策略无法在合理的时间内发现它们,因此通常只有很少有具有高回报的情节作为示范。在这项工作中,我们介绍了Align-rudder,该王牌利用了一个配置文件模型来进行奖励重新分布,该模型是从多个示范序列比对获得的。因此,Align-Rudder有效地采用了奖励再分配,从而大大改善了很少的演示学习。 Align-rudder在复杂的人工任务上的竞争者优于竞争对手,延迟的奖励和几乎没有示威的竞争者。在Minecraft获得Diamond的任务上,Align Rudder能够挖掘钻石,尽管不经常。代码可在https://github.com/ml-jku/align-rudder上找到。 YouTube:https://youtu.be/ho-_8zul-uy
translated by 谷歌翻译
机器学习算法中多个超参数的最佳设置是发出大多数可用数据的关键。为此目的,已经提出了几种方法,例如进化策略,随机搜索,贝叶斯优化和启发式拇指规则。在钢筋学习(RL)中,学习代理在与其环境交互时收集的数据的信息内容严重依赖于许多超参数的设置。因此,RL算法的用户必须依赖于基于搜索的优化方法,例如网格搜索或Nelder-Mead单简单算法,这对于大多数R1任务来说是非常效率的,显着减慢学习曲线和离开用户的速度有目的地偏见数据收集的负担。在这项工作中,为了使RL算法更加用户独立,提出了一种使用贝叶斯优化的自主超参数设置的新方法。来自过去剧集和不同的超参数值的数据通过执行行为克隆在元学习水平上使用,这有助于提高最大化获取功能的加强学习变体的有效性。此外,通过紧密地整合在加强学习代理设计中的贝叶斯优化,还减少了收敛到给定任务的最佳策略所需的状态转换的数量。与其他手动调整和基于优化的方法相比,计算实验显示了有希望的结果,这突出了改变算法超级参数来增加所生成数据的信息内容的好处。
translated by 谷歌翻译
Reinforcement Learning has emerged as a strong alternative to solve optimization tasks efficiently. The use of these algorithms highly depends on the feedback signals provided by the environment in charge of informing about how good (or bad) the decisions made by the learned agent are. Unfortunately, in a broad range of problems the design of a good reward function is not trivial, so in such cases sparse reward signals are instead adopted. The lack of a dense reward function poses new challenges, mostly related to exploration. Imitation Learning has addressed those problems by leveraging demonstrations from experts. In the absence of an expert (and its subsequent demonstrations), an option is to prioritize well-suited exploration experiences collected by the agent in order to bootstrap its learning process with good exploration behaviors. However, this solution highly depends on the ability of the agent to discover such trajectories in the early stages of its learning process. To tackle this issue, we propose to combine imitation learning with intrinsic motivation, two of the most widely adopted techniques to address problems with sparse reward. In this work intrinsic motivation is used to encourage the agent to explore the environment based on its curiosity, whereas imitation learning allows repeating the most promising experiences to accelerate the learning process. This combination is shown to yield an improved performance and better generalization in procedurally-generated environments, outperforming previously reported self-imitation learning methods and achieving equal or better sample efficiency with respect to intrinsic motivation in isolation.
translated by 谷歌翻译
Deep reinforcement learning (DRL) provides a new way to generate robot control policy. However, the process of training control policy requires lengthy exploration, resulting in a low sample efficiency of reinforcement learning (RL) in real-world tasks. Both imitation learning (IL) and learning from demonstrations (LfD) improve the training process by using expert demonstrations, but imperfect expert demonstrations can mislead policy improvement. Offline to Online reinforcement learning requires a lot of offline data to initialize the policy, and distribution shift can easily lead to performance degradation during online fine-tuning. To solve the above problems, we propose a learning from demonstrations method named A-SILfD, which treats expert demonstrations as the agent's successful experiences and uses experiences to constrain policy improvement. Furthermore, we prevent performance degradation due to large estimation errors in the Q-function by the ensemble Q-functions. Our experiments show that A-SILfD can significantly improve sample efficiency using a small number of different quality expert demonstrations. In four Mujoco continuous control tasks, A-SILfD can significantly outperform baseline methods after 150,000 steps of online training and is not misled by imperfect expert demonstrations during training.
translated by 谷歌翻译
资产分配(或投资组合管理)是确定如何最佳将有限预算的资金分配给一系列金融工具/资产(例如股票)的任务。这项研究调查了使用无模型的深RL代理应用于投资组合管理的增强学习(RL)的性能。我们培训了几个RL代理商的现实股票价格,以学习如何执行资产分配。我们比较了这些RL剂与某些基线剂的性能。我们还比较了RL代理,以了解哪些类别的代理表现更好。从我们的分析中,RL代理可以执行投资组合管理的任务,因为它们的表现明显优于基线代理(随机分配和均匀分配)。四个RL代理(A2C,SAC,PPO和TRPO)总体上优于最佳基线MPT。这显示了RL代理商发现更有利可图的交易策略的能力。此外,基于价值和基于策略的RL代理之间没有显着的性能差异。演员批评者的表现比其他类型的药物更好。同样,在政策代理商方面的表现要好,因为它们在政策评估方面更好,样品效率在投资组合管理中并不是一个重大问题。这项研究表明,RL代理可以大大改善资产分配,因为它们的表现优于强基础。基于我们的分析,在政策上,参与者批评的RL药物显示出最大的希望。
translated by 谷歌翻译
我们开发了一种新的持续元学习方法,以解决连续多任务学习中的挑战。在此设置中,代理商的目标是快速通过任何任务序列实现高奖励。先前的Meta-Creenifiltive学习算法已经表现出有希望加速收购新任务的结果。但是,他们需要在培训期间访问所有任务。除了简单地将过去的经验转移到新任务,我们的目标是设计学习学习的持续加强学习算法,使用他们以前任务的经验更快地学习新任务。我们介绍了一种新的方法,连续的元策略搜索(Comps),通过以增量方式,在序列中的每个任务上,通过序列的每个任务来消除此限制,而无需重新访问先前的任务。 Comps持续重复两个子程序:使用RL学习新任务,并使用RL的经验完全离线Meta学习,为后续任务学习做好准备。我们发现,在若干挑战性连续控制任务的旧序列上,Comps优于持续的持续学习和非政策元增强方法。
translated by 谷歌翻译
由于数据量增加,金融业的快速变化已经彻底改变了数据处理和数据分析的技术,并带来了新的理论和计算挑战。与古典随机控制理论和解决财务决策问题的其他分析方法相比,解决模型假设的财务决策问题,强化学习(RL)的新发展能够充分利用具有更少模型假设的大量财务数据并改善复杂的金融环境中的决策。该调查纸目的旨在审查最近的资金途径的发展和使用RL方法。我们介绍了马尔可夫决策过程,这是许多常用的RL方法的设置。然后引入各种算法,重点介绍不需要任何模型假设的基于价值和基于策略的方法。连接是用神经网络进行的,以扩展框架以包含深的RL算法。我们的调查通过讨论了这些RL算法在金融中各种决策问题中的应用,包括最佳执行,投资组合优化,期权定价和对冲,市场制作,智能订单路由和Robo-Awaring。
translated by 谷歌翻译
模仿学习在有效地学习政策方面对复杂的决策问题有着巨大的希望。当前的最新算法经常使用逆增强学习(IRL),在给定一组专家演示的情况下,代理会替代奖励功能和相关的最佳策略。但是,这种IRL方法通常需要在复杂控制问题上进行实质性的在线互动。在这项工作中,我们提出了正规化的最佳运输(ROT),这是一种新的模仿学习算法,基于最佳基于最佳运输轨迹匹配的最新进展。我们的主要技术见解是,即使只有少量演示,即使只有少量演示,也可以自适应地将轨迹匹配的奖励与行为克隆相结合。我们对横跨DeepMind Control Suite,OpenAI Robotics和Meta-World基准的20个视觉控制任务进行的实验表明,与先前最新的方法相比,平均仿真达到了90%的专家绩效的速度,达到了90%的专家性能。 。在现实世界的机器人操作中,只有一次演示和一个小时的在线培训,ROT在14个任务中的平均成功率为90.1%。
translated by 谷歌翻译
仅国家模仿学习的最新进展将模仿学习的适用性扩展到现实世界中的范围,从而减轻了观察专家行动的需求。但是,现有的解决方案只学会从数据中提取州对行动映射策略,而无需考虑专家如何计划到目标。这阻碍了利用示威游行并限制政策的灵活性的能力。在本文中,我们介绍了解耦政策优化(DEPO),该策略优化(DEPO)明确将策略脱离为高级状态计划者和逆动力学模型。借助嵌入式的脱钩策略梯度和生成对抗训练,DEPO可以将知识转移到不同的动作空间或状态过渡动态,并可以将规划师推广到无示威的状态区域。我们的深入实验分析表明,DEPO在学习最佳模仿性能的同时学习通用目标状态计划者的有效性。我们证明了DEPO通过预训练跨任务转移的吸引力,以及与各种技能共同培训的潜力。
translated by 谷歌翻译
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an offpolicy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
强化学习的主要方法是根据预期的回报将信贷分配给行动。但是,我们表明回报可能取决于政策,这可能会导致价值估计的过度差异和减慢学习的速度。取而代之的是,我们证明了优势函数可以解释为因果效应,并与因果关系共享相似的属性。基于此洞察力,我们提出了直接优势估计(DAE),这是一种可以对优势函数进行建模并直接从政策数据进行估算的新方法,同时同时最大程度地减少了返回的方差而无需(操作 - )值函数。我们还通过显示如何无缝整合到DAE中来将我们的方法与时间差异方法联系起来。所提出的方法易于实施,并且可以通过现代参与者批评的方法很容易适应。我们对三个离散控制域进行经验评估DAE,并表明它可以超过广义优势估计(GAE),这是优势估计的强大基线,当将大多数环境应用于策略优化时。
translated by 谷歌翻译
在实践中,只要可以设计教学代理以提供专家监督,仿制学习就是纯粹的加强学习。但是,我们表明,当教学代理商决定与学生无法访问的特权信息时,在模仿学习期间,此信息被边缘化,导致“模仿差距”,导致潜在,差距。先前的工作通过仿制学习的仿制学习来弥合这一差距。虽然经常成功,但逐步的进展失败,需要频繁切换勘探和记忆之间的频繁交换。为了更好地解决这些任务并减轻模仿缺口,我们提出“适应性不管”(顾问)。顾问在培训期间动态重量仿制和奖励的加固学习损失,在模仿和探索之间启用了在线切换。在Gridworlds中设置的一套充满挑战的任务,多代理粒子环境和高保真3D模拟器,我们展示了与顾问的在线交换,优于纯粹的模仿,纯粹的加固学习以及它们的顺序和并行组合。
translated by 谷歌翻译
Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. In this paper, we demonstrate that due to errors introduced by extrapolation, standard offpolicy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning without data correlated to the distribution under the current policy, making them ineffective for this fixed batch setting. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcement learning algorithm which can learn effectively from arbitrary, fixed batch data, and empirically demonstrate the quality of its behavior in several tasks.
translated by 谷歌翻译
与政策策略梯度技术相比,使用先前收集的数据的无模型的无模型深钢筋学习(RL)方法可以提高采样效率。但是,当利益政策的分布与收集数据的政策之间的差异时,非政策学习变得具有挑战性。尽管提出了良好的重要性抽样和范围的政策梯度技术来补偿这种差异,但它们通常需要一系列长轨迹,以增加计算复杂性并引起其他问题,例如消失或爆炸梯度。此外,由于需要行动概率,它们对连续动作领域的概括严格受到限制,这不适合确定性政策。为了克服这些局限性,我们引入了一种替代的非上政策校正算法,用于连续作用空间,参与者 - 批判性非政策校正(AC-OFF-POC),以减轻先前收集的数据引入的潜在缺陷。通过由代理商对随机采样批次过渡的状态的最新动作决策计算出的新颖差异度量,该方法不需要任何策略的实际或估计的行动概率,并提供足够的一步重要性抽样。理论结果表明,引入的方法可以使用固定的独特点获得收缩映射,从而可以进行“安全”的非政策学习。我们的经验结果表明,AC-Off-POC始终通过有效地安排学习率和Q学习和政策优化的学习率,以比竞争方法更少的步骤改善最新的回报。
translated by 谷歌翻译