Offline reinforcement-learning (RL) algorithms learn to make decisions using a given, fixed training dataset without the possibility of additional online data collection. This problem setting is captivating because it holds the promise of utilizing previously collected datasets without any costly or risky interaction with the environment. However, this promise also bears the drawback of this setting. The restricted dataset induces subjective uncertainty because the agent can encounter unfamiliar sequences of states and actions that the training data did not cover. Moreover, inherent system stochasticity further increases uncertainty and aggravates the offline RL problem, preventing the agent from learning an optimal policy. To mitigate the destructive uncertainty effects, we need to balance the aspiration to take reward-maximizing actions with the incurred risk due to incorrect ones. In financial economics, modern portfolio theory (MPT) is a method that risk-averse investors can use to construct diversified portfolios that maximize their returns without unacceptable levels of risk. We integrate MPT into the agent's decision-making process to present a simple-yet-highly-effective risk-aware planning algorithm for offline RL. Our algorithm allows us to systematically account for the \emph{estimated quality} of specific actions and their \emph{estimated risk} due to the uncertainty. We show that our approach can be coupled with the Transformer architecture to yield a state-of-the-art planner for offline RL tasks, maximizing the return while significantly reducing the variance.
translated by 谷歌翻译
强化学习(RL)通常涉及估计静止政策或单步模型,利用马尔可夫属性来解决问题。但是,我们也可以将RL视为通用序列建模问题,目标是产生一系列导致一系列高奖励的动作。通过这种方式观看,考虑在其他域中运用良好的高容量序列预测模型,例如自然语言处理,也可以为RL问题提供有效的解决方案。为此,我们探索如何使用变压器架构与序列建模的工具来解决RL,以将分布在轨迹上和将光束搜索作为规划算法进行重新定位。框架RL作为序列建模问题简化了一系列设计决策,允许我们分配在离线RL算法中常见的许多组件。我们展示了这种方法跨越长地平动态预测,仿制学习,目标条件的RL和离线RL的灵活性。此外,我们表明这种方法可以与现有的无模型算法结合起来,以在稀疏奖励,长地平线任务中产生最先进的策划仪。
translated by 谷歌翻译
Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. * equal contribution. † equal advising. Orders randomized.34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
Behavioural cloning (BC) is a commonly used imitation learning method to infer a sequential decision-making policy from expert demonstrations. However, when the quality of the data is not optimal, the resulting behavioural policy also performs sub-optimally once deployed. Recently, there has been a surge in offline reinforcement learning methods that hold the promise to extract high-quality policies from sub-optimal historical data. A common approach is to perform regularisation during training, encouraging updates during policy evaluation and/or policy improvement to stay close to the underlying data. In this work, we investigate whether an offline approach to improving the quality of the existing data can lead to improved behavioural policies without any changes in the BC algorithm. The proposed data improvement approach - Trajectory Stitching (TS) - generates new trajectories (sequences of states and actions) by `stitching' pairs of states that were disconnected in the original data and generating their connecting new action. By construction, these new transitions are guaranteed to be highly plausible according to probabilistic models of the environment, and to improve a state-value function. We demonstrate that the iterative process of replacing old trajectories with new ones incrementally improves the underlying behavioural policy. Extensive experimental results show that significant performance gains can be achieved using TS over BC policies extracted from the original data. Furthermore, using the D4RL benchmarking suite, we demonstrate that state-of-the-art results are obtained by combining TS with two existing offline learning methodologies reliant on BC, model-based offline planning (MBOP) and policy constraint (TD3+BC).
translated by 谷歌翻译
最近的工作表明,离线增强学习(RL)可以作为序列建模问题(Chen等,2021; Janner等,2021)配制,并通过类似于大规模语言建模的方法解决。但是,RL的任何实际实例化也涉及一个在线组件,在线组件中,通过与环境的任务规定相互作用对被动离线数据集进行了预测的策略。我们建议在线决策变压器(ODT),这是一种基于序列建模的RL算法,该算法将离线预处理与统一框架中的在线填充融为一体。我们的框架将序列级熵正规仪与自回归建模目标结合使用,用于样品效率探索和填充。从经验上讲,我们表明ODT在D4RL基准上的绝对性能中与最先进的表现具有竞争力,但在填充过程中显示出更大的收益。
translated by 谷歌翻译
强化学习(RL)通过与环境相互作用的试验过程解决顺序决策问题。尽管RL在玩复杂的视频游戏方面取得了巨大的成功,但在现实世界中,犯错误总是不希望的。为了提高样本效率并从而降低错误,据信基于模型的增强学习(MBRL)是一个有前途的方向,它建立了环境模型,在该模型中可以进行反复试验,而无需实际成本。在这项调查中,我们对MBRL进行了审查,重点是Deep RL的最新进展。对于非壮观环境,学到的环境模型与真实环境之间始终存在概括性错误。因此,非常重要的是分析环境模型中的政策培训与实际环境中的差异,这反过来又指导了更好的模型学习,模型使用和政策培训的算法设计。此外,我们还讨论了其他形式的RL,包括离线RL,目标条件RL,多代理RL和Meta-RL的最新进展。此外,我们讨论了MBRL在现实世界任务中的适用性和优势。最后,我们通过讨论MBRL未来发展的前景来结束这项调查。我们认为,MBRL在被忽略的现实应用程序中具有巨大的潜力和优势,我们希望这项调查能够吸引更多关于MBRL的研究。
translated by 谷歌翻译
基于变压器神经网络体系结构的自然语言处理(NLP)的令人印象深刻的结果激发了研究人员探索视线离线增强学习(RL)作为通用序列建模问题。基于此范式的最新著作已获得最新的结果,其中一些主要确定性的离线Atari和D4RL基准。但是,由于这些方法将国家和行动共同模拟单一的测序问题,因此它们努力将政策和世界动态对回报的影响解散。因此,在对抗或随机环境中,这些方法导致过度乐观的行为,在自主驾驶(例如自主驾驶)中可能是危险的。在这项工作中,我们提出了一种通过明确解开政策和世界模型来解决这种乐观偏见的方法,该方法使我们在测试时可以搜索对环境中多个可能的未来的稳健性的策略。我们在模拟中的各种自动驾驶任务上展示了我们的方法的出色性能。
translated by 谷歌翻译
Offline reinforcement learning (RL) is suitable for safety-critical domains where online exploration is too costly or dangerous. In safety-critical settings, decision-making should take into consideration the risk of catastrophic outcomes. In other words, decision-making should be risk-sensitive. Previous works on risk in offline RL combine together offline RL techniques, to avoid distributional shift, with risk-sensitive RL algorithms, to achieve risk-sensitivity. In this work, we propose risk-sensitivity as a mechanism to jointly address both of these issues. Our model-based approach is risk-averse to both epistemic and aleatoric uncertainty. Risk-aversion to epistemic uncertainty prevents distributional shift, as areas not covered by the dataset have high epistemic uncertainty. Risk-aversion to aleatoric uncertainty discourages actions that may result in poor outcomes due to environment stochasticity. Our experiments show that our algorithm achieves competitive performance on deterministic benchmarks, and outperforms existing approaches for risk-sensitive objectives in stochastic domains.
translated by 谷歌翻译
尽管基于计划的序列建模方法在连续控制方面表现出巨大的潜力,但由于高维空间中规划的高度计算复杂性和天生的困难,将它们扩展到高维状态序列仍然是一个开放的挑战。我们提出了轨迹自动编码计划器(TAP),这是一种基于计划的序列建模RL方法,可扩展到高州行动维度。使用状态条件矢量定量的变分自动编码器(VQ-VAE),点击模拟给定当前状态的轨迹的条件分布。当部署为RL代理时,TAP避免在高维连续动作空间中逐步计划,而是通过Beam Search寻找最佳的潜在代码序列。与$ o(d^3)$轨迹变压器的复杂性不同,TAP享受常数$ o(c)$规划有关州行动维度$ d $的计算复杂性。我们的经验评估还表明,随着维度的增长,TAP的表现越来越强。对于具有较高状态和动作维度的ADROIT机器人手动操纵任务,TAP超过了基于模型的方法,包括TT,其边距很大,并且还击败了强大的无模型参与者 - 批评基准。
translated by 谷歌翻译
Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.
translated by 谷歌翻译
在离线强化学习(离线RL)中,主要挑战之一是处理学习策略与给定数据集之间的分布转变。为了解决这个问题,最近的离线RL方法试图引入保守主义偏见,以鼓励在高信心地区学习。无模型方法使用保守的正常化或特殊网络结构直接对策略或价值函数学习进行这样的偏见,但它们约束的策略搜索限制了脱机数据集之外的泛化。基于模型的方法使用保守量量化学习前瞻性动态模型,然后生成虚构的轨迹以扩展脱机数据集。然而,由于离线数据集中的有限样本,保守率量化通常在支撑区域内遭受全面化。不可靠的保守措施将误导基于模型的想象力,以不受欢迎的地区,导致过多的行为。为了鼓励更多的保守主义,我们提出了一种基于模型的离线RL框架,称为反向离线模型的想象(ROMI)。我们与新颖的反向策略结合使用逆向动力学模型,该模型可以生成导致脱机数据集中的目标目标状态的卷展栏。这些反向的想象力提供了无通知的数据增强,以便无模型策略学习,并使远程数据集的保守概括。 ROMI可以有效地与现成的无模型算法组合,以实现基于模型的概括,具有适当的保守主义。经验结果表明,我们的方法可以在离线RL基准任务中产生更保守的行为并实现最先进的性能。
translated by 谷歌翻译
博学的无模型离线增强学习(RL)方法的策略通常被限制在数据集的支持范围内,以避免可能的危险危险分发措施或状态,从而使处理不支持的区域挑战。基于模型的RL方法通过使用经过训练的前进或反向动力学模型生成虚构轨迹来提供更丰富的数据集和收益概括。但是,想象的过渡可能不准确,因此降低了基础离线RL方法的性能。在本文中,我们建议通过使用训练有素的双向动力学模型和通过双重检查推出策略来增强离线数据集。我们通过信任前向模型和落后模型一致的样本来介绍保守主义。我们的方法是基于置信度的双向离线模型的想象力,可以生成可靠的样本,并可以与任何无模型的离线RL方法结合使用。 D4RL基准测试的实验结果表明,我们的方法显着提高了现有的无模型离线RL算法的性能,并在基线方法上取得了竞争性或更好的分数。
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译
人类可以利用先前的经验,并从少数示威活动中学习新颖的任务。与旨在通过更好的算法设计来快速适应的离线元强化学习相反,我们研究了建筑归纳偏见对少量学习能力的影响。我们提出了一个基于及时的决策变压器(提示-DT),该变压器利用了变压器体系结构和及时框架的顺序建模能力,以在离线RL中实现少量适应。我们设计了轨迹提示,其中包含少量演示的片段,并编码特定于任务的信息以指导策略生成。我们在五个Mujoco控制基准中进行的实验表明,提示-DT是一个强大的少数学习者,而没有对看不见的目标任务进行任何额外的填充。提示-DT的表现优于其变体和强大的元线RL基线,只有一个轨迹提示符只包含少量时间段。提示-DT也很健壮,可以提示长度更改并可以推广到分布(OOD)环境。
translated by 谷歌翻译
机器学习算法中多个超参数的最佳设置是发出大多数可用数据的关键。为此目的,已经提出了几种方法,例如进化策略,随机搜索,贝叶斯优化和启发式拇指规则。在钢筋学习(RL)中,学习代理在与其环境交互时收集的数据的信息内容严重依赖于许多超参数的设置。因此,RL算法的用户必须依赖于基于搜索的优化方法,例如网格搜索或Nelder-Mead单简单算法,这对于大多数R1任务来说是非常效率的,显着减慢学习曲线和离开用户的速度有目的地偏见数据收集的负担。在这项工作中,为了使RL算法更加用户独立,提出了一种使用贝叶斯优化的自主超参数设置的新方法。来自过去剧集和不同的超参数值的数据通过执行行为克隆在元学习水平上使用,这有助于提高最大化获取功能的加强学习变体的有效性。此外,通过紧密地整合在加强学习代理设计中的贝叶斯优化,还减少了收敛到给定任务的最佳策略所需的状态转换的数量。与其他手动调整和基于优化的方法相比,计算实验显示了有希望的结果,这突出了改变算法超级参数来增加所生成数据的信息内容的好处。
translated by 谷歌翻译
Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential decision-making. We view decision-making not through the lens of reinforcement learning (RL), but rather through conditional generative modeling. To our surprise, we find that our formulation leads to policies that can outperform existing offline RL approaches across standard benchmarks. By modeling a policy as a return-conditional diffusion model, we illustrate how we may circumvent the need for dynamic programming and subsequently eliminate many of the complexities that come with traditional offline RL. We further demonstrate the advantages of modeling policies as conditional diffusion models by considering two other conditioning variables: constraints and skills. Conditioning on a single constraint or skill during training leads to behaviors at test-time that can satisfy several constraints together or demonstrate a composition of skills. Our results illustrate that conditional generative modeling is a powerful tool for decision-making.
translated by 谷歌翻译
在训练数据的分布中评估时,学到的模型和政策可以有效地概括,但可以在分布输入输入的情况下产生不可预测且错误的输出。为了避免在部署基于学习的控制算法时分配变化,我们寻求一种机制将代理商限制为类似于受过训练的国家和行动的机制。在控制理论中,Lyapunov稳定性和控制不变的集合使我们能够保证稳定系统周围系统的控制器,而在机器学习中,密度模型使我们能够估算培训数据分布。我们可以将这两个概念结合起来,产生基于学习的控制算法,这些算法仅使用分配动作将系统限制为分布状态?在这项工作中,我们建议通过结合Lyapunov稳定性和密度估计的概念来做到这一点,引入Lyapunov密度模型:控制Lyapunov函数和密度模型的概括,这些函数和密度模型可以保证代理商在其整个轨迹上保持分布的能力。
translated by 谷歌翻译
尽管学习环境内部模型的强化学习(RL)方法具有比没有模型的对应物更有效的样本效率,但学会从高维传感器中建模原始观察结果可能具有挑战性。先前的工作通过通过辅助目标(例如重建或价值预测)学习观察值的低维表示来解决这一挑战。但是,这些辅助目标与RL目标之间的一致性通常不清楚。在这项工作中,我们提出了一个单一的目标,该目标共同优化了潜在空间模型和政策,以实现高回报,同时保持自洽。这个目标是预期收益的下限。与基于模型的RL在策略探索或模型保证方面的先前范围不同,我们的界限直接依靠整体RL目标。我们证明,所得算法匹配或改善了最佳基于模型和无模型的RL方法的样品效率。尽管这种有效的样品方法通常在计算上是要求的,但我们的方法在较小的壁式锁定时间降低了50 \%。
translated by 谷歌翻译
由于数据量增加,金融业的快速变化已经彻底改变了数据处理和数据分析的技术,并带来了新的理论和计算挑战。与古典随机控制理论和解决财务决策问题的其他分析方法相比,解决模型假设的财务决策问题,强化学习(RL)的新发展能够充分利用具有更少模型假设的大量财务数据并改善复杂的金融环境中的决策。该调查纸目的旨在审查最近的资金途径的发展和使用RL方法。我们介绍了马尔可夫决策过程,这是许多常用的RL方法的设置。然后引入各种算法,重点介绍不需要任何模型假设的基于价值和基于策略的方法。连接是用神经网络进行的,以扩展框架以包含深的RL算法。我们的调查通过讨论了这些RL算法在金融中各种决策问题中的应用,包括最佳执行,投资组合优化,期权定价和对冲,市场制作,智能订单路由和Robo-Awaring。
translated by 谷歌翻译
Safe Reinforcement Learning can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes. We categorize and analyze two approaches of Safe Reinforcement Learning. The first is based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor. The second is based on the modification of the exploration process through the incorporation of external knowledge or the guidance of a risk metric. We use the proposed classification to survey the existing literature, as well as suggesting future directions for Safe Reinforcement Learning.
translated by 谷歌翻译