农作物管理,包括氮(N)受精和灌溉管理,对农作物产量,经济利润和环境产生了重大影响。尽管存在管理指南,但要在特定的种植环境和农作物中找到最佳的管理实践是挑战。先前的工作使用加强学习(RL)和作物模拟器来解决该问题,但是训练有素的政策要么具有有限的性能,要么在现实世界中不可部署。在本文中,我们提出了一种智能作物管理系统,该系统通过RL,模仿学习(IL)同时优化N受精和灌溉,并使用农业技术决策系统(DSSAT)进行了作物模拟。我们首先使用Deep RL,尤其是Deep Q-Network来培训需要从模拟器中的所有状态信息作为观测值(表示为完整观察)的管理政策。然后,我们援引IL来培训管理政策,这些政策只需要有限的国家信息,这些信息可以通过模仿以前的RL训练有素的政策在全面观察中轻松获得的国家(表示为部分观察)。我们在佛罗里达州使用玉米的案例研究进行实验,并将受过训练的政策与玉米管理指南进行比较。我们在全面观察和部分观察中训练有素的政策取得了更好的结果,从而获得更高的利润或类似的利润,而环境影响较小。此外,部分观察管理政策在使用易于使用的信息时直接在现实世界中部署。
translated by 谷歌翻译
通过加强学习解决现实世界的顺序决策问题(RL)通常始于使用模拟真实条件的模拟环境。我们为现实的农作物管理任务提供了一种新颖的开源RL环境。 Gym-DSSAT是高保真作物模拟器的农业技术转移决策支持系统(DSSAT)的健身房界面。在过去的30年中,DSSAT已发展,并被农学家广泛认可。 Gym-DSSAT带有基于现实世界玉米实验的预定义仿真。环境与任何健身房环境一样易于使用。我们使用基本RL算法提供性能基准。我们还简要概述了用Fortran编写的单片DSSAT模拟器如何变成Python RL环境。我们的方法是通用的,可以应用于类似的模拟器。我们报告了非常初步的实验结果,这表明RL可以帮助研究人员改善受精和灌溉实践的可持续性。
translated by 谷歌翻译
我们展示了一种带有Openai健身房界面的作物仿真环境,并应用现代深度加强学习(DRL)算法以优化产量。我们经验表明,DRL算法可用于发现新的政策和方法,以帮助优化作物产量,同时最小化水和肥料使用等约束因素。我们提出这种混合厂建模和数据驱动的方法,用于发现新策略的优化作物产量可能有助于满足越来越多的全球粮食需求,由于人口扩张和气候变化。
translated by 谷歌翻译
Deep reinforcement learning has considerable potential to improve irrigation scheduling in many cropping systems by applying adaptive amounts of water based on various measurements over time. The goal is to discover an intelligent decision rule that processes information available to growers and prescribes sensible irrigation amounts for the time steps considered. Due to the technical novelty, however, the research on the technique remains sparse and impractical. To accelerate the progress, the paper proposes a general framework and actionable procedure that allow researchers to formulate their own optimisation problems and implement solution algorithms based on deep reinforcement learning. The effectiveness of the framework was demonstrated using a case study of irrigated wheat grown in a productive region of Australia where profits were maximised. Specifically, the decision rule takes nine state variable inputs: crop phenological stage, leaf area index, extractable soil water for each of the five top layers, cumulative rainfall and cumulative irrigation. It returns a probabilistic prescription over five candidate irrigation amounts (0, 10, 20, 30 and 40 mm) every day. The production system was simulated at Goondiwindi using the APSIM-Wheat crop model. After training in the learning environment using 1981--2010 weather data, the learned decision rule was tested individually for each year of 2011--2020. The results were compared against the benchmark profits obtained using irrigation schedules optimised individually for each of the considered years. The discovered decision rule prescribed daily irrigation amounts that achieved more than 96% of the benchmark profits. The framework is general and applicable to a wide range of cropping systems with realistic optimisation problems.
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
具有成本效益的资产管理是多个行业的兴趣领域。具体而言,本文开发了深入的加固学习(DRL)解决方案,以自动确定不断恶化的水管的最佳康复政策。我们在在线和离线DRL设置中处理康复计划的问题。在在线DRL中,代理与具有不同长度,材料和故障率特征的多个管道的模拟环境进行交互。我们使用深Q学习(DQN)训练代理商,以最低限度的平均成本和减少故障概率学习最佳政策。在离线学习中,代理使用静态数据,例如DQN重播数据,通过保守的Q学习算法学习最佳策略,而无需与环境进行进一步的交互。我们证明,基于DRL的政策改善了标准预防,纠正和贪婪的计划替代方案。此外,从固定的DQN重播数据集中学习超过在线DQN设置。结果保证,由大型国家和行动轨迹组成的水管的现有恶化概况为在离线环境中学习康复政策提供了宝贵的途径,而无需模拟器。
translated by 谷歌翻译
仿制学习(IL)是一种有效的学习范例,利用代理和环境之间的交互。它不需要显式奖励信号,而是尝试使用专家演示恢复所需的策略。通常,IL方法可以分类为行为克隆(BC)和逆钢筋学习(IRL)。在这项工作中,提出了一种基于概率密度估计的新型奖励功能,用于IRL,这可以显着降低现有IRL方法的复杂性。此外,我们证明,只要确定性,我们源自奖励函数的理论上最佳政策与专家政策相同。因此,可以优雅地将IRL问题变为概率密度估计问题。基于所提出的奖励函数,我们展示了一个“观看 - 尝试学习”样式框架命名概率密度估计的基于仿真学习(PDEIL),其可以在离散和连续的动作空间中工作。最后,在健身房环境中的综合实验表明,Pdeil比现有算法恢复靠近地面真理的奖励更有效。
translated by 谷歌翻译
道路维护规划是道路资产管理的一个组成部分。维护和康复(M&R)实践中的主要挑战之一是确定维护类型和时间。本研究提出了一种基于长期路面性能(LTPP)数据库的强化学习(RL)的框架,以确定M&R实践的类型和时间。首先以所提出的算法开发预测DNN模型,其用作RL算法的环境。对于RL模型的策略估计,开发了DQN和PPO模型。然而,由于更好的收敛性和更高的样本效率,终点被选中了PPO。本研究中使用的指标是国际粗糙度指数(IRI)和车辙深度(RD)。最初,我们将裂化度量(cm)视为第三指示器,但是由于与其他指标相比的数据少得多,因此被排除在外,导致结果的准确性较低。此外,在成本效益计算(奖励)中,我们考虑了M&R治疗的经济和环境影响。使用Palate 2.0软件评估了成本和环境影响。我们的方法是在德克萨斯州德克萨斯州的23公里长的六车道高速公路的假设案例研究中进行了测试。结果提出了一个20年的M&R计划,其中道路状况保持在出色的条件范围。由于道路的早期阶段处于良好的服务水平,因此在第一年不需要重型维护实践。后来,经过重型的M&R作用,有几个1-2岁的治疗方法。所有这些都表明拟议的计划具有逻辑结果。决策者和运输机构可以使用此计划进行更好的维护实践,以防止预算浪费,同时最大限度地减少环境影响。
translated by 谷歌翻译
资产分配(或投资组合管理)是确定如何最佳将有限预算的资金分配给一系列金融工具/资产(例如股票)的任务。这项研究调查了使用无模型的深RL代理应用于投资组合管理的增强学习(RL)的性能。我们培训了几个RL代理商的现实股票价格,以学习如何执行资产分配。我们比较了这些RL剂与某些基线剂的性能。我们还比较了RL代理,以了解哪些类别的代理表现更好。从我们的分析中,RL代理可以执行投资组合管理的任务,因为它们的表现明显优于基线代理(随机分配和均匀分配)。四个RL代理(A2C,SAC,PPO和TRPO)总体上优于最佳基线MPT。这显示了RL代理商发现更有利可图的交易策略的能力。此外,基于价值和基于策略的RL代理之间没有显着的性能差异。演员批评者的表现比其他类型的药物更好。同样,在政策代理商方面的表现要好,因为它们在政策评估方面更好,样品效率在投资组合管理中并不是一个重大问题。这项研究表明,RL代理可以大大改善资产分配,因为它们的表现优于强基础。基于我们的分析,在政策上,参与者批评的RL药物显示出最大的希望。
translated by 谷歌翻译
Drug dosing is an important application of AI, which can be formulated as a Reinforcement Learning (RL) problem. In this paper, we identify two major challenges of using RL for drug dosing: delayed and prolonged effects of administering medications, which break the Markov assumption of the RL framework. We focus on prolongedness and define PAE-POMDP (Prolonged Action Effect-Partially Observable Markov Decision Process), a subclass of POMDPs in which the Markov assumption does not hold specifically due to prolonged effects of actions. Motivated by the pharmacology literature, we propose a simple and effective approach to converting drug dosing PAE-POMDPs into MDPs, enabling the use of the existing RL algorithms to solve such problems. We validate the proposed approach on a toy task, and a challenging glucose control task, for which we devise a clinically-inspired reward function. Our results demonstrate that: (1) the proposed method to restore the Markov assumption leads to significant improvements over a vanilla baseline; (2) the approach is competitive with recurrent policies which may inherently capture the prolonged effect of actions; (3) it is remarkably more time and memory efficient than the recurrent baseline and hence more suitable for real-time dosing control systems; and (4) it exhibits favorable qualitative behavior in our policy analysis.
translated by 谷歌翻译
最近,目睹了利用专家国家在模仿学习(IL)中的各种成功应用。然而,来自视觉输入(ILFVI)的另一个IL设定 - IL,它通过利用在线视觉资源而具有更大的承诺,它具有低数据效率和良好的性能,从政策学习方式和高度产生了差 - 宣称视觉输入。我们提出了由禁止策略学习方式,数据增强和编码器技术组成的OPIFVI(视觉输入的偏离策略模仿),分别分别解决所提到的挑战。更具体地,为了提高数据效率,OPIFVI以脱策方式进行IL,可以多次使用采样数据。此外,我们提高了opifvi与光谱归一化的稳定性,以减轻脱助政策培训的副作用。我们认为代理商的ILFVI表现不佳的核心因素可能不会从视觉输入中提取有意义的功能。因此,Opifvi采用计算机愿望的数据增强,以帮助列车编码器,可以更好地从视觉输入中提取功能。另外,对编码器的梯度背交量的特定结构旨在稳定编码器训练。最后,我们证明OPIFVI能够实现专家级性能和优于现有的基线,无论是通过使用Deepmind控制套件的广泛实验,无论视觉演示还是视觉观测。
translated by 谷歌翻译
强化学习(RL)通过与环境相互作用的试验过程解决顺序决策问题。尽管RL在玩复杂的视频游戏方面取得了巨大的成功,但在现实世界中,犯错误总是不希望的。为了提高样本效率并从而降低错误,据信基于模型的增强学习(MBRL)是一个有前途的方向,它建立了环境模型,在该模型中可以进行反复试验,而无需实际成本。在这项调查中,我们对MBRL进行了审查,重点是Deep RL的最新进展。对于非壮观环境,学到的环境模型与真实环境之间始终存在概括性错误。因此,非常重要的是分析环境模型中的政策培训与实际环境中的差异,这反过来又指导了更好的模型学习,模型使用和政策培训的算法设计。此外,我们还讨论了其他形式的RL,包括离线RL,目标条件RL,多代理RL和Meta-RL的最新进展。此外,我们讨论了MBRL在现实世界任务中的适用性和优势。最后,我们通过讨论MBRL未来发展的前景来结束这项调查。我们认为,MBRL在被忽略的现实应用程序中具有巨大的潜力和优势,我们希望这项调查能够吸引更多关于MBRL的研究。
translated by 谷歌翻译
Compared with model-based control and optimization methods, reinforcement learning (RL) provides a data-driven, learning-based framework to formulate and solve sequential decision-making problems. The RL framework has become promising due to largely improved data availability and computing power in the aviation industry. Many aviation-based applications can be formulated or treated as sequential decision-making problems. Some of them are offline planning problems, while others need to be solved online and are safety-critical. In this survey paper, we first describe standard RL formulations and solutions. Then we survey the landscape of existing RL-based applications in aviation. Finally, we summarize the paper, identify the technical gaps, and suggest future directions of RL research in aviation.
translated by 谷歌翻译
We develop a simple framework to learn bio-inspired foraging policies using human data. We conduct an experiment where humans are virtually immersed in an open field foraging environment and are trained to collect the highest amount of rewards. A Markov Decision Process (MDP) framework is introduced to model the human decision dynamics. Then, Imitation Learning (IL) based on maximum likelihood estimation is used to train Neural Networks (NN) that map human decisions to observed states. The results show that passive imitation substantially underperforms humans. We further refine the human-inspired policies via Reinforcement Learning (RL) using the on-policy Proximal Policy Optimization (PPO) algorithm which shows better stability than other algorithms and can steadily improve the policies pretrained with IL. We show that the combination of IL and RL can match human results and that good performance strongly depends on combining the allocentric information with an egocentric representation of the environment.
translated by 谷歌翻译
This paper is a technical overview of DeepMind and Google's recent work on reinforcement learning for controlling commercial cooling systems. Building on expertise that began with cooling Google's data centers more efficiently, we recently conducted live experiments on two real-world facilities in partnership with Trane Technologies, a building management system provider. These live experiments had a variety of challenges in areas such as evaluation, learning from offline data, and constraint satisfaction. Our paper describes these challenges in the hope that awareness of them will benefit future applied RL work. We also describe the way we adapted our RL system to deal with these challenges, resulting in energy savings of approximately 9% and 13% respectively at the two live experiment sites.
translated by 谷歌翻译
在自主驾驶场中,人类知识融合到深增强学习(DRL)通常基于在模拟环境中记录的人类示范。这限制了在现实世界交通中的概率和可行性。我们提出了一种两级DRL方法,从真实的人类驾驶中学习,实现优于纯DRL代理的性能。培训DRL代理商是在Carla的框架内完成了机器人操作系统(ROS)。对于评估,我们设计了不同的真实驾驶场景,可以将提出的两级DRL代理与纯DRL代理进行比较。在从人驾驶员中提取“良好”行为之后,例如在信号交叉口中的预期,该代理变得更有效,并且驱动更安全,这使得这种自主代理更适应人体机器人交互(HRI)流量。
translated by 谷歌翻译
动态作业车间调度问题(DJSP)是一类是专门考虑固有的不确定性,如切换顺序要求和现实的智能制造的设置可能机器故障调度任务。因为传统方法不能动态生成环境的扰动面有效调度策略,我们制定DJSP马尔可夫决策过程(MDP)通过强化学习(RL)加以解决。为此,我们提出了一个灵活的混合架构,采用析取图的状态和一组通用的调度规则与之前最小的领域知识的行动空间。注意机制被用作状态的特征提取的图形表示学习(GRL)模块,并且采用双决斗深Q-网络与优先重放和嘈杂的网络(D3QPN)到每个状态映射到最适当的调度规则。此外,我们提出Gymjsp,基于众所周知的或图书馆公共标杆,提供了RL和DJSP研究社区标准化现成的现成工具。各种DJSP实例综合实验证实,我们提出的框架是优于基准算法可在所有情况下,较小的完工时间,并提供了在混合架构的各个组成部分的有效性实证理由。
translated by 谷歌翻译
对话策略学习是面向任务的对话系统(TDS)中的关键组成部分,该系统决定在每个回合处给定对话状态的系统的下一个动作。加强学习(RL)通常被选为学习对话策略,将用户作为环境和系统作为代理。已经创建了许多基准数据集和算法,以促进基于RL的对话策略的制定和评估。在本文中,我们调查了RL规定的对话政策的最新进展和挑战。更具体地说,我们确定了主要问题,并总结了基于RL的对话政策学习的相应解决方案。此外,我们通过将最新方法分类为RL中的基本元素,对将RL应用于对话政策学习的全面调查。我们认为,这项调查可以阐明对话管理未来的研究。
translated by 谷歌翻译
Deep reinforcement learning algorithms have succeeded in several challenging domains. Classic Online RL job schedulers can learn efficient scheduling strategies but often takes thousands of timesteps to explore the environment and adapt from a randomly initialized DNN policy. Existing RL schedulers overlook the importance of learning from historical data and improving upon custom heuristic policies. Offline reinforcement learning presents the prospect of policy optimization from pre-recorded datasets without online environment interaction. Following the recent success of data-driven learning, we explore two RL methods: 1) Behaviour Cloning and 2) Offline RL, which aim to learn policies from logged data without interacting with the environment. These methods address the challenges concerning the cost of data collection and safety, particularly pertinent to real-world applications of RL. Although the data-driven RL methods generate good results, we show that the performance is highly dependent on the quality of the historical datasets. Finally, we demonstrate that by effectively incorporating prior expert demonstrations to pre-train the agent, we short-circuit the random exploration phase to learn a reasonable policy with online training. We utilize Offline RL as a \textbf{launchpad} to learn effective scheduling policies from prior experience collected using Oracle or heuristic policies. Such a framework is effective for pre-training from historical datasets and well suited to continuous improvement with online data collection.
translated by 谷歌翻译
在各种控制任务域中,现有控制器提供了基线的性能水平,虽然可能是次优的 - 应维护。依赖于国家和行动空间的广泛探索的强化学习(RL)算法可用于优化控制策略。但是,完全探索性的RL算法可能会在训练过程中降低低于基线水平的性能。在本文中,我们解决了控制政策的在线优化问题,同时最大程度地减少了遗憾的W.R.T基线政策绩效。我们提出了一个共同的仿制学习框架,表示乔尔。 JIRL中的学习过程假设了基线策略的可用性,并设计了两个目标\ textbf {(a)}利用基线的在线演示,以最大程度地减少培训期间的遗憾W.R.T的基线策略,\ textbf {(b) }最终超过了基线性能。 JIRL通过最初学习模仿基线策略并逐渐将控制从基线转移到RL代理来解决这些目标。实验结果表明,JIRR有效地实现了几个连续的动作空间域中的上述目标。结果表明,JIRL在最终性能中与最先进的算法相当,同时在所有提出的域中训练期间都会降低基线后悔。此外,结果表明,对于最先进的基线遗憾最小化方法,其基线后悔的减少因素最高为21美元。
translated by 谷歌翻译