已经开发了增强学习(RL)技术来优化工业冷却系统,与传统的启发式政策相比,提供了可观的节能。工业控制中的一个主要挑战涉及由于机械限制而在现实世界中可行的学习行为。例如,某些操作只能每隔几个小时执行一次,而其他动作可以更频繁地采取。如果没有广泛的奖励工程和实验,RL代理可能无法学习机械的现实操作。为了解决这个问题,我们使用层次结构的增强学习与多种根据操作时间尺度控制动作子集的代理。我们的分层方法可以在现有基线上节省能源,同时在模拟的HVAC控制环境中保持在安全范围内的限制(例如操作冷却器)。
translated by 谷歌翻译
我们提出了一个混合工业冷却系统模型,该模型将分析解决方案嵌入多物理模拟中。该模型设计用于增强学习(RL)应用程序,并平衡简单性与模拟保真度和解释性。该模型的忠诚度根据大规模冷却系统的现实世界数据进行了评估。接下来是一个案例研究,说明如何将模型用于RL研究。为此,我们开发了一个工业任务套件,该套件允许指定不同的问题设置和复杂性水平,并使用它来评估不同RL算法的性能。
translated by 谷歌翻译
This paper is a technical overview of DeepMind and Google's recent work on reinforcement learning for controlling commercial cooling systems. Building on expertise that began with cooling Google's data centers more efficiently, we recently conducted live experiments on two real-world facilities in partnership with Trane Technologies, a building management system provider. These live experiments had a variety of challenges in areas such as evaluation, learning from offline data, and constraint satisfaction. Our paper describes these challenges in the hope that awareness of them will benefit future applied RL work. We also describe the way we adapted our RL system to deal with these challenges, resulting in energy savings of approximately 9% and 13% respectively at the two live experiment sites.
translated by 谷歌翻译
Heating in private households is a major contributor to the emissions generated today. Heat pumps are a promising alternative for heat generation and are a key technology in achieving our goals of the German energy transformation and to become less dependent on fossil fuels. Today, the majority of heat pumps in the field are controlled by a simple heating curve, which is a naive mapping of the current outdoor temperature to a control action. A more advanced control approach is model predictive control (MPC) which was applied in multiple research works to heat pump control. However, MPC is heavily dependent on the building model, which has several disadvantages. Motivated by this and by recent breakthroughs in the field, this work applies deep reinforcement learning (DRL) to heat pump control in a simulated environment. Through a comparison to MPC, it could be shown that it is possible to apply DRL in a model-free manner to achieve MPC-like performance. This work extends other works which have already applied DRL to building heating operation by performing an in-depth analysis of the learned control strategies and by giving a detailed comparison of the two state-of-the-art control methods.
translated by 谷歌翻译
我们提供了PelficGridWorld软件包,为用户提供轻量级,模块化和可定制的框架,用于创建专注的电源系统的多代理体育馆环境,该环境易于与强化学习(RL)的现有培训框架集成。虽然存在许多框架用于训练多代理RL(MARL)政策,但没有可以快速原型并发开发环境,尤其是在所需电流解决方案来定义网格的异构(复合式,多器件)电力系统的背景下 - 级别变量和成本。 PowerGridWorld是一个开源软件包,有助于填补此间隙。为了突出PowerGridWorld的关键功能,我们展示了两个案例研究,并使用Openai的多代理深度确定性政策梯度(MADDPG)和RLLIB的近端策略优化(PPO)算法来演示MARL政策。在这两种情况下,至少一些代理子集合在每次作为奖励(负成本)结构的一部分中的一部分中的功率流溶液的元件。
translated by 谷歌翻译
The decarbonization of buildings presents new challenges for the reliability of the electrical grid as a result of the intermittency of renewable energy sources and increase in grid load brought about by end-use electrification. To restore reliability, grid-interactive efficient buildings can provide flexibility services to the grid through demand response. Residential demand response programs are hindered by the need for manual intervention by customers. To maximize the energy flexibility potential of residential buildings, an advanced control architecture is needed. Reinforcement learning is well-suited for the control of flexible resources as it is able to adapt to unique building characteristics compared to expert systems. Yet, factors hindering the adoption of RL in real-world applications include its large data requirements for training, control security and generalizability. Here we address these challenges by proposing the MERLIN framework and using a digital twin of a real-world 17-building grid-interactive residential community in CityLearn. We show that 1) independent RL-controllers for batteries improve building and district level KPIs compared to a reference RBC by tailoring their policies to individual buildings, 2) despite unique occupant behaviours, transferring the RL policy of any one of the buildings to other buildings provides comparable performance while reducing the cost of training, 3) training RL-controllers on limited temporal data that does not capture full seasonality in occupant behaviour has little effect on performance. Although, the zero-net-energy (ZNE) condition of the buildings could be maintained or worsened as a result of controlled batteries, KPIs that are typically improved by ZNE condition (electricity price and carbon emissions) are further improved when the batteries are managed by an advanced controller.
translated by 谷歌翻译
优化能源需求响应的价格需要一个灵活的控制器,具有导航复杂环境的能力。我们提出了一种强化学习控制器,令人惊讶的是最小化其架构的修改。我们建议令人惊讶的最小化可用于提高学习速度,以利用人们在人民能源使用中的可预测性。我们的架构在模拟能源需求响应时表现良好。我们提出这种修改,以改善功能,并在大规模的实验中保存。
translated by 谷歌翻译
增强学习(RL)是多能管理系统的有前途的最佳控制技术。它不需要先验模型 - 降低了前期和正在进行的项目特定工程工作,并且能够学习基础系统动力学的更好表示。但是,香草RL不能提供约束满意度的保证 - 导致其在安全至关重要的环境中产生各种不安全的互动。在本文中,我们介绍了两种新颖的安全RL方法,即SafeFallback和Afvafe,其中安全约束配方与RL配方脱钩,并且提供了硬构成满意度,可以保证在培训(探索)和开发过程中(近距离) )最佳政策。在模拟的多能系统案例研究中,我们已经表明,这两种方法均与香草RL基准相比(94,6%和82,8%,而35.5%)和香草RL基准相比明显更高的效用(即有用的政策)开始。提出的SafeFallback方法甚至可以胜过香草RL基准(102,9%至100%)。我们得出的结论是,这两种方法都是超越RL的安全限制处理技术,正如随机代理所证明的,同时仍提供坚硬的保证。最后,我们向I.A.提出了基本的未来工作。随着更多数据可用,改善约束功能本身。
translated by 谷歌翻译
This paper studies a model for online job scheduling in green datacenters. In green datacenters, resource availability depends on the power supply from the renewables. Intermittent power supply from renewables leads to intermittent resource availability, inducing job delays (and associated costs). Green datacenter operators must intelligently manage their workloads and available power supply to extract maximum benefits. The scheduler's objective is to schedule jobs on a set of resources to maximize the total value (revenue) while minimizing the overall job delay. A trade-off exists between achieving high job value on the one hand and low expected delays on the other. Hence, the aims of achieving high rewards and low costs are in opposition. In addition, datacenter operators often prioritize multiple objectives, including high system utilization and job completion. To accomplish the opposing goals of maximizing total job value and minimizing job delays, we apply the Proportional-Integral-Derivative (PID) Lagrangian methods in Deep Reinforcement Learning to job scheduling problem in the green datacenter environment. Lagrangian methods are widely used algorithms for constrained optimization problems. We adopt a controls perspective to learn the Lagrange multiplier with proportional, integral, and derivative control, achieving favorable learning dynamics. Feedback control defines cost terms for the learning agent, monitors the cost limits during training, and continuously adjusts the learning parameters to achieve stable performance. Our experiments demonstrate improved performance compared to scheduling policies without the PID Lagrangian methods. Experimental results illustrate the effectiveness of the Constraint Controlled Reinforcement Learning (CoCoRL) scheduler that simultaneously satisfies multiple objectives.
translated by 谷歌翻译
Reinforcement Learning (RL) generally suffers from poor sample complexity, mostly due to the need to exhaustively explore the state space to find good policies. On the other hand, we postulate that expert knowledge of the system to control often allows us to design simple rules we expect good policies to follow at all times. In this work, we hence propose a simple yet effective modification of continuous actor-critic RL frameworks to incorporate such prior knowledge in the learned policies and constrain them to regions of the state space that are deemed interesting, thereby significantly accelerating their convergence. Concretely, we saturate the actions chosen by the agent if they do not comply with our intuition and, critically, modify the gradient update step of the policy to ensure the learning process does not suffer from the saturation step. On a room temperature control simulation case study, these modifications allow agents to converge to well-performing policies up to one order of magnitude faster than classical RL agents while retaining good final performance.
translated by 谷歌翻译
Energy consumption in buildings, both residential and commercial, accounts for approximately 40% of all energy usage in the U.S., and similar numbers are being reported from countries around the world. This significant amount of energy is used to maintain a comfortable, secure, and productive environment for the occupants. So, it is crucial that the energy consumption in buildings must be optimized, all the while maintaining satisfactory levels of occupant comfort, health, and safety. Recently, Machine Learning has been proven to be an invaluable tool in deriving important insights from data and optimizing various systems. In this work, we review the ways in which machine learning has been leveraged to make buildings smart and energy-efficient. For the convenience of readers, we provide a brief introduction of several machine learning paradigms and the components and functioning of each smart building system we cover. Finally, we discuss challenges faced while implementing machine learning algorithms in smart buildings and provide future avenues for research at the intersection of smart buildings and machine learning.
translated by 谷歌翻译
建筑物中的供暖和冷却系统占全球能源使用的31%,其中大部分受基于规则的控制器(RBC)调节,这些控制器(RBC)既不通过与网格最佳交互来最大程度地提高能源效率或最小化排放。通过增强学习(RL)的控制已显示可显着提高建筑能源效率,但是现有的解决方案需要在模拟器中进行预训练,这些模拟器对世界上每栋建筑物的获得非常昂贵。作为回应,我们表明可以通过结合系统识别和基于模型的RL的想法来对建筑物进行安全,零射击的控制。我们称这种组合珍珠(概率避免施加加固的增强学习),并表明它可以减少排放而无需预先培训,只需要三个小时的调试期。在三个不同的建筑能源模拟的实验中,我们显示珍珠在所有情况下都胜过现有的RBC,并且在所有情况下,流行的RL基线,在维持热舒适度的同时,将建筑物排放量降低了31%。
translated by 谷歌翻译
建筑物中的加热和冷却系统占全球能源使用的31 \%,其中大部分受基于规则的控制器(RBC)调节,这些控制器(RBC)既不通过与电网进行最佳交互来最大化能源效率或最小化排放。通过强化学习(RL)的控制已显示可显着提高建筑能源效率,但是现有的解决方案需要访问世界上每栋建筑物都无法期望的特定建筑模拟器或数据。作为回应,我们表明可以在没有这样的知识的情况下获得减少排放的政策,这是我们称为零射击建筑物控制的范式。我们结合了系统识别和基于模型的RL的想法,以创建PEARL(概率避免发射的增强学习),并表明建立表现模型所需的短期积极探索是所需的。在三个不同的建筑能源模拟的实验中,我们显示珍珠在所有情况下都优于现有的RBC,并且在所有情况下,流行的RL基线,在维持热舒适度的同时,将建筑物排放量减少了31 \%。我们的源代码可通过https://enjeener.io/projects/pearl在线获得。
translated by 谷歌翻译
在本文中,多种子体增强学习用于控制混合能量存储系统,通过最大化可再生能源和交易的价值来降低微电网的能量成本。该代理商必须学习在波动需求,动态批发能源价格和不可预测的可再生能源中,控制三种不同类型的能量存储系统。考虑了两种案例研究:首先看能量存储系统如何在动态定价下更好地整合可再生能源发电,第二种与这些同一代理商如何与聚合剂一起使用,以向自私外部微电网销售能量的能量减少自己的能源票据。这项工作发现,具有分散执行的多代理深度确定性政策梯度的集中学习及其最先进的变体允许多种代理方法显着地比来自单个全局代理的控制更好。还发现,在多种子体方法中使用单独的奖励功能比使用单个控制剂更好。还发现能够与其他微电网交易,而不是卖回实用电网,也发现大大增加了网格的储蓄。
translated by 谷歌翻译
在化学厂的运行过程中,必须始终保持产品质量,并应最大程度地降低规范产品的生产。因此,必须测量与产品质量相关的过程变量,例如工厂各个部分的材料的温度和组成,并且必须根据测量结果进行适当的操作(即控制)。一些过程变量(例如温度和流速)可以连续,即时测量。但是,其他变量(例如成分和粘度)只能通过从植物中抽样物质后进行耗时的分析来获得。已经提出了软传感器,用于估算从易于测量变量实时获得的过程变量。但是,在未记录的情况下(推断),传统统计软传感器的估计精度(由记录的测量值构成)可能非常差。在这项研究中,我们通过使用动态模拟器来估算植物的内部状态变量,该模拟器可以根据化学工程知识和人工智能(AI)技术估算和预测未记录的情况,称为增强学习,并建议使用使用估计植物的内部状态变量作为软传感器。此外,我们描述了使用此类软传感器的植物操作和控制的前景以及为拟议系统获得必要的预测模型(即模拟器)的方法。
translated by 谷歌翻译
The energy sector is facing rapid changes in the transition towards clean renewable sources. However, the growing share of volatile, fluctuating renewable generation such as wind or solar energy has already led to an increase in power grid congestion and network security concerns. Grid operators mitigate these by modifying either generation or demand (redispatching, curtailment, flexible loads). Unfortunately, redispatching of fossil generators leads to excessive grid operation costs and higher emissions, which is in direct opposition to the decarbonization of the energy sector. In this paper, we propose an AlphaZero-based grid topology optimization agent as a non-costly, carbon-free congestion management alternative. Our experimental evaluation confirms the potential of topology optimization for power grid operation, achieves a reduction of the average amount of required redispatching by 60%, and shows the interoperability with traditional congestion management methods. Our approach also ranked 1st in the WCCI 2022 Learning to Run a Power Network (L2RPN) competition. Based on our findings, we identify and discuss open research problems as well as technical challenges for a productive system on a real power grid.
translated by 谷歌翻译
强化学习的标准制定缺乏指定禁止和禁止行为的实用方式。最常见的是,从业者通过手动工程来指定行为规范的任务,这是一个需要几个迭代的反向直观的过程,并且易于奖励代理人。在这项工作中,我们认为,几乎完全用于安全RL的受限制的RL,也有可能大大减少应用加强学习项目中奖励规范所花费的工作量。为此,我们建议在CMDP框架中指定行为偏好,并使用拉格朗日方法,该方法寻求解决代理程序的策略和拉格朗日乘法器之间的最小问题,以自动称量每个行为约束。具体而言,我们研究了如何调整CMDP,以便解决基于目标的任务,同时遵守一组行为约束,并提出对Sac-Lagrangian算法的修改以处理若干约束的具有挑战性的情况。我们对这一框架进行了一系列持续控制任务,该任务与用于视频游戏中NPC设计的加固学习应用相关。
translated by 谷歌翻译
Hierarchical Reinforcement Learning (HRL) algorithms have been demonstrated to perform well on high-dimensional decision making and robotic control tasks. However, because they solely optimize for rewards, the agent tends to search the same space redundantly. This problem reduces the speed of learning and achieved reward. In this work, we present an Off-Policy HRL algorithm that maximizes entropy for efficient exploration. The algorithm learns a temporally abstracted low-level policy and is able to explore broadly through the addition of entropy to the high-level. The novelty of this work is the theoretical motivation of adding entropy to the RL objective in the HRL setting. We empirically show that the entropy can be added to both levels if the Kullback-Leibler (KL) divergence between consecutive updates of the low-level policy is sufficiently small. We performed an ablative study to analyze the effects of entropy on hierarchy, in which adding entropy to high-level emerged as the most desirable configuration. Furthermore, a higher temperature in the low-level leads to Q-value overestimation and increases the stochasticity of the environment that the high-level operates on, making learning more challenging. Our method, SHIRO, surpasses state-of-the-art performance on a range of simulated robotic control benchmark tasks and requires minimal tuning.
translated by 谷歌翻译
Hierarchical methods in reinforcement learning have the potential to reduce the amount of decisions that the agent needs to perform when learning new tasks. However, finding a reusable useful temporal abstractions that facilitate fast learning remains a challenging problem. Recently, several deep learning approaches were proposed to learn such temporal abstractions in the form of options in an end-to-end manner. In this work, we point out several shortcomings of these methods and discuss their potential negative consequences. Subsequently, we formulate the desiderata for reusable options and use these to frame the problem of learning options as a gradient-based meta-learning problem. This allows us to formulate an objective that explicitly incentivizes options which allow a higher-level decision maker to adjust in few steps to different tasks. Experimentally, we show that our method is able to learn transferable components which accelerate learning and performs better than existing prior methods developed for this setting. Additionally, we perform ablations to quantify the impact of using gradient-based meta-learning as well as other proposed changes.
translated by 谷歌翻译
许多现实世界的应用程序都可以作为多机构合作问题进行配置,例如网络数据包路由和自动驾驶汽车的协调。深入增强学习(DRL)的出现为通过代理和环境的相互作用提供了一种有前途的多代理合作方法。但是,在政策搜索过程中,传统的DRL解决方案遭受了多个代理具有连续动作空间的高维度。此外,代理商政策的动态性使训练非平稳。为了解决这些问题,我们建议采用高级决策和低水平的个人控制,以进行有效的政策搜索,提出一种分层增强学习方法。特别是,可以在高级离散的动作空间中有效地学习多个代理的合作。同时,低水平的个人控制可以减少为单格强化学习。除了分层增强学习外,我们还建议对手建模网络在学习过程中对其他代理的政策进行建模。与端到端的DRL方法相反,我们的方法通过以层次结构将整体任务分解为子任务来降低学习的复杂性。为了评估我们的方法的效率,我们在合作车道变更方案中进行了现实世界中的案例研究。模拟和现实世界实验都表明我们的方法在碰撞速度和收敛速度中的优越性。
translated by 谷歌翻译