在边缘计算中,必须根据用户移动性迁移用户的服务配置文件。已经提出了强化学习(RL)框架。然而,这些框架并不考虑偶尔的服务器故障,尽管很少会阻止Edge Computing用户的延迟敏感应用程序(例如自动驾驶和实时障碍物检测)的平稳和安全功能,因为用户的计算作业不再是完全的。由于这些故障的发生率很低,因此,RL算法本质上很难为数据驱动的算法学习针对典型事件和罕见事件方案的最佳服务迁移解决方案。因此,我们引入了罕见的事件自适应弹性框架火,该框架将重要性采样集成到加强学习中以放置备份服务。我们以与其对价值函数的贡献成正比的稀有事件进行采样,以学习最佳政策。我们的框架平衡了服务迁移和迁移成本之间的迁移权衡,与失败的成本以及备份放置和移民的成本。我们提出了一种基于重要性抽样的Q-学习算法,并证明其界限和收敛到最佳性。随后,我们提出了新的资格轨迹,我们的算法的线性函数近似和深Q学习版本,以确保其扩展到现实世界情景。我们扩展框架,以适应具有不同风险承受失败的用户。最后,我们使用痕量驱动的实验表明我们的算法在发生故障时会降低成本。
translated by 谷歌翻译
多访问边缘计算(MEC)是一个新兴的计算范式,将云计算扩展到网络边缘,以支持移动设备上的资源密集型应用程序。作为MEC的关键问题,服务迁移需要决定如何迁移用户服务,以维持用户在覆盖范围和容量有限的MEC服务器之间漫游的服务质量。但是,由于动态的MEC环境和用户移动性,找到最佳的迁移策略是棘手的。许多现有研究根据完整的系统级信息做出集中式迁移决策,这是耗时的,并且缺乏理想的可扩展性。为了应对这些挑战,我们提出了一种新颖的学习驱动方法,该方法以用户为中心,可以通过使用不完整的系统级信息来做出有效的在线迁移决策。具体而言,服务迁移问题被建模为可观察到的马尔可夫决策过程(POMDP)。为了解决POMDP,我们设计了一个新的编码网络,该网络结合了长期记忆(LSTM)和一个嵌入式矩阵,以有效提取隐藏信息,并进一步提出了一种定制的非政策型演员 - 批判性算法,以进行有效的训练。基于现实世界的移动性痕迹的广泛实验结果表明,这种新方法始终优于启发式和最先进的学习驱动算法,并且可以在各种MEC场景上取得近乎最佳的结果。
translated by 谷歌翻译
未来的互联网涉及几种新兴技术,例如5G和5G网络,车辆网络,无人机(UAV)网络和物联网(IOT)。此外,未来的互联网变得异质并分散了许多相关网络实体。每个实体可能需要做出本地决定,以在动态和不确定的网络环境下改善网络性能。最近使用标准学习算法,例如单药强化学习(RL)或深入强化学习(DRL),以使每个网络实体作为代理人通过与未知环境进行互动来自适应地学习最佳决策策略。但是,这种算法未能对网络实体之间的合作或竞争进行建模,而只是将其他实体视为可能导致非平稳性问题的环境的一部分。多机构增强学习(MARL)允许每个网络实体不仅观察环境,还可以观察其他实体的政策来学习其最佳政策。结果,MAL可以显着提高网络实体的学习效率,并且最近已用于解决新兴网络中的各种问题。在本文中,我们因此回顾了MAL在新兴网络中的应用。特别是,我们提供了MARL的教程,以及对MARL在下一代互联网中的应用进行全面调查。特别是,我们首先介绍单代机Agent RL和MARL。然后,我们回顾了MAL在未来互联网中解决新兴问题的许多应用程序。这些问题包括网络访问,传输电源控制,计算卸载,内容缓存,数据包路由,无人机网络的轨迹设计以及网络安全问题。
translated by 谷歌翻译
由于数据量增加,金融业的快速变化已经彻底改变了数据处理和数据分析的技术,并带来了新的理论和计算挑战。与古典随机控制理论和解决财务决策问题的其他分析方法相比,解决模型假设的财务决策问题,强化学习(RL)的新发展能够充分利用具有更少模型假设的大量财务数据并改善复杂的金融环境中的决策。该调查纸目的旨在审查最近的资金途径的发展和使用RL方法。我们介绍了马尔可夫决策过程,这是许多常用的RL方法的设置。然后引入各种算法,重点介绍不需要任何模型假设的基于价值和基于策略的方法。连接是用神经网络进行的,以扩展框架以包含深的RL算法。我们的调查通过讨论了这些RL算法在金融中各种决策问题中的应用,包括最佳执行,投资组合优化,期权定价和对冲,市场制作,智能订单路由和Robo-Awaring。
translated by 谷歌翻译
我们提出了一种方法来减少由事件触发控制(ETC)技术的分布式Q学习系统所需信息的通信。我们考虑在Markov决策过程(MDP)上的分布式Q学习问题的基线情景。在基于事件的方法之后,N代理商探索MDP并仅在必要时将体验传达给中央学习者,这执行了Actor Q函数的更新。我们设计了一个基于事件的分布式Q学习系统(EBD-Q),并在vanilla Q学习算法方面推出了收敛保证。我们提出了实验结果,示出了基于事件的通信导致这种分布式系统中的数据传输速率大幅度降低。此外,我们讨论基于事件的方法对所研究的学习过程的基于事件的方法以及它们如何应用于更复杂的多代理系统。
translated by 谷歌翻译
This paper surveys the eld of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the eld and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but di ers considerably in the details and in the use of the word \reinforcement." The paper discusses central issues of reinforcement learning, including trading o exploration and exploitation, establishing the foundations of the eld via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
translated by 谷歌翻译
自动驾驶汽车(AV)必须在动态环境中安全有效地操作。为此,配备联合雷达通信(JRC)功能的AVS可以通过使用雷达检测和数据通信功能来增强驾驶安全性。但是,在不确定性和周围环境的动态下,通过两种不同功能优化AV系统的性能非常具有挑战性。在这项工作中,我们首先提出一个基于马尔可夫决策过程(MDP)的智能优化框架,以帮助AV在周围环境的动态和不确定性下选择JRC操作功能时做出最佳决策。然后,我们开发了一种有效的学习算法,利用了深度强化学习技术的最新进展,以找到AV的最佳政策,而无需任何有关周围环境的先前信息。此外,为了使我们提出的框架更加可扩展,我们开发了一种转移学习(TL)机制,该机制使AV能够利用有价值的体验来加速培训过程,以加速培训过程。广泛的模拟表明,与其他常规的深钢筋学习方法相比,提议的可转移深钢筋学习框架可将AV的障碍检测概率降低到67%。
translated by 谷歌翻译
We consider infinite horizon Markov decision processes (MDPs) with fast-slow structure, meaning that certain parts of the state space move "fast" (and in a sense, are more influential) while other parts transition more "slowly." Such structure is common in real-world problems where sequential decisions need to be made at high frequencies, yet information that varies at a slower timescale also influences the optimal policy. Examples include: (1) service allocation for a multi-class queue with (slowly varying) stochastic costs, (2) a restless multi-armed bandit with an environmental state, and (3) energy demand response, where both day-ahead and real-time prices play a role in the firm's revenue. Models that fully capture these problems often result in MDPs with large state spaces and large effective time horizons (due to frequent decisions), rendering them computationally intractable. We propose an approximate dynamic programming algorithmic framework based on the idea of "freezing" the slow states, solving a set of simpler finite-horizon MDPs (the lower-level MDPs), and applying value iteration (VI) to an auxiliary MDP that transitions on a slower timescale (the upper-level MDP). We also extend the technique to a function approximation setting, where a feature-based linear architecture is used. On the theoretical side, we analyze the regret incurred by each variant of our frozen-state approach. Finally, we give empirical evidence that the frozen-state approach generates effective policies using just a fraction of the computational cost, while illustrating that simply omitting slow states from the decision modeling is often not a viable heuristic.
translated by 谷歌翻译
Batch reinforcement learning is a subfield of dynamic programming-based reinforcement learning. Originally defined as the task of learning the best possible policy from a fixed set of a priori-known transition samples, the (batch) algorithms developed in this field can be easily adapted to the classical online case, where the agent interacts with the environment while learning. Due to the efficient use of collected data and the stability of the learning process, this research area has attracted a lot of attention recently. In this chapter, we introduce the basic principles and the theory behind batch reinforcement learning, describe the most important algorithms, exemplarily discuss ongoing research within this field, and briefly survey real-world applications of batch reinforcement learning.
translated by 谷歌翻译
我们介绍了一种改进政策改进的方法,该方法在基于价值的强化学习(RL)的贪婪方法与基于模型的RL的典型计划方法之间进行了插值。新方法建立在几何视野模型(GHM,也称为伽马模型)的概念上,该模型对给定策略的折现状态验证分布进行了建模。我们表明,我们可以通过仔细的基本策略GHM的仔细组成,而无需任何其他学习,可以评估任何非马尔科夫策略,以固定的概率在一组基本马尔可夫策略之间切换。然后,我们可以将广义政策改进(GPI)应用于此类非马尔科夫政策的收集,以获得新的马尔可夫政策,通常将其表现优于其先驱。我们对这种方法提供了彻底的理论分析,开发了转移和标准RL的应用,并在经验上证明了其对标准GPI的有效性,对充满挑战的深度RL连续控制任务。我们还提供了GHM培训方法的分析,证明了关于先前提出的方法的新型收敛结果,并显示了如何在深度RL设置中稳定训练这些模型。
translated by 谷歌翻译
随着移动网络的增殖,我们正在遇到强大的服务多样化,这需要从现有网络的更大灵活性。建议网络切片作为5G和未来网络的资源利用解决方案,以解决这种可怕需求。在网络切片中,动态资源编排和网络切片管理对于最大化资源利用率至关重要。不幸的是,由于缺乏准确的模型和动态隐藏结构,这种过程对于传统方法来说太复杂。在不知道模型和隐藏结构的情况下,我们将问题作为受约束的马尔可夫决策过程(CMDP)制定。此外,我们建议使用Clara解决问题,这是一种基于钢筋的基于资源分配算法。特别是,我们分别使用自适应内部点策略优化和投影层分析累积和瞬时约束。评估表明,Clara明显优于资源配置的基线,通过服务需求保证。
translated by 谷歌翻译
In many sequential decision-making problems one is interested in minimizing an expected cumulative cost while taking into account risk, i.e., increased awareness of events of small probability and high consequences. Accordingly, the objective of this paper is to present efficient reinforcement learning algorithms for risk-constrained Markov decision processes (MDPs), where risk is represented via a chance constraint or a constraint on the conditional value-at-risk (CVaR) of the cumulative cost. We collectively refer to such problems as percentile risk-constrained MDPs. Specifically, we first derive a formula for computing the gradient of the Lagrangian function for percentile riskconstrained MDPs. Then, we devise policy gradient and actor-critic algorithms that (1) estimate such gradient, (2) update the policy in the descent direction, and (3) update the Lagrange multiplier in the ascent direction. For these algorithms we prove convergence to locally optimal policies. Finally, we demonstrate the effectiveness of our algorithms in an optimal stopping problem and an online marketing application.
translated by 谷歌翻译
体验重播\ CITEP {Lin1993ReInforcement,Mnih2015human}是一种广泛使用的技术,可以实现有效利用数据和R1算法中的性能提高。在经验重放中,过去的转换存储在内存缓冲区中并在学习期间重新使用。在以前的作品中提出了从重播缓冲区中提出了用于从重放缓冲区的采样方案的各种建议,试图最佳选择这些经验,这些经历将有最大贡献的融合到最佳政策。在这里,我们对重播采样方案提供一些条件,该方案将确保收敛,重点是表格设置中的众所周知的Q学习算法。在为收敛建立充足的条件后,我们向建议以偏见方式重播的经验略有不同的用法作为改变所产生的策略的属性的方法。我们启动了对体验重放的严格研究作为控制和修改生成策略的属性的工具。特别是,我们表明使用适当的偏置采样方案可以允许我们实现\ emph {Safe}策略。我们认为,使用体验重放作为偏置机制,允许以可取的方式控制所产生的政策是许多应用程序具有有希望的潜力的想法。
translated by 谷歌翻译
我们为处理顺序决策和外在不确定性的应用程序开发了增强学习(RL)框架,例如资源分配和库存管理。在这些应用中,不确定性仅由于未来需求等外源变量所致。一种流行的方法是使用历史数据预测外源变量,然后对预测进行计划。但是,这种间接方法需要对外源过程进行高保真模型,以确保良好的下游决策,当外源性过程复杂时,这可能是不切实际的。在这项工作中,我们提出了一种基于事后观察学习的替代方法,该方法避开了对外源过程进行建模的建模。我们的主要见解是,与Sim2real RL不同,我们可以在历史数据中重新审视过去的决定,并在这些应用程序中对其他动作产生反事实后果。我们的框架将事后最佳的行动用作政策培训信号,并在决策绩效方面具有强大的理论保证。我们使用框架开发了一种算法,以分配计算资源,以用于现实世界中的Microsoft Azure工作负载。结果表明,我们的方法比域特异性的启发式方法和SIM2REAL RL基准学习更好的政策。
translated by 谷歌翻译
Recent advances in distributed artificial intelligence (AI) have led to tremendous breakthroughs in various communication services, from fault-tolerant factory automation to smart cities. When distributed learning is run over a set of wirelessly connected devices, random channel fluctuations and the incumbent services running on the same network impact the performance of both distributed learning and the coexisting service. In this paper, we investigate a mixed service scenario where distributed AI workflow and ultra-reliable low latency communication (URLLC) services run concurrently over a network. Consequently, we propose a risk sensitivity-based formulation for device selection to minimize the AI training delays during its convergence period while ensuring that the operational requirements of the URLLC service are met. To address this challenging coexistence problem, we transform it into a deep reinforcement learning problem and address it via a framework based on soft actor-critic algorithm. We evaluate our solution with a realistic and 3GPP-compliant simulator for factory automation use cases. Our simulation results confirm that our solution can significantly decrease the training delay of the distributed AI service while keeping the URLLC availability above its required threshold and close to the scenario where URLLC solely consumes all network resources.
translated by 谷歌翻译
具有切换持续时间的轮询系统是具有若干实际应用的有用模型。它被归类为离散事件动态系统(DED),没有人在建模方法中同意的是。此外,DEDS非常复杂。迄今为止,最复杂的兴趣调查系统建模的方法是连续时间马尔可夫决策过程(CTMDP)。本文提出了一个半马尔可夫决策过程(SMDP)轮询系统的制定,以引入额外的建模能力。这种权力以截断误差和昂贵的数值积分为代价,自然导致SMDP政策是否提供有价值的优势。为了进一步添加到此方案,显示CTMDP中可以利用稀疏性以开发计算有效的模型。使用半Markov过程模拟器评估SMDP和CTMDP策略的折扣性能。两项政策伴随着专门为该投票系统开发的启发式政策,作为详尽的服务政策。参数和非参数假设试验用于测试性能差异是否有统计学意义。
translated by 谷歌翻译
Safe Reinforcement Learning can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes. We categorize and analyze two approaches of Safe Reinforcement Learning. The first is based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor. The second is based on the modification of the exploration process through the incorporation of external knowledge or the guidance of a risk metric. We use the proposed classification to survey the existing literature, as well as suggesting future directions for Safe Reinforcement Learning.
translated by 谷歌翻译
本文研究了一种使用背景计划的新方法,用于基于模型的增强学习:混合(近似)动态编程更新和无模型更新,类似于DYNA体系结构。通过学习模型的背景计划通常比无模型替代方案(例如Double DQN)差,尽管前者使用了更多的内存和计算。基本问题是,学到的模型可能是不准确的,并且经常会产生无效的状态,尤其是在迭代许多步骤时。在本文中,我们通过将背景规划限制为一组(抽象)子目标并仅学习本地,子观念模型来避免这种限制。这种目标空间计划(GSP)方法更有效地是在计算上,自然地纳入了时间抽象,以进行更快的长胜压计划,并避免完全学习过渡动态。我们表明,在各种情况下,我们的GSP算法比双DQN基线要快得多。
translated by 谷歌翻译
Recent technological advancements in space, air and ground components have made possible a new network paradigm called "space-air-ground integrated network" (SAGIN). Unmanned aerial vehicles (UAVs) play a key role in SAGINs. However, due to UAVs' high dynamics and complexity, the real-world deployment of a SAGIN becomes a major barrier for realizing such SAGINs. Compared to the space and terrestrial components, UAVs are expected to meet performance requirements with high flexibility and dynamics using limited resources. Therefore, employing UAVs in various usage scenarios requires well-designed planning in algorithmic approaches. In this paper, we provide a comprehensive review of recent learning-based algorithmic approaches. We consider possible reward functions and discuss the state-of-the-art algorithms for optimizing the reward functions, including Q-learning, deep Q-learning, multi-armed bandit (MAB), particle swarm optimization (PSO) and satisfaction-based learning algorithms. Unlike other survey papers, we focus on the methodological perspective of the optimization problem, which can be applicable to various UAV-assisted missions on a SAGIN using these algorithms. We simulate users and environments according to real-world scenarios and compare the learning-based and PSO-based methods in terms of throughput, load, fairness, computation time, etc. We also implement and evaluate the 2-dimensional (2D) and 3-dimensional (3D) variations of these algorithms to reflect different deployment cases. Our simulation suggests that the $3$D satisfaction-based learning algorithm outperforms the other approaches for various metrics in most cases. We discuss some open challenges at the end and our findings aim to provide design guidelines for algorithm selections while optimizing the deployment of UAV-assisted SAGINs.
translated by 谷歌翻译
在标准数据分析框架中,首先收集数据(全部一次),然后进行数据分析。此外,通常认为数据生成过程是外源性的。当数据分析师对数据的生成方式没有影响时,这种方法是自然的。但是,数字技术的进步使公司促进了从数据中学习并同时做出决策。随着这些决定生成新数据,数据分析师(业务经理或算法)也成为数据生成器。这种相互作用会产生一种新型的偏见 - 增强偏见 - 加剧了静态数据分析中的内生性问题。因果推理技术应该被纳入加强学习中以解决此类问题。
translated by 谷歌翻译