Development of guidance, navigation and control frameworks/algorithms for swarms attracted significant attention in recent years. That being said, algorithms for planning swarm allocations/trajectories for engaging with enemy swarms is largely an understudied problem. Although small-scale scenarios can be addressed with tools from differential game theory, existing approaches fail to scale for large-scale multi-agent pursuit evasion (PE) scenarios. In this work, we propose a reinforcement learning (RL) based framework to decompose to large-scale swarm engagement problems into a number of independent multi-agent pursuit-evasion games. We simulate a variety of multi-agent PE scenarios, where finite time capture is guaranteed under certain conditions. The calculated PE statistics are provided as a reward signal to the high level allocation layer, which uses an RL algorithm to allocate controlled swarm units to eliminate enemy swarm units with maximum efficiency. We verify our approach in large-scale swarm-to-swarm engagement simulations.
translated by 谷歌翻译
许多现实世界的应用程序都可以作为多机构合作问题进行配置,例如网络数据包路由和自动驾驶汽车的协调。深入增强学习(DRL)的出现为通过代理和环境的相互作用提供了一种有前途的多代理合作方法。但是,在政策搜索过程中,传统的DRL解决方案遭受了多个代理具有连续动作空间的高维度。此外,代理商政策的动态性使训练非平稳。为了解决这些问题,我们建议采用高级决策和低水平的个人控制,以进行有效的政策搜索,提出一种分层增强学习方法。特别是,可以在高级离散的动作空间中有效地学习多个代理的合作。同时,低水平的个人控制可以减少为单格强化学习。除了分层增强学习外,我们还建议对手建模网络在学习过程中对其他代理的政策进行建模。与端到端的DRL方法相反,我们的方法通过以层次结构将整体任务分解为子任务来降低学习的复杂性。为了评估我们的方法的效率,我们在合作车道变更方案中进行了现实世界中的案例研究。模拟和现实世界实验都表明我们的方法在碰撞速度和收敛速度中的优越性。
translated by 谷歌翻译
最先进的多机构增强学习(MARL)方法为各种复杂问题提供了有希望的解决方案。然而,这些方法都假定代理执行同步的原始操作执行,因此它们不能真正可扩展到长期胜利的真实世界多代理/机器人任务,这些任务固有地要求代理/机器人以异步的理由,涉及有关高级动作选择的理由。不同的时间。宏观行动分散的部分可观察到的马尔可夫决策过程(MACDEC-POMDP)是在完全合作的多代理任务中不确定的异步决策的一般形式化。在本论文中,我们首先提出了MacDec-Pomdps的一组基于价值的RL方法,其中允许代理在三个范式中使用宏观成果功能执行异步学习和决策:分散学习和控制,集中学习,集中学习和控制,以及分散执行的集中培训(CTDE)。在上述工作的基础上,我们在三个训练范式下制定了一组基于宏观行动的策略梯度算法,在该训练范式下,允许代理以异步方式直接优化其参数化策略。我们在模拟和真实的机器人中评估了我们的方法。经验结果证明了我们在大型多代理问题中的方法的优势,并验证了我们算法在学习具有宏观actions的高质量和异步溶液方面的有效性。
translated by 谷歌翻译
具有很多玩家的非合作和合作游戏具有许多应用程序,但是当玩家数量增加时,通常仍然很棘手。由Lasry和Lions以及Huang,Caines和Malham \'E引入的,平均野外运动会(MFGS)依靠平均场外近似值,以使玩家数量可以成长为无穷大。解决这些游戏的传统方法通常依赖于以完全了解模型的了解来求解部分或随机微分方程。最近,增强学习(RL)似乎有望解决复杂问题。通过组合MFGS和RL,我们希望在人口规模和环境复杂性方面能够大规模解决游戏。在这项调查中,我们回顾了有关学习MFG中NASH均衡的最新文献。我们首先确定最常见的设置(静态,固定和进化)。然后,我们为经典迭代方法(基于最佳响应计算或策略评估)提供了一个通用框架,以确切的方式解决MFG。在这些算法和与马尔可夫决策过程的联系的基础上,我们解释了如何使用RL以无模型的方式学习MFG解决方案。最后,我们在基准问题上介绍了数值插图,并以某些视角得出结论。
translated by 谷歌翻译
我们为仓库环境中的移动机器人提供基于新颖的强化学习(RL)任务分配和分散的导航算法。我们的方法是针对各种机器人执行各种接送和交付任务的场景而设计的。我们考虑了联合分散任务分配和导航的问题,并提出了解决该问题的两层方法。在更高级别,我们通过根据马尔可夫决策过程制定任务并选择适当的奖励来最大程度地减少总旅行延迟(TTD)来解决任务分配。在较低级别,我们使用基于ORCA的分散导航方案,使每个机器人能够独立执行这些任务,并避免与其他机器人和动态障碍物发生碰撞。我们通过定义较高级别的奖励作为低级导航算法的反馈来结合这些下层和上层。我们在复杂的仓库布局中进行了广泛的评估,并具有大量代理商,并根据近视拾取距离距离最小化和基于遗憾的任务选择,突出了对最先进算法的好处。我们观察到任务完成时间的改善高达14%,并且在计算机器人的无碰撞轨迹方面提高了40%。
translated by 谷歌翻译
安全是空中交通时的主要问题。通过成对分离最小值确保无人驾驶飞机(无人机)之间的飞行安全性,利用冲突检测和分辨方法。现有方法主要处理成对冲突,但由于交通密度的预期增加,可能会发生两个以上的无人机的遇到。在本文中,我们将多UAV冲突解决模型作为多功能加强学习问题。我们实现了一种基于图形神经网络的算法,配合代理可以与共同生成分辨率的操作进行通信。该模型在具有3和4个当前代理的情况下进行评估。结果表明,代理商能够通过合作策略成功解决多UV冲突。
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
深度加强学习(RL)使得可以使用神经网络作为功能近似器来解决复杂的机器人问题。然而,在从一个环境转移到另一个环境时,在普通环境中培训的政策在泛化方面受到影响。在这项工作中,我们使用强大的马尔可夫决策过程(RMDP)来训练无人机控制策略,这将思想与强大的控制和RL相结合。它选择了悲观优化,以处理从一个环境到另一个环境的策略转移之间的潜在间隙。训练有素的控制策略是关于四转位位置控制的任务。 RL代理商在Mujoco模拟器中培训。在测试期间,使用不同的环境参数(培训期间看不见)来验证训练策略的稳健性,以从一个环境转移到另一个环境。强大的政策在这些环境中表现出标准代理,表明增加的鲁棒性增加了一般性,并且可以适应非静止环境。代码:https://github.com/adipandas/gym_multirotor
translated by 谷歌翻译
小型无人驾驶飞机的障碍避免对于未来城市空袭(UAM)和无人机系统(UAS)交通管理(UTM)的安全性至关重要。有许多技术用于实时强大的无人机指导,但其中许多在离散的空域和控制中解决,这将需要额外的路径平滑步骤来为UA提供灵活的命令。为提供无人驾驶飞机的操作安全有效的计算指导,我们探讨了基于近端政策优化(PPO)的深增强学习算法的使用,以指导自主UA到其目的地,同时通过连续控制避免障碍物。所提出的场景状态表示和奖励功能可以将连续状态空间映射到连续控制,以便进行标题角度和速度。为了验证所提出的学习框架的性能,我们用静态和移动障碍进行了数值实验。详细研究了与环境和安全操作界限的不确定性。结果表明,该拟议的模型可以提供准确且强大的指导,并解决了99%以上的成功率的冲突。
translated by 谷歌翻译
大型人口系统的分析和控制对研究和工程的各个领域引起了极大的兴趣,从机器人群的流行病学到经济学和金融。一种越来越流行和有效的方法来实现多代理系统中的顺序决策,这是通过多机构增强学习,因为它允许对高度复杂的系统进行自动和无模型的分析。但是,可伸缩性的关键问题使控制和增强学习算法的设计变得复杂,尤其是在具有大量代理的系统中。尽管强化学习在许多情况下都发现了经验成功,但许多代理商的问题很快就变得棘手了,需要特别考虑。在这项调查中,我们将阐明当前的方法,以通过多代理强化学习以及通过诸如平均场游戏,集体智能或复杂的网络理论等研究领域进行仔细理解和分析大型人口系统。这些经典独立的主题领域提供了多种理解或建模大型人口系统的方法,这可能非常适合将来的可拖动MARL算法制定。最后,我们调查了大规模控制的潜在应用领域,并确定了实用系统中学习算法的富有成果的未来应用。我们希望我们的调查可以为理论和应用科学的初级和高级研究人员提供洞察力和未来的方向。
translated by 谷歌翻译
资产分配(或投资组合管理)是确定如何最佳将有限预算的资金分配给一系列金融工具/资产(例如股票)的任务。这项研究调查了使用无模型的深RL代理应用于投资组合管理的增强学习(RL)的性能。我们培训了几个RL代理商的现实股票价格,以学习如何执行资产分配。我们比较了这些RL剂与某些基线剂的性能。我们还比较了RL代理,以了解哪些类别的代理表现更好。从我们的分析中,RL代理可以执行投资组合管理的任务,因为它们的表现明显优于基线代理(随机分配和均匀分配)。四个RL代理(A2C,SAC,PPO和TRPO)总体上优于最佳基线MPT。这显示了RL代理商发现更有利可图的交易策略的能力。此外,基于价值和基于策略的RL代理之间没有显着的性能差异。演员批评者的表现比其他类型的药物更好。同样,在政策代理商方面的表现要好,因为它们在政策评估方面更好,样品效率在投资组合管理中并不是一个重大问题。这项研究表明,RL代理可以大大改善资产分配,因为它们的表现优于强基础。基于我们的分析,在政策上,参与者批评的RL药物显示出最大的希望。
translated by 谷歌翻译
Reinforcement Learning (RL) is currently one of the most commonly used techniques for traffic signal control (TSC), which can adaptively adjusted traffic signal phase and duration according to real-time traffic data. However, a fully centralized RL approach is beset with difficulties in a multi-network scenario because of exponential growth in state-action space with increasing intersections. Multi-agent reinforcement learning (MARL) can overcome the high-dimension problem by employing the global control of each local RL agent, but it also brings new challenges, such as the failure of convergence caused by the non-stationary Markov Decision Process (MDP). In this paper, we introduce an off-policy nash deep Q-Network (OPNDQN) algorithm, which mitigates the weakness of both fully centralized and MARL approaches. The OPNDQN algorithm solves the problem that traditional algorithms cannot be used in large state-action space traffic models by utilizing a fictitious game approach at each iteration to find the nash equilibrium among neighboring intersections, from which no intersection has incentive to unilaterally deviate. One of main advantages of OPNDQN is to mitigate the non-stationarity of multi-agent Markov process because it considers the mutual influence among neighboring intersections by sharing their actions. On the other hand, for training a large traffic network, the convergence rate of OPNDQN is higher than that of existing MARL approaches because it does not incorporate all state information of each agent. We conduct an extensive experiments by using Simulation of Urban MObility simulator (SUMO), and show the dominant superiority of OPNDQN over several existing MARL approaches in terms of average queue length, episode training reward and average waiting time.
translated by 谷歌翻译
Recently, some challenging tasks in multi-agent systems have been solved by some hierarchical reinforcement learning methods. Inspired by the intra-level and inter-level coordination in the human nervous system, we propose a novel value decomposition framework HAVEN based on hierarchical reinforcement learning for fully cooperative multi-agent problems. To address the instability arising from the concurrent optimization of policies between various levels and agents, we introduce the dual coordination mechanism of inter-level and inter-agent strategies by designing reward functions in a two-level hierarchy. HAVEN does not require domain knowledge and pre-training, and can be applied to any value decomposition variant. Our method achieves desirable results on different decentralized partially observable Markov decision process domains and outperforms other popular multi-agent hierarchical reinforcement learning algorithms.
translated by 谷歌翻译
Hierarchical Reinforcement Learning (HRL) algorithms have been demonstrated to perform well on high-dimensional decision making and robotic control tasks. However, because they solely optimize for rewards, the agent tends to search the same space redundantly. This problem reduces the speed of learning and achieved reward. In this work, we present an Off-Policy HRL algorithm that maximizes entropy for efficient exploration. The algorithm learns a temporally abstracted low-level policy and is able to explore broadly through the addition of entropy to the high-level. The novelty of this work is the theoretical motivation of adding entropy to the RL objective in the HRL setting. We empirically show that the entropy can be added to both levels if the Kullback-Leibler (KL) divergence between consecutive updates of the low-level policy is sufficiently small. We performed an ablative study to analyze the effects of entropy on hierarchy, in which adding entropy to high-level emerged as the most desirable configuration. Furthermore, a higher temperature in the low-level leads to Q-value overestimation and increases the stochasticity of the environment that the high-level operates on, making learning more challenging. Our method, SHIRO, surpasses state-of-the-art performance on a range of simulated robotic control benchmark tasks and requires minimal tuning.
translated by 谷歌翻译
智能能源网络提供了一种有效的手段,可容纳可变可再生能源(例如太阳能和风能)的高渗透率,这是能源生产深度脱碳的关键。但是,鉴于可再生能源以及能源需求的可变性,必须制定有效的控制和能源存储方案来管理可变的能源产生并实现所需的系统经济学和环境目标。在本文中,我们引入了由电池和氢能存储组成的混合储能系统,以处理与电价,可再生能源生产和消费有关的不确定性。我们旨在提高可再生能源利用率,并最大程度地减少能源成本和碳排放,同时确保网络内的能源可靠性和稳定性。为了实现这一目标,我们提出了一种多代理的深层确定性政策梯度方法,这是一种基于强化的基于强化学习的控制策略,可实时优化混合能源存储系统和能源需求的调度。提出的方法是无模型的,不需要明确的知识和智能能源网络环境的严格数学模型。基于现实世界数据的仿真结果表明:(i)混合储能系统和能源需求的集成和优化操作可将碳排放量减少78.69%,将成本节省的成本储蓄提高23.5%,可续订的能源利用率比13.2%以上。其他基线模型和(ii)所提出的算法优于最先进的自学习算法,例如Deep-Q网络。
translated by 谷歌翻译
Recently, model-based agents have achieved better performance than model-free ones using the same computational budget and training time in single-agent environments. However, due to the complexity of multi-agent systems, it is tough to learn the model of the environment. The significant compounding error may hinder the learning process when model-based methods are applied to multi-agent tasks. This paper proposes an implicit model-based multi-agent reinforcement learning method based on value decomposition methods. Under this method, agents can interact with the learned virtual environment and evaluate the current state value according to imagined future states in the latent space, making agents have the foresight. Our approach can be applied to any multi-agent value decomposition method. The experimental results show that our method improves the sample efficiency in different partially observable Markov decision process domains.
translated by 谷歌翻译
尽管近年来在多机构增强学习(MARL)方面取得了重大进展,但复杂领域的协调仍然是一个挑战。 MARL的工作通常专注于解决代理与环境中所有其他代理和实体互动的任务;但是,我们观察到现实世界任务通常由几个局部代理相互作用(子任务)的几个隔离实例组成,并且每个代理都可以有意义地专注于一个子任务,以排除环境中其他所有内容。在这些综合任务中,成功的策略通常可以分解为两个决策级别:代理人分配给特定的子任务,并且每个代理人仅针对其指定的子任务有效地采取行动。这种分解的决策提供了强烈的结构感应偏见,大大降低了代理观察空间,并鼓励在训练期间重复使用和组成子任务特异性策略,而不是将子任务的每个新组成视为独特的。我们介绍了ALMA,这是一种利用这些结构化任务的一般学习方法。阿尔玛同时学习高级子任务分配策略和低级代理政策。我们证明,阿尔玛(Alma)在许多具有挑战性的环境中学习了复杂的协调行为,表现优于强大的基准。 Alma的模块化还使其能够更好地概括为新的环境配置。最后,我们发现,尽管ALMA可以整合受过训练的分配和行动策略,但最佳性能仅通过共同训练所有组件才能获得。我们的代码可从https://github.com/shariqiqbal2810/alma获得
translated by 谷歌翻译
In this paper, we consider the problem of path finding for a set of homogeneous and autonomous agents navigating a previously unknown stochastic environment. In our problem setting, each agent attempts to maximize a given utility function while respecting safety properties. Our solution is based on ideas from evolutionary game theory, namely replicating policies that perform well and diminishing ones that do not. We do a comprehensive comparison with related multiagent planning methods, and show that our technique beats state of the art RL algorithms in minimizing path length by nearly 30% in large spaces. We show that our algorithm is computationally faster than deep RL methods by at least an order of magnitude. We also show that it scales better with an increase in the number of agents as compared to other methods, path planning methods in particular. Lastly, we empirically prove that the policies that we learn are evolutionarily stable and thus impervious to invasion by any other policy.
translated by 谷歌翻译
在过去的十年中,强化学习成功地解决了复杂的控制任务和决策问题,例如Go棋盘游戏。然而,在将这些算法部署到现实世界情景方面的成功案例很少。原因之一是在处理和避免不安全状态时缺乏保证,这是关键控制工程系统的基本要求。在本文中,我们介绍了指导性的安全射击(GUS),这是一种基于模型的RL方法,可以学会以最小的侵犯安全限制来控制系统。该模型以迭代批次方式在系统操作过程中收集的数据中学习,然后用于计划在每个时间步骤执行的最佳动作。我们提出了三个不同的安全计划者,一个基于简单的随机拍摄策略,两个基于MAP-ELITE,一种更高级的发散搜索算法。实验表明,这些计划者可以帮助学习代理避免在最大程度地探索状态空间的同时避免不安全的情况,这是学习系统准确模型的必要方面。此外,与无模型方法相比,学习模型可以减少与现实系统的交互作用的数量,同时仍达到高奖励,这是处理工程系统时的基本要求。
translated by 谷歌翻译
Technology advancements in wireless communications and high-performance Extended Reality (XR) have empowered the developments of the Metaverse. The demand for Metaverse applications and hence, real-time digital twinning of real-world scenes is increasing. Nevertheless, the replication of 2D physical world images into 3D virtual world scenes is computationally intensive and requires computation offloading. The disparity in transmitted scene dimension (2D as opposed to 3D) leads to asymmetric data sizes in uplink (UL) and downlink (DL). To ensure the reliability and low latency of the system, we consider an asynchronous joint UL-DL scenario where in the UL stage, the smaller data size of the physical world scenes captured by multiple extended reality users (XUs) will be uploaded to the Metaverse Console (MC) to be construed and rendered. In the DL stage, the larger-size 3D virtual world scenes need to be transmitted back to the XUs. The decisions pertaining to computation offloading and channel assignment are optimized in the UL stage, and the MC will optimize power allocation for users assigned with a channel in the UL transmission stage. Some problems arise therefrom: (i) interactive multi-process chain, specifically Asynchronous Markov Decision Process (AMDP), (ii) joint optimization in multiple processes, and (iii) high-dimensional objective functions, or hybrid reward scenarios. To ensure the reliability and low latency of the system, we design a novel multi-agent reinforcement learning algorithm structure, namely Asynchronous Actors Hybrid Critic (AAHC). Extensive experiments demonstrate that compared to proposed baselines, AAHC obtains better solutions with preferable training time.
translated by 谷歌翻译