Efficient exploration strategy is one of essential issues in cooperative multi-agent reinforcement learning (MARL) algorithms requiring complex coordination. In this study, we introduce a new exploration method with the strangeness that can be easily incorporated into any centralized training and decentralized execution (CTDE)-based MARL algorithms. The strangeness refers to the degree of unfamiliarity of the observations that an agent visits. In order to give the observation strangeness a global perspective, it is also augmented with the the degree of unfamiliarity of the visited entire state. The exploration bonus is obtained from the strangeness and the proposed exploration method is not much affected by stochastic transitions commonly observed in MARL tasks. To prevent a high exploration bonus from making the MARL training insensitive to extrinsic rewards, we also propose a separate action-value function trained by both extrinsic reward and exploration bonus, on which a behavioral policy to generate transitions is designed based. It makes the CTDE-based MARL algorithms more stable when they are used with an exploration method. Through a comparative evaluation in didactic examples and the StarCraft Multi-Agent Challenge, we show that the proposed exploration method achieves significant performance improvement in the CTDE-based MARL algorithms.
translated by 谷歌翻译
在复杂的协调问题中,深层合作多智能经纪增强学习(Marl)的高效探索仍然依然存在挑战。在本文中,我们介绍了一种具有奇妙驱动的探索的新型情节多功能钢筋学习,称为EMC。我们利用对流行分解的MARL算法的洞察力“诱导的”个体Q值,即用于本地执行的单个实用程序功能,是本地动作观察历史的嵌入,并且可以捕获因奖励而捕获代理之间的相互作用在集中培训期间的反向化。因此,我们使用单独的Q值的预测误差作为协调勘探的内在奖励,利用集肠内存来利用探索的信息经验来提高政策培训。随着代理商的个人Q值函数的动态捕获了国家的新颖性和其他代理人的影响,我们的内在奖励可以促使对新或有前途的国家的协调探索。我们通过教学实例说明了我们的方法的优势,并展示了在星际争霸II微互动基准中挑战任务的最先进的MARL基础上的其显着优势。
translated by 谷歌翻译
Recently, model-based agents have achieved better performance than model-free ones using the same computational budget and training time in single-agent environments. However, due to the complexity of multi-agent systems, it is tough to learn the model of the environment. The significant compounding error may hinder the learning process when model-based methods are applied to multi-agent tasks. This paper proposes an implicit model-based multi-agent reinforcement learning method based on value decomposition methods. Under this method, agents can interact with the learned virtual environment and evaluate the current state value according to imagined future states in the latent space, making agents have the foresight. Our approach can be applied to any multi-agent value decomposition method. The experimental results show that our method improves the sample efficiency in different partially observable Markov decision process domains.
translated by 谷歌翻译
深度强化学习(DRL)和深度多机构的强化学习(MARL)在包括游戏AI,自动驾驶汽车,机器人技术等各种领域取得了巨大的成功。但是,众所周知,DRL和Deep MARL代理的样本效率低下,即使对于相对简单的问题设置,通常也需要数百万个相互作用,从而阻止了在实地场景中的广泛应用和部署。背后的一个瓶颈挑战是众所周知的探索问题,即如何有效地探索环境和收集信息丰富的经验,从而使政策学习受益于最佳研究。在稀疏的奖励,吵闹的干扰,长距离和非平稳的共同学习者的复杂环境中,这个问题变得更加具有挑战性。在本文中,我们对单格和多代理RL的现有勘探方法进行了全面的调查。我们通过确定有效探索的几个关键挑战开始调查。除了上述两个主要分支外,我们还包括其他具有不同思想和技术的著名探索方法。除了算法分析外,我们还对一组常用基准的DRL进行了全面和统一的经验比较。根据我们的算法和实证研究,我们终于总结了DRL和Deep Marl中探索的公开问题,并指出了一些未来的方向。
translated by 谷歌翻译
This work considers the problem of learning cooperative policies in complex, partially observable domains without explicit communication. We extend three classes of single-agent deep reinforcement learning algorithms based on policy gradient, temporal-difference error, and actor-critic methods to cooperative multi-agent systems. We introduce a set of cooperative control tasks that includes tasks with discrete and continuous actions, as well as tasks that involve hundreds of agents. The three approaches are evaluated against each other using different neural architectures, training procedures, and reward structures. Using deep reinforcement learning with a curriculum learning scheme, our approach can solve problems that were previously considered intractable by most multi-agent reinforcement learning algorithms. We show that policy gradient methods tend to outperform both temporal-difference and actor-critic methods when using feed-forward neural architectures. We also show that recurrent policies, while more difficult to train, outperform feed-forward policies on our evaluation tasks.
translated by 谷歌翻译
In multi-agent reinforcement learning (MARL), many popular methods, such as VDN and QMIX, are susceptible to a critical multi-agent pathology known as relative overgeneralization (RO), which arises when the optimal joint action's utility falls below that of a sub-optimal joint action in cooperative tasks. RO can cause the agents to get stuck into local optima or fail to solve tasks that require significant coordination between agents within a given timestep. Recent value-based MARL algorithms such as QPLEX and WQMIX can overcome RO to some extent. However, our experimental results show that they can still fail to solve cooperative tasks that exhibit strong RO. In this work, we propose a novel approach called curriculum learning for relative overgeneralization (CURO) to better overcome RO. To solve a target task that exhibits strong RO, in CURO, we first fine-tune the reward function of the target task to generate source tasks that are tailored to the current ability of the learning agent and train the agent on these source tasks first. Then, to effectively transfer the knowledge acquired in one task to the next, we use a novel transfer learning method that combines value function transfer with buffer transfer, which enables more efficient exploration in the target task. We demonstrate that, when applied to QMIX, CURO overcomes severe RO problem and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks.
translated by 谷歌翻译
在本文中,我们认为合作的多代理强化学习(MARL)具有稀疏的奖励。为了解决这个问题,我们提出了一种名为Maser:MARL的新方法,并具有从经验重播缓冲区产生的子目标。在广泛使用的集中式培训的假设下,通过分散执行和对MARL的Q值分解的一致性,Maser通过考虑单个Q值和总Q值来自动为多个代理人生成适当的子目标。然后,Maser根据与Q学习相关的可行表示为每个代理设计个人固有奖励,以便代理人达到其子目标,同时最大化联合行动值。数值结果表明,与其他最先进的MARL算法相比,Maser的表现明显优于Starcraft II微管理基准。
translated by 谷歌翻译
In many real-world settings, a team of agents must coordinate their behaviour while acting in a decentralised way. At the same time, it is often possible to train the agents in a centralised fashion in a simulated or laboratory setting, where global state information is available and communication constraints are lifted. Learning joint actionvalues conditioned on extra state information is an attractive way to exploit centralised learning, but the best strategy for then extracting decentralised policies is unclear. Our solution is QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. QMIX employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations. We structurally enforce that the joint-action value is monotonic in the per-agent values, which allows tractable maximisation of the joint action-value in off-policy learning, and guarantees consistency between the centralised and decentralised policies. We evaluate QMIX on a challenging set of StarCraft II micromanagement tasks, and show that QMIX significantly outperforms existing value-based multi-agent reinforcement learning methods.
translated by 谷歌翻译
我们开发了一个多功能辅助救援学习(MARL)方法,以了解目标跟踪的可扩展控制策略。我们的方法可以处理任意数量的追求者和目标;我们显示出现的任务,该任务包括高达1000追踪跟踪1000个目标。我们使用分散的部分可观察的马尔可夫决策过程框架来模拟追求者作为接受偏见观察(范围和轴承)的代理,了解使用固定的未知政策的目标。注意机制用于参数化代理的价值函数;这种机制允许我们处理任意数量的目标。熵 - 正规的脱助政策RL方法用于培训随机政策,我们讨论如何在追求者之间实现对冲行为,尽管有完全分散的控制执行,但仍然导致合作较弱的合作形式。我们进一步开发了一个掩蔽启发式,允许训练较少的问题,少量追求目标和在更大的问题上执行。进行彻底的仿真实验,消融研究和对现有技术算法的比较,以研究对不同数量的代理和目标性能的方法和鲁棒性的可扩展性。
translated by 谷歌翻译
Recently, some challenging tasks in multi-agent systems have been solved by some hierarchical reinforcement learning methods. Inspired by the intra-level and inter-level coordination in the human nervous system, we propose a novel value decomposition framework HAVEN based on hierarchical reinforcement learning for fully cooperative multi-agent problems. To address the instability arising from the concurrent optimization of policies between various levels and agents, we introduce the dual coordination mechanism of inter-level and inter-agent strategies by designing reward functions in a two-level hierarchy. HAVEN does not require domain knowledge and pre-training, and can be applied to any value decomposition variant. Our method achieves desirable results on different decentralized partially observable Markov decision process domains and outperforms other popular multi-agent hierarchical reinforcement learning algorithms.
translated by 谷歌翻译
Cooperative multi-agent reinforcement learning (MARL) has made prominent progress in recent years. For training efficiency and scalability, most of the MARL algorithms make all agents share the same policy or value network. However, in many complex multi-agent tasks, different agents are expected to possess specific abilities to handle different subtasks. In those scenarios, sharing parameters indiscriminately may lead to similar behavior across all agents, which will limit the exploration efficiency and degrade the final performance. To balance the training complexity and the diversity of agent behavior, we propose a novel framework to learn dynamic subtask assignment (LDSA) in cooperative MARL. Specifically, we first introduce a subtask encoder to construct a vector representation for each subtask according to its identity. To reasonably assign agents to different subtasks, we propose an ability-based subtask selection strategy, which can dynamically group agents with similar abilities into the same subtask. In this way, agents dealing with the same subtask share their learning of specific abilities and different subtasks correspond to different specific abilities. We further introduce two regularizers to increase the representation difference between subtasks and stabilize the training by discouraging agents from frequently changing subtasks, respectively. Empirical results show that LDSA learns reasonable and effective subtask assignment for better collaboration and significantly improves the learning performance on the challenging StarCraft II micromanagement benchmark and Google Research Football.
translated by 谷歌翻译
In many real-world scenarios, rewards extrinsic to the agent are extremely sparse, or absent altogether. In such cases, curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful later in its life. We formulate curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model. Our formulation scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly predicting pixels, and, critically, ignores the aspects of the environment that cannot affect the agent. The proposed approach is evaluated in two environments: VizDoom and Super Mario Bros. Three broad settings are investigated: 1) sparse extrinsic reward, where curiosity allows for far fewer interactions with the environment to reach the goal; 2) exploration with no extrinsic reward, where curiosity pushes the agent to explore more efficiently; and 3) generalization to unseen scenarios (e.g. new levels of the same game) where the knowledge gained from earlier experience helps the agent explore new places much faster than starting from scratch.
translated by 谷歌翻译
最先进的多机构增强学习(MARL)方法为各种复杂问题提供了有希望的解决方案。然而,这些方法都假定代理执行同步的原始操作执行,因此它们不能真正可扩展到长期胜利的真实世界多代理/机器人任务,这些任务固有地要求代理/机器人以异步的理由,涉及有关高级动作选择的理由。不同的时间。宏观行动分散的部分可观察到的马尔可夫决策过程(MACDEC-POMDP)是在完全合作的多代理任务中不确定的异步决策的一般形式化。在本论文中,我们首先提出了MacDec-Pomdps的一组基于价值的RL方法,其中允许代理在三个范式中使用宏观成果功能执行异步学习和决策:分散学习和控制,集中学习,集中学习和控制,以及分散执行的集中培训(CTDE)。在上述工作的基础上,我们在三个训练范式下制定了一组基于宏观行动的策略梯度算法,在该训练范式下,允许代理以异步方式直接优化其参数化策略。我们在模拟和真实的机器人中评估了我们的方法。经验结果证明了我们在大型多代理问题中的方法的优势,并验证了我们算法在学习具有宏观actions的高质量和异步溶液方面的有效性。
translated by 谷歌翻译
多代理增强学习(MARL)在价值函数分解方法的发展中见证了重大进展。由于单调性,它可以通过最大程度地分解每个代理实用程序来优化联合动作值函数。在本文中,我们表明,在部分可观察到的MARL问题中,代理商对自己的行为的订购可能会对代表功能类施加并发约束(跨不同状态),从而在培训期间造成重大估计错误。我们解决了这一限制,并提出了PAC,PAC是一个新的框架,利用了最佳联合行动选择的反事实预测产生的辅助信息,这可以通过新颖的反事实损失通过新颖的辅助来实现价值功能分解。开发了一种基于变异推理的信息编码方法,以从估计的基线收集和编码反事实预测。为了实现分散的执行,我们还得出了受最大收入MARL框架启发的分级分配的代理策略。我们评估了有关多代理捕食者捕食者和一组Starcraft II微管理任务的PAC。经验结果表明,在所有基准上,PAC对基于最先进的价值和基于策略的多代理增强学习算法的结果得到了改善。
translated by 谷歌翻译
许多现实世界的应用程序都可以作为多机构合作问题进行配置,例如网络数据包路由和自动驾驶汽车的协调。深入增强学习(DRL)的出现为通过代理和环境的相互作用提供了一种有前途的多代理合作方法。但是,在政策搜索过程中,传统的DRL解决方案遭受了多个代理具有连续动作空间的高维度。此外,代理商政策的动态性使训练非平稳。为了解决这些问题,我们建议采用高级决策和低水平的个人控制,以进行有效的政策搜索,提出一种分层增强学习方法。特别是,可以在高级离散的动作空间中有效地学习多个代理的合作。同时,低水平的个人控制可以减少为单格强化学习。除了分层增强学习外,我们还建议对手建模网络在学习过程中对其他代理的政策进行建模。与端到端的DRL方法相反,我们的方法通过以层次结构将整体任务分解为子任务来降低学习的复杂性。为了评估我们的方法的效率,我们在合作车道变更方案中进行了现实世界中的案例研究。模拟和现实世界实验都表明我们的方法在碰撞速度和收敛速度中的优越性。
translated by 谷歌翻译
我们将记住和忘记的经验重播(Ref-ER)算法扩展到多代理增强学习(MARL)。参考器被证明超过了最先进的算法状态,以连续控制从OpenAI健身房到复杂的流体流动。在MARL中,代理之间的依赖项包括在州值估计器中,环境动力学是通过参考文献使用的重要性权重对其建模的。在协作环境中,当使用个人奖励估算值时,我们发现最佳性能,并且我们忽略了其他动作对过渡图的影响。我们基准在斯坦福大学智能系统实验室(SISL)环境中进行参考文献的性能。我们发现,采用单个馈送前馈神经网络来进行策略和参考文献中的价值函数,优于依靠复杂的神经网络体系结构的最先进的算法状态。
translated by 谷歌翻译
我们研究了普遍存在的动作,即所有动作都有预设执行持续时间的环境中,研究了无模型的多机械加固学习(MARL)。在执行期间,环境变化受到动作执行的影响但不同步。在许多现实世界中,这种设置无处不在。但是,大多数MAL方法都假定推断后立即执行动作,这通常是不现实的,并且可能导致多机构协调的灾难性失败。为了填补这一空白,我们为MARL开发了一个算法的算法框架。然后,我们为无模型的MARL算法提出了一种新颖的情节记忆,legeM。 Legem通过利用代理人的个人经历来建立代理商的情节记忆。它通过解决了通过我们的新型奖励再分配计划提出的具有挑战性的时间信用分配问题来提高多机构学习,从而减轻了非马克维亚奖励的问题。我们在各种多代理方案上评估了Legem,其中包括猎鹿游戏,采石场游戏,造林游戏和Starcraft II微管理任务。经验结果表明,LegeM显着提高了多机构的协调,并提高了领先的绩效并提高了样本效率。
translated by 谷歌翻译
分布式多智能经纪增强学习(Marl)算法最近引起了兴趣激增,主要是由于深神经网络(DNN)的最新进步。由于利用固定奖励模型来学习基础值函数,传统的基于模型(MB)或无模型(MF)RL算法不可直接适用于MARL问题。虽然涉及单一代理时,基于DNN的解决方案完全良好地表现出,但是这种方法无法完全推广到MARL问题的复杂性。换句话说,尽管最近的基于DNN的DNN用于多种子体环境的方法取得了卓越的性能,但它们仍然容易出现过度,对参数选择的高敏感性,以及样本低效率。本文提出了多代理自适应Kalman时间差(MAK-TD)框架及其继任者表示的基于代表的变体,称为MAK-SR。直观地说,主要目标是利用卡尔曼滤波(KF)的独特特征,如不确定性建模和在线二阶学习。提议的MAK-TD / SR框架考虑了与高维多算法环境相关联的动作空间的连续性,并利用卡尔曼时间差(KTD)来解决参数不确定性。通过利用KTD框架,SR学习过程被建模到过滤问题,其中径向基函数(RBF)估计器用于将连续空间编码为特征向量。另一方面,对于学习本地化奖励功能,我们求助于多种模型自适应估计(MMAE),处理缺乏关于观察噪声协方差和观察映射功能的先前知识。拟议的MAK-TD / SR框架通过多个实验进行评估,该实验通过Openai Gym Marl基准实施。
translated by 谷歌翻译
Starcraft II多代理挑战(SMAC)被创建为合作多代理增强学习(MARL)的具有挑战性的基准问题。 SMAC专注于星际争霸微管理的问题,并假设每个单元都由独立行动并仅具有本地信息的学习代理人单独控制;假定通过分散执行(CTDE)进行集中培训。为了在SMAC中表现良好,MARL算法必须处理多机构信贷分配和联合行动评估的双重问题。本文介绍了一种新的体系结构Transmix,这是一个基于变压器的联合行动值混合网络,与其他最先进的合作MARL解决方案相比,我们显示出高效且可扩展的。 Transmix利用变形金刚学习更丰富的混合功能的能力来结合代理的个人价值函数。它与以前的SMAC场景上的工作相当,并且在困难场景上胜过其他技术,以及被高斯噪音损坏的场景以模拟战争的雾。
translated by 谷歌翻译
钢筋学习的最新进展证明了其在超级人类水平上解决硬质孕代环境互动任务的能力。然而,由于大多数RL最先进的算法的样本低效率,即,需要大量培训集,因此在实际和现实世界任务中的应用目前有限。例如,在Dota 2中击败人类参与者的Openai五种算法已经训练了数千年的游戏时间。存在解决样本低效问题的几种方法,可以通过更好地探索环境来提供更有效的使用或旨在获得更相关和多样化的经验。然而,为了我们的知识,没有用于基于模型的算法的这种方法,其在求解具有高维状态空间的硬控制任务方面的高采样效率。这项工作连接了探索技术和基于模型的加强学习。我们设计了一种新颖的探索方法,考虑了基于模型的方法的特征。我们还通过实验证明我们的方法显着提高了基于模型的算法梦想家的性能。
translated by 谷歌翻译