我们研究了普遍存在的动作,即所有动作都有预设执行持续时间的环境中,研究了无模型的多机械加固学习(MARL)。在执行期间,环境变化受到动作执行的影响但不同步。在许多现实世界中,这种设置无处不在。但是,大多数MAL方法都假定推断后立即执行动作,这通常是不现实的,并且可能导致多机构协调的灾难性失败。为了填补这一空白,我们为MARL开发了一个算法的算法框架。然后,我们为无模型的MARL算法提出了一种新颖的情节记忆,legeM。 Legem通过利用代理人的个人经历来建立代理商的情节记忆。它通过解决了通过我们的新型奖励再分配计划提出的具有挑战性的时间信用分配问题来提高多机构学习,从而减轻了非马克维亚奖励的问题。我们在各种多代理方案上评估了Legem,其中包括猎鹿游戏,采石场游戏,造林游戏和Starcraft II微管理任务。经验结果表明,LegeM显着提高了多机构的协调,并提高了领先的绩效并提高了样本效率。
translated by 谷歌翻译
在复杂的协调问题中,深层合作多智能经纪增强学习(Marl)的高效探索仍然依然存在挑战。在本文中,我们介绍了一种具有奇妙驱动的探索的新型情节多功能钢筋学习,称为EMC。我们利用对流行分解的MARL算法的洞察力“诱导的”个体Q值,即用于本地执行的单个实用程序功能,是本地动作观察历史的嵌入,并且可以捕获因奖励而捕获代理之间的相互作用在集中培训期间的反向化。因此,我们使用单独的Q值的预测误差作为协调勘探的内在奖励,利用集肠内存来利用探索的信息经验来提高政策培训。随着代理商的个人Q值函数的动态捕获了国家的新颖性和其他代理人的影响,我们的内在奖励可以促使对新或有前途的国家的协调探索。我们通过教学实例说明了我们的方法的优势,并展示了在星际争霸II微互动基准中挑战任务的最先进的MARL基础上的其显着优势。
translated by 谷歌翻译
最先进的多机构增强学习(MARL)方法为各种复杂问题提供了有希望的解决方案。然而,这些方法都假定代理执行同步的原始操作执行,因此它们不能真正可扩展到长期胜利的真实世界多代理/机器人任务,这些任务固有地要求代理/机器人以异步的理由,涉及有关高级动作选择的理由。不同的时间。宏观行动分散的部分可观察到的马尔可夫决策过程(MACDEC-POMDP)是在完全合作的多代理任务中不确定的异步决策的一般形式化。在本论文中,我们首先提出了MacDec-Pomdps的一组基于价值的RL方法,其中允许代理在三个范式中使用宏观成果功能执行异步学习和决策:分散学习和控制,集中学习,集中学习和控制,以及分散执行的集中培训(CTDE)。在上述工作的基础上,我们在三个训练范式下制定了一组基于宏观行动的策略梯度算法,在该训练范式下,允许代理以异步方式直接优化其参数化策略。我们在模拟和真实的机器人中评估了我们的方法。经验结果证明了我们在大型多代理问题中的方法的优势,并验证了我们算法在学习具有宏观actions的高质量和异步溶液方面的有效性。
translated by 谷歌翻译
在本文中,我们认为合作的多代理强化学习(MARL)具有稀疏的奖励。为了解决这个问题,我们提出了一种名为Maser:MARL的新方法,并具有从经验重播缓冲区产生的子目标。在广泛使用的集中式培训的假设下,通过分散执行和对MARL的Q值分解的一致性,Maser通过考虑单个Q值和总Q值来自动为多个代理人生成适当的子目标。然后,Maser根据与Q学习相关的可行表示为每个代理设计个人固有奖励,以便代理人达到其子目标,同时最大化联合行动值。数值结果表明,与其他最先进的MARL算法相比,Maser的表现明显优于Starcraft II微管理基准。
translated by 谷歌翻译
Recently, some challenging tasks in multi-agent systems have been solved by some hierarchical reinforcement learning methods. Inspired by the intra-level and inter-level coordination in the human nervous system, we propose a novel value decomposition framework HAVEN based on hierarchical reinforcement learning for fully cooperative multi-agent problems. To address the instability arising from the concurrent optimization of policies between various levels and agents, we introduce the dual coordination mechanism of inter-level and inter-agent strategies by designing reward functions in a two-level hierarchy. HAVEN does not require domain knowledge and pre-training, and can be applied to any value decomposition variant. Our method achieves desirable results on different decentralized partially observable Markov decision process domains and outperforms other popular multi-agent hierarchical reinforcement learning algorithms.
translated by 谷歌翻译
在本文中,我们提出了一个名为“星际争霸多代理挑战”的新颖基准,代理商学习执行多阶段任务并使用没有精确奖励功能的环境因素。以前的挑战(SMAC)被认为是多名强化学习的标准基准,主要涉及确保所有代理人仅通过具有明显的奖励功能的精细操纵而合作消除接近对手。另一方面,这一挑战对MARL算法的探索能力有效地学习隐式多阶段任务和环境因素以及微控制感兴趣。这项研究涵盖了进攻和防御性场景。在进攻情况下,代理商必须学会先寻找对手,然后消除他们。防御性场景要求代理使用地形特征。例如,代理需要将自己定位在保护结构后面,以使敌人更难攻击。我们研究了SMAC+下的MARL算法,并观察到最近的方法在与以前的挑战类似,但在进攻情况下表现不佳。此外,我们观察到,增强的探索方法对性能有积极影响,但无法完全解决所有情况。这项研究提出了未来研究的新方向。
translated by 谷歌翻译
Though transfer learning is promising to increase the learning efficiency, the existing methods are still subject to the challenges from long-horizon tasks, especially when expert policies are sub-optimal and partially useful. Hence, a novel algorithm named EASpace (Enhanced Action Space) is proposed in this paper to transfer the knowledge of multiple sub-optimal expert policies. EASpace formulates each expert policy into multiple macro actions with different execution time period, then integrates all macro actions into the primitive action space directly. Through this formulation, the proposed EASpace could learn when to execute which expert policy and how long it lasts. An intra-macro-action learning rule is proposed by adjusting the temporal difference target of macro actions to improve the data efficiency and alleviate the non-stationarity issue in multi-agent settings. Furthermore, an additional reward proportional to the execution time of macro actions is introduced to encourage the environment exploration via macro actions, which is significant to learn a long-horizon task. Theoretical analysis is presented to show the convergence of the proposed algorithm. The efficiency of the proposed algorithm is illustrated by a grid-based game and a multi-agent pursuit problem. The proposed algorithm is also implemented to real physical systems to justify its effectiveness.
translated by 谷歌翻译
本文考虑了多智能经纪人强化学习(MARL)任务,代理商在集会结束时获得共享全球奖励。这种奖励的延迟性质影响了代理商在中间时间步骤中评估其行动质量的能力。本文侧重于开发学习焦点奖励的时间重新分布的方法,以获得密集奖励信号。解决这些MARL问题需要解决两个挑战:识别(1)沿着集发作(沿时间)的长度相对重要性,以及(2)在任何单一时间步骤(代理商中)的相对重要性。在本文中,我们介绍了奖励中的奖励再分配,在整容多智能体加固学习(Arel)中奖励再分配,以解决这两个挑战。 Arel使用注意机制来表征沿着轨迹(时间关注)对状态转换的动作的影响,以及每个代理在每个时间步骤(代理人注意)的影响。 Arel预测的重新分配奖励是密集的,可以与任何给定的MARL算法集成。我们评估了粒子世界环境的具有挑战性的任务和星际争霸多功能挑战。 arel导致粒子世界的奖励较高,并改善星际争端的胜利率与三个最先进的奖励再分配方法相比。我们的代码可在https://github.com/baicenxiao/arel获得。
translated by 谷歌翻译
Offline multi-agent reinforcement learning (MARL) aims to learn effective multi-agent policies from pre-collected datasets, which is an important step toward the deployment of multi-agent systems in real-world applications. However, in practice, each individual behavior policy that generates multi-agent joint trajectories usually has a different level of how well it performs. e.g., an agent is a random policy while other agents are medium policies. In the cooperative game with global reward, one agent learned by existing offline MARL often inherits this random policy, jeopardizing the performance of the entire team. In this paper, we investigate offline MARL with explicit consideration on the diversity of agent-wise trajectories and propose a novel framework called Shared Individual Trajectories (SIT) to address this problem. Specifically, an attention-based reward decomposition network assigns the credit to each agent through a differentiable key-value memory mechanism in an offline manner. These decomposed credits are then used to reconstruct the joint offline datasets into prioritized experience replay with individual trajectories, thereafter agents can share their good trajectories and conservatively train their policies with a graph attention network (GAT) based critic. We evaluate our method in both discrete control (i.e., StarCraft II and multi-agent particle environment) and continuous control (i.e, multi-agent mujoco). The results indicate that our method achieves significantly better results in complex and mixed offline multi-agent datasets, especially when the difference of data quality between individual trajectories is large.
translated by 谷歌翻译
在现实设置中跨多个代理的决策同步是有问题的,因为它要求代理等待其他代理人终止和交流有关终止的终止。理想情况下,代理应该学习和执行异步。这样的异步方法还允许暂时扩展的动作,这些操作可能会根据执行的情况和操作花费不同的时间。不幸的是,当前的策略梯度方法不适用于异步设置,因为他们认为代理在每个时间步骤中都同步推理了动作选择。为了允许异步学习和决策,我们制定了一组异步的多代理参与者 - 批判性方法,这些方法使代理可以在三个标准培训范式中直接优化异步策略:分散的学习,集中学习,集中学习和集中培训以进行分解执行。各种现实域中的经验结果(在模拟和硬件中)证明了我们在大型多代理问题中的优势,并验证了我们算法在学习高质量和异步解决方案方面的有效性。
translated by 谷歌翻译
Recently, model-based agents have achieved better performance than model-free ones using the same computational budget and training time in single-agent environments. However, due to the complexity of multi-agent systems, it is tough to learn the model of the environment. The significant compounding error may hinder the learning process when model-based methods are applied to multi-agent tasks. This paper proposes an implicit model-based multi-agent reinforcement learning method based on value decomposition methods. Under this method, agents can interact with the learned virtual environment and evaluate the current state value according to imagined future states in the latent space, making agents have the foresight. Our approach can be applied to any multi-agent value decomposition method. The experimental results show that our method improves the sample efficiency in different partially observable Markov decision process domains.
translated by 谷歌翻译
Efficient exploration strategy is one of essential issues in cooperative multi-agent reinforcement learning (MARL) algorithms requiring complex coordination. In this study, we introduce a new exploration method with the strangeness that can be easily incorporated into any centralized training and decentralized execution (CTDE)-based MARL algorithms. The strangeness refers to the degree of unfamiliarity of the observations that an agent visits. In order to give the observation strangeness a global perspective, it is also augmented with the the degree of unfamiliarity of the visited entire state. The exploration bonus is obtained from the strangeness and the proposed exploration method is not much affected by stochastic transitions commonly observed in MARL tasks. To prevent a high exploration bonus from making the MARL training insensitive to extrinsic rewards, we also propose a separate action-value function trained by both extrinsic reward and exploration bonus, on which a behavioral policy to generate transitions is designed based. It makes the CTDE-based MARL algorithms more stable when they are used with an exploration method. Through a comparative evaluation in didactic examples and the StarCraft Multi-Agent Challenge, we show that the proposed exploration method achieves significant performance improvement in the CTDE-based MARL algorithms.
translated by 谷歌翻译
深度强化学习(DRL)和深度多机构的强化学习(MARL)在包括游戏AI,自动驾驶汽车,机器人技术等各种领域取得了巨大的成功。但是,众所周知,DRL和Deep MARL代理的样本效率低下,即使对于相对简单的问题设置,通常也需要数百万个相互作用,从而阻止了在实地场景中的广泛应用和部署。背后的一个瓶颈挑战是众所周知的探索问题,即如何有效地探索环境和收集信息丰富的经验,从而使政策学习受益于最佳研究。在稀疏的奖励,吵闹的干扰,长距离和非平稳的共同学习者的复杂环境中,这个问题变得更加具有挑战性。在本文中,我们对单格和多代理RL的现有勘探方法进行了全面的调查。我们通过确定有效探索的几个关键挑战开始调查。除了上述两个主要分支外,我们还包括其他具有不同思想和技术的著名探索方法。除了算法分析外,我们还对一组常用基准的DRL进行了全面和统一的经验比较。根据我们的算法和实证研究,我们终于总结了DRL和Deep Marl中探索的公开问题,并指出了一些未来的方向。
translated by 谷歌翻译
Many real-world problems, such as network packet routing and the coordination of autonomous vehicles, are naturally modelled as cooperative multi-agent systems. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actorcritic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.
translated by 谷歌翻译
In multi-agent reinforcement learning (MARL), many popular methods, such as VDN and QMIX, are susceptible to a critical multi-agent pathology known as relative overgeneralization (RO), which arises when the optimal joint action's utility falls below that of a sub-optimal joint action in cooperative tasks. RO can cause the agents to get stuck into local optima or fail to solve tasks that require significant coordination between agents within a given timestep. Recent value-based MARL algorithms such as QPLEX and WQMIX can overcome RO to some extent. However, our experimental results show that they can still fail to solve cooperative tasks that exhibit strong RO. In this work, we propose a novel approach called curriculum learning for relative overgeneralization (CURO) to better overcome RO. To solve a target task that exhibits strong RO, in CURO, we first fine-tune the reward function of the target task to generate source tasks that are tailored to the current ability of the learning agent and train the agent on these source tasks first. Then, to effectively transfer the knowledge acquired in one task to the next, we use a novel transfer learning method that combines value function transfer with buffer transfer, which enables more efficient exploration in the target task. We demonstrate that, when applied to QMIX, CURO overcomes severe RO problem and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks.
translated by 谷歌翻译
Cooperative multi-agent reinforcement learning (MARL) has made prominent progress in recent years. For training efficiency and scalability, most of the MARL algorithms make all agents share the same policy or value network. However, in many complex multi-agent tasks, different agents are expected to possess specific abilities to handle different subtasks. In those scenarios, sharing parameters indiscriminately may lead to similar behavior across all agents, which will limit the exploration efficiency and degrade the final performance. To balance the training complexity and the diversity of agent behavior, we propose a novel framework to learn dynamic subtask assignment (LDSA) in cooperative MARL. Specifically, we first introduce a subtask encoder to construct a vector representation for each subtask according to its identity. To reasonably assign agents to different subtasks, we propose an ability-based subtask selection strategy, which can dynamically group agents with similar abilities into the same subtask. In this way, agents dealing with the same subtask share their learning of specific abilities and different subtasks correspond to different specific abilities. We further introduce two regularizers to increase the representation difference between subtasks and stabilize the training by discouraging agents from frequently changing subtasks, respectively. Empirical results show that LDSA learns reasonable and effective subtask assignment for better collaboration and significantly improves the learning performance on the challenging StarCraft II micromanagement benchmark and Google Research Football.
translated by 谷歌翻译
协调图是一种有前途的模型代理协作在多智能体增强学习中的合作方法。它将一个大的多代理系统分解为代表底层协调依赖性的重叠组套件。此范例中的一个危急挑战是计算基于图形的值分子的最大值动作的复杂性。它指的是分散的约束优化问题(DCOP),其恒定比率近似是NP - 硬问题。为了绕过这一基本硬度,提出了一种新的方法,命名为自组织的多项式协调图(SOP-CG),它使用结构化图表来保证具有足够功能表达的所致DCOP的最优性。我们将图形拓扑扩展为状态依赖性,将图形选择作为假想的代理商,最终从统一的Bellman Optimaly方程中获得端到端的学习范例。在实验中,我们表明我们的方法了解可解释的图形拓扑,诱导有效的协调,并提高各种合作多功能机构任务的性能。
translated by 谷歌翻译
We explore value-based solutions for multi-agent reinforcement learning (MARL) tasks in the centralized training with decentralized execution (CTDE) regime popularized recently. However, VDN and QMIX are representative examples that use the idea of factorization of the joint actionvalue function into individual ones for decentralized execution. VDN and QMIX address only a fraction of factorizable MARL tasks due to their structural constraint in factorization such as additivity and monotonicity. In this paper, we propose a new factorization method for MARL, QTRAN, which is free from such structural constraints and takes on a new approach to transforming the original joint action-value function into an easily factorizable one, with the same optimal actions. QTRAN guarantees more general factorization than VDN or QMIX, thus covering a much wider class of MARL tasks than does previous methods. Our experiments for the tasks of multi-domain Gaussian-squeeze and modified predator-prey demonstrate QTRAN's superior performance with especially larger margins in games whose payoffs penalize non-cooperative behavior more aggressively.
translated by 谷歌翻译
通过回顾一封来自情节记忆的过去的经验,可以通过回忆过去的经验来实现钢筋学习的样本效率。我们提出了一种新的基于模型的轨迹的集体记忆,解决了集体控制的当前限制。我们的记忆估计轨迹值,指导代理人朝着良好的政策。基于内存构建,我们通过动态混合控制统一模型的基于动态和习惯学习来构建互补学习模型,进入单个架构。实验表明,我们的模型可以比各种环境中的其他强力加强学习代理更快,更好地学习,包括随机和非马尔可夫环境。
translated by 谷歌翻译
协作多代理增强学习(MARL)已在许多实际应用中广泛使用,在许多实际应用中,每个代理商都根据自己的观察做出决定。大多数主流方法在对分散的局部实用程序函数进行建模时,将每个局部观察结果视为完整的。但是,他们忽略了这样一个事实,即可以将局部观察信息进一步分为几个实体,只有一部分实体有助于建模推理。此外,不同实体的重要性可能会随着时间而变化。为了提高分散政策的性能,使用注意机制用于捕获本地信息的特征。然而,现有的注意模型依赖于密集的完全连接的图,并且无法更好地感知重要状态。为此,我们提出了一个稀疏的状态MARL(S2RL)框架,该框架利用稀疏的注意机制将无关的信息丢弃在局部观察中。通过自我注意力和稀疏注意机制估算局部效用函数,然后将其合并为标准的关节价值函数和中央评论家的辅助关节价值函数。我们将S2RL框架设计为即插即用的模块,使其足够一般,可以应用于各种方法。关于Starcraft II的广泛实验表明,S2RL可以显着提高许多最新方法的性能。
translated by 谷歌翻译