几乎所有的多代理强化学习算法没有交流,都遵循分散执行的集中培训原则。在集中培训期间,代理可以以相同的信号为指导,例如全球国家。但是,在分散执行期间,代理缺乏共享信号。受到观点不变性和对比学习的启发,我们在本文中提出了共识学习,以学习合作的多代理增强学习。尽管基于局部观察结果,但不同的代理可以在离散空间中推断出相同的共识。在分散执行期间,我们将推断的共识作为对代理网络的明确输入提供了,从而发展了他们的合作精神。我们提出的方法可以扩展到具有小模型更改的各种多代理增强学习算法。此外,我们执行一些完全合作的任务,并获得令人信服的结果。
translated by 谷歌翻译
Recently, model-based agents have achieved better performance than model-free ones using the same computational budget and training time in single-agent environments. However, due to the complexity of multi-agent systems, it is tough to learn the model of the environment. The significant compounding error may hinder the learning process when model-based methods are applied to multi-agent tasks. This paper proposes an implicit model-based multi-agent reinforcement learning method based on value decomposition methods. Under this method, agents can interact with the learned virtual environment and evaluate the current state value according to imagined future states in the latent space, making agents have the foresight. Our approach can be applied to any multi-agent value decomposition method. The experimental results show that our method improves the sample efficiency in different partially observable Markov decision process domains.
translated by 谷歌翻译
Recently, some challenging tasks in multi-agent systems have been solved by some hierarchical reinforcement learning methods. Inspired by the intra-level and inter-level coordination in the human nervous system, we propose a novel value decomposition framework HAVEN based on hierarchical reinforcement learning for fully cooperative multi-agent problems. To address the instability arising from the concurrent optimization of policies between various levels and agents, we introduce the dual coordination mechanism of inter-level and inter-agent strategies by designing reward functions in a two-level hierarchy. HAVEN does not require domain knowledge and pre-training, and can be applied to any value decomposition variant. Our method achieves desirable results on different decentralized partially observable Markov decision process domains and outperforms other popular multi-agent hierarchical reinforcement learning algorithms.
translated by 谷歌翻译
多代理深入的强化学习已应用于解决各种离散或连续动作空间的各种复杂问题,并取得了巨大的成功。但是,大多数实际环境不能仅通过离散的动作空间或连续的动作空间来描述。而且很少有作品曾经利用深入的加固学习(DRL)来解决混合动作空间的多代理问题。因此,我们提出了一种新颖的算法:深层混合软性角色 - 批评(MAHSAC)来填补这一空白。该算法遵循集中式训练但分散执行(CTDE)范式,并扩展软actor-Critic算法(SAC),以根据最大熵在多机构环境中处理混合动作空间问题。我们的经验在一个简单的多代理粒子世界上运行,具有连续的观察和离散的动作空间以及一些基本的模拟物理。实验结果表明,MAHSAC在训练速度,稳定性和抗干扰能力方面具有良好的性能。同时,它在合作场景和竞争性场景中胜过现有的独立深层学习方法。
translated by 谷歌翻译
Cooperative multi-agent reinforcement learning (MARL) has made prominent progress in recent years. For training efficiency and scalability, most of the MARL algorithms make all agents share the same policy or value network. However, in many complex multi-agent tasks, different agents are expected to possess specific abilities to handle different subtasks. In those scenarios, sharing parameters indiscriminately may lead to similar behavior across all agents, which will limit the exploration efficiency and degrade the final performance. To balance the training complexity and the diversity of agent behavior, we propose a novel framework to learn dynamic subtask assignment (LDSA) in cooperative MARL. Specifically, we first introduce a subtask encoder to construct a vector representation for each subtask according to its identity. To reasonably assign agents to different subtasks, we propose an ability-based subtask selection strategy, which can dynamically group agents with similar abilities into the same subtask. In this way, agents dealing with the same subtask share their learning of specific abilities and different subtasks correspond to different specific abilities. We further introduce two regularizers to increase the representation difference between subtasks and stabilize the training by discouraging agents from frequently changing subtasks, respectively. Empirical results show that LDSA learns reasonable and effective subtask assignment for better collaboration and significantly improves the learning performance on the challenging StarCraft II micromanagement benchmark and Google Research Football.
translated by 谷歌翻译
Cooperative multi-agent reinforcement learning (MARL) has achieved significant results, most notably by leveraging the representation-learning abilities of deep neural networks. However, large centralized approaches quickly become infeasible as the number of agents scale, and fully decentralized approaches can miss important opportunities for information sharing and coordination. Furthermore, not all agents are equal -- in some cases, individual agents may not even have the ability to send communication to other agents or explicitly model other agents. This paper considers the case where there is a single, powerful, \emph{central agent} that can observe the entire observation space, and there are multiple, low-powered \emph{local agents} that can only receive local observations and are not able to communicate with each other. The central agent's job is to learn what message needs to be sent to different local agents based on the global observations, not by centrally solving the entire problem and sending action commands, but by determining what additional information an individual agent should receive so that it can make a better decision. In this work we present our MARL algorithm \algo, describe where it would be most applicable, and implement it in the cooperative navigation and multi-agent walker domains. Empirical results show that 1) learned communication does indeed improve system performance, 2) results generalize to heterogeneous local agents, and 3) results generalize to different reward structures.
translated by 谷歌翻译
将深度强化学习(DRL)扩展到多代理领域的研究已经解决了许多复杂的问题,并取得了重大成就。但是,几乎所有这些研究都只关注离散或连续的动作空间,而且很少有作品曾经使用过多代理的深度强化学习来实现现实世界中的环境问题,这些问题主要具有混合动作空间。因此,在本文中,我们提出了两种算法:深层混合软性角色批评(MAHSAC)和多代理混合杂种深层确定性政策梯度(MAHDDPG)来填补这一空白。这两种算法遵循集中式培训和分散执行(CTDE)范式,并可以解决混合动作空间问题。我们的经验在多代理粒子环境上运行,这是一个简单的多代理粒子世界,以及一些基本的模拟物理。实验结果表明,这些算法具有良好的性能。
translated by 谷歌翻译
Reinforcement learning in multi-agent scenarios is important for real-world applications but presents challenges beyond those seen in singleagent settings. We present an actor-critic algorithm that trains decentralized policies in multiagent settings, using centrally computed critics that share an attention mechanism which selects relevant information for each agent at every timestep. This attention mechanism enables more effective and scalable learning in complex multiagent environments, when compared to recent approaches. Our approach is applicable not only to cooperative settings with shared rewards, but also individualized reward settings, including adversarial settings, as well as settings that do not provide global states, and it makes no assumptions about the action spaces of the agents. As such, it is flexible enough to be applied to most multi-agent learning problems.
translated by 谷歌翻译
在合作的多代理增强学习(MARL)中,代理只能获得部分观察,有效利用本地信息至关重要。在长期观察期间,代理可以构建\ textit {意识},使队友减轻部分可观察性问题。但是,以前的MAL方法通常忽略了对本地信息的这种利用。为了解决这个问题,我们提出了一个新颖的框架,多代理\ textit {本地信息分解,以意识到队友}(linda),代理商通过该框架学会分解本地信息并为每个队友建立意识。我们将意识模拟为随机随机变量并执行表示学习,以确保意识表示的信息,通过最大程度地提高意识与相应代理的实际轨迹之间的相互信息。 Linda对特定算法是不可知论的,可以灵活地集成到不同的MARL方法中。足够的实验表明,所提出的框架从当地的部分观察结果中学习了信息丰富的意识,以更好地协作并显着提高学习绩效,尤其是在具有挑战性的任务上。
translated by 谷歌翻译
由于共同国家行动空间相对于代理人的数量,多代理强化学习(MARL)中的政策学习(MARL)是具有挑战性的。为了实现更高的可伸缩性,通过分解执行(CTDE)的集中式培训范式被MARL中的分解结构广泛采用。但是,我们观察到,即使在简单的矩阵游戏中,合作MARL中现有的CTDE算法也无法实现最佳性。为了理解这种现象,我们引入了一个具有政策分解(GPF-MAC)的广义多代理参与者批评的框架,该框架的特征是对分解的联合政策的学习,即,每个代理人的政策仅取决于其自己的观察行动历史。我们表明,最受欢迎的CTDE MARL算法是GPF-MAC的特殊实例,可能会陷入次优的联合政策中。为了解决这个问题,我们提出了一个新颖的转型框架,该框架将多代理的MDP重新制定为具有连续结构的特殊“单位代理” MDP,并且可以允许使用现成的单机械加固学习(SARL)算法来有效地学习相应的多代理任务。这种转换保留了SARL算法的最佳保证,以合作MARL。为了实例化此转换框架,我们提出了一个转换的PPO,称为T-PPO,该PPO可以在有限的多代理MDP中进行理论上执行最佳的策略学习,并在一系列合作的多代理任务上显示出明显的超出性能。
translated by 谷歌翻译
政策梯度方法在多智能体增强学习中变得流行,但由于存在环境随机性和探索代理(即非公平性​​),它们遭受了高度的差异,这可能因信用分配难度而受到困扰。结果,需要一种方法,该方法不仅能够有效地解决上述两个问题,而且需要足够强大地解决各种任务。为此,我们提出了一种新的多代理政策梯度方法,称为强大的本地优势(ROLA)演员 - 评论家。 Rola允许每个代理人将个人动作值函数作为当地评论家,以及通过基于集中评论家的新型集中培训方法来改善环境不良。通过使用此本地批评,每个代理都计算基准,以减少对其策略梯度估计的差异,这导致含有其他代理的预期优势动作值,这些选项可以隐式提高信用分配。我们在各种基准测试中评估ROLA,并在许多最先进的多代理政策梯度算法上显示其鲁棒性和有效性。
translated by 谷歌翻译
通过集中培训和分散执行的价值功能分解是有助于解决合作多功能协商强化任务的承诺。该地区QMIX的方法之一已成为最先进的,在星际争霸II微型管理基准上实现了最佳性能。然而,已知QMIX中每个代理估计的单调混合是限制它可以表示的关节动作Q值,以及单个代理价值函数估计的全局状态信息,通常导致子优相。为此,我们呈现LSF-SAC,这是一种新颖的框架,其具有基于变分推理的信息共享机制,作为额外的状态信息,以帮助在价值函数分子中提供各个代理。我们证明,这种潜在的个人状态信息共享可以显着扩展价值函数分解的力量,而通过软演员批评设计仍然可以在LSF-SAC中保持完全分散的执行。我们在星际争霸II微型管理挑战上评估LSF-SAC,并证明它在挑战协作任务方面优于几种最先进的方法。我们进一步设定了广泛的消融研究,以定位核算其绩效改进的关键因素。我们认为,这种新的洞察力可以导致新的地方价值估算方法和变分的深度学习算法。可以在https://sites.google.com/view/sacmm处找到演示视频和实现代码。
translated by 谷歌翻译
We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multiagent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.
translated by 谷歌翻译
作为分散的部分观察到的马尔可夫决策过程(DEC-POMDP)问题的解决方案之一,最近的价值分解方法已经实现了显着的结果。然而,大多数值分解方法需要在训练期间的环境完全可观察状态,但这在一些场景中是不可行的,在某些情况下可以获得不完整和嘈杂的观察。因此,我们提出了一种新颖的值分解框架,命名为值分解(侧)的状态推断,这消除了通过同时寻求最佳控制和状态推断的两个问题来了解全局状态的需要。侧面可以扩展到任何值分解方法,以解决部分可观察的问题。通过比较星际II微型管理任务中的不同算法的性能,但我们验证了没有可访问状态,方面可以推断基于过去的本地观测的增强学习过程,甚至在一些基础上实现卓越的结果复杂的情景。
translated by 谷歌翻译
分散的学习对合作多代理增强学习(MARL)表现出了巨大的希望。但是,非平稳性仍然是分散学习的重大挑战。在论文中,我们以最简单和基本的方式解决了非平稳性问题,并提出\ textit {多代理替代Q学习}(MA2QL),在那里,代理商轮流通过Q学习来更新其Q-函数。MA2QL是完全分散合作MARL的一种\ Textit {Minimalist}方法,但理论上是基础的。我们证明,当每个代理商在每个回合都保证$ \ varepsilon $ -Convergence时,他们的联合政策会收敛到NASH平衡。实际上,MA2QL仅需要对独立Q学习(IQL)的最小变化。我们经验评估MA2QL对各种合作的多代理任务。结果表明,MA2QL始终胜过IQL,尽管这种变化很小,但它验证了MA2QL的有效性。
translated by 谷歌翻译
保守主义的概念导致了离线强化学习(RL)的重要进展,其中代理从预先收集的数据集中学习。但是,尽可能多的实际方案涉及多个代理之间的交互,解决更实际的多代理设置中的离线RL仍然是一个开放的问题。鉴于最近将Online RL算法转移到多代理设置的成功,可以预期离线RL算法也将直接传输到多代理设置。令人惊讶的是,当基于保守的算法应用于多蛋白酶的算法时,性能显着降低了越来越多的药剂。为了减轻劣化,我们确定了价值函数景观可以是非凹形的关键问题,并且策略梯度改进容易出现本地最优。自从任何代理人的次优政策可能导致不协调的全球失败以来,多个代理人会加剧问题。在这种直觉之后,我们提出了一种简单而有效的方法,脱机多代理RL与演员整流(OMAR),通过有效的一阶政策梯度和Zeroth订单优化方法为演员更好地解决这一关键挑战优化保守值函数。尽管简单,奥马尔显着优于强大的基线,在多售后连续控制基准测试中具有最先进的性能。
translated by 谷歌翻译
Adequate strategizing of agents behaviors is essential to solving cooperative MARL problems. One intuitively beneficial yet uncommon method in this domain is predicting agents future behaviors and planning accordingly. Leveraging this point, we propose a two-level hierarchical architecture that combines a novel information-theoretic objective with a trajectory prediction model to learn a strategy. To this end, we introduce a latent policy that learns two types of latent strategies: individual $z_A$, and relational $z_R$ using a modified Graph Attention Network module to extract interaction features. We encourage each agent to behave according to the strategy by conditioning its local $Q$ functions on $z_A$, and we further equip agents with a shared $Q$ function that conditions on $z_R$. Additionally, we introduce two regularizers to allow predicted trajectories to be accurate and rewarding. Empirical results on Google Research Football (GRF) and StarCraft (SC) II micromanagement tasks show that our method establishes a new state of the art being, to the best of our knowledge, the first MARL algorithm to solve all super hard SC II scenarios as well as the GRF full game with a win rate higher than $95\%$, thus outperforming all existing methods. Videos and brief overview of the methods and results are available at: https://sites.google.com/view/hier-strats-marl/home.
translated by 谷歌翻译
学习协作对于多机构增强学习(MARL)至关重要。以前的作品通过最大化代理行为的相关性来促进协作,该行为的相关性通常以不同形式的相互信息(MI)为特征。但是,我们揭示了次最佳的协作行为,也出现了强烈的相关性,并且简单地最大化MI可以阻碍学习的学习能力。为了解决这个问题,我们提出了一个新颖的MARL框架,称为“渐进式信息协作(PMIC)”,以进行更有效的MI驱动协作。 PMIC使用全球国家和联合行动之间MI测量的新协作标准。基于此标准,PMIC的关键思想是最大程度地提高与优越的协作行为相关的MI,并最大程度地减少与下等方面相关的MI。这两个MI目标通过促进更好的合作,同时避免陷入次级优势,从而扮演互补的角色。与其他算法相比,在各种MARL基准测试的实验表明,PMIC的表现出色。
translated by 谷歌翻译
多代理增强学习(MARL)在价值函数分解方法的发展中见证了重大进展。由于单调性,它可以通过最大程度地分解每个代理实用程序来优化联合动作值函数。在本文中,我们表明,在部分可观察到的MARL问题中,代理商对自己的行为的订购可能会对代表功能类施加并发约束(跨不同状态),从而在培训期间造成重大估计错误。我们解决了这一限制,并提出了PAC,PAC是一个新的框架,利用了最佳联合行动选择的反事实预测产生的辅助信息,这可以通过新颖的反事实损失通过新颖的辅助来实现价值功能分解。开发了一种基于变异推理的信息编码方法,以从估计的基线收集和编码反事实预测。为了实现分散的执行,我们还得出了受最大收入MARL框架启发的分级分配的代理策略。我们评估了有关多代理捕食者捕食者和一组Starcraft II微管理任务的PAC。经验结果表明,在所有基准上,PAC对基于最先进的价值和基于策略的多代理增强学习算法的结果得到了改善。
translated by 谷歌翻译
In multi-agent reinforcement learning (MARL), many popular methods, such as VDN and QMIX, are susceptible to a critical multi-agent pathology known as relative overgeneralization (RO), which arises when the optimal joint action's utility falls below that of a sub-optimal joint action in cooperative tasks. RO can cause the agents to get stuck into local optima or fail to solve tasks that require significant coordination between agents within a given timestep. Recent value-based MARL algorithms such as QPLEX and WQMIX can overcome RO to some extent. However, our experimental results show that they can still fail to solve cooperative tasks that exhibit strong RO. In this work, we propose a novel approach called curriculum learning for relative overgeneralization (CURO) to better overcome RO. To solve a target task that exhibits strong RO, in CURO, we first fine-tune the reward function of the target task to generate source tasks that are tailored to the current ability of the learning agent and train the agent on these source tasks first. Then, to effectively transfer the knowledge acquired in one task to the next, we use a novel transfer learning method that combines value function transfer with buffer transfer, which enables more efficient exploration in the target task. We demonstrate that, when applied to QMIX, CURO overcomes severe RO problem and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks.
translated by 谷歌翻译