建模其他代理的行为对于了解代理商互动和提出有效决策至关重要。代理模型的现有方法通常假设在执行期间对所建模代理的本地观测和所选操作的知识。为了消除这种假设,我们使用编码器解码器体系结构从受控代理的本地信息中提取表示。在培训期间使用所建模代理的观测和动作,我们的模型学会仅在受控剂的局部观察中提取有关所建模代理的表示。这些陈述用于增加受控代理的决定政策,这些政策通过深度加强学习培训;因此,在执行期间,策略不需要访问其他代理商的信息。我们提供合作,竞争和混合多种子体环境中的全面评估和消融研究,表明我们的方法比不使用所学习表示的基线方法实现更高的回报。
translated by 谷歌翻译
成功部署多机构强化学习通常需要代理来适应其行为。在这项工作中,我们讨论了团队合作适应的问题,其中一组代理团队需要调整其政策以通过有限的微调解决新的任务。由代理人需要能够识别和区分任务以使其行为适应当前任务的直觉的动机,我们建议学习多代理任务嵌入(MATE)。这些任务嵌入方式是使用针对重建过渡和奖励功能进行优化的编码器架构训练的,这些功能唯一地识别任务。我们表明,在提供任务嵌入时,一组代理商可以适应新颖的任务。我们提出了三个伴侣训练范例:独立伴侣,集中式伴侣和混合伴侣,这些伴侣在任务编码的信息中有所不同。我们表明,伴侣学到的嵌入识别任务,并提供有用的信息,哪些代理在适应新任务期间利用了哪些代理。
translated by 谷歌翻译
多代理深度增强学习(Marl)缺乏缺乏共同使用的评估任务和标准,使方法之间的比较困难。在这项工作中,我们提供了一个系统评估,并比较了三种不同类别的Marl算法(独立学习,集中式多代理政策梯度,价值分解)在各种协作多智能经纪人学习任务中。我们的实验是在不同学习任务中作为算法的预期性能的参考,我们为不同学习方法的有效性提供了见解。我们开源EPYMARL,它将Pymarl CodeBase扩展到包括其他算法,并允许灵活地配置算法实现细节,例如参数共享。最后,我们开源两种环境,用于多智能经纪研究,重点关注稀疏奖励下的协调。
translated by 谷歌翻译
We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multiagent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.
translated by 谷歌翻译
多代理游戏中的均衡选择是指选择帕累托最佳平衡的问题。已经表明,由于每个代理商在训练过程中对其他代理商的政策的不确定性,许多最先进的多机构增强学习(MARL)算法容易融合到帕累托主导的平衡。为了解决次优的平衡选择,我们提出了一种使用无关紧要游戏的简单原则(具有相同奖励的超级合作游戏)的参与者批评算法(PAC):每个代理人都可以假设其他人会选择动作的动作这将导致帕累托最佳平衡。我们评估了PAC在一系列多种多样的游戏中,并表明与替代MARL算法相比,它会收敛到更高的情节回报,并在一系列矩阵游戏中成功收敛到帕累托优势。最后,我们提出了一个图形神经网络扩展,该扩展可以在具有多达15个代理商的游戏中有效地扩展。
translated by 谷歌翻译
Cooperative multi-agent reinforcement learning (MARL) has achieved significant results, most notably by leveraging the representation-learning abilities of deep neural networks. However, large centralized approaches quickly become infeasible as the number of agents scale, and fully decentralized approaches can miss important opportunities for information sharing and coordination. Furthermore, not all agents are equal -- in some cases, individual agents may not even have the ability to send communication to other agents or explicitly model other agents. This paper considers the case where there is a single, powerful, \emph{central agent} that can observe the entire observation space, and there are multiple, low-powered \emph{local agents} that can only receive local observations and are not able to communicate with each other. The central agent's job is to learn what message needs to be sent to different local agents based on the global observations, not by centrally solving the entire problem and sending action commands, but by determining what additional information an individual agent should receive so that it can make a better decision. In this work we present our MARL algorithm \algo, describe where it would be most applicable, and implement it in the cooperative navigation and multi-agent walker domains. Empirical results show that 1) learned communication does indeed improve system performance, 2) results generalize to heterogeneous local agents, and 3) results generalize to different reward structures.
translated by 谷歌翻译
本文考虑了多智能经纪人强化学习(MARL)任务,代理商在集会结束时获得共享全球奖励。这种奖励的延迟性质影响了代理商在中间时间步骤中评估其行动质量的能力。本文侧重于开发学习焦点奖励的时间重新分布的方法,以获得密集奖励信号。解决这些MARL问题需要解决两个挑战:识别(1)沿着集发作(沿时间)的长度相对重要性,以及(2)在任何单一时间步骤(代理商中)的相对重要性。在本文中,我们介绍了奖励中的奖励再分配,在整容多智能体加固学习(Arel)中奖励再分配,以解决这两个挑战。 Arel使用注意机制来表征沿着轨迹(时间关注)对状态转换的动作的影响,以及每个代理在每个时间步骤(代理人注意)的影响。 Arel预测的重新分配奖励是密集的,可以与任何给定的MARL算法集成。我们评估了粒子世界环境的具有挑战性的任务和星际争霸多功能挑战。 arel导致粒子世界的奖励较高,并改善星际争端的胜利率与三个最先进的奖励再分配方法相比。我们的代码可在https://github.com/baicenxiao/arel获得。
translated by 谷歌翻译
Recently, model-based agents have achieved better performance than model-free ones using the same computational budget and training time in single-agent environments. However, due to the complexity of multi-agent systems, it is tough to learn the model of the environment. The significant compounding error may hinder the learning process when model-based methods are applied to multi-agent tasks. This paper proposes an implicit model-based multi-agent reinforcement learning method based on value decomposition methods. Under this method, agents can interact with the learned virtual environment and evaluate the current state value according to imagined future states in the latent space, making agents have the foresight. Our approach can be applied to any multi-agent value decomposition method. The experimental results show that our method improves the sample efficiency in different partially observable Markov decision process domains.
translated by 谷歌翻译
在加固学习中的代理商中设计有效的沟通机制一直是一个具有挑战性的任务,特别是对于现实世界的应用。代理人的数量可以增长或环境有时需要与现实世界情景中的变化数量的代理商进行互动。为此,在尺度和动态方面,需要处理各种代理框架的各种方案,以便对现实世界的应用来说是实用的。我们制定多种代理环境,具有不同数量的代理作为多任务问题,提出了一个元增强学习(Meta-RL)框架来解决这个问题。所提出的框架采用Meta学习的通信模式识别(CPR)模块来识别促进培训过程的通信行为和提取信息。实验结果旨在证明所提出的框架(A)推广到看不见的更大量的药剂,(B)允许代理的数量在发作之间发生变化。还提供了烧蚀研究,以推理拟议的CPR设计并显示这种设计是有效的。
translated by 谷歌翻译
在人工多智能体系中,学习协作政策的能力是基于代理商的沟通技巧,他们必须能够编码从环境中收到的信息,并学习如何与手头任务所要求的其他代理分享它。我们介绍了一个深度加强学习方法,连接驱动的通信(CDC),促进了多种子体协作行为的出现,仅通过经验。代理被建模为加权图的节点,其状态相关的边缘编码可以交换的对方式。我们介绍了一种依赖于图形的关注机制,可以控制代理的传入消息如何加权。此机制完全核对图表所表示的系统的当前状态,并在捕获信息如何在图中流动的扩散过程中构建。图形拓扑未被假定已知先验,但在代理人的观察中动态依赖于代理人,并以端到端的方式与注意机制和政策同时学习。我们的经验结果表明,CDC能够学习有效的协作政策,并可以在合作导航任务上过度执行竞争学习算法。
translated by 谷歌翻译
几乎所有的多代理强化学习算法没有交流,都遵循分散执行的集中培训原则。在集中培训期间,代理可以以相同的信号为指导,例如全球国家。但是,在分散执行期间,代理缺乏共享信号。受到观点不变性和对比学习的启发,我们在本文中提出了共识学习,以学习合作的多代理增强学习。尽管基于局部观察结果,但不同的代理可以在离散空间中推断出相同的共识。在分散执行期间,我们将推断的共识作为对代理网络的明确输入提供了,从而发展了他们的合作精神。我们提出的方法可以扩展到具有小模型更改的各种多代理增强学习算法。此外,我们执行一些完全合作的任务,并获得令人信服的结果。
translated by 谷歌翻译
Reinforcement learning in multi-agent scenarios is important for real-world applications but presents challenges beyond those seen in singleagent settings. We present an actor-critic algorithm that trains decentralized policies in multiagent settings, using centrally computed critics that share an attention mechanism which selects relevant information for each agent at every timestep. This attention mechanism enables more effective and scalable learning in complex multiagent environments, when compared to recent approaches. Our approach is applicable not only to cooperative settings with shared rewards, but also individualized reward settings, including adversarial settings, as well as settings that do not provide global states, and it makes no assumptions about the action spaces of the agents. As such, it is flexible enough to be applied to most multi-agent learning problems.
translated by 谷歌翻译
We consider the problem of multi-agent navigation and collision avoidance when observations are limited to the local neighborhood of each agent. We propose InforMARL, a novel architecture for multi-agent reinforcement learning (MARL) which uses local information intelligently to compute paths for all the agents in a decentralized manner. Specifically, InforMARL aggregates information about the local neighborhood of agents for both the actor and the critic using a graph neural network and can be used in conjunction with any standard MARL algorithm. We show that (1) in training, InforMARL has better sample efficiency and performance than baseline approaches, despite using less information, and (2) in testing, it scales well to environments with arbitrary numbers of agents and obstacles.
translated by 谷歌翻译
Many real-world problems, such as network packet routing and the coordination of autonomous vehicles, are naturally modelled as cooperative multi-agent systems. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actorcritic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.
translated by 谷歌翻译
钢筋学习的最新进展证明了其在超级人类水平上解决硬质孕代环境互动任务的能力。然而,由于大多数RL最先进的算法的样本低效率,即,需要大量培训集,因此在实际和现实世界任务中的应用目前有限。例如,在Dota 2中击败人类参与者的Openai五种算法已经训练了数千年的游戏时间。存在解决样本低效问题的几种方法,可以通过更好地探索环境来提供更有效的使用或旨在获得更相关和多样化的经验。然而,为了我们的知识,没有用于基于模型的算法的这种方法,其在求解具有高维状态空间的硬控制任务方面的高采样效率。这项工作连接了探索技术和基于模型的加强学习。我们设计了一种新颖的探索方法,考虑了基于模型的方法的特征。我们还通过实验证明我们的方法显着提高了基于模型的算法梦想家的性能。
translated by 谷歌翻译
在合作的多代理增强学习(MARL)中,代理只能获得部分观察,有效利用本地信息至关重要。在长期观察期间,代理可以构建\ textit {意识},使队友减轻部分可观察性问题。但是,以前的MAL方法通常忽略了对本地信息的这种利用。为了解决这个问题,我们提出了一个新颖的框架,多代理\ textit {本地信息分解,以意识到队友}(linda),代理商通过该框架学会分解本地信息并为每个队友建立意识。我们将意识模拟为随机随机变量并执行表示学习,以确保意识表示的信息,通过最大程度地提高意识与相应代理的实际轨迹之间的相互信息。 Linda对特定算法是不可知论的,可以灵活地集成到不同的MARL方法中。足够的实验表明,所提出的框架从当地的部分观察结果中学习了信息丰富的意识,以更好地协作并显着提高学习绩效,尤其是在具有挑战性的任务上。
translated by 谷歌翻译
Multi-robot manipulation tasks involve various control entities that can be separated into dynamically independent parts. A typical example of such real-world tasks is dual-arm manipulation. Learning to naively solve such tasks with reinforcement learning is often unfeasible due to the sample complexity and exploration requirements growing with the dimensionality of the action and state spaces. Instead, we would like to handle such environments as multi-agent systems and have several agents control parts of the whole. However, decentralizing the generation of actions requires coordination across agents through a channel limited to information central to the task. This paper proposes an approach to coordinating multi-robot manipulation through learned latent action spaces that are shared across different agents. We validate our method in simulated multi-robot manipulation tasks and demonstrate improvement over previous baselines in terms of sample efficiency and learning performance.
translated by 谷歌翻译
尽管深度强化学习(RL)最近取得了许多成功,但其方法仍然效率低下,这使得在数据方面解决了昂贵的许多问题。我们的目标是通过利用未标记的数据中的丰富监督信号来进行学习状态表示,以解决这一问题。本文介绍了三种不同的表示算法,可以访问传统RL算法使用的数据源的不同子集使用:(i)GRICA受到独立组件分析(ICA)的启发,并训练深层神经网络以输出统计独立的独立特征。输入。 Grica通过最大程度地减少每个功能与其他功能之间的相互信息来做到这一点。此外,格里卡仅需要未分类的环境状态。 (ii)潜在表示预测(LARP)还需要更多的上下文:除了要求状态作为输入外,它还需要先前的状态和连接它们的动作。该方法通过预测当前状态和行动的环境的下一个状态来学习状态表示。预测器与图形搜索算法一起使用。 (iii)重新培训通过训练深层神经网络来学习国家表示,以学习奖励功能的平滑版本。该表示形式用于预处理输入到深度RL,而奖励预测指标用于奖励成型。此方法仅需要环境中的状态奖励对学习表示表示。我们发现,每种方法都有其优势和缺点,并从我们的实验中得出结论,包括无监督的代表性学习在RL解决问题的管道中可以加快学习的速度。
translated by 谷歌翻译
独立的强化学习算法没有理论保证,用于在多代理设置中找到最佳策略。然而,在实践中,先前的作品报告了在某些域中的独立算法和其他方面的良好性能。此外,文献中缺乏对独立算法的优势和弱点的全面研究。在本文中,我们对四个Pettingzoo环境进行了独立算法的性能的实证比较,这些环境跨越了三种主要类别的多助理环境,即合作,竞争和混合。我们表明,在完全可观察的环境中,独立的算法可以在协作和竞争环境中与多代理算法进行同步。对于混合环境,我们表明通过独立算法培训的代理商学会单独执行,但未能学会与盟友合作并与敌人竞争。我们还表明,添加重复性提高了合作部分可观察环境中独立算法的学习。
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译