Advancements in reinforcement learning (RL) have inspired new directions in intelligent automation of network defense. However, many of these advancements have either outpaced their application to network security or have not considered the challenges associated with implementing them in the real-world. To understand these problems, this work evaluates several RL approaches implemented in the second edition of the CAGE Challenge, a public competition to build an autonomous network defender agent in a high-fidelity network simulator. Our approaches all build on the Proximal Policy Optimization (PPO) family of algorithms, and include hierarchical RL, action masking, custom training, and ensemble RL. We find that the ensemble RL technique performs strongest, outperforming our other models and taking second place in the competition. To understand applicability to real environments we evaluate each method's ability to generalize to unseen networks and against an unknown attack strategy. In unseen environments, all of our approaches perform worse, with degradation varied based on the type of environmental change. Against an unknown attacker strategy, we found that our models had reduced overall performance even though the new strategy was less efficient than the ones our models trained on. Together, these results highlight promising research directions for autonomous network defense in the real world.
translated by 谷歌翻译
防御网络攻击的计算机网络需要及时应对警报和威胁情报。关于如何响应的决定涉及基于妥协指标的多个节点跨多个节点协调动作,同时最大限度地减少对网络操作的中断。目前,PlayBooks用于自动化响应过程的部分,但通常将复杂的决策留给人类分析师。在这项工作中,我们在大型工业控制网络中提出了一种深度增强学习方法,以便在大型工业控制网络中进行自主反应和恢复。我们提出了一种基于关注的神经结构,其在保护下灵活地灵活。要培训和评估自治防御者代理,我们提出了一个适合加强学习的工业控制网络仿真环境。实验表明,学习代理可以有效减轻在执行前几个月几个月的可观察信号的进步。所提出的深度加强学习方法优于模拟中完全自动化的Playbook方法,采取更少的破坏性动作,同时在网络上保留更多节点。学习的政策对攻击者行为的变化也比PlayBook方法更加强大。
translated by 谷歌翻译
Reinforcement learning (RL) operating on attack graphs leveraging cyber terrain principles are used to develop reward and state associated with determination of surveillance detection routes (SDR). This work extends previous efforts on developing RL methods for path analysis within enterprise networks. This work focuses on building SDR where the routes focus on exploring the network services while trying to evade risk. RL is utilized to support the development of these routes by building a reward mechanism that would help in realization of these paths. The RL algorithm is modified to have a novel warm-up phase which decides in the initial exploration which areas of the network are safe to explore based on the rewards and penalty scale factor.
translated by 谷歌翻译
Reinforcement learning allows machines to learn from their own experience. Nowadays, it is used in safety-critical applications, such as autonomous driving, despite being vulnerable to attacks carefully crafted to either prevent that the reinforcement learning algorithm learns an effective and reliable policy, or to induce the trained agent to make a wrong decision. The literature about the security of reinforcement learning is rapidly growing, and some surveys have been proposed to shed light on this field. However, their categorizations are insufficient for choosing an appropriate defense given the kind of system at hand. In our survey, we do not only overcome this limitation by considering a different perspective, but we also discuss the applicability of state-of-the-art attacks and defenses when reinforcement learning algorithms are used in the context of autonomous driving.
translated by 谷歌翻译
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.
translated by 谷歌翻译
在过去的十年中,多智能经纪人强化学习(Marl)已经有了重大进展,但仍存在许多挑战,例如高样本复杂性和慢趋同稳定的政策,在广泛的部署之前需要克服,这是可能的。然而,在实践中,许多现实世界的环境已经部署了用于生成策略的次优或启发式方法。一个有趣的问题是如何最好地使用这些方法作为顾问,以帮助改善多代理领域的加强学习。在本文中,我们提供了一个原则的框架,用于将动作建议纳入多代理设置中的在线次优顾问。我们描述了在非传记通用随机游戏环境中提供多种智能强化代理(海军上将)的问题,并提出了两种新的基于Q学习的算法:海军上将决策(海军DM)和海军上将 - 顾问评估(Admiral-AE) ,这使我们能够通过适当地纳入顾问(Admiral-DM)的建议来改善学习,并评估顾问(Admiral-AE)的有效性。我们从理论上分析了算法,并在一般加上随机游戏中提供了关于他们学习的定点保证。此外,广泛的实验说明了这些算法:可以在各种环境中使用,具有对其他相关基线的有利相比的性能,可以扩展到大状态行动空间,并且对来自顾问的不良建议具有稳健性。
translated by 谷歌翻译
Lifelong learning aims to create AI systems that continuously and incrementally learn during a lifetime, similar to biological learning. Attempts so far have met problems, including catastrophic forgetting, interference among tasks, and the inability to exploit previous knowledge. While considerable research has focused on learning multiple input distributions, typically in classification, lifelong reinforcement learning (LRL) must also deal with variations in the state and transition distributions, and in the reward functions. Modulating masks, recently developed for classification, are particularly suitable to deal with such a large spectrum of task variations. In this paper, we adapted modulating masks to work with deep LRL, specifically PPO and IMPALA agents. The comparison with LRL baselines in both discrete and continuous RL tasks shows competitive performance. We further investigated the use of a linear combination of previously learned masks to exploit previous knowledge when learning new tasks: not only is learning faster, the algorithm solves tasks that we could not otherwise solve from scratch due to extremely sparse rewards. The results suggest that RL with modulating masks is a promising approach to lifelong learning, to the composition of knowledge to learn increasingly complex tasks, and to knowledge reuse for efficient and faster learning.
translated by 谷歌翻译
软件测试活动旨在找到软件产品的可能缺陷,并确保该产品满足其预期要求。一些软件测试接近的方法缺乏自动化或部分自动化,这增加了测试时间和整体软件测试成本。最近,增强学习(RL)已成功地用于复杂的测试任务中,例如游戏测试,回归测试和测试案例优先级,以自动化该过程并提供持续的适应。从业者可以通过从头开始实现RL算法或使用RL框架来使用RL。开发人员已广泛使用这些框架来解决包括软件测试在内的各个领域中的问题。但是,据我们所知,尚无研究从经验上评估RL框架中实用算法的有效性和性能。在本文中,我们凭经验研究了精心选择的RL算法在两个重要的软件测试任务上的应用:在连续集成(CI)和游戏测试的上下文中测试案例的优先级。对于游戏测试任务,我们在简单游戏上进行实验,并使用RL算法探索游戏以检测错误。结果表明,一些选定的RL框架,例如Tensorforce优于文献的最新方法。为了确定测试用例的优先级,我们在CI环境上运行实验,其中使用来自不同框架的RL算法来对测试用例进行排名。我们的结果表明,在某些情况下,预实算算法之间的性能差异很大,激励了进一步的研究。此外,建议对希望选择RL框架的研究人员进行一些基准问题的经验评估,以确保RL算法按预期执行。
translated by 谷歌翻译
已经证明,深层合奏将典型的集体学习中看到的积极效果扩展到神经网络和增强学习(RL)。但是,要提高此类整体模型的效率仍然有很多事情要做。在这项工作中,我们介绍了在RL(feft)中快速传输的各种合奏,这是一种基于合奏的新方法,用于在高度多模式环境中进行增强学习,并改善了转移到看不见的环境。该算法分为两个主要阶段:合奏成员的培训,以及合成成员的合成(或微调)成员,以在新环境中起作用。该算法的第一阶段涉及并行培训常规的政策梯度或参与者 - 批评者,但增加了鼓励这些政策彼此不同的损失。这会导致单个单峰剂探索最佳策略的空间,并捕获与单个参与者相比,捕获环境的多模式的更多。 DEFT的第二阶段涉及将组件策略综合为新的策略,该策略以两种方式之一在修改的环境中效果很好。为了评估DEFT的性能,我们从近端策略优化(PPO)算法的基本版本开始,并通过faft的修改将其扩展。我们的结果表明,预处理阶段可有效地在多模式环境中产生各种策略。除了替代方案,faft通常会收敛到高奖励的速度要快得多,例如随机初始化而无需faft和合奏成员的微调。虽然当然还有更多的工作来分析理论上的熟练并将其扩展为更强大,但我们认为,它为在环境中捕获多模式的框架提供了一个强大的框架,同时仍将使用简单策略表示的RL方法。
translated by 谷歌翻译
互联网连接系统的规模大大增加,这些系统比以往任何时候都更接触到网络攻击。网络攻击的复杂性和动态需要保护机制响应,自适应和可扩展。机器学习,或更具体地说,深度增强学习(DRL),方法已经广泛提出以解决这些问题。通过将深入学习纳入传统的RL,DRL能够解决复杂,动态,特别是高维的网络防御问题。本文提出了对为网络安全开发的DRL方法进行了调查。我们触及不同的重要方面,包括基于DRL的网络 - 物理系统的安全方法,自主入侵检测技术和基于多元的DRL的游戏理论模拟,用于防范策略对网络攻击。还给出了对基于DRL的网络安全的广泛讨论和未来的研究方向。我们预计这一全面审查提供了基础,并促进了未来的研究,探讨了越来越复杂的网络安全问题。
translated by 谷歌翻译
在各种零和游戏中,自我播放的增强学习已经达到了最先进的,通常是超人的表现。然而,先前的工作发现,反对常规对手的政策能够在灾难性的情况下对对抗性政策失败:一个对受害者明确训练的对手。使用对抗训练的先前防御能够使受害者对特定的对手有牢固的态度,但受害者仍然容易受到新的对手。我们猜想这种限制是由于训练过程中看到的对手多样性不足。我们建议使用基于人口的训练的辩护,以使受害者对抗各种各样的对手。我们在两个低维环境中评估了这种防御对新对手的鲁棒性。通过攻击者训练时间步长以利用受害者的数量来衡量,我们的防守对对抗者提高了对手的鲁棒性。此外,我们表明鲁棒性与对手人群的大小相关。
translated by 谷歌翻译
最近的研究表明,深层增强学习剂容易受到代理投入的小对抗扰动,这提出了对在现实世界中部署这些药剂的担忧。为了解决这个问题,我们提出了一个主要的框架,是培训加强学习代理的主要框架,以改善鲁棒性,以防止$ L_P $ -NORM偏见的对抗性攻击。我们的框架与流行的深度加强学习算法兼容,我们用深Q学习,A3C和PPO展示了其性能。我们在三个深度RL基准(Atari,Mujoco和Procgen)上进行实验,以展示我们稳健的培训算法的有效性。我们的径向-RL代理始终如一地占据了不同强度的攻击时的现有方法,并且培训更加计算效率。此外,我们提出了一种新的评估方法,称为贪婪最坏情况奖励(GWC)来衡量深度RL代理商的攻击不良鲁棒性。我们表明GWC可以有效地评估,并且对最糟糕的对抗攻击序列是对奖励的良好估计。用于我们实验的所有代码可在https://github.com/tuomaso/radial_rl_v2上获得。
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
In this article we introduce the Arcade Learning Environment (ALE): both a challenge problem and a platform and methodology for evaluating the development of general, domain-independent AI technology. ALE provides an interface to hundreds of Atari 2600 game environments, each one different, interesting, and designed to be a challenge for human players. ALE presents significant research challenges for reinforcement learning, model learning, model-based planning, imitation learning, transfer learning, and intrinsic motivation. Most importantly, it provides a rigorous testbed for evaluating and comparing approaches to these problems. We illustrate the promise of ALE by developing and benchmarking domain-independent agents designed using well-established AI techniques for both reinforcement learning and planning. In doing so, we also propose an evaluation methodology made possible by ALE, reporting empirical results on over 55 different games. All of the software, including the benchmark agents, is publicly available.
translated by 谷歌翻译
数字化和远程连接扩大了攻击面,使网络系统更脆弱。由于攻击者变得越来越复杂和资源丰富,仅仅依赖传统网络保护,如入侵检测,防火墙和加密,不足以保护网络系统。网络弹性提供了一种新的安全范式,可以使用弹性机制来补充保护不足。一种网络弹性机制(CRM)适应了已知的或零日威胁和实际威胁和不确定性,并对他们进行战略性地响应,以便在成功攻击时保持网络系统的关键功能。反馈架构在启用CRM的在线感应,推理和致动过程中发挥关键作用。强化学习(RL)是一个重要的工具,对网络弹性的反馈架构构成。它允许CRM提供有限或没有事先知识和攻击者的有限攻击的顺序响应。在这项工作中,我们审查了Cyber​​恢复力的RL的文献,并讨论了对三种主要类型的漏洞,即姿势有关,与信息相关的脆弱性的网络恢复力。我们介绍了三个CRM的应用领域:移动目标防御,防守网络欺骗和辅助人类安全技术。 RL算法也有漏洞。我们解释了RL的三个漏洞和目前的攻击模型,其中攻击者针对环境与代理商之间交换的信息:奖励,国家观察和行动命令。我们展示攻击者可以通过最低攻击努力来欺骗RL代理商学习邪恶的政策。最后,我们讨论了RL为基于RL的CRM的网络安全和恢复力和新兴应用的未来挑战。
translated by 谷歌翻译
In this paper, we consider the problem of path finding for a set of homogeneous and autonomous agents navigating a previously unknown stochastic environment. In our problem setting, each agent attempts to maximize a given utility function while respecting safety properties. Our solution is based on ideas from evolutionary game theory, namely replicating policies that perform well and diminishing ones that do not. We do a comprehensive comparison with related multiagent planning methods, and show that our technique beats state of the art RL algorithms in minimizing path length by nearly 30% in large spaces. We show that our algorithm is computationally faster than deep RL methods by at least an order of magnitude. We also show that it scales better with an increase in the number of agents as compared to other methods, path planning methods in particular. Lastly, we empirically prove that the policies that we learn are evolutionarily stable and thus impervious to invasion by any other policy.
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
最近的工作表明,深增强学习(DRL)政策易受对抗扰动的影响。对手可以通过扰乱药剂观察到的环境来误导DRL代理商的政策。现有攻击原则上是可行的,但在实践中面临挑战,例如通过太慢,无法实时欺骗DRL政策。我们表明,使用通用的对冲扰动(UAP)方法来计算扰动,独立于应用它们的各个输入,可以有效地欺骗DRL策略。我们描述了三种这样的攻击变体。通过使用三个Atari 2600游戏的广泛评估,我们表明我们的攻击是有效的,因为它们完全降低了三种不同的DRL代理商的性能(高达100%,即使在扰乱的$ L_ infty $绑定时也很小为0.01)。与不同DRL策略的响应时间(平均0.6ms)相比,它比不同DRL策略的响应时间(0.6ms)更快,并且比使用对抗扰动的前攻击更快(平均1.8ms)。我们还表明,我们的攻击技术是高效的,平均地产生0.027ms的在线计算成本。使用涉及机器人运动的两个进一步任务,我们确认我们的结果概括了更复杂的DRL任务。此外,我们证明了已知防御的有效性降低了普遍扰动。我们提出了一种有效的技术,可检测针对DRL政策的所有已知的对抗性扰动,包括本文呈现的所有普遍扰动。
translated by 谷歌翻译
深入学习的强化学习(RL)的结合导致了一系列令人印象深刻的壮举,许多相信(深)RL提供了一般能力的代理。然而,RL代理商的成功往往对培训过程中的设计选择非常敏感,这可能需要繁琐和易于易于的手动调整。这使得利用RL对新问题充满挑战,同时也限制了其全部潜力。在许多其他机器学习领域,AutomL已经示出了可以自动化这样的设计选择,并且在应用于RL时也会产生有希望的初始结果。然而,自动化强化学习(AutorL)不仅涉及Automl的标准应用,而且还包括RL独特的额外挑战,其自然地产生了不同的方法。因此,Autorl已成为RL中的一个重要研究领域,提供来自RNA设计的各种应用中的承诺,以便玩游戏等游戏。鉴于RL中考虑的方法和环境的多样性,在不同的子领域进行了大部分研究,从Meta学习到进化。在这项调查中,我们寻求统一自动的领域,我们提供常见的分类法,详细讨论每个区域并对研究人员来说是一个兴趣的开放问题。
translated by 谷歌翻译
虽然现实世界的增强学习应用程序(RL)越来越流行,但安全性和RL系统的鲁棒性需要更多的关注。最近的一项工作表明,在多代理RL环境中,可以将后门触发动作注入受害者(又称Trojan特工),这可能会在看到后门触发动作后立即导致灾难性故障。我们提出了RL后门检测的问题,旨在解决此安全漏洞。我们从广泛的经验研究中得出的一个有趣的观察是一种触发平滑性属性,与后门触发动作相似,正常动作也可以触发特洛伊木马的性能低。受到这一观察的启发,我们提出了一种加强学习解决方案Trojanseeker为特洛伊木马的代理找到近似触发作用,并进一步提出了一种有效的方法,以根据机器的学习来减轻特洛伊木马。实验表明,我们的方法可以正确区分和减轻各种类型的代理和环境中的所有特洛伊木马代理。
translated by 谷歌翻译